[rs6000] 64-bit integer loads/stores and FP instructions

Message ID 2055734.X6m1ecN8ti@polaris
State New
Headers show
Series
  • [rs6000] 64-bit integer loads/stores and FP instructions
Related show

Commit Message

Eric Botcazou Feb. 6, 2019, 10:08 p.m.
Hi,

as reported e.g. at https://gcc.gnu.org/ml/gcc-help/2018-11/msg00038.html, the 
7 series of compilers started to use FP instructions for simple 64-bit integer 
loads/stores in unexpected ways.  Consider:

uint64_t var;

void foo1 (uint64_t *p)
{
  var = *p;
}

void foo2 (uint64_t *p)
{
  *p = var;
}

void foo3 (void * p)
{
  var = *(uint64_t *)p;
}

void foo4 (void *p)
{
  *(uint64_t *)p = var;
}

At -O0, the 32-bit compiler uses a double lwz/stw pairs for the 4 functions.  
At -O1, it uses a lfd/stfd pair for foo2 and foo4, but neither for foo1 nor 
foo3.  At -O2, it again uses a double lwz/stw pairs for the 4 functions.

This was introduced by this change:
  https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01993.html

Now, if you try with the GCC 8 compiler, then you get back the double lwz/stw 
pairs for the 4 functions at every optimization level.

The difference between GCC 7 and GCC 8 comes from this change:
  https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00584.html
which seems to imply that the first change was problematic but which was not 
backported to the 7 branch.

So at a minimum I think that the second change should be backported to the 7 
branch; it applies (almost) cleanly and gives no regressions on PowerPC/Linux.

	Backport from mainline
	2018-06-11  Segher Boessenkool  <segher@kernel.crashing.org>

	PR target/85755
	* config/rs6000/rs6000.md (*movdi_internal32): Put constraint modifiers
	on the correct operand.
	(*movdi_internal64): Ditto.


But I also think that another part of the first change is problematic, namely 
the removal of the '*' modifier on the alternatives, which means that the 
register preference of the instruction isn't skewed towards integer registers 
any more.  That may be OK for native targets, but that is frowned upon for 
embedded targets, where people are really unhappy when the compiler emits 
floating-point instructions for pure integer code for no apparent reason.

So I suggest decoupling the (recent) native targets from the rest for this 
specific issue, see a proof of concept in the second patch.

Thoughts?

-- 
Eric Botcazou

Comments

Segher Boessenkool Feb. 7, 2019, 12:49 p.m. | #1
Hi!

On Wed, Feb 06, 2019 at 11:08:44PM +0100, Eric Botcazou wrote:
> as reported e.g. at https://gcc.gnu.org/ml/gcc-help/2018-11/msg00038.html, the 

> 7 series of compilers started to use FP instructions for simple 64-bit integer 

> loads/stores in unexpected ways.  Consider:


<snip>

> The difference between GCC 7 and GCC 8 comes from this change:

>   https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00584.html

> which seems to imply that the first change was problematic but which was not 

> backported to the 7 branch.

> 

> So at a minimum I think that the second change should be backported to the 7 

> branch; it applies (almost) cleanly and gives no regressions on PowerPC/Linux.

> 

> 	Backport from mainline

> 	2018-06-11  Segher Boessenkool  <segher@kernel.crashing.org>

> 

> 	PR target/85755

> 	* config/rs6000/rs6000.md (*movdi_internal32): Put constraint modifiers

> 	on the correct operand.

> 	(*movdi_internal64): Ditto.


Backporting this is okay.  (It was not done because it does not affect
correctness).  What is the "almost", btw?

> But I also think that another part of the first change is problematic, namely 

> the removal of the '*' modifier on the alternatives, which means that the 

> register preference of the instruction isn't skewed towards integer registers 

> any more.


(In https://gcc.gnu.org/ml/gcc-patches/2016-11/txt4j9hLRLCzf.txt and you
presumably mean the *movdi_internal32 hunk, and the first line of it:

-         "=Y,        r,         r,         ?m,        ?*d,        ?*d,
+         "=Y,        r,         r,         ^m,        ^d,         ^d,

where the last three are mem<-reg, reg<-mem, reg<-reg, for fp regs).

This patch was a year before we switched to LRA, and for non-LRA ports *
does not do any register preferencing.

>  That may be OK for native targets, but that is frowned upon for 

> embedded targets, where people are really unhappy when the compiler emits 

> floating-point instructions for pure integer code for no apparent reason.


For 6xx/7xx people *wanted* to use FP loads and stores whenever possible,
because you get twice the bandwidth with them.

If you do not want the FP registers used (or FP insns used), use
-msoft-float, that's what it's for.

Patches are welcome, but complicating this already very complex code by
introducing an arbitrary distinction between "embedded and other" (or
actually, "server and other") isn't the way to go.  Sorry.

Could you open a PR please?


Segher
Eric Botcazou Feb. 8, 2019, 10:46 a.m. | #2
> Backporting this is okay.  (It was not done because it does not affect

> correctness).  What is the "almost", btw?


The predicate of operand #0 of movdi_internal32 is rs6000_nonimmediate_operand 
on the 7 branch and nonimmediate_operand on the 8 branch and later.

> (In https://gcc.gnu.org/ml/gcc-patches/2016-11/txt4j9hLRLCzf.txt and you

> presumably mean the *movdi_internal32 hunk, and the first line of it:

> 

> -         "=Y,        r,         r,         ?m,        ?*d,        ?*d,

> +         "=Y,        r,         r,         ^m,        ^d,         ^d,

> 

> where the last three are mem<-reg, reg<-mem, reg<-reg, for fp regs).

> 

> This patch was a year before we switched to LRA, and for non-LRA ports *

> does not do any register preferencing.


Are you sure?  See record_reg_classes in ira-costs.c:

		case '*':
		  /* Ignore the next letter for this pass.  */
		  c = *++p;
		  break;

> For 6xx/7xx people *wanted* to use FP loads and stores whenever possible,

> because you get twice the bandwidth with them.

> 

> If you do not want the FP registers used (or FP insns used), use

> -msoft-float, that's what it's for.


No disagreement, but you can't force people to use it when they use FP code in 
other parts of their software.  Believe me, we (and probably others) tried...

> Patches are welcome, but complicating this already very complex code by

> introducing an arbitrary distinction between "embedded and other" (or

> actually, "server and other") isn't the way to go.  Sorry.


Not clear what you mean by complicating here...  the approach is rather simple 
and precisely aimed at not disturbing the common path, i.e. server processors,
while preserving the historical path for other processors.  AFAIK nobody has 
evaluated the effects of the original change beyond POWER 7/8/9.

> Could you open a PR please?


Sure, but about what now? ;-)

-- 
Eric Botcazou
Segher Boessenkool Feb. 8, 2019, 4:08 p.m. | #3
On Fri, Feb 08, 2019 at 11:46:37AM +0100, Eric Botcazou wrote:
> > Backporting this is okay.  (It was not done because it does not affect

> > correctness).  What is the "almost", btw?

> 

> The predicate of operand #0 of movdi_internal32 is rs6000_nonimmediate_operand 

> on the 7 branch and nonimmediate_operand on the 8 branch and later.


Ah k, that is fine of course.

> > (In https://gcc.gnu.org/ml/gcc-patches/2016-11/txt4j9hLRLCzf.txt and you

> > presumably mean the *movdi_internal32 hunk, and the first line of it:

> > 

> > -         "=Y,        r,         r,         ?m,        ?*d,        ?*d,

> > +         "=Y,        r,         r,         ^m,        ^d,         ^d,

> > 

> > where the last three are mem<-reg, reg<-mem, reg<-reg, for fp regs).

> > 

> > This patch was a year before we switched to LRA, and for non-LRA ports *

> > does not do any register preferencing.

> 

> Are you sure?


No, I checked the docs so I wouldn't spout nonsense, but then I misread.
Oops.

> > For 6xx/7xx people *wanted* to use FP loads and stores whenever possible,

> > because you get twice the bandwidth with them.

> > 

> > If you do not want the FP registers used (or FP insns used), use

> > -msoft-float, that's what it's for.

> 

> No disagreement, but you can't force people to use it when they use FP code in 

> other parts of their software.  Believe me, we (and probably others) tried...


Of course.  And we try to use integer registers only in normal cases, these
days.

> > Patches are welcome, but complicating this already very complex code by

> > introducing an arbitrary distinction between "embedded and other" (or

> > actually, "server and other") isn't the way to go.  Sorry.

> 

> Not clear what you mean by complicating here...  the approach is rather simple 

> and precisely aimed at not disturbing the common path, i.e. server processors,

> while preserving the historical path for other processors.  AFAIK nobody has 

> evaluated the effects of the original change beyond POWER 7/8/9.


You make an arbitrary distinction between certain CPUs and duplicate patterns
based on that.

> > Could you open a PR please?

> 

> Sure, but about what now? ;-)


About the bug: it should behave the same as before, use GPRs only in your
testcase.


Segher
Eric Botcazou Feb. 11, 2019, 9:07 a.m. | #4
> You make an arbitrary distinction between certain CPUs and duplicate

> patterns based on that.


Sure, somewhat arbitrary, but that's a proof of concept and IMO better than 
treating every processor the same way.  The alternative would be to complicate 
further the existing patterns by means of the "enabled" attribute or somesuch, 
but I can try it too.

> About the bug: it should behave the same as before, use GPRs only in your

> testcase.


It's fixed by backporting the last part of PR target/85755 on the 7 branch, 
but I can certainly add the testcase to the gcc.target/powerpc testsuite.

-- 
Eric Botcazou
Segher Boessenkool Feb. 11, 2019, 1:16 p.m. | #5
On Mon, Feb 11, 2019 at 10:07:44AM +0100, Eric Botcazou wrote:
> > You make an arbitrary distinction between certain CPUs and duplicate

> > patterns based on that.

> 

> Sure, somewhat arbitrary, but that's a proof of concept and IMO better than 

> treating every processor the same way.  The alternative would be to complicate 

> further the existing patterns by means of the "enabled" attribute or somesuch, 

> but I can try it too.


No, we should allow both integer and floating point insns for integer stores
always.  We just get the cost estimates slightly wrong now, apparently.


Segher
Eric Botcazou Feb. 12, 2019, 10:55 a.m. | #6
> No, we should allow both integer and floating point insns for integer stores

> always.  We just get the cost estimates slightly wrong now, apparently.


Note that my proof of concept patch doesn't disallow them either...  So what 
do you suggest?  Just putting back the '*' modifiers in the DI patterns?  As a 
matter of fact there are still present in the SI pattern.

-- 
Eric Botcazou
Segher Boessenkool Feb. 12, 2019, 12:36 p.m. | #7
On Tue, Feb 12, 2019 at 11:55:24AM +0100, Eric Botcazou wrote:
> > No, we should allow both integer and floating point insns for integer stores

> > always.  We just get the cost estimates slightly wrong now, apparently.

> 

> Note that my proof of concept patch doesn't disallow them either...  So what 

> do you suggest?  Just putting back the '*' modifiers in the DI patterns?


Yeah, something like that.  It will need some serious testing, to make
sure we don't regress (including not regressing what that patch that took
them away was meant to do).  I can arrange some testing, will you do the
patch though?

> As a matter of fact there are still present in the SI pattern.


Yeah.  It might not hurt at all to put them back in the DI as well.
Here's hoping.

Thanks,


Segher
Eric Botcazou Feb. 14, 2019, 4:30 p.m. | #8
> Yeah, something like that.  It will need some serious testing, to make

> sure we don't regress (including not regressing what that patch that took

> them away was meant to do).  I can arrange some testing, will you do the

> patch though?


I can do the patch and also (correctness) testing for 32-bit Linux.

Another issue is the extent of the patch: practically speaking, putting back 
the '*' modifier before all the 'd' constraints would be sufficient, but the 
current setting is a bit inconsistent|*] so this could also be adjusted.

[*] For example, in the *movdi_internal32 pattern, 2 'wi' constraints have it 
but not the other 2.  Likewise for "wv'.

-- 
Eric Botcazou
Segher Boessenkool Feb. 19, 2019, 7:57 p.m. | #9
On Thu, Feb 14, 2019 at 05:30:52PM +0100, Eric Botcazou wrote:
> > Yeah, something like that.  It will need some serious testing, to make

> > sure we don't regress (including not regressing what that patch that took

> > them away was meant to do).  I can arrange some testing, will you do the

> > patch though?

> 

> I can do the patch and also (correctness) testing for 32-bit Linux.


Performance testing is important here, of course.

> Another issue is the extent of the patch: practically speaking, putting back 

> the '*' modifier before all the 'd' constraints would be sufficient, but the 

> current setting is a bit inconsistent|*] so this could also be adjusted.

> 

> [*] For example, in the *movdi_internal32 pattern, 2 'wi' constraints have it 

> but not the other 2.  Likewise for "wv'.


I think we should change as little as possible for 7/8/9 here, because this
is pretty fragile :-(

But for 10, yes, let's get more sanity.

(We'll have the "enabled" attribute for GCC 10, btw).


Segher

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 7d399a2227b..d48395666fc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4894,6 +4894,13 @@  rs6000_option_override_internal (bool global_init_p)
 				 || rs6000_tune == PROCESSOR_PPCE5500
 				 || rs6000_tune == PROCESSOR_PPCE6500);
 
+  TARGET_AVOID_FPU_FOR_INT_MOVES = (rs6000_tune != PROCESSOR_POWER4
+				    && rs6000_tune != PROCESSOR_POWER5
+				    && rs6000_tune != PROCESSOR_POWER6
+				    && rs6000_tune != PROCESSOR_POWER7
+				    && rs6000_tune != PROCESSOR_POWER8
+				    && rs6000_tune != PROCESSOR_POWER9);
+
   /* Allow debug switches to override the above settings.  These are set to -1
      in rs6000.opt to indicate the user hasn't directly set the switch.  */
   if (TARGET_ALWAYS_HINT >= 0)
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index a6929d9641b..e5b8fde8ff0 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -8586,7 +8586,8 @@ 
           Oj,        wM,        OjwM,      Oj,        wM,         wS,
           wB"))]
 
-  "! TARGET_POWERPC64
+  "!TARGET_POWERPC64
+   && !TARGET_AVOID_FPU_FOR_INT_MOVES
    && (gpc_reg_operand (operands[0], DImode)
        || gpc_reg_operand (operands[1], DImode))"
   "@
@@ -8616,6 +8617,25 @@ 
                 vecsimple")
    (set_attr "size" "64")])
 
+;; This is the pre-GCC7 variant for integer-oriented CPUs without vector units.
+(define_insn "*movdi_internal32_basic"
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=Y,r,r,?m,?*d,?*d,r")
+	(match_operand:DI 1 "input_operand" "r,Y,r,d,m,d,IJKnGHF"))]
+  "!TARGET_POWERPC64
+   && TARGET_AVOID_FPU_FOR_INT_MOVES
+   && (gpc_reg_operand (operands[0], DImode)
+       || gpc_reg_operand (operands[1], DImode))"
+  "@
+   #
+   #
+   #
+   stfd%U0%X0 %1,%0
+   lfd%U1%X1 %0,%1
+   fmr %0,%1
+   #"
+  [(set_attr "type" "store,load,*,fpstore,fpload,fpsimple,*")
+   (set_attr "size" "64")])
+
 (define_split
   [(set (match_operand:DI 0 "gpc_reg_operand")
 	(match_operand:DI 1 "const_int_operand"))]
@@ -8664,6 +8684,7 @@ 
                 wg,        r,         wj,        r"))]
 
   "TARGET_POWERPC64
+   && !TARGET_AVOID_FPU_FOR_INT_MOVES
    && (gpc_reg_operand (operands[0], DImode)
        || gpc_reg_operand (operands[1], DImode))"
   "@
@@ -8710,6 +8731,33 @@ 
                 8,         4,         4,         4,         4,          4,
                 4,         4,         4,         4")])
 
+;; This is the pre-GCC7 variant for integer-oriented CPUs without vector units.
+(define_insn "*movdi_internal64_basic"
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=Y,r,r,r,r,r,?m,?*d,?*d,r,*h,*h,r,?*wg")
+	(match_operand:DI 1 "input_operand" "r,Y,r,I,L,nF,d,m,d,*h,r,0,*wg,r"))]
+  "TARGET_POWERPC64
+   && TARGET_AVOID_FPU_FOR_INT_MOVES
+   && (gpc_reg_operand (operands[0], DImode)
+       || gpc_reg_operand (operands[1], DImode))"
+  "@
+   std%U0%X0 %1,%0
+   ld%U1%X1 %0,%1
+   mr %0,%1
+   li %0,%1
+   lis %0,%v1
+   #
+   stfd%U0%X0 %1,%0
+   lfd%U1%X1 %0,%1
+   fmr %0,%1
+   mf%1 %0
+   mt%0 %1
+   nop
+   mftgpr %0,%1
+   mffgpr %0,%1"
+  [(set_attr "type" "store,load,*,*,*,*,fpstore,fpload,fpsimple,mfjmpr,mtjmpr,*,mftgpr,mffgpr")
+   (set_attr "size" "64")
+   (set_attr "length" "4,4,4,4,4,20,4,4,4,4,4,4,4,4")])
+
 ; Some DImode loads are best done as a load of -1 followed by a mask
 ; instruction.
 (define_split
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index ace8a477550..381a0c94788 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -106,12 +106,13 @@  unsigned int rs6000_debug
 
 ;; Whether to enable the -mfloat128 stuff without necessarily enabling the
 ;; __float128 keyword.
-TargetSave
-unsigned char x_TARGET_FLOAT128_TYPE
-
-Variable
+TargetVariable
 unsigned char TARGET_FLOAT128_TYPE
 
+;; Whether to avoid using the FPU or other units for integer moves if possible
+TargetVariable
+unsigned char TARGET_AVOID_FPU_FOR_INT_MOVES
+
 ;; This option existed in the past, but now is always on.
 mpowerpc
 Target RejectNegative Undocumented Ignore