hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness.

Message ID 004701d677ea$a2fb2930$e8f17b90$@nextmovesoftware.com
State New
Headers show
Series
  • hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness.
Related show

Commit Message

Roger Sayle Aug. 21, 2020, 6:41 p.m.
This is my proposed fix to PR middle-end/87256 where synth_mult takes an
unreasonable amount of CPU time determining an optimal sequence of
instructions to perform multiplications by (large) integer constants on
hppa.
One workaround, proposed in bugzilla, is to increase the hash table used
to cache/reuse intermediate results. This helps but is a workaround for
the (more subtle) underlying problem.

The real issue is that the hppa_rtx_costs function is providing wildly
inaccurate values (estimates) to the middle-end.  For example, (p*q)+(r*s)
would appear to be cheaper than a single multiplication.  Another
example is that "(ashiftrt:di regA regB)" is claimed to be only
COST_N_INSNS(1) when in fact the hppa backend actually generates:

ashrdi: ldi 32,%r28
        and %r24,%r28,%r28
        comib,= 0,%r28,.L6
        mtsar %r24
        subi 31,%r24,%r24
        extrs %r25,0,1,%r28
        mtsar %r24
        bv %r0(%r2)
        vextrs %r25,32,%r29
.L6:    uaddcm %r0,%r24,%r28
        vshd %r0,%r26,%r29
        subi 31,%r28,%r28
        mtsar %r28
        zdep %r25,30,31,%r19
        subi 31,%r24,%r24
        zvdep %r19,32,%r19
        mtsar %r24
        or %r19,%r29,%r29
        bv %r0(%r2)
        vextrs %r25,32,%r28

which is slightly more than a single instruction.

It turns out that simply tightening up the logic in hppa_rtx_costs to
return more reasonable values, dramatically reduces the number of recursive
invocations in synth_mult for the test case in PR87256, and presumably
also produces faster code (that should be observable in benchmarks).

Unfortunately, once again this has only be tested via cross-compilers
to hppa-unknown-linux-gnu and hppa64-unknown-linux-gnu hosted on
x86_64-pc-linux-gnu, but I can confirm cross-compilation is much faster.
Many thanks in advance to anyone who can bootstrap and regression test
this on real hardware.  In an ideal world, changes to rtx_costs should
be pretty safe, this function can't introduce any bugs, only expose those
are already present (but possibly latent) elsewhere in the compiler.  Ha.


2020-08-21  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR middle-end/87256
	* config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function
	to check for coefficients supported by shNadd and shladd,l.
	(hppa_rtx_costs):  Rewrite to avoid using estimates based upon
	FACTOR and enable recursing deeper into RTL expressions.


Thanks again.
Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

Comments

John David Anglin Aug. 25, 2020, 2:44 p.m. | #1
On 2020-08-21 2:41 p.m., Roger Sayle wrote:
> 2020-08-21  Roger Sayle  <roger@nextmovesoftware.com>

>

> gcc/ChangeLog

> 	PR middle-end/87256

> 	* config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function

> 	to check for coefficients supported by shNadd and shladd,l.

> 	(hppa_rtx_costs):  Rewrite to avoid using estimates based upon

> 	FACTOR and enable recursing deeper into RTL expressions.

This change is also fine.

The gcc.target/hppa/shadd-2.c test now fails because there are two additional sh1add instructions.
However, the total number of instructions is the same as before.

I confirmed that change fixes PR middle-end/87256.  So, we can close this PR when the patch is installed.

Please install on trunk and gcc-10.  gcc-10 is the default compiler on Debian.

Thanks,
Dave

-- 
John David Anglin  dave.anglin@bell.net
Roger Sayle Aug. 26, 2020, 11:19 a.m. | #2
Hi Dave,

> This change is also fine.

> 

> The gcc.target/hppa/shadd-2.c test now fails because there are two

additional sh1add instructions.
> However, the total number of instructions is the same as before.


Just to be on the safe side, I took a deeper look into this.
We're now generating a slightly different sequence for multiply by 87.

Before:
        zdep %r28,28,29,%r20            ; r20 = r28*8
        ldo RR'default_target_regs-$global$(%r1),%r19
        sub %r20,%r28,%r20              ; r20 = r28*7
        sh2addl %r20,%r28,%r20          ; r20 = r28*29
        sh2addl %r20,%r19,%r19          ; r20 = addr+r28*116
        sub %r19,%r20,%r19              ; r20 = addr+r28*87

After:
        sh2addl %r28,%r28,%r19          ; r19 = r28*5
        ldo RR'default_target_regs-$global$(%r1),%r1
        sh2addl %r19,%r28,%r19          ; r19 = r28*21
        sh1addl %r19,%r28,%r19          ; r19 = r28*43
        sh1addl %r19,%r28,%r19          ; r19 = r28*87
        addl %r1,%r19,%r1               ; r1 = addr+r28*87

As you point out, both sequences have exactly the same length and rtx_costs.
I suspect the middle-end has a minor preference for simple shifts and adds.
If shadd-2.c is there to check we make use of sh?add instructions, then
we're
making even more use of them now.

I agree it's reasonable to increase the scan-assembler-times count for this
test.

Cheers,
Roger
--
Aldy Hernandez via Gcc-patches Aug. 26, 2020, 4:10 p.m. | #3
On Wed, 2020-08-26 at 12:19 +0100, Roger Sayle wrote:
> Hi Dave,

> 

> > This change is also fine.

> > 

> > The gcc.target/hppa/shadd-2.c test now fails because there are two

> additional sh1add instructions.

> > However, the total number of instructions is the same as before.

> 

> Just to be on the safe side, I took a deeper look into this.

> We're now generating a slightly different sequence for multiply by 87.

> 

> Before:

>         zdep %r28,28,29,%r20            ; r20 = r28*8

>         ldo RR'default_target_regs-$global$(%r1),%r19

>         sub %r20,%r28,%r20              ; r20 = r28*7

>         sh2addl %r20,%r28,%r20          ; r20 = r28*29

>         sh2addl %r20,%r19,%r19          ; r20 = addr+r28*116

>         sub %r19,%r20,%r19              ; r20 = addr+r28*87

> 

> After:

>         sh2addl %r28,%r28,%r19          ; r19 = r28*5

>         ldo RR'default_target_regs-$global$(%r1),%r1

>         sh2addl %r19,%r28,%r19          ; r19 = r28*21

>         sh1addl %r19,%r28,%r19          ; r19 = r28*43

>         sh1addl %r19,%r28,%r19          ; r19 = r28*87

>         addl %r1,%r19,%r1               ; r1 = addr+r28*87

> 

> As you point out, both sequences have exactly the same length and rtx_costs.

> I suspect the middle-end has a minor preference for simple shifts and adds.

> If shadd-2.c is there to check we make use of sh?add instructions, then

> we're

> making even more use of them now.

> 

> I agree it's reasonable to increase the scan-assembler-times count for this

> test.

Yes, those tests were there to catch failure to create shadd insns after some
changes in the combiner.  If we're generating more, then that's good enough for
those tests :-)

jeff

Patch

diff --git a/gcc/config/pa/pa.c b/gcc/config/pa/pa.c
index 07d3287..cc876e6 100644
--- a/gcc/config/pa/pa.c
+++ b/gcc/config/pa/pa.c
@@ -1492,6 +1492,33 @@  hppa_address_cost (rtx X, machine_mode mode ATTRIBUTE_UNUSED,
     }
 }
 
+/* Return true if X represents a (possibly non-canonical) shNadd pattern.
+   The machine mode of X is known to be SImode or DImode.  */
+
+static bool
+hppa_rtx_costs_shadd_p (rtx x)
+{
+  if (GET_CODE (x) != PLUS
+      || !REG_P (XEXP (x, 1)))
+    return false;
+  rtx op0 = XEXP (x, 0);
+  if (GET_CODE (op0) == ASHIFT
+      && CONST_INT_P (XEXP (op0, 1))
+      && REG_P (XEXP (op0, 0)))
+    {
+      unsigned HOST_WIDE_INT x = UINTVAL (XEXP (op0, 1));
+      return x == 1 || x == 2 || x == 3;
+    }
+  if (GET_CODE (op0) == MULT
+      && CONST_INT_P (XEXP (op0, 1))
+      && REG_P (XEXP (op0, 0)))
+    {
+      unsigned HOST_WIDE_INT x = UINTVAL (XEXP (op0, 1));
+      return x == 2 || x == 4 || x == 8;
+    }
+  return false;
+}
+
 /* Compute a (partial) cost for rtx X.  Return true if the complete
    cost has been computed, and false if subexpressions should be
    scanned.  In either case, *TOTAL contains the cost result.  */
@@ -1499,15 +1526,16 @@  hppa_address_cost (rtx X, machine_mode mode ATTRIBUTE_UNUSED,
 static bool
 hppa_rtx_costs (rtx x, machine_mode mode, int outer_code,
 		int opno ATTRIBUTE_UNUSED,
-		int *total, bool speed ATTRIBUTE_UNUSED)
+		int *total, bool speed)
 {
-  int factor;
   int code = GET_CODE (x);
 
   switch (code)
     {
     case CONST_INT:
-      if (INTVAL (x) == 0)
+      if (outer_code == SET)
+	*total = COSTS_N_INSNS (1);
+      else if (INTVAL (x) == 0)
 	*total = 0;
       else if (INT_14_BITS (x))
 	*total = 1;
@@ -1537,25 +1565,28 @@  hppa_rtx_costs (rtx x, machine_mode mode, int outer_code,
       if (GET_MODE_CLASS (mode) == MODE_FLOAT)
 	{
 	  *total = COSTS_N_INSNS (3);
-	  return true;
 	}
-
-      /* A mode size N times larger than SImode needs O(N*N) more insns.  */
-      factor = GET_MODE_SIZE (mode) / 4;
-      if (factor == 0)
-	factor = 1;
-
-      if (TARGET_PA_11 && !TARGET_DISABLE_FPREGS && !TARGET_SOFT_FLOAT)
-	*total = factor * factor * COSTS_N_INSNS (8);
+      else if (mode == DImode)
+	{
+          if (TARGET_PA_11 && !TARGET_DISABLE_FPREGS && !TARGET_SOFT_FLOAT)
+ 	    *total = COSTS_N_INSNS (32);
+	  else
+	    *total = COSTS_N_INSNS (80);
+	}
       else
-	*total = factor * factor * COSTS_N_INSNS (20);
-      return true;
+	{
+          if (TARGET_PA_11 && !TARGET_DISABLE_FPREGS && !TARGET_SOFT_FLOAT)
+ 	    *total = COSTS_N_INSNS (8);
+	  else
+	    *total = COSTS_N_INSNS (20);
+	}
+      return REG_P (XEXP (x, 0)) && REG_P (XEXP (x, 1));
 
     case DIV:
       if (GET_MODE_CLASS (mode) == MODE_FLOAT)
 	{
 	  *total = COSTS_N_INSNS (14);
-	  return true;
+	  return false;
 	}
       /* FALLTHRU */
 
@@ -1563,34 +1594,107 @@  hppa_rtx_costs (rtx x, machine_mode mode, int outer_code,
     case MOD:
     case UMOD:
       /* A mode size N times larger than SImode needs O(N*N) more insns.  */
-      factor = GET_MODE_SIZE (mode) / 4;
-      if (factor == 0)
-	factor = 1;
-
-      *total = factor * factor * COSTS_N_INSNS (60);
-      return true;
+      if (mode == DImode)
+        *total = COSTS_N_INSNS (240);
+      else
+        *total = COSTS_N_INSNS (60);
+      return REG_P (XEXP (x, 0)) && REG_P (XEXP (x, 1));
 
     case PLUS: /* this includes shNadd insns */
     case MINUS:
       if (GET_MODE_CLASS (mode) == MODE_FLOAT)
+	*total = COSTS_N_INSNS (3);
+      else if (mode == DImode)
 	{
-	  *total = COSTS_N_INSNS (3);
-	  return true;
+	  if (TARGET_64BIT)
+	    {
+	      *total = COSTS_N_INSNS (1);
+	      /* Handle shladd,l instructions.  */
+	      if (hppa_rtx_costs_shadd_p (x))
+		return true;
+	    }
+	  else
+	    *total = COSTS_N_INSNS (2);
 	}
-
-      /* A size N times larger than UNITS_PER_WORD needs N times as
-	 many insns, taking N times as long.  */
-      factor = GET_MODE_SIZE (mode) / UNITS_PER_WORD;
-      if (factor == 0)
-	factor = 1;
-      *total = factor * COSTS_N_INSNS (1);
-      return true;
+      else
+	{
+          *total = COSTS_N_INSNS (1);
+	  /* Handle shNadd instructions.  */
+	  if (hppa_rtx_costs_shadd_p (x))
+	    return true;
+	}
+      return REG_P (XEXP (x, 0))
+	     && (REG_P (XEXP (x, 1))
+		 || CONST_INT_P (XEXP (x, 1)));
 
     case ASHIFT:
+      if (mode == DImode)
+        {
+	  if (TARGET_64BIT)
+	    *total = COSTS_N_INSNS (3);
+	  else if (REG_P (XEXP (x, 0)) && CONST_INT_P (XEXP (x, 1)))
+	    {
+	      *total = COSTS_N_INSNS (2);
+	      return true;
+	    }
+	  else if (speed)
+	    *total = COSTS_N_INSNS (13);
+	  else
+	    *total = COSTS_N_INSNS (18);
+	}
+      else if (TARGET_64BIT)
+        *total = COSTS_N_INSNS (4);
+      else
+        *total = COSTS_N_INSNS (2);
+      return REG_P (XEXP (x, 0))
+	     && (REG_P (XEXP (x, 1))
+		 || CONST_INT_P (XEXP (x, 1)));
+
     case ASHIFTRT:
+      if (mode == DImode)
+        {
+	  if (TARGET_64BIT)
+	    *total = COSTS_N_INSNS (3);
+	  else if (REG_P (XEXP (x, 0)) && CONST_INT_P (XEXP (x, 1)))
+	    {
+	      *total = COSTS_N_INSNS (2);
+	      return true;
+	    }
+	  else if (speed)
+	    *total = COSTS_N_INSNS (14);
+	  else
+	    *total = COSTS_N_INSNS (19);
+	}
+      else if (TARGET_64BIT)
+        *total = COSTS_N_INSNS (4);
+      else
+        *total = COSTS_N_INSNS (2);
+      return REG_P (XEXP (x, 0))
+	     && (REG_P (XEXP (x, 1))
+		 || CONST_INT_P (XEXP (x, 1)));
+
     case LSHIFTRT:
-      *total = COSTS_N_INSNS (1);
-      return true;
+      if (mode == DImode)
+        {
+	  if (TARGET_64BIT)
+	    *total = COSTS_N_INSNS (2);
+	  else if (REG_P (XEXP (x, 0)) && CONST_INT_P (XEXP (x, 1)))
+	    {
+	      *total = COSTS_N_INSNS (2);
+	      return true;
+	    }
+	  else if (speed)
+	    *total = COSTS_N_INSNS (12);
+	  else
+	    *total = COSTS_N_INSNS (15);
+	}
+      else if (TARGET_64BIT)
+        *total = COSTS_N_INSNS (3);
+      else
+        *total = COSTS_N_INSNS (2);
+      return REG_P (XEXP (x, 0))
+	     && (REG_P (XEXP (x, 1))
+		 || CONST_INT_P (XEXP (x, 1)));
 
     default:
       return false;