[x86_64] PR target/11877: Use xor to write zero to memory with -Os

Message ID 001001d765ea$286661f0$793325d0$@nextmovesoftware.com
State New
Headers show
Series
  • [x86_64] PR target/11877: Use xor to write zero to memory with -Os
Related show

Commit Message

Roger Sayle June 20, 2021, 3:37 p.m.
The following patch attempts to resolve PR target/11877 (without
triggering PR/23102).  On x86_64, writing an SImode or DImode zero
to memory uses an instruction encoding that is larger than first
clearing a register (using xor) then writing that to memory.  Hence,
after reload, the peephole2 pass can determine if there's a suitable
free register, and if so, use that to shrink the code size with -Os.

To improve code size, and avoid inserting a large number of xor
instructions (PR target/23102), this patch makes use of peephole2's
efficient pattern matching to use a single temporary for a run of
consecutive writes.  In theory, one could do better still with a
new target-specific pass, gated on -Os, to shrink these instructions
(like stv), but that's probably overkill for the little remaining
space savings.

Evaluating this patch on the CSiBE benchmark (v2.1.1) results in a
0.26% code size improvement (3715273 bytes down to 3705477) on x86_64
with -Os [saving 1 byte every 400].  549 of 894 tests improve, two
tests grow larger.  Analysis of these 2 pathological cases reveals
that although peephole2's match_scratch prefers to use a call-clobbered
register (to avoid requiring a new stack frame), very rarely this
interacts with GCC's shrink wrapping optimization, which may previously
have avoided saving/restoring a call clobbered register, such as %eax,
in the calling function.

This patch has been tested on x86_64-pc-linux-gnu with a make bootstrap
and make -k check with no new failures.

Ok for mainline?


2021-06-20  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR target/11877
	* config/i386/i386.md: New define_peephole2s to shrink writing
	1, 2 or 4 consecutive zeros to memory when optimizing for size.

gcc/testsuite/ChangeLog
	PR target/11877
	* gcc.target/i386/pr11877.c: New test case.

--
Roger Sayle
NextMove Software
Cambridge, UK

/* PR target/11877 */
/* { dg-do compile } */
/* { dg-options "-Os" } */

void foo (long long *p)
{
  *p = 0;
}

void bar (int *p)
{
  *p = 0;
}

/* { dg-final { scan-assembler-times "xorl\[ \t\]" 2 } } */
/* { dg-final { scan-assembler-not "\\\$0," } } */

Comments

Aldy Hernandez via Gcc-patches June 21, 2021, 7:18 a.m. | #1
On Sun, Jun 20, 2021 at 5:37 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>

>

> The following patch attempts to resolve PR target/11877 (without

> triggering PR/23102).  On x86_64, writing an SImode or DImode zero

> to memory uses an instruction encoding that is larger than first

> clearing a register (using xor) then writing that to memory.  Hence,

> after reload, the peephole2 pass can determine if there's a suitable

> free register, and if so, use that to shrink the code size with -Os.

>

> To improve code size, and avoid inserting a large number of xor

> instructions (PR target/23102), this patch makes use of peephole2's

> efficient pattern matching to use a single temporary for a run of

> consecutive writes.  In theory, one could do better still with a

> new target-specific pass, gated on -Os, to shrink these instructions

> (like stv), but that's probably overkill for the little remaining

> space savings.


Agreed. Peephole2 pass runs before sched2 (and x86 targets do not use
sched1 pass), so there is a good chance these instructions stay
together until the new peephole2 converts them.

> Evaluating this patch on the CSiBE benchmark (v2.1.1) results in a

> 0.26% code size improvement (3715273 bytes down to 3705477) on x86_64

> with -Os [saving 1 byte every 400].  549 of 894 tests improve, two

> tests grow larger.  Analysis of these 2 pathological cases reveals

> that although peephole2's match_scratch prefers to use a call-clobbered

> register (to avoid requiring a new stack frame), very rarely this

> interacts with GCC's shrink wrapping optimization, which may previously

> have avoided saving/restoring a call clobbered register, such as %eax,

> in the calling function.

>

> This patch has been tested on x86_64-pc-linux-gnu with a make bootstrap

> and make -k check with no new failures.

>

> Ok for mainline?

>

>

> 2021-06-20  Roger Sayle  <roger@nextmovesoftware.com>

>

> gcc/ChangeLog

>         PR target/11877

>         * config/i386/i386.md: New define_peephole2s to shrink writing

>         1, 2 or 4 consecutive zeros to memory when optimizing for size.

>

> gcc/testsuite/ChangeLog

>         PR target/11877

>         * gcc.target/i386/pr11877.c: New test case.


OK.

Thanks,
Uros.
Aldy Hernandez via Gcc-patches June 21, 2021, 8:36 a.m. | #2
On Mon, Jun 21, 2021 at 09:18:28AM +0200, Uros Bizjak via Gcc-patches wrote:
> > 2021-06-20  Roger Sayle  <roger@nextmovesoftware.com>

> >

> > gcc/ChangeLog

> >         PR target/11877

> >         * config/i386/i386.md: New define_peephole2s to shrink writing

> >         1, 2 or 4 consecutive zeros to memory when optimizing for size.

> >

> > gcc/testsuite/ChangeLog

> >         PR target/11877

> >         * gcc.target/i386/pr11877.c: New test case.

> 

> OK.


It unfortunately doesn't extend well to larger memory clearing.
Consider e.g.
void
foo (int *p)
{
  p[0] = 0;
  p[7] = 0;
  p[23] = 0;
  p[41] = 0;
  p[48] = 0;
  p[59] = 0;
  p[69] = 0;
  p[78] = 0;
  p[83] = 0;
  p[89] = 0;
  p[98] = 0;
  p[121] = 0;
  p[132] = 0;
  p[143] = 0;
  p[154] = 0;
}
where with the patch we emit:
        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        xorl    %esi, %esi
        xorl    %r8d, %r8d
        movl    %eax, (%rdi)
        movl    %eax, 28(%rdi)
        movl    %eax, 92(%rdi)
        movl    %eax, 164(%rdi)
        movl    %edx, 192(%rdi)
        movl    %edx, 236(%rdi)
        movl    %edx, 276(%rdi)
        movl    %edx, 312(%rdi)
        movl    %ecx, 332(%rdi)
        movl    %ecx, 356(%rdi)
        movl    %ecx, 392(%rdi)
        movl    %ecx, 484(%rdi)
        movl    %esi, 528(%rdi)
        movl    %esi, 572(%rdi)
        movl    %r8d, 616(%rdi)
Here is an incremental (so far untested) patch that emits:
        xorl    %eax, %eax
        movl    %eax, (%rdi)
        movl    %eax, 28(%rdi)
        movl    %eax, 92(%rdi)
        movl    %eax, 164(%rdi)
        movl    %eax, 192(%rdi)
        movl    %eax, 236(%rdi)
        movl    %eax, 276(%rdi)
        movl    %eax, 312(%rdi)
        movl    %eax, 332(%rdi)
        movl    %eax, 356(%rdi)
        movl    %eax, 392(%rdi)
        movl    %eax, 484(%rdi)
        movl    %eax, 528(%rdi)
        movl    %eax, 572(%rdi)
        movl    %eax, 616(%rdi)
instead:

2021-06-21  Jakub Jelinek  <jakub@redhat.com>

	PR target/11877
	* config/i386/i386-protos.h (ix86_zero_stores_peep2_p): Declare.
	* config/i386/i386.c (ix86_zero_stores_peep2_p): New function.
	* config/i386/i386.md (peephole2s for 1/2/4 stores of const0_rtx):
	Remove "" from match_operand.  Add peephole2s for 1/2/4 stores of
	const0_rtx following previous successful peep2s.

--- gcc/config/i386/i386-protos.h.jj	2021-06-07 09:24:57.696690116 +0200
+++ gcc/config/i386/i386-protos.h	2021-06-21 10:21:05.428887980 +0200
@@ -111,6 +111,7 @@ extern bool ix86_use_lea_for_mov (rtx_in
 extern bool ix86_avoid_lea_for_addr (rtx_insn *, rtx[]);
 extern void ix86_split_lea_for_addr (rtx_insn *, rtx[], machine_mode);
 extern bool ix86_lea_for_add_ok (rtx_insn *, rtx[]);
+extern bool ix86_zero_stores_peep2_p (rtx_insn *, rtx);
 extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high);
 extern bool ix86_dep_by_shift_count (const_rtx set_insn, const_rtx use_insn);
 extern bool ix86_agi_dependent (rtx_insn *set_insn, rtx_insn *use_insn);
--- gcc/config/i386/i386.c.jj	2021-06-21 09:39:21.622487840 +0200
+++ gcc/config/i386/i386.c	2021-06-21 10:21:12.389794740 +0200
@@ -15186,6 +15186,33 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
   return ix86_lea_outperforms (insn, regno0, regno1, regno2, 0, false);
 }
 
+/* Return true if insns before FIRST_INSN (which is of the form
+   (set (memory) (zero_operand)) are all also either in the
+   same form, or (set (zero_operand) (const_int 0)).  */
+
+bool
+ix86_zero_stores_peep2_p (rtx_insn *first_insn, rtx zero_operand)
+{
+  rtx_insn *insn = first_insn;
+  for (int count = 0; count < 512; count++)
+    {
+      insn = prev_nonnote_nondebug_insn_bb (insn);
+      if (!insn) 
+	return false;
+      rtx set = single_set (insn);
+      if (!set)
+	return false;
+      if (SET_SRC (set) == const0_rtx
+	  && rtx_equal_p (SET_DEST (set), zero_operand))
+	return true;
+      if (set != PATTERN (insn)
+	  || !rtx_equal_p (SET_SRC (set), zero_operand)
+	  || !memory_operand (SET_DEST (set), VOIDmode))
+	return false;
+    }
+  return false;
+}
+
 /* Return true if destination reg of SET_BODY is shift count of
    USE_BODY.  */
 
--- gcc/config/i386/i386.md.jj	2021-06-21 09:42:04.086303699 +0200
+++ gcc/config/i386/i386.md	2021-06-21 10:21:31.932532964 +0200
@@ -19360,10 +19360,10 @@ (define_peephole2
 ;; When optimizing for size, zeroing memory should use a register.
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 3 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 4 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
   [(set (match_dup 1) (match_dup 0))
    (set (match_dup 2) (match_dup 0))
@@ -19375,8 +19375,8 @@ (define_peephole2
 
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
   [(set (match_dup 1) (match_dup 0))
    (set (match_dup 2) (match_dup 0))]
@@ -19386,13 +19386,48 @@ (define_peephole2
 
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
   [(set (match_dup 1) (match_dup 0))]
 {
   ix86_expand_clear (operands[0]);
 })
 
+(define_peephole2
+  [(set (match_operand:SWI48 5 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"
+  [(set (match_dup 5) (match_dup 0))
+   (set (match_dup 1) (match_dup 0))
+   (set (match_dup 2) (match_dup 0))
+   (set (match_dup 3) (match_dup 0))
+   (set (match_dup 4) (match_dup 0))])
+
+(define_peephole2
+  [(set (match_operand:SWI48 3 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"
+  [(set (match_dup 3) (match_dup 0))
+   (set (match_dup 1) (match_dup 0))
+   (set (match_dup 2) (match_dup 0))])
+
+(define_peephole2
+  [(set (match_operand:SWI48 2 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"
+  [(set (match_dup 2) (match_dup 0))
+   (set (match_dup 1) (match_dup 0))])
+
 ;; Reload dislikes loading constants directly into class_likely_spilled
 ;; hard registers.  Try to tidy things up here.
 (define_peephole2


	Jakub
Aldy Hernandez via Gcc-patches June 21, 2021, 9:19 a.m. | #3
On Mon, Jun 21, 2021 at 10:37 AM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>

> On Mon, Jun 21, 2021 at 09:18:28AM +0200, Uros Bizjak via Gcc-patches wrote:

> > > 2021-06-20  Roger Sayle  <roger@nextmovesoftware.com>

> > >

> > > gcc/ChangeLog

> > >         PR target/11877

> > >         * config/i386/i386.md: New define_peephole2s to shrink writing

> > >         1, 2 or 4 consecutive zeros to memory when optimizing for size.

> > >

> > > gcc/testsuite/ChangeLog

> > >         PR target/11877

> > >         * gcc.target/i386/pr11877.c: New test case.

> >

> > OK.

>

> It unfortunately doesn't extend well to larger memory clearing.

> Consider e.g.

> void

> foo (int *p)

> {

>   p[0] = 0;

>   p[7] = 0;

>   p[23] = 0;

>   p[41] = 0;

>   p[48] = 0;

>   p[59] = 0;

>   p[69] = 0;

>   p[78] = 0;

>   p[83] = 0;

>   p[89] = 0;

>   p[98] = 0;

>   p[121] = 0;

>   p[132] = 0;

>   p[143] = 0;

>   p[154] = 0;

> }

> where with the patch we emit:

>         xorl    %eax, %eax

>         xorl    %edx, %edx

>         xorl    %ecx, %ecx

>         xorl    %esi, %esi

>         xorl    %r8d, %r8d

>         movl    %eax, (%rdi)

>         movl    %eax, 28(%rdi)

>         movl    %eax, 92(%rdi)

>         movl    %eax, 164(%rdi)

>         movl    %edx, 192(%rdi)

>         movl    %edx, 236(%rdi)

>         movl    %edx, 276(%rdi)

>         movl    %edx, 312(%rdi)

>         movl    %ecx, 332(%rdi)

>         movl    %ecx, 356(%rdi)

>         movl    %ecx, 392(%rdi)

>         movl    %ecx, 484(%rdi)

>         movl    %esi, 528(%rdi)

>         movl    %esi, 572(%rdi)

>         movl    %r8d, 616(%rdi)

> Here is an incremental (so far untested) patch that emits:

>         xorl    %eax, %eax

>         movl    %eax, (%rdi)

>         movl    %eax, 28(%rdi)

>         movl    %eax, 92(%rdi)

>         movl    %eax, 164(%rdi)

>         movl    %eax, 192(%rdi)

>         movl    %eax, 236(%rdi)

>         movl    %eax, 276(%rdi)

>         movl    %eax, 312(%rdi)

>         movl    %eax, 332(%rdi)

>         movl    %eax, 356(%rdi)

>         movl    %eax, 392(%rdi)

>         movl    %eax, 484(%rdi)

>         movl    %eax, 528(%rdi)

>         movl    %eax, 572(%rdi)

>         movl    %eax, 616(%rdi)

> instead:

>

> 2021-06-21  Jakub Jelinek  <jakub@redhat.com>

>

>         PR target/11877

>         * config/i386/i386-protos.h (ix86_zero_stores_peep2_p): Declare.

>         * config/i386/i386.c (ix86_zero_stores_peep2_p): New function.

>         * config/i386/i386.md (peephole2s for 1/2/4 stores of const0_rtx):

>         Remove "" from match_operand.  Add peephole2s for 1/2/4 stores of

>         const0_rtx following previous successful peep2s.

>

> --- gcc/config/i386/i386-protos.h.jj    2021-06-07 09:24:57.696690116 +0200

> +++ gcc/config/i386/i386-protos.h       2021-06-21 10:21:05.428887980 +0200

> @@ -111,6 +111,7 @@ extern bool ix86_use_lea_for_mov (rtx_in

>  extern bool ix86_avoid_lea_for_addr (rtx_insn *, rtx[]);

>  extern void ix86_split_lea_for_addr (rtx_insn *, rtx[], machine_mode);

>  extern bool ix86_lea_for_add_ok (rtx_insn *, rtx[]);

> +extern bool ix86_zero_stores_peep2_p (rtx_insn *, rtx);

>  extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high);

>  extern bool ix86_dep_by_shift_count (const_rtx set_insn, const_rtx use_insn);

>  extern bool ix86_agi_dependent (rtx_insn *set_insn, rtx_insn *use_insn);

> --- gcc/config/i386/i386.c.jj   2021-06-21 09:39:21.622487840 +0200

> +++ gcc/config/i386/i386.c      2021-06-21 10:21:12.389794740 +0200

> @@ -15186,6 +15186,33 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx

>    return ix86_lea_outperforms (insn, regno0, regno1, regno2, 0, false);

>  }

>

> +/* Return true if insns before FIRST_INSN (which is of the form

> +   (set (memory) (zero_operand)) are all also either in the

> +   same form, or (set (zero_operand) (const_int 0)).  */

> +

> +bool

> +ix86_zero_stores_peep2_p (rtx_insn *first_insn, rtx zero_operand)

> +{

> +  rtx_insn *insn = first_insn;

> +  for (int count = 0; count < 512; count++)


Can't the peephole add a note (reg_equal?) that the
SET_SRC of the previously matched store is zero?

That would avoid the need to walk here.

> +    {

> +      insn = prev_nonnote_nondebug_insn_bb (insn);

> +      if (!insn)

> +       return false;

> +      rtx set = single_set (insn);

> +      if (!set)

> +       return false;

> +      if (SET_SRC (set) == const0_rtx

> +         && rtx_equal_p (SET_DEST (set), zero_operand))

> +       return true;

> +      if (set != PATTERN (insn)

> +         || !rtx_equal_p (SET_SRC (set), zero_operand)

> +         || !memory_operand (SET_DEST (set), VOIDmode))

> +       return false;

> +    }

> +  return false;

> +}

> +

>  /* Return true if destination reg of SET_BODY is shift count of

>     USE_BODY.  */

>

> --- gcc/config/i386/i386.md.jj  2021-06-21 09:42:04.086303699 +0200

> +++ gcc/config/i386/i386.md     2021-06-21 10:21:31.932532964 +0200

> @@ -19360,10 +19360,10 @@ (define_peephole2

>  ;; When optimizing for size, zeroing memory should use a register.

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 3 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 4 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

>    [(set (match_dup 1) (match_dup 0))

>     (set (match_dup 2) (match_dup 0))

> @@ -19375,8 +19375,8 @@ (define_peephole2

>

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

>    [(set (match_dup 1) (match_dup 0))

>     (set (match_dup 2) (match_dup 0))]

> @@ -19386,13 +19386,48 @@ (define_peephole2

>

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

>    [(set (match_dup 1) (match_dup 0))]

>  {

>    ix86_expand_clear (operands[0]);

>  })

>

> +(define_peephole2

> +  [(set (match_operand:SWI48 5 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"

> +  [(set (match_dup 5) (match_dup 0))

> +   (set (match_dup 1) (match_dup 0))

> +   (set (match_dup 2) (match_dup 0))

> +   (set (match_dup 3) (match_dup 0))

> +   (set (match_dup 4) (match_dup 0))])

> +

> +(define_peephole2

> +  [(set (match_operand:SWI48 3 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"

> +  [(set (match_dup 3) (match_dup 0))

> +   (set (match_dup 1) (match_dup 0))

> +   (set (match_dup 2) (match_dup 0))])

> +

> +(define_peephole2

> +  [(set (match_operand:SWI48 2 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && ix86_zero_stores_peep2_p (peep2_next_insn (0), operands[0])"

> +  [(set (match_dup 2) (match_dup 0))

> +   (set (match_dup 1) (match_dup 0))])

> +

>  ;; Reload dislikes loading constants directly into class_likely_spilled

>  ;; hard registers.  Try to tidy things up here.

>  (define_peephole2

>

>

>         Jakub

>
Aldy Hernandez via Gcc-patches June 21, 2021, 9:58 a.m. | #4
On Mon, Jun 21, 2021 at 11:19:12AM +0200, Richard Biener wrote:
> > --- gcc/config/i386/i386.c.jj   2021-06-21 09:39:21.622487840 +0200

> > +++ gcc/config/i386/i386.c      2021-06-21 10:21:12.389794740 +0200

> > @@ -15186,6 +15186,33 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx

> >    return ix86_lea_outperforms (insn, regno0, regno1, regno2, 0, false);

> >  }

> >

> > +/* Return true if insns before FIRST_INSN (which is of the form

> > +   (set (memory) (zero_operand)) are all also either in the

> > +   same form, or (set (zero_operand) (const_int 0)).  */

> > +

> > +bool

> > +ix86_zero_stores_peep2_p (rtx_insn *first_insn, rtx zero_operand)

> > +{

> > +  rtx_insn *insn = first_insn;

> > +  for (int count = 0; count < 512; count++)

> 

> Can't the peephole add a note (reg_equal?) that the

> SET_SRC of the previously matched store is zero?


I think REG_EQUAL is not valid, the documentation says that it should
be used on SET of a REG, which is not the case here - we have a MEM.

> That would avoid the need to walk here.


But we could do what I've done in
r11-7694-gd55ce33a34a8e33d17285228b32cf1e564241a70
- have int ix86_last_zero_store_uid;
set to INSN_UID of the last store emitted by the peephole2s and
then check that INSN_UID against the var.

	Jakub
Aldy Hernandez via Gcc-patches June 21, 2021, 10:14 a.m. | #5
On Mon, Jun 21, 2021 at 11:59 AM Jakub Jelinek <jakub@redhat.com> wrote:
>

> On Mon, Jun 21, 2021 at 11:19:12AM +0200, Richard Biener wrote:

> > > --- gcc/config/i386/i386.c.jj   2021-06-21 09:39:21.622487840 +0200

> > > +++ gcc/config/i386/i386.c      2021-06-21 10:21:12.389794740 +0200

> > > @@ -15186,6 +15186,33 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx

> > >    return ix86_lea_outperforms (insn, regno0, regno1, regno2, 0, false);

> > >  }

> > >

> > > +/* Return true if insns before FIRST_INSN (which is of the form

> > > +   (set (memory) (zero_operand)) are all also either in the

> > > +   same form, or (set (zero_operand) (const_int 0)).  */

> > > +

> > > +bool

> > > +ix86_zero_stores_peep2_p (rtx_insn *first_insn, rtx zero_operand)

> > > +{

> > > +  rtx_insn *insn = first_insn;

> > > +  for (int count = 0; count < 512; count++)

> >

> > Can't the peephole add a note (reg_equal?) that the

> > SET_SRC of the previously matched store is zero?

>

> I think REG_EQUAL is not valid, the documentation says that it should

> be used on SET of a REG, which is not the case here - we have a MEM.

>

> > That would avoid the need to walk here.

>

> But we could do what I've done in

> r11-7694-gd55ce33a34a8e33d17285228b32cf1e564241a70

> - have int ix86_last_zero_store_uid;

> set to INSN_UID of the last store emitted by the peephole2s and

> then check that INSN_UID against the var.


Hmm, or have reg_nonzero_bits_for_peephole2 () and maintain
that somehow ... (conservatively drop it when a SET is seen).

>         Jakub

>
Aldy Hernandez via Gcc-patches June 21, 2021, 10:28 a.m. | #6
On Mon, Jun 21, 2021 at 12:14:09PM +0200, Richard Biener wrote:
> > But we could do what I've done in

> > r11-7694-gd55ce33a34a8e33d17285228b32cf1e564241a70

> > - have int ix86_last_zero_store_uid;

> > set to INSN_UID of the last store emitted by the peephole2s and

> > then check that INSN_UID against the var.

> 

> Hmm, or have reg_nonzero_bits_for_peephole2 () and maintain

> that somehow ... (conservatively drop it when a SET is seen).


Maintaining something in peephole2 wouldn't be that easy because
of peephole2's rolling window, plus it would need to be done
in the generic code even when nothing but a single target in a specific case
needs that.

The following seems to work.

2021-06-21  Jakub Jelinek  <jakub@redhat.com>

	PR target/11877
	* config/i386/i386-protos.h (ix86_last_zero_store_uid): Declare.
	* config/i386/i386-expand.c (ix86_last_zero_store_uid): New variable.
	* config/i386/i386.c (ix86_expand_prologue): Clear it.
	* config/i386/i386.md (peephole2s for 1/2/4 stores of const0_rtx):
	Remove "" from match_operand.  Emit new insns using emit_move_insn and
	set ix86_last_zero_store_uid to INSN_UID of the last store.
	Add peephole2s for 1/2/4 stores of const0_rtx following previous
	successful peep2s.

--- gcc/config/i386/i386-protos.h.jj	2021-06-21 11:59:16.769693735 +0200
+++ gcc/config/i386/i386-protos.h	2021-06-21 12:01:47.875691930 +0200
@@ -111,6 +111,7 @@ extern bool ix86_use_lea_for_mov (rtx_in
 extern bool ix86_avoid_lea_for_addr (rtx_insn *, rtx[]);
 extern void ix86_split_lea_for_addr (rtx_insn *, rtx[], machine_mode);
 extern bool ix86_lea_for_add_ok (rtx_insn *, rtx[]);
+extern int ix86_last_zero_store_uid;
 extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high);
 extern bool ix86_dep_by_shift_count (const_rtx set_insn, const_rtx use_insn);
 extern bool ix86_agi_dependent (rtx_insn *set_insn, rtx_insn *use_insn);
--- gcc/config/i386/i386-expand.c.jj	2021-06-21 09:39:21.604488082 +0200
+++ gcc/config/i386/i386-expand.c	2021-06-21 12:21:33.017977951 +0200
@@ -1316,6 +1316,9 @@ find_nearest_reg_def (rtx_insn *insn, in
   return false;
 }
 
+/* INSN_UID of the last insn emitted by zero store peephole2s.  */
+int ix86_last_zero_store_uid;
+
 /* Split lea instructions into a sequence of instructions
    which are executed on ALU to avoid AGU stalls.
    It is assumed that it is allowed to clobber flags register
--- gcc/config/i386/i386.c.jj	2021-06-21 09:39:21.622487840 +0200
+++ gcc/config/i386/i386.c	2021-06-21 12:06:54.049634337 +0200
@@ -8196,6 +8196,7 @@ ix86_expand_prologue (void)
   bool save_stub_call_needed;
   rtx static_chain = NULL_RTX;
 
+  ix86_last_zero_store_uid = 0;
   if (ix86_function_naked (current_function_decl))
     {
       if (flag_stack_usage_info)
--- gcc/config/i386/i386.md.jj	2021-06-21 09:42:04.086303699 +0200
+++ gcc/config/i386/i386.md	2021-06-21 12:14:10.411847549 +0200
@@ -19360,37 +19360,96 @@ (define_peephole2
 ;; When optimizing for size, zeroing memory should use a register.
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 3 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 4 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
-  [(set (match_dup 1) (match_dup 0))
-   (set (match_dup 2) (match_dup 0))
-   (set (match_dup 3) (match_dup 0))
-   (set (match_dup 4) (match_dup 0))]
+  [(const_int 0)]
 {
   ix86_expand_clear (operands[0]);
+  emit_move_insn (operands[1], operands[0]);
+  emit_move_insn (operands[2], operands[0]);
+  emit_move_insn (operands[3], operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[4], operands[0]));
+  DONE;
 })
 
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
-   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
-  [(set (match_dup 1) (match_dup 0))
-   (set (match_dup 2) (match_dup 0))]
+  [(const_int 0)]
 {
   ix86_expand_clear (operands[0]);
+  emit_move_insn (operands[1], operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[2], operands[0]));
+  DONE;
 })
 
 (define_peephole2
   [(match_scratch:SWI48 0 "r")
-   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))]
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]
   "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
-  [(set (match_dup 1) (match_dup 0))]
+  [(const_int 0)]
 {
   ix86_expand_clear (operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[1], operands[0]));
+  DONE;
+})
+
+(define_peephole2
+  [(set (match_operand:SWI48 5 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"
+  [(const_int 0)]
+{
+  emit_move_insn (operands[5], operands[0]);
+  emit_move_insn (operands[1], operands[0]);
+  emit_move_insn (operands[2], operands[0]);
+  emit_move_insn (operands[3], operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[4], operands[0]));
+  DONE;
+})
+
+(define_peephole2
+  [(set (match_operand:SWI48 3 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"
+  [(const_int 0)]
+{
+  emit_move_insn (operands[3], operands[0]);
+  emit_move_insn (operands[1], operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[2], operands[0]));
+  DONE;
+})
+
+(define_peephole2
+  [(set (match_operand:SWI48 2 "memory_operand")
+	(match_operand:SWI48 0 "general_reg_operand"))
+   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]
+  "optimize_insn_for_size_p ()
+   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"
+  [(const_int 0)]
+{
+  emit_move_insn (operands[2], operands[0]);
+  ix86_last_zero_store_uid
+    = INSN_UID (emit_move_insn (operands[1], operands[0]));
+  DONE;
 })
 
 ;; Reload dislikes loading constants directly into class_likely_spilled


	Jakub
Aldy Hernandez via Gcc-patches June 21, 2021, 12:01 p.m. | #7
On Mon, Jun 21, 2021 at 12:28 PM Jakub Jelinek <jakub@redhat.com> wrote:
>

> On Mon, Jun 21, 2021 at 12:14:09PM +0200, Richard Biener wrote:

> > > But we could do what I've done in

> > > r11-7694-gd55ce33a34a8e33d17285228b32cf1e564241a70

> > > - have int ix86_last_zero_store_uid;

> > > set to INSN_UID of the last store emitted by the peephole2s and

> > > then check that INSN_UID against the var.

> >

> > Hmm, or have reg_nonzero_bits_for_peephole2 () and maintain

> > that somehow ... (conservatively drop it when a SET is seen).

>

> Maintaining something in peephole2 wouldn't be that easy because

> of peephole2's rolling window, plus it would need to be done

> in the generic code even when nothing but a single target in a specific case

> needs that.

>

> The following seems to work.

>

> 2021-06-21  Jakub Jelinek  <jakub@redhat.com>

>

>         PR target/11877

>         * config/i386/i386-protos.h (ix86_last_zero_store_uid): Declare.

>         * config/i386/i386-expand.c (ix86_last_zero_store_uid): New variable.

>         * config/i386/i386.c (ix86_expand_prologue): Clear it.

>         * config/i386/i386.md (peephole2s for 1/2/4 stores of const0_rtx):

>         Remove "" from match_operand.  Emit new insns using emit_move_insn and

>         set ix86_last_zero_store_uid to INSN_UID of the last store.

>         Add peephole2s for 1/2/4 stores of const0_rtx following previous

>         successful peep2s.


LGTM.

Thanks,
Uros.

>

> --- gcc/config/i386/i386-protos.h.jj    2021-06-21 11:59:16.769693735 +0200

> +++ gcc/config/i386/i386-protos.h       2021-06-21 12:01:47.875691930 +0200

> @@ -111,6 +111,7 @@ extern bool ix86_use_lea_for_mov (rtx_in

>  extern bool ix86_avoid_lea_for_addr (rtx_insn *, rtx[]);

>  extern void ix86_split_lea_for_addr (rtx_insn *, rtx[], machine_mode);

>  extern bool ix86_lea_for_add_ok (rtx_insn *, rtx[]);

> +extern int ix86_last_zero_store_uid;

>  extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high);

>  extern bool ix86_dep_by_shift_count (const_rtx set_insn, const_rtx use_insn);

>  extern bool ix86_agi_dependent (rtx_insn *set_insn, rtx_insn *use_insn);

> --- gcc/config/i386/i386-expand.c.jj    2021-06-21 09:39:21.604488082 +0200

> +++ gcc/config/i386/i386-expand.c       2021-06-21 12:21:33.017977951 +0200

> @@ -1316,6 +1316,9 @@ find_nearest_reg_def (rtx_insn *insn, in

>    return false;

>  }

>

> +/* INSN_UID of the last insn emitted by zero store peephole2s.  */

> +int ix86_last_zero_store_uid;

> +

>  /* Split lea instructions into a sequence of instructions

>     which are executed on ALU to avoid AGU stalls.

>     It is assumed that it is allowed to clobber flags register

> --- gcc/config/i386/i386.c.jj   2021-06-21 09:39:21.622487840 +0200

> +++ gcc/config/i386/i386.c      2021-06-21 12:06:54.049634337 +0200

> @@ -8196,6 +8196,7 @@ ix86_expand_prologue (void)

>    bool save_stub_call_needed;

>    rtx static_chain = NULL_RTX;

>

> +  ix86_last_zero_store_uid = 0;

>    if (ix86_function_naked (current_function_decl))

>      {

>        if (flag_stack_usage_info)

> --- gcc/config/i386/i386.md.jj  2021-06-21 09:42:04.086303699 +0200

> +++ gcc/config/i386/i386.md     2021-06-21 12:14:10.411847549 +0200

> @@ -19360,37 +19360,96 @@ (define_peephole2

>  ;; When optimizing for size, zeroing memory should use a register.

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 3 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 4 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

> -  [(set (match_dup 1) (match_dup 0))

> -   (set (match_dup 2) (match_dup 0))

> -   (set (match_dup 3) (match_dup 0))

> -   (set (match_dup 4) (match_dup 0))]

> +  [(const_int 0)]

>  {

>    ix86_expand_clear (operands[0]);

> +  emit_move_insn (operands[1], operands[0]);

> +  emit_move_insn (operands[2], operands[0]);

> +  emit_move_insn (operands[3], operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[4], operands[0]));

> +  DONE;

>  })

>

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))

> -   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

> -  [(set (match_dup 1) (match_dup 0))

> -   (set (match_dup 2) (match_dup 0))]

> +  [(const_int 0)]

>  {

>    ix86_expand_clear (operands[0]);

> +  emit_move_insn (operands[1], operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[2], operands[0]));

> +  DONE;

>  })

>

>  (define_peephole2

>    [(match_scratch:SWI48 0 "r")

> -   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))]

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]

>    "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"

> -  [(set (match_dup 1) (match_dup 0))]

> +  [(const_int 0)]

>  {

>    ix86_expand_clear (operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[1], operands[0]));

> +  DONE;

> +})

> +

> +(define_peephole2

> +  [(set (match_operand:SWI48 5 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 3 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 4 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"

> +  [(const_int 0)]

> +{

> +  emit_move_insn (operands[5], operands[0]);

> +  emit_move_insn (operands[1], operands[0]);

> +  emit_move_insn (operands[2], operands[0]);

> +  emit_move_insn (operands[3], operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[4], operands[0]));

> +  DONE;

> +})

> +

> +(define_peephole2

> +  [(set (match_operand:SWI48 3 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))

> +   (set (match_operand:SWI48 2 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"

> +  [(const_int 0)]

> +{

> +  emit_move_insn (operands[3], operands[0]);

> +  emit_move_insn (operands[1], operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[2], operands[0]));

> +  DONE;

> +})

> +

> +(define_peephole2

> +  [(set (match_operand:SWI48 2 "memory_operand")

> +       (match_operand:SWI48 0 "general_reg_operand"))

> +   (set (match_operand:SWI48 1 "memory_operand") (const_int 0))]

> +  "optimize_insn_for_size_p ()

> +   && INSN_UID (peep2_next_insn (0)) == ix86_last_zero_store_uid"

> +  [(const_int 0)]

> +{

> +  emit_move_insn (operands[2], operands[0]);

> +  ix86_last_zero_store_uid

> +    = INSN_UID (emit_move_insn (operands[1], operands[0]));

> +  DONE;

>  })

>

>  ;; Reload dislikes loading constants directly into class_likely_spilled

>

>

>         Jakub

>

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 48532eb..2333261 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -19357,6 +19357,42 @@ 
   ix86_expand_clear (operands[1]);
 })
 
+;; When optimizing for size, zeroing memory should use a register.
+(define_peephole2
+  [(match_scratch:SWI48 0 "r")
+   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))
+   (set (match_operand:SWI48 3 "memory_operand" "") (const_int 0))
+   (set (match_operand:SWI48 4 "memory_operand" "") (const_int 0))]
+  "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
+  [(set (match_dup 1) (match_dup 0))
+   (set (match_dup 2) (match_dup 0))
+   (set (match_dup 3) (match_dup 0))
+   (set (match_dup 4) (match_dup 0))]
+{
+  ix86_expand_clear (operands[0]);
+})
+
+(define_peephole2
+  [(match_scratch:SWI48 0 "r")
+   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))
+   (set (match_operand:SWI48 2 "memory_operand" "") (const_int 0))]
+  "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
+  [(set (match_dup 1) (match_dup 0))
+   (set (match_dup 2) (match_dup 0))]
+{
+  ix86_expand_clear (operands[0]);
+})
+
+(define_peephole2
+  [(match_scratch:SWI48 0 "r")
+   (set (match_operand:SWI48 1 "memory_operand" "") (const_int 0))]
+  "optimize_insn_for_size_p () && peep2_regno_dead_p (0, FLAGS_REG)"
+  [(set (match_dup 1) (match_dup 0))]
+{
+  ix86_expand_clear (operands[0]);
+})
+
 ;; Reload dislikes loading constants directly into class_likely_spilled
 ;; hard registers.  Try to tidy things up here.
 (define_peephole2