[v4] Add QI vector mode support to by-pieces for memset

Message ID 20210721200216.256776-1-hjl.tools@gmail.com
State Superseded
Headers show
Series
  • [v4] Add QI vector mode support to by-pieces for memset
Related show

Commit Message

Pierre-Marie de Rodat via Gcc-patches July 21, 2021, 8:02 p.m.
1. Replace scalar_int_mode with fixed_size_mode in the by-pieces
infrastructure to allow non-integer mode.
2. Rename widest_int_mode_for_size to widest_fixed_size_mode_for_size
to return QI vector mode for memset.
3. Add op_by_pieces_d::smallest_fixed_size_mode_for_size to return the
smallest integer or QI vector mode.
4. Remove clear_by_pieces_1 and use builtin_memset_read_str in
clear_by_pieces to support vector mode broadcast.
5. Add lowpart_subreg_regno, a wrapper around simplify_subreg_regno that
uses subreg_lowpart_offset (mode, prev_mode) as the offset.
6. Add TARGET_GEN_MEMSET_SCRATCH_RTX to allow the backend to use a hard
scratch register to avoid stack realignment when expanding memset.

gcc/

	PR middle-end/90773
	* builtins.c (builtin_memcpy_read_str): Change the mode argument
	from scalar_int_mode to fixed_size_mode.
	(builtin_strncpy_read_str): Likewise.
	(gen_memset_value_from_prev): New function.
	(gen_memset_broadcast): Likewise.
	(builtin_memset_read_str): Change the mode argument from
	scalar_int_mode to fixed_size_mode.  Use gen_memset_value_from_prev
	and gen_memset_broadcast.
	(builtin_memset_gen_str): Likewise.
	(try_store_by_multiple_pieces): Use by_pieces_constfn to declare
	constfun.
	* builtins.h (builtin_strncpy_read_str): Replace scalar_int_mode
	with fixed_size_mode.
	(builtin_memset_read_str): Likewise.
	* expr.c (widest_int_mode_for_size): Renamed to ...
	(widest_fixed_size_mode_for_size): Add a bool argument to
	indicate if QI vector mode can be used.
	(by_pieces_ninsns): Call widest_fixed_size_mode_for_size
	instead of widest_int_mode_for_size.
	(pieces_addr::adjust): Change the mode argument from
	scalar_int_mode to fixed_size_mode.
	(op_by_pieces_d): Make m_len read-only.  Add a bool member,
	m_qi_vector_mode, to indicate that QI vector mode can be used.
	(op_by_pieces_d::op_by_pieces_d): Add a bool argument to
	initialize m_qi_vector_mode.  Call widest_fixed_size_mode_for_size
	instead of widest_int_mode_for_size.
	(op_by_pieces_d::get_usable_mode): Change the mode argument from
	scalar_int_mode to fixed_size_mode.  Call
	widest_fixed_size_mode_for_size instead of
	widest_int_mode_for_size.
	(op_by_pieces_d::smallest_fixed_size_mode_for_size): New member
	function to return the smallest integer or QI vector mode.
	(op_by_pieces_d::run): Call widest_fixed_size_mode_for_size
	instead of widest_int_mode_for_size.  Call
	smallest_fixed_size_mode_for_size instead of
	smallest_int_mode_for_size.
	(store_by_pieces_d::store_by_pieces_d): Add a bool argument to
	indicate that QI vector mode can be used and pass it to
	op_by_pieces_d::op_by_pieces_d.
	(can_store_by_pieces): Call widest_fixed_size_mode_for_size
	instead of widest_int_mode_for_size.
	(store_by_pieces): Pass memsetp to
	store_by_pieces_d::store_by_pieces_d.
	(clear_by_pieces_1): Removed.
	(clear_by_pieces): Replace clear_by_pieces_1 with
	builtin_memset_read_str and pass true to store_by_pieces_d to
	support vector mode broadcast.
	(string_cst_read_str): Change the mode argument from
	scalar_int_mode to fixed_size_mode.
	* expr.h (by_pieces_constfn): Change scalar_int_mode to
	fixed_size_mode.
	(by_pieces_prev): Likewise.
	* rtl.h (lowpart_subreg_regno): New.
	* rtlanal.c (lowpart_subreg_regno): New.  A wrapper around
	simplify_subreg_regno.
	* target.def (gen_memset_scratch_rtx): New hook.
	* doc/tm.texi.in: Add TARGET_GEN_MEMSET_SCRATCH_RTX.
	* doc/tm.texi: Regenerated.

gcc/testsuite/

	* gcc.target/i386/pr100865-3.c: Expect vmovdqu8 instead of
	vmovdqu.
	* gcc.target/i386/pr100865-4b.c: Likewise.
---
 gcc/builtins.c                              | 180 ++++++++++++++++----
 gcc/builtins.h                              |   4 +-
 gcc/doc/tm.texi                             |   7 +
 gcc/doc/tm.texi.in                          |   2 +
 gcc/expr.c                                  | 168 ++++++++++++------
 gcc/expr.h                                  |   4 +-
 gcc/rtl.h                                   |   2 +
 gcc/rtlanal.c                               |  11 ++
 gcc/target.def                              |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-3.c  |   2 +-
 gcc/testsuite/gcc.target/i386/pr100865-4b.c |   2 +-
 11 files changed, 303 insertions(+), 88 deletions(-)

-- 
2.31.1

Comments

Pierre-Marie de Rodat via Gcc-patches July 26, 2021, 6:42 p.m. | #1
"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> +/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)

> +   bytes from constant string DATA + OFFSET and return it as target

> +   constant.  If PREV isn't nullptr, it has the RTL info from the

> +   previous iteration.  */

> +

> +rtx

> +builtin_memset_read_str (void *data, void *prev,

> +			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,

> +			 fixed_size_mode mode)

> +{

>    const char *c = (const char *) data;

> -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));

> +  unsigned int size = GET_MODE_SIZE (mode);

>  

> -  memset (p, *c, GET_MODE_SIZE (mode));

> +  rtx target = gen_memset_value_from_prev ((by_pieces_prev *) prev,

> +					   mode);

> +  if (target != nullptr)

> +    return target;

> +  rtx src = gen_int_mode (*c, QImode);

> +  target = gen_memset_broadcast (src, mode);

> +  if (target != nullptr)

> +    return target;

>  

> -  return c_readstr (p, mode);

> +  char *p = XALLOCAVEC (char, size);

> +

> +  memset (p, *c, size);

> +

> +  /* Vector mode should be handled by gen_memset_broadcast above.  */


Nit: s/Vector mode should/Vector modes should/

> +  return c_readstr (p, as_a <scalar_int_mode> (mode));

>  }

>  

>  /* Callback routine for store_by_pieces.  Return the RTL of a register

> @@ -6788,33 +6910,29 @@ builtin_memset_read_str (void *data, void *prevp,

>     nullptr, it has the RTL info from the previous iteration.  */

>  

>  static rtx

> -builtin_memset_gen_str (void *data, void *prevp,

> +builtin_memset_gen_str (void *data, void *prev,

>  			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,

> -			scalar_int_mode mode)

> +			fixed_size_mode mode)

>  {

>    rtx target, coeff;

>    size_t size;

>    char *p;

>  

> -  by_pieces_prev *prev = (by_pieces_prev *) prevp;

> -  if (prev != nullptr && prev->data != nullptr)

> -    {

> -      /* Use the previous data in the same mode.  */

> -      if (prev->mode == mode)

> -	return prev->data;

> -

> -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);

> -      if (target != nullptr)

> -	return target;

> -    }

> -

>    size = GET_MODE_SIZE (mode);

>    if (size == 1)

>      return (rtx) data;

>  

> +  target = gen_memset_value_from_prev ((by_pieces_prev *) prev, mode);

> +  if (target != nullptr)

> +    return target;

> +

> +  target = gen_memset_broadcast ((rtx) data, mode);

> +  if (target != nullptr)

> +    return target;

> +

>    p = XALLOCAVEC (char, size);

>    memset (p, 1, size);

> -  coeff = c_readstr (p, mode);

> +  coeff = c_readstr (p, as_a <scalar_int_mode> (mode));


The same comment would be useful here.

>  

>    target = convert_to_mode (mode, (rtx) data, 1);

>    target = expand_mult (mode, target, coeff, NULL_RTX, 1);

> @@ -6918,7 +7036,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,

>  			    &valc, align, true))

>      return false;

>  

> -  rtx (*constfun) (void *, void *, HOST_WIDE_INT, scalar_int_mode);

> +  by_pieces_constfn constfun;

>    void *constfundata;

>    if (val)

>      {

> diff --git a/gcc/builtins.h b/gcc/builtins.h

> index a64ece3f1cd..4b2ad766c61 100644

> --- a/gcc/builtins.h

> +++ b/gcc/builtins.h

> @@ -111,9 +111,9 @@ extern tree mathfn_built_in (tree, enum built_in_function fn);

>  extern tree mathfn_built_in (tree, combined_fn);

>  extern tree mathfn_built_in_type (combined_fn);

>  extern rtx builtin_strncpy_read_str (void *, void *, HOST_WIDE_INT,

> -				     scalar_int_mode);

> +				     fixed_size_mode);

>  extern rtx builtin_memset_read_str (void *, void *, HOST_WIDE_INT,

> -				    scalar_int_mode);

> +				    fixed_size_mode);

>  extern rtx expand_builtin_saveregs (void);

>  extern tree std_build_builtin_va_list (void);

>  extern tree std_fn_abi_va_list (tree);

> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi

> index c8f4abe3e41..14d387750a8 100644

> --- a/gcc/doc/tm.texi

> +++ b/gcc/doc/tm.texi

> @@ -12149,6 +12149,13 @@ This function prepares to emit a conditional comparison within a sequence

>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.

>  @end deftypefn

>  

> +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})

> +This hook should return an rtx for scratch register in @var{mode} to


s/for scratch register/for a scratch register/

> +be used by memset broadcast.  The backend can use a hard scratch register


How about: s/be used by memset broadcast/be used when expanding memset calls/
since the hook is (IMO rightly) not restricted to vector modes.

> +to avoid stack realignment when expanding memset.  The default is

> +@code{gen_reg_rtx}.

> +@end deftypefn

> +

>  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

>  This target hook returns a new value for the number of times @var{loop}

>  should be unrolled. The parameter @var{nunroll} is the number of times

> […]

> @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

>        max_size = STORE_MAX_PIECES + 1;

>        while (max_size > 1 && l > 0)

>  	{

> -	  scalar_int_mode mode = widest_int_mode_for_size (max_size);

> +	  /* Since this can be called before virtual registers are ready

> +	     to use, avoid QI vector mode here.  */

> +	  fixed_size_mode mode

> +	    = widest_fixed_size_mode_for_size (max_size, false);


I think I might have asked this before, sorry, but: when is that true
and why does it matter?

Apart from that final question, the patch looks good with the changes
above (no need to repost).  I'd just like to know about the reason for
the “false” argument above before the patch goes in.

Thanks,
Richard
Pierre-Marie de Rodat via Gcc-patches July 26, 2021, 9:04 p.m. | #2
On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>

> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > +/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)

> > +   bytes from constant string DATA + OFFSET and return it as target

> > +   constant.  If PREV isn't nullptr, it has the RTL info from the

> > +   previous iteration.  */

> > +

> > +rtx

> > +builtin_memset_read_str (void *data, void *prev,

> > +                      HOST_WIDE_INT offset ATTRIBUTE_UNUSED,

> > +                      fixed_size_mode mode)

> > +{

> >    const char *c = (const char *) data;

> > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));

> > +  unsigned int size = GET_MODE_SIZE (mode);

> >

> > -  memset (p, *c, GET_MODE_SIZE (mode));

> > +  rtx target = gen_memset_value_from_prev ((by_pieces_prev *) prev,

> > +                                        mode);

> > +  if (target != nullptr)

> > +    return target;

> > +  rtx src = gen_int_mode (*c, QImode);

> > +  target = gen_memset_broadcast (src, mode);

> > +  if (target != nullptr)

> > +    return target;

> >

> > -  return c_readstr (p, mode);

> > +  char *p = XALLOCAVEC (char, size);

> > +

> > +  memset (p, *c, size);

> > +

> > +  /* Vector mode should be handled by gen_memset_broadcast above.  */

>

> Nit: s/Vector mode should/Vector modes should/


Fixed.

> > +  return c_readstr (p, as_a <scalar_int_mode> (mode));

> >  }

> >

> >  /* Callback routine for store_by_pieces.  Return the RTL of a register

> > @@ -6788,33 +6910,29 @@ builtin_memset_read_str (void *data, void *prevp,

> >     nullptr, it has the RTL info from the previous iteration.  */

> >

> >  static rtx

> > -builtin_memset_gen_str (void *data, void *prevp,

> > +builtin_memset_gen_str (void *data, void *prev,

> >                       HOST_WIDE_INT offset ATTRIBUTE_UNUSED,

> > -                     scalar_int_mode mode)

> > +                     fixed_size_mode mode)

> >  {

> >    rtx target, coeff;

> >    size_t size;

> >    char *p;

> >

> > -  by_pieces_prev *prev = (by_pieces_prev *) prevp;

> > -  if (prev != nullptr && prev->data != nullptr)

> > -    {

> > -      /* Use the previous data in the same mode.  */

> > -      if (prev->mode == mode)

> > -     return prev->data;

> > -

> > -      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);

> > -      if (target != nullptr)

> > -     return target;

> > -    }

> > -

> >    size = GET_MODE_SIZE (mode);

> >    if (size == 1)

> >      return (rtx) data;

> >

> > +  target = gen_memset_value_from_prev ((by_pieces_prev *) prev, mode);

> > +  if (target != nullptr)

> > +    return target;

> > +

> > +  target = gen_memset_broadcast ((rtx) data, mode);

> > +  if (target != nullptr)

> > +    return target;

> > +

> >    p = XALLOCAVEC (char, size);

> >    memset (p, 1, size);

> > -  coeff = c_readstr (p, mode);

> > +  coeff = c_readstr (p, as_a <scalar_int_mode> (mode));

>

> The same comment would be useful here.


I added the same comment here.

> >

> >    target = convert_to_mode (mode, (rtx) data, 1);

> >    target = expand_mult (mode, target, coeff, NULL_RTX, 1);

> > @@ -6918,7 +7036,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,

> >                           &valc, align, true))

> >      return false;

> >

> > -  rtx (*constfun) (void *, void *, HOST_WIDE_INT, scalar_int_mode);

> > +  by_pieces_constfn constfun;

> >    void *constfundata;

> >    if (val)

> >      {

> > diff --git a/gcc/builtins.h b/gcc/builtins.h

> > index a64ece3f1cd..4b2ad766c61 100644

> > --- a/gcc/builtins.h

> > +++ b/gcc/builtins.h

> > @@ -111,9 +111,9 @@ extern tree mathfn_built_in (tree, enum built_in_function fn);

> >  extern tree mathfn_built_in (tree, combined_fn);

> >  extern tree mathfn_built_in_type (combined_fn);

> >  extern rtx builtin_strncpy_read_str (void *, void *, HOST_WIDE_INT,

> > -                                  scalar_int_mode);

> > +                                  fixed_size_mode);

> >  extern rtx builtin_memset_read_str (void *, void *, HOST_WIDE_INT,

> > -                                 scalar_int_mode);

> > +                                 fixed_size_mode);

> >  extern rtx expand_builtin_saveregs (void);

> >  extern tree std_build_builtin_va_list (void);

> >  extern tree std_fn_abi_va_list (tree);

> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi

> > index c8f4abe3e41..14d387750a8 100644

> > --- a/gcc/doc/tm.texi

> > +++ b/gcc/doc/tm.texi

> > @@ -12149,6 +12149,13 @@ This function prepares to emit a conditional comparison within a sequence

> >   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.

> >  @end deftypefn

> >

> > +@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})

> > +This hook should return an rtx for scratch register in @var{mode} to

>

> s/for scratch register/for a scratch register/


Fixed.

> > +be used by memset broadcast.  The backend can use a hard scratch register

>

> How about: s/be used by memset broadcast/be used when expanding memset calls/

> since the hook is (IMO rightly) not restricted to vector modes.


Fixed.

> > +to avoid stack realignment when expanding memset.  The default is

> > +@code{gen_reg_rtx}.

> > +@end deftypefn

> > +

> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> >  This target hook returns a new value for the number of times @var{loop}

> >  should be unrolled. The parameter @var{nunroll} is the number of times

> > […]

> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> >        max_size = STORE_MAX_PIECES + 1;

> >        while (max_size > 1 && l > 0)

> >       {

> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> > +       /* Since this can be called before virtual registers are ready

> > +          to use, avoid QI vector mode here.  */

> > +       fixed_size_mode mode

> > +         = widest_fixed_size_mode_for_size (max_size, false);

>

> I think I might have asked this before, sorry, but: when is that true

> and why does it matter?


can_store_by_pieces may be called:

value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,
value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

before virtual registers can be used.   When true is passed to
widest_fixed_size_mode_for_size,  virtual registers may be used
to expand memset to broadcast, which leads to ICE.   Since for the
purpose of can_store_by_pieces, we don't need to expand memset
to broadcast and pass false here can avoid ICE.

> Apart from that final question, the patch looks good with the changes

> above (no need to repost).  I'd just like to know about the reason for

> the “false” argument above before the patch goes in.

>

> Thanks,

> Richard


Thanks.


-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 26, 2021, 9:53 p.m. | #3
"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> <richard.sandiford@arm.com> wrote:

>>

>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

>> > +to avoid stack realignment when expanding memset.  The default is

>> > +@code{gen_reg_rtx}.

>> > +@end deftypefn

>> > +

>> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

>> >  This target hook returns a new value for the number of times @var{loop}

>> >  should be unrolled. The parameter @var{nunroll} is the number of times

>> > […]

>> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

>> >        max_size = STORE_MAX_PIECES + 1;

>> >        while (max_size > 1 && l > 0)

>> >       {

>> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

>> > +       /* Since this can be called before virtual registers are ready

>> > +          to use, avoid QI vector mode here.  */

>> > +       fixed_size_mode mode

>> > +         = widest_fixed_size_mode_for_size (max_size, false);

>>

>> I think I might have asked this before, sorry, but: when is that true

>> and why does it matter?

>

> can_store_by_pieces may be called:

>

> value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

>

> before virtual registers can be used.   When true is passed to

> widest_fixed_size_mode_for_size,  virtual registers may be used

> to expand memset to broadcast, which leads to ICE.   Since for the

> purpose of can_store_by_pieces, we don't need to expand memset

> to broadcast and pass false here can avoid ICE.


Ah, I see, thanks.

That sounds like a problem in the way that the memset const function is
written though.  can_store_by_pieces is just a query function, so I don't
think it should be trying to create new registers for can_store_by_pieces,
even if it could.  At the same time, can_store_by_pieces should make the
same choices as the real expander would.

I think this means that:

- gen_memset_broadcast should be inlined into its callers, with the
  builtin_memset_read_str getting the CONST_INT_P case and
  builtin_memset_gen_str getting the variable case.

- builtin_memset_read_str should then stop at and return the
  gen_const_vec_duplicate when the prev argument is null.
  Only when prev is nonnull should it go on to call the hook
  and copy the constant to the register that the hook returns.

I admit that's uglier than the current version, but it looks like that's
what the current interface expects.

Thanks,
Richard
Pierre-Marie de Rodat via Gcc-patches July 26, 2021, 10:56 p.m. | #4
On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>

> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> > <richard.sandiford@arm.com> wrote:

> >>

> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> >> > +to avoid stack realignment when expanding memset.  The default is

> >> > +@code{gen_reg_rtx}.

> >> > +@end deftypefn

> >> > +

> >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> >> >  This target hook returns a new value for the number of times @var{loop}

> >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> >> > […]

> >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> >> >        max_size = STORE_MAX_PIECES + 1;

> >> >        while (max_size > 1 && l > 0)

> >> >       {

> >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> >> > +       /* Since this can be called before virtual registers are ready

> >> > +          to use, avoid QI vector mode here.  */

> >> > +       fixed_size_mode mode

> >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> >>

> >> I think I might have asked this before, sorry, but: when is that true

> >> and why does it matter?

> >

> > can_store_by_pieces may be called:

> >

> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> >

> > before virtual registers can be used.   When true is passed to

> > widest_fixed_size_mode_for_size,  virtual registers may be used

> > to expand memset to broadcast, which leads to ICE.   Since for the

> > purpose of can_store_by_pieces, we don't need to expand memset

> > to broadcast and pass false here can avoid ICE.

>

> Ah, I see, thanks.

>

> That sounds like a problem in the way that the memset const function is

> written though.  can_store_by_pieces is just a query function, so I don't

> think it should be trying to create new registers for can_store_by_pieces,

> even if it could.  At the same time, can_store_by_pieces should make the

> same choices as the real expander would.

>

> I think this means that:

>

> - gen_memset_broadcast should be inlined into its callers, with the

>   builtin_memset_read_str getting the CONST_INT_P case and

>   builtin_memset_gen_str getting the variable case.

>

> - builtin_memset_read_str should then stop at and return the

>   gen_const_vec_duplicate when the prev argument is null.

>   Only when prev is nonnull should it go on to call the hook

>   and copy the constant to the register that the hook returns.


How about keeping gen_memset_broadcast and passing PREV to it:

  rtx target;
  if (CONST_INT_P (data))
    {
      rtx const_vec = gen_const_vec_duplicate (mode, data);
      if (prev == NULL)
        /* Return CONST_VECTOR when called by a query function.  */
        target = const_vec;
      else
        {
          /* Use the move expander with CONST_VECTOR.  */
          target = targetm.gen_memset_scratch_rtx (mode);
          emit_move_insn (target, const_vec);
        }
    }
  else
    {
      target = targetm.gen_memset_scratch_rtx (mode);
      class expand_operand ops[2];
      create_output_operand (&ops[0], target, mode);
      create_input_operand (&ops[1], data, QImode);
      expand_insn (icode, 2, ops);
      if (!rtx_equal_p (target, ops[0].value))
        emit_move_insn (target, ops[0].value);
    }

> I admit that's uglier than the current version, but it looks like that's

> what the current interface expects.

>

> Thanks,

> Richard




-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 26, 2021, 11:19 p.m. | #5
On Mon, Jul 26, 2021 at 3:56 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>

> On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> <richard.sandiford@arm.com> wrote:

> >

> > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> > > <richard.sandiford@arm.com> wrote:

> > >>

> > >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > >> > +to avoid stack realignment when expanding memset.  The default is

> > >> > +@code{gen_reg_rtx}.

> > >> > +@end deftypefn

> > >> > +

> > >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> > >> >  This target hook returns a new value for the number of times @var{loop}

> > >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> > >> > […]

> > >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> > >> >        max_size = STORE_MAX_PIECES + 1;

> > >> >        while (max_size > 1 && l > 0)

> > >> >       {

> > >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> > >> > +       /* Since this can be called before virtual registers are ready

> > >> > +          to use, avoid QI vector mode here.  */

> > >> > +       fixed_size_mode mode

> > >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> > >>

> > >> I think I might have asked this before, sorry, but: when is that true

> > >> and why does it matter?

> > >

> > > can_store_by_pieces may be called:

> > >

> > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > >

> > > before virtual registers can be used.   When true is passed to

> > > widest_fixed_size_mode_for_size,  virtual registers may be used

> > > to expand memset to broadcast, which leads to ICE.   Since for the

> > > purpose of can_store_by_pieces, we don't need to expand memset

> > > to broadcast and pass false here can avoid ICE.

> >

> > Ah, I see, thanks.

> >

> > That sounds like a problem in the way that the memset const function is

> > written though.  can_store_by_pieces is just a query function, so I don't

> > think it should be trying to create new registers for can_store_by_pieces,

> > even if it could.  At the same time, can_store_by_pieces should make the

> > same choices as the real expander would.

> >

> > I think this means that:

> >

> > - gen_memset_broadcast should be inlined into its callers, with the

> >   builtin_memset_read_str getting the CONST_INT_P case and

> >   builtin_memset_gen_str getting the variable case.

> >

> > - builtin_memset_read_str should then stop at and return the

> >   gen_const_vec_duplicate when the prev argument is null.


This doesn't work since can_store_by_pieces has

                 cst = (*constfun) (constfundata, nullptr, offset, mode);
                  if (!targetm.legitimate_constant_p (mode, cst))

ix86_legitimate_constant_p only allows 0 or -1 for CONST_VECTOR.
can_store_by_pieces doesn't work well for vector modes.

> >   Only when prev is nonnull should it go on to call the hook

> >   and copy the constant to the register that the hook returns.

>

> How about keeping gen_memset_broadcast and passing PREV to it:

>

>   rtx target;

>   if (CONST_INT_P (data))

>     {

>       rtx const_vec = gen_const_vec_duplicate (mode, data);

>       if (prev == NULL)

>         /* Return CONST_VECTOR when called by a query function.  */

>         target = const_vec;

>       else

>         {

>           /* Use the move expander with CONST_VECTOR.  */

>           target = targetm.gen_memset_scratch_rtx (mode);

>           emit_move_insn (target, const_vec);

>         }

>     }

>   else

>     {

>       target = targetm.gen_memset_scratch_rtx (mode);

>       class expand_operand ops[2];

>       create_output_operand (&ops[0], target, mode);

>       create_input_operand (&ops[1], data, QImode);

>       expand_insn (icode, 2, ops);

>       if (!rtx_equal_p (target, ops[0].value))

>         emit_move_insn (target, ops[0].value);

>     }

>

> > I admit that's uglier than the current version, but it looks like that's

> > what the current interface expects.

> >

> > Thanks,

> > Richard

>

>

>

> --

> H.J.




-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 27, 2021, 3:31 p.m. | #6
On Mon, Jul 26, 2021 at 4:19 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>

> On Mon, Jul 26, 2021 at 3:56 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >

> > On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> > <richard.sandiford@arm.com> wrote:

> > >

> > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > > > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> > > > <richard.sandiford@arm.com> wrote:

> > > >>

> > > >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > > >> > +to avoid stack realignment when expanding memset.  The default is

> > > >> > +@code{gen_reg_rtx}.

> > > >> > +@end deftypefn

> > > >> > +

> > > >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> > > >> >  This target hook returns a new value for the number of times @var{loop}

> > > >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> > > >> > […]

> > > >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> > > >> >        max_size = STORE_MAX_PIECES + 1;

> > > >> >        while (max_size > 1 && l > 0)

> > > >> >       {

> > > >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> > > >> > +       /* Since this can be called before virtual registers are ready

> > > >> > +          to use, avoid QI vector mode here.  */

> > > >> > +       fixed_size_mode mode

> > > >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> > > >>

> > > >> I think I might have asked this before, sorry, but: when is that true

> > > >> and why does it matter?

> > > >

> > > > can_store_by_pieces may be called:

> > > >

> > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > > >

> > > > before virtual registers can be used.   When true is passed to

> > > > widest_fixed_size_mode_for_size,  virtual registers may be used

> > > > to expand memset to broadcast, which leads to ICE.   Since for the

> > > > purpose of can_store_by_pieces, we don't need to expand memset

> > > > to broadcast and pass false here can avoid ICE.

> > >

> > > Ah, I see, thanks.

> > >

> > > That sounds like a problem in the way that the memset const function is

> > > written though.  can_store_by_pieces is just a query function, so I don't

> > > think it should be trying to create new registers for can_store_by_pieces,

> > > even if it could.  At the same time, can_store_by_pieces should make the

> > > same choices as the real expander would.

> > >

> > > I think this means that:

> > >

> > > - gen_memset_broadcast should be inlined into its callers, with the

> > >   builtin_memset_read_str getting the CONST_INT_P case and

> > >   builtin_memset_gen_str getting the variable case.

> > >

> > > - builtin_memset_read_str should then stop at and return the

> > >   gen_const_vec_duplicate when the prev argument is null.

>

> This doesn't work since can_store_by_pieces has

>

>                  cst = (*constfun) (constfundata, nullptr, offset, mode);

>                   if (!targetm.legitimate_constant_p (mode, cst))


We can add a target hook, targetm.legitimate_memset_constant_p,
which defaults to targetm.legitimate_constant_p.  Will it be acceptable?

> ix86_legitimate_constant_p only allows 0 or -1 for CONST_VECTOR.

> can_store_by_pieces doesn't work well for vector modes.

>

> > >   Only when prev is nonnull should it go on to call the hook

> > >   and copy the constant to the register that the hook returns.

> >

> > How about keeping gen_memset_broadcast and passing PREV to it:

> >

> >   rtx target;

> >   if (CONST_INT_P (data))

> >     {

> >       rtx const_vec = gen_const_vec_duplicate (mode, data);

> >       if (prev == NULL)

> >         /* Return CONST_VECTOR when called by a query function.  */

> >         target = const_vec;

> >       else

> >         {

> >           /* Use the move expander with CONST_VECTOR.  */

> >           target = targetm.gen_memset_scratch_rtx (mode);

> >           emit_move_insn (target, const_vec);

> >         }

> >     }

> >   else

> >     {

> >       target = targetm.gen_memset_scratch_rtx (mode);

> >       class expand_operand ops[2];

> >       create_output_operand (&ops[0], target, mode);

> >       create_input_operand (&ops[1], data, QImode);

> >       expand_insn (icode, 2, ops);

> >       if (!rtx_equal_p (target, ops[0].value))

> >         emit_move_insn (target, ops[0].value);

> >     }

> >

> > > I admit that's uglier than the current version, but it looks like that's

> > > what the current interface expects.

> > >

> > > Thanks,

> > > Richard

> >

> >

> >

> > --

> > H.J.

>

>

>

> --

> H.J.




-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 28, 2021, 3:20 a.m. | #7
On Tue, Jul 27, 2021 at 8:31 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>

> On Mon, Jul 26, 2021 at 4:19 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >

> > On Mon, Jul 26, 2021 at 3:56 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> > >

> > > On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> > > <richard.sandiford@arm.com> wrote:

> > > >

> > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > > > > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> > > > > <richard.sandiford@arm.com> wrote:

> > > > >>

> > > > >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > > > >> > +to avoid stack realignment when expanding memset.  The default is

> > > > >> > +@code{gen_reg_rtx}.

> > > > >> > +@end deftypefn

> > > > >> > +

> > > > >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> > > > >> >  This target hook returns a new value for the number of times @var{loop}

> > > > >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> > > > >> > […]

> > > > >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> > > > >> >        max_size = STORE_MAX_PIECES + 1;

> > > > >> >        while (max_size > 1 && l > 0)

> > > > >> >       {

> > > > >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> > > > >> > +       /* Since this can be called before virtual registers are ready

> > > > >> > +          to use, avoid QI vector mode here.  */

> > > > >> > +       fixed_size_mode mode

> > > > >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> > > > >>

> > > > >> I think I might have asked this before, sorry, but: when is that true

> > > > >> and why does it matter?

> > > > >

> > > > > can_store_by_pieces may be called:

> > > > >

> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> > > > >

> > > > > before virtual registers can be used.   When true is passed to

> > > > > widest_fixed_size_mode_for_size,  virtual registers may be used

> > > > > to expand memset to broadcast, which leads to ICE.   Since for the

> > > > > purpose of can_store_by_pieces, we don't need to expand memset

> > > > > to broadcast and pass false here can avoid ICE.

> > > >

> > > > Ah, I see, thanks.

> > > >

> > > > That sounds like a problem in the way that the memset const function is

> > > > written though.  can_store_by_pieces is just a query function, so I don't

> > > > think it should be trying to create new registers for can_store_by_pieces,

> > > > even if it could.  At the same time, can_store_by_pieces should make the

> > > > same choices as the real expander would.

> > > >

> > > > I think this means that:

> > > >

> > > > - gen_memset_broadcast should be inlined into its callers, with the

> > > >   builtin_memset_read_str getting the CONST_INT_P case and

> > > >   builtin_memset_gen_str getting the variable case.

> > > >

> > > > - builtin_memset_read_str should then stop at and return the

> > > >   gen_const_vec_duplicate when the prev argument is null.

> >

> > This doesn't work since can_store_by_pieces has

> >

> >                  cst = (*constfun) (constfundata, nullptr, offset, mode);

> >                   if (!targetm.legitimate_constant_p (mode, cst))

>

> We can add a target hook, targetm.legitimate_memset_constant_p,

> which defaults to targetm.legitimate_constant_p.  Will it be acceptable?


In the v5 patch,  I changed it to

                 cst = (*constfun) (constfundata, nullptr, offset, mode);
                  /* All CONST_VECTORs are legitimate if vec_duplicate
                     is supported.  */
                  if (!((memsetp
                         && VECTOR_MODE_P (mode)
                         && GET_MODE_INNER (mode) == QImode
                         && (optab_handler (vec_duplicate_optab, mode)
                             != CODE_FOR_nothing))
                        || targetm.legitimate_constant_p (mode, cst)))
                    return 0;

> > ix86_legitimate_constant_p only allows 0 or -1 for CONST_VECTOR.

> > can_store_by_pieces doesn't work well for vector modes.

> >

> > > >   Only when prev is nonnull should it go on to call the hook

> > > >   and copy the constant to the register that the hook returns.

> > >

> > > How about keeping gen_memset_broadcast and passing PREV to it:

> > >

> > >   rtx target;

> > >   if (CONST_INT_P (data))

> > >     {

> > >       rtx const_vec = gen_const_vec_duplicate (mode, data);

> > >       if (prev == NULL)

> > >         /* Return CONST_VECTOR when called by a query function.  */

> > >         target = const_vec;

> > >       else

> > >         {

> > >           /* Use the move expander with CONST_VECTOR.  */

> > >           target = targetm.gen_memset_scratch_rtx (mode);

> > >           emit_move_insn (target, const_vec);

> > >         }

> > >     }

> > >   else

> > >     {

> > >       target = targetm.gen_memset_scratch_rtx (mode);

> > >       class expand_operand ops[2];

> > >       create_output_operand (&ops[0], target, mode);

> > >       create_input_operand (&ops[1], data, QImode);

> > >       expand_insn (icode, 2, ops);

> > >       if (!rtx_equal_p (target, ops[0].value))

> > >         emit_move_insn (target, ops[0].value);

> > >     }

> > >

> > > > I admit that's uglier than the current version, but it looks like that's

> > > > what the current interface expects.

> > > >

> > > > Thanks,

> > > > Richard

> > >

> > >

> > >

> > > --

> > > H.J.

> >

> >

> >

> > --

> > H.J.

>

>

>

> --

> H.J.




-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 30, 2021, 9:05 a.m. | #8
"H.J. Lu" <hjl.tools@gmail.com> writes:
> On Tue, Jul 27, 2021 at 8:31 AM H.J. Lu <hjl.tools@gmail.com> wrote:

>>

>> On Mon, Jul 26, 2021 at 4:19 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >

>> > On Mon, Jul 26, 2021 at 3:56 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> > >

>> > > On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

>> > > <richard.sandiford@arm.com> wrote:

>> > > >

>> > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

>> > > > > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

>> > > > > <richard.sandiford@arm.com> wrote:

>> > > > >>

>> > > > >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

>> > > > >> > +to avoid stack realignment when expanding memset.  The default is

>> > > > >> > +@code{gen_reg_rtx}.

>> > > > >> > +@end deftypefn

>> > > > >> > +

>> > > > >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

>> > > > >> >  This target hook returns a new value for the number of times @var{loop}

>> > > > >> >  should be unrolled. The parameter @var{nunroll} is the number of times

>> > > > >> > […]

>> > > > >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

>> > > > >> >        max_size = STORE_MAX_PIECES + 1;

>> > > > >> >        while (max_size > 1 && l > 0)

>> > > > >> >       {

>> > > > >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

>> > > > >> > +       /* Since this can be called before virtual registers are ready

>> > > > >> > +          to use, avoid QI vector mode here.  */

>> > > > >> > +       fixed_size_mode mode

>> > > > >> > +         = widest_fixed_size_mode_for_size (max_size, false);

>> > > > >>

>> > > > >> I think I might have asked this before, sorry, but: when is that true

>> > > > >> and why does it matter?

>> > > > >

>> > > > > can_store_by_pieces may be called:

>> > > > >

>> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

>> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

>> > > > >

>> > > > > before virtual registers can be used.   When true is passed to

>> > > > > widest_fixed_size_mode_for_size,  virtual registers may be used

>> > > > > to expand memset to broadcast, which leads to ICE.   Since for the

>> > > > > purpose of can_store_by_pieces, we don't need to expand memset

>> > > > > to broadcast and pass false here can avoid ICE.

>> > > >

>> > > > Ah, I see, thanks.

>> > > >

>> > > > That sounds like a problem in the way that the memset const function is

>> > > > written though.  can_store_by_pieces is just a query function, so I don't

>> > > > think it should be trying to create new registers for can_store_by_pieces,

>> > > > even if it could.  At the same time, can_store_by_pieces should make the

>> > > > same choices as the real expander would.

>> > > >

>> > > > I think this means that:

>> > > >

>> > > > - gen_memset_broadcast should be inlined into its callers, with the

>> > > >   builtin_memset_read_str getting the CONST_INT_P case and

>> > > >   builtin_memset_gen_str getting the variable case.

>> > > >

>> > > > - builtin_memset_read_str should then stop at and return the

>> > > >   gen_const_vec_duplicate when the prev argument is null.

>> >

>> > This doesn't work since can_store_by_pieces has

>> >

>> >                  cst = (*constfun) (constfundata, nullptr, offset, mode);

>> >                   if (!targetm.legitimate_constant_p (mode, cst))

>>

>> We can add a target hook, targetm.legitimate_memset_constant_p,

>> which defaults to targetm.legitimate_constant_p.  Will it be acceptable?

>

> In the v5 patch,  I changed it to

>

>                  cst = (*constfun) (constfundata, nullptr, offset, mode);

>                   /* All CONST_VECTORs are legitimate if vec_duplicate

>                      is supported.  */


Maybe “can be loaded” rather than “are legitimate”, since they're
not necessarily legitimate in the sense of legitimate_constant_p
(hence the patch).  Also, since we assume elsewhere that
vec_duplicate is a precondition for picking a vector mode,
I think we should do the same here (and note that in the comment).
So…

>                   if (!((memsetp

>                          && VECTOR_MODE_P (mode)

>                          && GET_MODE_INNER (mode) == QImode

>                          && (optab_handler (vec_duplicate_optab, mode)

>                              != CODE_FOR_nothing))


I think we need only the (memsetp && VECTOR_MODE_P (mode)) check.

This feels a bit of a hack TBH.  I think the same principles apply
to vectors and integers here: forcing the constant to memory is
still likely to be an optimisation, but is an extra overhead that
we should probably account for.

However, I agree this is probably the most practical way forward
at the moment.

Thanks,
Richard

>                         || targetm.legitimate_constant_p (mode, cst)))

>                     return 0;

>

>> > ix86_legitimate_constant_p only allows 0 or -1 for CONST_VECTOR.

>> > can_store_by_pieces doesn't work well for vector modes.

>> >

>> > > >   Only when prev is nonnull should it go on to call the hook

>> > > >   and copy the constant to the register that the hook returns.

>> > >

>> > > How about keeping gen_memset_broadcast and passing PREV to it:

>> > >

>> > >   rtx target;

>> > >   if (CONST_INT_P (data))

>> > >     {

>> > >       rtx const_vec = gen_const_vec_duplicate (mode, data);

>> > >       if (prev == NULL)

>> > >         /* Return CONST_VECTOR when called by a query function.  */

>> > >         target = const_vec;

>> > >       else

>> > >         {

>> > >           /* Use the move expander with CONST_VECTOR.  */

>> > >           target = targetm.gen_memset_scratch_rtx (mode);

>> > >           emit_move_insn (target, const_vec);

>> > >         }

>> > >     }

>> > >   else

>> > >     {

>> > >       target = targetm.gen_memset_scratch_rtx (mode);

>> > >       class expand_operand ops[2];

>> > >       create_output_operand (&ops[0], target, mode);

>> > >       create_input_operand (&ops[1], data, QImode);

>> > >       expand_insn (icode, 2, ops);

>> > >       if (!rtx_equal_p (target, ops[0].value))

>> > >         emit_move_insn (target, ops[0].value);

>> > >     }

>> > >

>> > > > I admit that's uglier than the current version, but it looks like that's

>> > > > what the current interface expects.

>> > > >

>> > > > Thanks,

>> > > > Richard

>> > >

>> > >

>> > >

>> > > --

>> > > H.J.

>> >

>> >

>> >

>> > --

>> > H.J.

>>

>>

>>

>> --

>> H.J.
Pierre-Marie de Rodat via Gcc-patches July 30, 2021, 9:06 a.m. | #9
"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> <richard.sandiford@arm.com> wrote:

>>

>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

>> > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

>> > <richard.sandiford@arm.com> wrote:

>> >>

>> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

>> >> > +to avoid stack realignment when expanding memset.  The default is

>> >> > +@code{gen_reg_rtx}.

>> >> > +@end deftypefn

>> >> > +

>> >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

>> >> >  This target hook returns a new value for the number of times @var{loop}

>> >> >  should be unrolled. The parameter @var{nunroll} is the number of times

>> >> > […]

>> >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

>> >> >        max_size = STORE_MAX_PIECES + 1;

>> >> >        while (max_size > 1 && l > 0)

>> >> >       {

>> >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

>> >> > +       /* Since this can be called before virtual registers are ready

>> >> > +          to use, avoid QI vector mode here.  */

>> >> > +       fixed_size_mode mode

>> >> > +         = widest_fixed_size_mode_for_size (max_size, false);

>> >>

>> >> I think I might have asked this before, sorry, but: when is that true

>> >> and why does it matter?

>> >

>> > can_store_by_pieces may be called:

>> >

>> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

>> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

>> >

>> > before virtual registers can be used.   When true is passed to

>> > widest_fixed_size_mode_for_size,  virtual registers may be used

>> > to expand memset to broadcast, which leads to ICE.   Since for the

>> > purpose of can_store_by_pieces, we don't need to expand memset

>> > to broadcast and pass false here can avoid ICE.

>>

>> Ah, I see, thanks.

>>

>> That sounds like a problem in the way that the memset const function is

>> written though.  can_store_by_pieces is just a query function, so I don't

>> think it should be trying to create new registers for can_store_by_pieces,

>> even if it could.  At the same time, can_store_by_pieces should make the

>> same choices as the real expander would.

>>

>> I think this means that:

>>

>> - gen_memset_broadcast should be inlined into its callers, with the

>>   builtin_memset_read_str getting the CONST_INT_P case and

>>   builtin_memset_gen_str getting the variable case.

>>

>> - builtin_memset_read_str should then stop at and return the

>>   gen_const_vec_duplicate when the prev argument is null.

>>   Only when prev is nonnull should it go on to call the hook

>>   and copy the constant to the register that the hook returns.

>

> How about keeping gen_memset_broadcast and passing PREV to it:

>

>   rtx target;

>   if (CONST_INT_P (data))

>     {

>       rtx const_vec = gen_const_vec_duplicate (mode, data);

>       if (prev == NULL)

>         /* Return CONST_VECTOR when called by a query function.  */

>         target = const_vec;

>       else

>         {

>           /* Use the move expander with CONST_VECTOR.  */

>           target = targetm.gen_memset_scratch_rtx (mode);

>           emit_move_insn (target, const_vec);

>         }

>     }

>   else

>     {

>       target = targetm.gen_memset_scratch_rtx (mode);

>       class expand_operand ops[2];

>       create_output_operand (&ops[0], target, mode);

>       create_input_operand (&ops[1], data, QImode);

>       expand_insn (icode, 2, ops);

>       if (!rtx_equal_p (target, ops[0].value))

>         emit_move_insn (target, ops[0].value);

>     }


TBH I think that complicates the interface too much.  The constant
and non-constant cases are now very different.

Thanks,
Richard
Pierre-Marie de Rodat via Gcc-patches July 30, 2021, 12:38 p.m. | #10
On Fri, Jul 30, 2021 at 2:05 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>

> "H.J. Lu" <hjl.tools@gmail.com> writes:

> > On Tue, Jul 27, 2021 at 8:31 AM H.J. Lu <hjl.tools@gmail.com> wrote:

> >>

> >> On Mon, Jul 26, 2021 at 4:19 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >> >

> >> > On Mon, Jul 26, 2021 at 3:56 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >> > >

> >> > > On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> >> > > <richard.sandiford@arm.com> wrote:

> >> > > >

> >> > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> >> > > > > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> >> > > > > <richard.sandiford@arm.com> wrote:

> >> > > > >>

> >> > > > >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> >> > > > >> > +to avoid stack realignment when expanding memset.  The default is

> >> > > > >> > +@code{gen_reg_rtx}.

> >> > > > >> > +@end deftypefn

> >> > > > >> > +

> >> > > > >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> >> > > > >> >  This target hook returns a new value for the number of times @var{loop}

> >> > > > >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> >> > > > >> > […]

> >> > > > >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> >> > > > >> >        max_size = STORE_MAX_PIECES + 1;

> >> > > > >> >        while (max_size > 1 && l > 0)

> >> > > > >> >       {

> >> > > > >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> >> > > > >> > +       /* Since this can be called before virtual registers are ready

> >> > > > >> > +          to use, avoid QI vector mode here.  */

> >> > > > >> > +       fixed_size_mode mode

> >> > > > >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> >> > > > >>

> >> > > > >> I think I might have asked this before, sorry, but: when is that true

> >> > > > >> and why does it matter?

> >> > > > >

> >> > > > > can_store_by_pieces may be called:

> >> > > > >

> >> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> >> > > > > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> >> > > > >

> >> > > > > before virtual registers can be used.   When true is passed to

> >> > > > > widest_fixed_size_mode_for_size,  virtual registers may be used

> >> > > > > to expand memset to broadcast, which leads to ICE.   Since for the

> >> > > > > purpose of can_store_by_pieces, we don't need to expand memset

> >> > > > > to broadcast and pass false here can avoid ICE.

> >> > > >

> >> > > > Ah, I see, thanks.

> >> > > >

> >> > > > That sounds like a problem in the way that the memset const function is

> >> > > > written though.  can_store_by_pieces is just a query function, so I don't

> >> > > > think it should be trying to create new registers for can_store_by_pieces,

> >> > > > even if it could.  At the same time, can_store_by_pieces should make the

> >> > > > same choices as the real expander would.

> >> > > >

> >> > > > I think this means that:

> >> > > >

> >> > > > - gen_memset_broadcast should be inlined into its callers, with the

> >> > > >   builtin_memset_read_str getting the CONST_INT_P case and

> >> > > >   builtin_memset_gen_str getting the variable case.

> >> > > >

> >> > > > - builtin_memset_read_str should then stop at and return the

> >> > > >   gen_const_vec_duplicate when the prev argument is null.

> >> >

> >> > This doesn't work since can_store_by_pieces has

> >> >

> >> >                  cst = (*constfun) (constfundata, nullptr, offset, mode);

> >> >                   if (!targetm.legitimate_constant_p (mode, cst))

> >>

> >> We can add a target hook, targetm.legitimate_memset_constant_p,

> >> which defaults to targetm.legitimate_constant_p.  Will it be acceptable?

> >

> > In the v5 patch,  I changed it to

> >

> >                  cst = (*constfun) (constfundata, nullptr, offset, mode);

> >                   /* All CONST_VECTORs are legitimate if vec_duplicate

> >                      is supported.  */

>

> Maybe “can be loaded” rather than “are legitimate”, since they're


Fixed.

> not necessarily legitimate in the sense of legitimate_constant_p

> (hence the patch).  Also, since we assume elsewhere that

> vec_duplicate is a precondition for picking a vector mode,

> I think we should do the same here (and note that in the comment).


Fixed.

> So…

>

> >                   if (!((memsetp

> >                          && VECTOR_MODE_P (mode)

> >                          && GET_MODE_INNER (mode) == QImode

> >                          && (optab_handler (vec_duplicate_optab, mode)

> >                              != CODE_FOR_nothing))

>

> I think we need only the (memsetp && VECTOR_MODE_P (mode)) check.

>

> This feels a bit of a hack TBH.  I think the same principles apply

> to vectors and integers here: forcing the constant to memory is

> still likely to be an optimisation, but is an extra overhead that

> we should probably account for.


Fixed.

> However, I agree this is probably the most practical way forward

> at the moment.


I sent out the v6 patch.

Thanks.

> Thanks,

> Richard

>

> >                         || targetm.legitimate_constant_p (mode, cst)))

> >                     return 0;

> >

> >> > ix86_legitimate_constant_p only allows 0 or -1 for CONST_VECTOR.

> >> > can_store_by_pieces doesn't work well for vector modes.

> >> >

> >> > > >   Only when prev is nonnull should it go on to call the hook

> >> > > >   and copy the constant to the register that the hook returns.

> >> > >

> >> > > How about keeping gen_memset_broadcast and passing PREV to it:

> >> > >

> >> > >   rtx target;

> >> > >   if (CONST_INT_P (data))

> >> > >     {

> >> > >       rtx const_vec = gen_const_vec_duplicate (mode, data);

> >> > >       if (prev == NULL)

> >> > >         /* Return CONST_VECTOR when called by a query function.  */

> >> > >         target = const_vec;

> >> > >       else

> >> > >         {

> >> > >           /* Use the move expander with CONST_VECTOR.  */

> >> > >           target = targetm.gen_memset_scratch_rtx (mode);

> >> > >           emit_move_insn (target, const_vec);

> >> > >         }

> >> > >     }

> >> > >   else

> >> > >     {

> >> > >       target = targetm.gen_memset_scratch_rtx (mode);

> >> > >       class expand_operand ops[2];

> >> > >       create_output_operand (&ops[0], target, mode);

> >> > >       create_input_operand (&ops[1], data, QImode);

> >> > >       expand_insn (icode, 2, ops);

> >> > >       if (!rtx_equal_p (target, ops[0].value))

> >> > >         emit_move_insn (target, ops[0].value);

> >> > >     }

> >> > >

> >> > > > I admit that's uglier than the current version, but it looks like that's

> >> > > > what the current interface expects.

> >> > > >

> >> > > > Thanks,

> >> > > > Richard

> >> > >

> >> > >

> >> > >

> >> > > --

> >> > > H.J.

> >> >

> >> >

> >> >

> >> > --

> >> > H.J.

> >>

> >>

> >>

> >> --

> >> H.J.




-- 
H.J.
Pierre-Marie de Rodat via Gcc-patches July 30, 2021, 12:39 p.m. | #11
On Fri, Jul 30, 2021 at 2:06 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>

> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> > On Mon, Jul 26, 2021 at 2:53 PM Richard Sandiford

> > <richard.sandiford@arm.com> wrote:

> >>

> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> >> > On Mon, Jul 26, 2021 at 11:42 AM Richard Sandiford

> >> > <richard.sandiford@arm.com> wrote:

> >> >>

> >> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:

> >> >> > +to avoid stack realignment when expanding memset.  The default is

> >> >> > +@code{gen_reg_rtx}.

> >> >> > +@end deftypefn

> >> >> > +

> >> >> >  @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})

> >> >> >  This target hook returns a new value for the number of times @var{loop}

> >> >> >  should be unrolled. The parameter @var{nunroll} is the number of times

> >> >> > […]

> >> >> > @@ -1446,7 +1511,10 @@ can_store_by_pieces (unsigned HOST_WIDE_INT len,

> >> >> >        max_size = STORE_MAX_PIECES + 1;

> >> >> >        while (max_size > 1 && l > 0)

> >> >> >       {

> >> >> > -       scalar_int_mode mode = widest_int_mode_for_size (max_size);

> >> >> > +       /* Since this can be called before virtual registers are ready

> >> >> > +          to use, avoid QI vector mode here.  */

> >> >> > +       fixed_size_mode mode

> >> >> > +         = widest_fixed_size_mode_for_size (max_size, false);

> >> >>

> >> >> I think I might have asked this before, sorry, but: when is that true

> >> >> and why does it matter?

> >> >

> >> > can_store_by_pieces may be called:

> >> >

> >> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> >> > value-prof.c:      if (!can_store_by_pieces (val, builtin_memset_read_str,

> >> >

> >> > before virtual registers can be used.   When true is passed to

> >> > widest_fixed_size_mode_for_size,  virtual registers may be used

> >> > to expand memset to broadcast, which leads to ICE.   Since for the

> >> > purpose of can_store_by_pieces, we don't need to expand memset

> >> > to broadcast and pass false here can avoid ICE.

> >>

> >> Ah, I see, thanks.

> >>

> >> That sounds like a problem in the way that the memset const function is

> >> written though.  can_store_by_pieces is just a query function, so I don't

> >> think it should be trying to create new registers for can_store_by_pieces,

> >> even if it could.  At the same time, can_store_by_pieces should make the

> >> same choices as the real expander would.

> >>

> >> I think this means that:

> >>

> >> - gen_memset_broadcast should be inlined into its callers, with the

> >>   builtin_memset_read_str getting the CONST_INT_P case and

> >>   builtin_memset_gen_str getting the variable case.

> >>

> >> - builtin_memset_read_str should then stop at and return the

> >>   gen_const_vec_duplicate when the prev argument is null.

> >>   Only when prev is nonnull should it go on to call the hook

> >>   and copy the constant to the register that the hook returns.

> >

> > How about keeping gen_memset_broadcast and passing PREV to it:

> >

> >   rtx target;

> >   if (CONST_INT_P (data))

> >     {

> >       rtx const_vec = gen_const_vec_duplicate (mode, data);

> >       if (prev == NULL)

> >         /* Return CONST_VECTOR when called by a query function.  */

> >         target = const_vec;

> >       else

> >         {

> >           /* Use the move expander with CONST_VECTOR.  */

> >           target = targetm.gen_memset_scratch_rtx (mode);

> >           emit_move_insn (target, const_vec);

> >         }

> >     }

> >   else

> >     {

> >       target = targetm.gen_memset_scratch_rtx (mode);

> >       class expand_operand ops[2];

> >       create_output_operand (&ops[0], target, mode);

> >       create_input_operand (&ops[1], data, QImode);

> >       expand_insn (icode, 2, ops);

> >       if (!rtx_equal_p (target, ops[0].value))

> >         emit_move_insn (target, ops[0].value);

> >     }

>

> TBH I think that complicates the interface too much.  The constant

> and non-constant cases are now very different.


I inlined gen_memset_broadcast in the v6 patch.

Thanks.

-- 
H.J.

Patch

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 170d776c410..26360b0b11b 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3890,13 +3890,16 @@  expand_builtin_strnlen (tree exp, rtx target, machine_mode target_mode)
 
 static rtx
 builtin_memcpy_read_str (void *data, void *, HOST_WIDE_INT offset,
-			 scalar_int_mode mode)
+			 fixed_size_mode mode)
 {
   /* The REPresentation pointed to by DATA need not be a nul-terminated
      string but the caller guarantees it's large enough for MODE.  */
   const char *rep = (const char *) data;
 
-  return c_readstr (rep + offset, mode, /*nul_terminated=*/false);
+  /* The by-pieces infrastructure does not try to pick a vector mode
+     for memcpy expansion.  */
+  return c_readstr (rep + offset, as_a <scalar_int_mode> (mode),
+		    /*nul_terminated=*/false);
 }
 
 /* LEN specify length of the block of memcpy/memset operation.
@@ -6547,14 +6550,16 @@  expand_builtin_stpncpy (tree exp, rtx)
 
 rtx
 builtin_strncpy_read_str (void *data, void *, HOST_WIDE_INT offset,
-			  scalar_int_mode mode)
+			  fixed_size_mode mode)
 {
   const char *str = (const char *) data;
 
   if ((unsigned HOST_WIDE_INT) offset > strlen (str))
     return const0_rtx;
 
-  return c_readstr (str + offset, mode);
+  /* The by-pieces infrastructure does not try to pick a vector mode
+     for strncpy expansion.  */
+  return c_readstr (str + offset, as_a <scalar_int_mode> (mode));
 }
 
 /* Helper to check the sizes of sequences and the destination of calls
@@ -6755,30 +6760,147 @@  expand_builtin_strncpy (tree exp, rtx target)
   return NULL_RTX;
 }
 
-/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
-   bytes from constant string DATA + OFFSET and return it as target
-   constant.  If PREV isn't nullptr, it has the RTL info from the
+/* Return the RTL of a register in MODE generated from PREV in the
    previous iteration.  */
 
-rtx
-builtin_memset_read_str (void *data, void *prevp,
-			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
-			 scalar_int_mode mode)
+static rtx
+gen_memset_value_from_prev (by_pieces_prev *prev, fixed_size_mode mode)
 {
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
+  rtx target = nullptr;
   if (prev != nullptr && prev->data != nullptr)
     {
       /* Use the previous data in the same mode.  */
       if (prev->mode == mode)
 	return prev->data;
+
+      fixed_size_mode prev_mode = prev->mode;
+
+      /* Don't use the previous data to write QImode if it is in a
+	 vector mode.  */
+      if (VECTOR_MODE_P (prev_mode) && mode == QImode)
+	return target;
+
+      rtx prev_rtx = prev->data;
+
+      if (REG_P (prev_rtx)
+	  && HARD_REGISTER_P (prev_rtx)
+	  && lowpart_subreg_regno (REGNO (prev_rtx), prev_mode, mode) < 0)
+	{
+	  /* This case occurs when PREV_MODE is a vector and when
+	     MODE is too small to store using vector operations.
+	     After register allocation, the code will need to move the
+	     lowpart of the vector register into a non-vector register.
+
+	     Also, the target has chosen to use a hard register
+	     instead of going with the default choice of using a
+	     pseudo register.  We should respect that choice and try to
+	     avoid creating a pseudo register with the same mode as the
+	     current hard register.
+
+	     In principle, we could just use a lowpart MODE subreg of
+	     the vector register.  However, the vector register mode might
+	     be too wide for non-vector registers, and we already know
+	     that the non-vector mode is too small for vector registers.
+	     It's therefore likely that we'd need to spill to memory in
+	     the vector mode and reload the non-vector value from there.
+
+	     Try to avoid that by reducing the vector register to the
+	     smallest size that it can hold.  This should increase the
+	     chances that non-vector registers can hold both the inner
+	     and outer modes of the subreg that we generate later.  */
+	  machine_mode m;
+	  fixed_size_mode candidate;
+	  FOR_EACH_MODE_IN_CLASS (m, GET_MODE_CLASS (mode))
+	    if (is_a<fixed_size_mode> (m, &candidate))
+	      {
+		if (GET_MODE_SIZE (candidate)
+		    >= GET_MODE_SIZE (prev_mode))
+		  break;
+		if (GET_MODE_SIZE (candidate) >= GET_MODE_SIZE (mode)
+		    && lowpart_subreg_regno (REGNO (prev_rtx),
+					     prev_mode, candidate) >= 0)
+		  {
+		    target = lowpart_subreg (candidate, prev_rtx,
+					     prev_mode);
+		    prev_rtx = target;
+		    prev_mode = candidate;
+		    break;
+		  }
+	      }
+	  if (target == nullptr)
+	    prev_rtx = copy_to_reg (prev_rtx);
+	}
+
+      target = lowpart_subreg (mode, prev_rtx, prev_mode);
     }
+  return target;
+}
+
+/* Return the RTL of a register in MODE broadcasted from DATA.  */
+
+static rtx
+gen_memset_broadcast (rtx data, fixed_size_mode mode)
+{
+  /* Skip if MODE is not a vector mode.  */
+  if (!VECTOR_MODE_P (mode))
+    return nullptr;
 
+  gcc_assert (GET_MODE_INNER (mode) == QImode);
+
+  enum insn_code icode = optab_handler (vec_duplicate_optab, mode);
+
+  /* vec_duplicate_optab is a precondition to pick a vector mode for
+     the memset expander.  */
+  gcc_assert (icode != CODE_FOR_nothing);
+
+  rtx target = targetm.gen_memset_scratch_rtx (mode);
+  if (CONST_INT_P (data))
+    {
+      /* Use the move expander with CONST_VECTOR.  */
+      rtx const_vec = gen_const_vec_duplicate (mode, data);
+      emit_move_insn (target, const_vec);
+    }
+  else
+    {
+      class expand_operand ops[2];
+      create_output_operand (&ops[0], target, mode);
+      create_input_operand (&ops[1], data, QImode);
+      expand_insn (icode, 2, ops);
+      if (!rtx_equal_p (target, ops[0].value))
+	emit_move_insn (target, ops[0].value);
+    }
+
+  return target;
+}
+
+/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
+   bytes from constant string DATA + OFFSET and return it as target
+   constant.  If PREV isn't nullptr, it has the RTL info from the
+   previous iteration.  */
+
+rtx
+builtin_memset_read_str (void *data, void *prev,
+			 HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
+			 fixed_size_mode mode)
+{
   const char *c = (const char *) data;
-  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
+  unsigned int size = GET_MODE_SIZE (mode);
 
-  memset (p, *c, GET_MODE_SIZE (mode));
+  rtx target = gen_memset_value_from_prev ((by_pieces_prev *) prev,
+					   mode);
+  if (target != nullptr)
+    return target;
+  rtx src = gen_int_mode (*c, QImode);
+  target = gen_memset_broadcast (src, mode);
+  if (target != nullptr)
+    return target;
 
-  return c_readstr (p, mode);
+  char *p = XALLOCAVEC (char, size);
+
+  memset (p, *c, size);
+
+  /* Vector mode should be handled by gen_memset_broadcast above.  */
+  return c_readstr (p, as_a <scalar_int_mode> (mode));
 }
 
 /* Callback routine for store_by_pieces.  Return the RTL of a register
@@ -6788,33 +6910,29 @@  builtin_memset_read_str (void *data, void *prevp,
    nullptr, it has the RTL info from the previous iteration.  */
 
 static rtx
-builtin_memset_gen_str (void *data, void *prevp,
+builtin_memset_gen_str (void *data, void *prev,
 			HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
-			scalar_int_mode mode)
+			fixed_size_mode mode)
 {
   rtx target, coeff;
   size_t size;
   char *p;
 
-  by_pieces_prev *prev = (by_pieces_prev *) prevp;
-  if (prev != nullptr && prev->data != nullptr)
-    {
-      /* Use the previous data in the same mode.  */
-      if (prev->mode == mode)
-	return prev->data;
-
-      target = simplify_gen_subreg (mode, prev->data, prev->mode, 0);
-      if (target != nullptr)
-	return target;
-    }
-
   size = GET_MODE_SIZE (mode);
   if (size == 1)
     return (rtx) data;
 
+  target = gen_memset_value_from_prev ((by_pieces_prev *) prev, mode);
+  if (target != nullptr)
+    return target;
+
+  target = gen_memset_broadcast ((rtx) data, mode);
+  if (target != nullptr)
+    return target;
+
   p = XALLOCAVEC (char, size);
   memset (p, 1, size);
-  coeff = c_readstr (p, mode);
+  coeff = c_readstr (p, as_a <scalar_int_mode> (mode));
 
   target = convert_to_mode (mode, (rtx) data, 1);
   target = expand_mult (mode, target, coeff, NULL_RTX, 1);
@@ -6918,7 +7036,7 @@  try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
 			    &valc, align, true))
     return false;
 
-  rtx (*constfun) (void *, void *, HOST_WIDE_INT, scalar_int_mode);
+  by_pieces_constfn constfun;
   void *constfundata;
   if (val)
     {
diff --git a/gcc/builtins.h b/gcc/builtins.h
index a64ece3f1cd..4b2ad766c61 100644
--- a/gcc/builtins.h
+++ b/gcc/builtins.h
@@ -111,9 +111,9 @@  extern tree mathfn_built_in (tree, enum built_in_function fn);
 extern tree mathfn_built_in (tree, combined_fn);
 extern tree mathfn_built_in_type (combined_fn);
 extern rtx builtin_strncpy_read_str (void *, void *, HOST_WIDE_INT,
-				     scalar_int_mode);
+				     fixed_size_mode);
 extern rtx builtin_memset_read_str (void *, void *, HOST_WIDE_INT,
-				    scalar_int_mode);
+				    fixed_size_mode);
 extern rtx expand_builtin_saveregs (void);
 extern tree std_build_builtin_va_list (void);
 extern tree std_fn_abi_va_list (tree);
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c8f4abe3e41..14d387750a8 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -12149,6 +12149,13 @@  This function prepares to emit a conditional comparison within a sequence
  @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the compares.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GEN_MEMSET_SCRATCH_RTX (machine_mode @var{mode})
+This hook should return an rtx for scratch register in @var{mode} to
+be used by memset broadcast.  The backend can use a hard scratch register
+to avoid stack realignment when expanding memset.  The default is
+@code{gen_reg_rtx}.
+@end deftypefn
+
 @deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST (unsigned @var{nunroll}, class loop *@var{loop})
 This target hook returns a new value for the number of times @var{loop}
 should be unrolled. The parameter @var{nunroll} is the number of times
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 9c4b5016053..923b1009d09 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7984,6 +7984,8 @@  lists.
 
 @hook TARGET_GEN_CCMP_NEXT
 
+@hook TARGET_GEN_MEMSET_SCRATCH_RTX
+
 @hook TARGET_LOOP_UNROLL_ADJUST
 
 @defmac POWI_MAX_MULTS
diff --git a/gcc/expr.c b/gcc/expr.c
index 6a4368113c4..238e78f5c4c 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -769,15 +769,36 @@  alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
-/* Return the widest integer mode that is narrower than SIZE bytes.  */
+/* Return the widest QI vector, if QI_MODE is true, or integer mode
+   that is narrower than SIZE bytes.  */
 
-static scalar_int_mode
-widest_int_mode_for_size (unsigned int size)
+static fixed_size_mode
+widest_fixed_size_mode_for_size (unsigned int size, bool qi_vector)
 {
-  scalar_int_mode result = NARROWEST_INT_MODE;
+  fixed_size_mode result = NARROWEST_INT_MODE;
 
   gcc_checking_assert (size > 1);
 
+  /* Use QI vector only if size is wider than a WORD.  */
+  if (qi_vector && size > UNITS_PER_WORD)
+    {
+      machine_mode mode;
+      fixed_size_mode candidate;
+      FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
+	if (is_a<fixed_size_mode> (mode, &candidate)
+	    && GET_MODE_INNER (candidate) == QImode)
+	  {
+	    if (GET_MODE_SIZE (candidate) >= size)
+	      break;
+	    if (optab_handler (vec_duplicate_optab, candidate)
+		!= CODE_FOR_nothing)
+	      result = candidate;
+	  }
+
+      if (result != NARROWEST_INT_MODE)
+	return result;
+    }
+
   opt_scalar_int_mode tmode;
   FOR_EACH_MODE_IN_CLASS (tmode, MODE_INT)
     if (GET_MODE_SIZE (tmode.require ()) < size)
@@ -815,13 +836,14 @@  by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
 		  unsigned int max_size, by_pieces_operation op)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-  scalar_int_mode mode;
+  fixed_size_mode mode;
 
   if (targetm.overlap_op_by_pieces_p () && op != COMPARE_BY_PIECES)
     {
       /* NB: Round up L and ALIGN to the widest integer mode for
 	 MAX_SIZE.  */
-      mode = widest_int_mode_for_size (max_size);
+      mode = widest_fixed_size_mode_for_size (max_size,
+					      op == SET_BY_PIECES);
       if (optab_handler (mov_optab, mode) != CODE_FOR_nothing)
 	{
 	  unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode));
@@ -835,7 +857,8 @@  by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
 
   while (max_size > 1 && l > 0)
     {
-      mode = widest_int_mode_for_size (max_size);
+      mode = widest_fixed_size_mode_for_size (max_size,
+					      op == SET_BY_PIECES);
       enum insn_code icode;
 
       unsigned int modesize = GET_MODE_SIZE (mode);
@@ -903,8 +926,7 @@  class pieces_addr
   void *m_cfndata;
 public:
   pieces_addr (rtx, bool, by_pieces_constfn, void *);
-  rtx adjust (scalar_int_mode, HOST_WIDE_INT,
-	      by_pieces_prev * = nullptr);
+  rtx adjust (fixed_size_mode, HOST_WIDE_INT, by_pieces_prev * = nullptr);
   void increment_address (HOST_WIDE_INT);
   void maybe_predec (HOST_WIDE_INT);
   void maybe_postinc (HOST_WIDE_INT);
@@ -1006,7 +1028,7 @@  pieces_addr::decide_autoinc (machine_mode ARG_UNUSED (mode), bool reverse,
    but we still modify the MEM's properties.  */
 
 rtx
-pieces_addr::adjust (scalar_int_mode mode, HOST_WIDE_INT offset,
+pieces_addr::adjust (fixed_size_mode mode, HOST_WIDE_INT offset,
 		     by_pieces_prev *prev)
 {
   if (m_constfn)
@@ -1060,11 +1082,14 @@  pieces_addr::maybe_postinc (HOST_WIDE_INT size)
 class op_by_pieces_d
 {
  private:
-  scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int);
+  fixed_size_mode get_usable_mode (fixed_size_mode, unsigned int);
+  fixed_size_mode smallest_fixed_size_mode_for_size (unsigned int);
 
  protected:
   pieces_addr m_to, m_from;
-  unsigned HOST_WIDE_INT m_len;
+  /* Make m_len read-only so that smallest_fixed_size_mode_for_size can
+     use it to check the valid mode size.  */
+  const unsigned HOST_WIDE_INT m_len;
   HOST_WIDE_INT m_offset;
   unsigned int m_align;
   unsigned int m_max_size;
@@ -1073,6 +1098,8 @@  class op_by_pieces_d
   bool m_push;
   /* True if targetm.overlap_op_by_pieces_p () returns true.  */
   bool m_overlap_op_by_pieces;
+  /* True if QI vector mode can be used.  */
+  bool m_qi_vector_mode;
 
   /* Virtual functions, overriden by derived classes for the specific
      operation.  */
@@ -1084,7 +1111,8 @@  class op_by_pieces_d
 
  public:
   op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
-		  unsigned HOST_WIDE_INT, unsigned int, bool);
+		  unsigned HOST_WIDE_INT, unsigned int, bool,
+		  bool = false);
   void run ();
 };
 
@@ -1099,11 +1127,12 @@  op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
 				by_pieces_constfn from_cfn,
 				void *from_cfn_data,
 				unsigned HOST_WIDE_INT len,
-				unsigned int align, bool push)
+				unsigned int align, bool push,
+				bool qi_vector_mode)
   : m_to (to, to_load, NULL, NULL),
     m_from (from, from_load, from_cfn, from_cfn_data),
     m_len (len), m_max_size (MOVE_MAX_PIECES + 1),
-    m_push (push)
+    m_push (push), m_qi_vector_mode (qi_vector_mode)
 {
   int toi = m_to.get_addr_inc ();
   int fromi = m_from.get_addr_inc ();
@@ -1124,7 +1153,9 @@  op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
   if (by_pieces_ninsns (len, align, m_max_size, MOVE_BY_PIECES) > 2)
     {
       /* Find the mode of the largest comparison.  */
-      scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
+      fixed_size_mode mode
+	= widest_fixed_size_mode_for_size (m_max_size,
+					   m_qi_vector_mode);
 
       m_from.decide_autoinc (mode, m_reverse, len);
       m_to.decide_autoinc (mode, m_reverse, len);
@@ -1139,8 +1170,8 @@  op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
 /* This function returns the largest usable integer mode for LEN bytes
    whose size is no bigger than size of MODE.  */
 
-scalar_int_mode
-op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len)
+fixed_size_mode
+op_by_pieces_d::get_usable_mode (fixed_size_mode mode, unsigned int len)
 {
   unsigned int size;
   do
@@ -1148,13 +1179,42 @@  op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len)
       size = GET_MODE_SIZE (mode);
       if (len >= size && prepare_mode (mode, m_align))
 	break;
-      /* NB: widest_int_mode_for_size checks SIZE > 1.  */
-      mode = widest_int_mode_for_size (size);
+      /* widest_fixed_size_mode_for_size checks SIZE > 1.  */
+      mode = widest_fixed_size_mode_for_size (size, m_qi_vector_mode);
     }
   while (1);
   return mode;
 }
 
+/* Return the smallest integer or QI vector mode that is not narrower
+   than SIZE bytes.  */
+
+fixed_size_mode
+op_by_pieces_d::smallest_fixed_size_mode_for_size (unsigned int size)
+{
+  /* Use QI vector only for > size of WORD.  */
+  if (m_qi_vector_mode && size > UNITS_PER_WORD)
+    {
+      machine_mode mode;
+      fixed_size_mode candidate;
+      FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
+	if (is_a<fixed_size_mode> (mode, &candidate)
+	    && GET_MODE_INNER (candidate) == QImode)
+	  {
+	    /* Don't return a mode wider than M_LEN.  */
+	    if (GET_MODE_SIZE (candidate) > m_len)
+	      break;
+
+	    if (GET_MODE_SIZE (candidate) >= size
+		&& (optab_handler (vec_duplicate_optab, candidate)
+		    != CODE_FOR_nothing))
+	      return candidate;
+	  }
+    }
+
+  return smallest_int_mode_for_size (size * BITS_PER_UNIT);
+}
+
 /* This function contains the main loop used for expanding a block
    operation.  First move what we can in the largest integer mode,
    then go to successively smaller modes.  For every access, call
@@ -1166,9 +1226,12 @@  op_by_pieces_d::run ()
   if (m_len == 0)
     return;
 
-  /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1.  */
-  scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
-  mode = get_usable_mode (mode, m_len);
+  unsigned HOST_WIDE_INT length = m_len;
+
+  /* widest_fixed_size_mode_for_size checks M_MAX_SIZE > 1.  */
+  fixed_size_mode mode
+    = widest_fixed_size_mode_for_size (m_max_size, m_qi_vector_mode);
+  mode = get_usable_mode (mode, length);
 
   by_pieces_prev to_prev = { nullptr, mode };
   by_pieces_prev from_prev = { nullptr, mode };
@@ -1178,7 +1241,7 @@  op_by_pieces_d::run ()
       unsigned int size = GET_MODE_SIZE (mode);
       rtx to1 = NULL_RTX, from1;
 
-      while (m_len >= size)
+      while (length >= size)
 	{
 	  if (m_reverse)
 	    m_offset -= size;
@@ -1201,22 +1264,22 @@  op_by_pieces_d::run ()
 	  if (!m_reverse)
 	    m_offset += size;
 
-	  m_len -= size;
+	  length -= size;
 	}
 
       finish_mode (mode);
 
-      if (m_len == 0)
+      if (length == 0)
 	return;
 
       if (!m_push && m_overlap_op_by_pieces)
 	{
 	  /* NB: Generate overlapping operations if it is not a stack
 	     push since stack push must not overlap.  Get the smallest
-	     integer mode for M_LEN bytes.  */
-	  mode = smallest_int_mode_for_size (m_len * BITS_PER_UNIT);
+	     fixed size mode for M_LEN bytes.  */
+	  mode = smallest_fixed_size_mode_for_size (length);
 	  mode = get_usable_mode (mode, GET_MODE_SIZE (mode));
-	  int gap = GET_MODE_SIZE (mode) - m_len;
+	  int gap = GET_MODE_SIZE (mode) - length;
 	  if (gap > 0)
 	    {
 	      /* If size of MODE > M_LEN, generate the last operation
@@ -1226,20 +1289,21 @@  op_by_pieces_d::run ()
 		m_offset += gap;
 	      else
 		m_offset -= gap;
-	      m_len += gap;
+	      length += gap;
 	    }
 	}
       else
 	{
-	  /* NB: widest_int_mode_for_size checks SIZE > 1.  */
-	  mode = widest_int_mode_for_size (size);
-	  mode = get_usable_mode (mode, m_len);
+	  /* widest_fixed_size_mode_for_size checks SIZE > 1.  */
+	  mode = widest_fixed_size_mode_for_size (size,
+						  m_qi_vector_mode);
+	  mode = get_usable_mode (mode, length);
 	}
     }
   while (1);
 
   /* The code above should have handled everything.  */
-  gcc_assert (!m_len);
+  gcc_assert (!length);
 }
 
 /* Derived class from op_by_pieces_d, providing support for block move
@@ -1355,9 +1419,10 @@  class store_by_pieces_d : public op_by_pieces_d
 
  public:
   store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data,
-		     unsigned HOST_WIDE_INT len, unsigned int align)
+		     unsigned HOST_WIDE_INT len, unsigned int align,
+		     bool qi_vector_mode)
     : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len,
-		      align, false)
+		      align, false, qi_vector_mode)
   {
   }
   rtx finish_retmode (memop_ret);
@@ -1446,7 +1511,10 @@  can_store_by_pieces (unsigned HOST_WIDE_INT len,
       max_size = STORE_MAX_PIECES + 1;
       while (max_size > 1 && l > 0)
 	{
-	  scalar_int_mode mode = widest_int_mode_for_size (max_size);
+	  /* Since this can be called before virtual registers are ready
+	     to use, avoid QI vector mode here.  */
+	  fixed_size_mode mode
+	    = widest_fixed_size_mode_for_size (max_size, false);
 
 	  icode = optab_handler (mov_optab, mode);
 	  if (icode != CODE_FOR_nothing
@@ -1504,7 +1572,8 @@  store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
 		 memsetp ? SET_BY_PIECES : STORE_BY_PIECES,
 		 optimize_insn_for_speed_p ()));
 
-  store_by_pieces_d data (to, constfun, constfundata, len, align);
+  store_by_pieces_d data (to, constfun, constfundata, len, align,
+			  memsetp);
   data.run ();
 
   if (retmode != RETURN_BEGIN)
@@ -1513,15 +1582,6 @@  store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
     return to;
 }
 
-/* Callback routine for clear_by_pieces.
-   Return const0_rtx unconditionally.  */
-
-static rtx
-clear_by_pieces_1 (void *, void *, HOST_WIDE_INT, scalar_int_mode)
-{
-  return const0_rtx;
-}
-
 /* Generate several move instructions to clear LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
@@ -1531,7 +1591,10 @@  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
   if (len == 0)
     return;
 
-  store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align);
+  /* Use builtin_memset_read_str to support vector mode broadcast.  */
+  char c = 0;
+  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
+			  true);
   data.run ();
 }
 
@@ -5754,7 +5817,7 @@  emit_storent_insn (rtx to, rtx from)
 
 static rtx
 string_cst_read_str (void *data, void *, HOST_WIDE_INT offset,
-		     scalar_int_mode mode)
+		     fixed_size_mode mode)
 {
   tree str = (tree) data;
 
@@ -5769,10 +5832,13 @@  string_cst_read_str (void *data, void *, HOST_WIDE_INT offset,
       size_t l = TREE_STRING_LENGTH (str) - offset;
       memcpy (p, TREE_STRING_POINTER (str) + offset, l);
       memset (p + l, '\0', GET_MODE_SIZE (mode) - l);
-      return c_readstr (p, mode, false);
+      return c_readstr (p, as_a <scalar_int_mode> (mode), false);
     }
 
-  return c_readstr (TREE_STRING_POINTER (str) + offset, mode, false);
+  /* The by-pieces infrastructure does not try to pick a vector mode
+     for storing STRING_CST.  */
+  return c_readstr (TREE_STRING_POINTER (str) + offset,
+		    as_a <scalar_int_mode> (mode), false);
 }
 
 /* Generate code for computing expression EXP,
diff --git a/gcc/expr.h b/gcc/expr.h
index a4f44265759..94a85b40dca 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -108,13 +108,13 @@  enum block_op_methods
 };
 
 typedef rtx (*by_pieces_constfn) (void *, void *, HOST_WIDE_INT,
-				  scalar_int_mode);
+				  fixed_size_mode);
 
 /* The second pointer passed to by_pieces_constfn.  */
 struct by_pieces_prev
 {
   rtx data;
-  scalar_int_mode mode;
+  fixed_size_mode mode;
 };
 
 extern rtx emit_block_move (rtx, rtx, rtx, enum block_op_methods);
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 5ffe90030fb..5fdcdfcd728 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2460,6 +2460,8 @@  extern bool subreg_offset_representable_p (unsigned int, machine_mode,
 extern unsigned int subreg_regno (const_rtx);
 extern int simplify_subreg_regno (unsigned int, machine_mode,
 				  poly_uint64, machine_mode);
+extern int lowpart_subreg_regno (unsigned int, machine_mode,
+				 machine_mode);
 extern unsigned int subreg_nregs (const_rtx);
 extern unsigned int subreg_nregs_with_regno (unsigned int, const_rtx);
 extern unsigned HOST_WIDE_INT nonzero_bits (const_rtx, machine_mode);
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index ec7a062829c..3b8d88afd4d 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -4325,6 +4325,17 @@  simplify_subreg_regno (unsigned int xregno, machine_mode xmode,
   return (int) yregno;
 }
 
+/* A wrapper around simplify_subreg_regno that uses subreg_lowpart_offset
+   (xmode, ymode) as the offset.  */
+
+int
+lowpart_subreg_regno (unsigned int regno, machine_mode xmode,
+		      machine_mode ymode)
+{
+  poly_uint64 offset = subreg_lowpart_offset (xmode, ymode);
+  return simplify_subreg_regno (regno, xmode, offset, ymode);
+}
+
 /* Return the final regno that a subreg expression refers to.  */
 unsigned int
 subreg_regno (const_rtx x)
diff --git a/gcc/target.def b/gcc/target.def
index 2e40448e6c5..b801289cebd 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2726,6 +2726,15 @@  DEFHOOK
  rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree op0, tree op1, int bit_code),
  NULL)
 
+DEFHOOK
+(gen_memset_scratch_rtx,
+ "This hook should return an rtx for scratch register in @var{mode} to\n\
+be used by memset broadcast.  The backend can use a hard scratch register\n\
+to avoid stack realignment when expanding memset.  The default is\n\
+@code{gen_reg_rtx}.",
+ rtx, (machine_mode mode),
+ gen_reg_rtx)
+
 /* Return a new value for loop unroll size.  */
 DEFHOOK
 (loop_unroll_adjust,
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
index b6dbcf7809b..007e79f91b0 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
@@ -10,6 +10,6 @@  foo (void)
 }
 
 /* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
-/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
 /* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
index f41e6147b4c..1e50dc842bc 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-4b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
@@ -4,6 +4,6 @@ 
 #include "pr100865-4a.c"
 
 /* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
-/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
+/* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%xmm\[0-9\]+, " 4 } } */
 /* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */