[v1] x86: Improve memset-vec-unaligned-erms.S

Message ID 20210520184404.2901975-1-goldstein.w.n@gmail.com
State New
Headers show
Series
  • [v1] x86: Improve memset-vec-unaligned-erms.S
Related show

Commit Message

Florian Weimer via Libc-alpha May 20, 2021, 6:44 p.m.
No bug. This commit makes a few small improvements to
memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64
instead of 128. Either alignment will perform equally well in a loop
and 128 just increases the odds of having to do an extra iteration
which can be significant overhead for small values. 2) Align some
targets and the loop. 3) Remove an ALU from the alignment process. 4)
Reorder the last 4x VEC so that they are stored after the loop. 5)
Move the condition for leq 8x VEC to before the alignment
process. test-memset and test-wmemset are both passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

---
Tests where run on the following CPUs:

Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

All times are the geometric mean of N=50. The unit of time is
seconds.

"Cur" refers to the current implementation
"New" refers to this patches implementation

Performance data attached in memset-data.pdf
    
Some notes on the numbers:

I only included numbers for the proper VEC_SIZE for the corresponding
cpu.

skl -> avx2
icl -> evex
tgl -> evex            
    
The changes only affect sizes > 2 * VEC_SIZE. The performance
differences in the size <= 2 * VEC_SIZE come from changes in alignment
after linking (i.e ENTRY aligns to 16, but performance can be affected
by alignment % 64 or alignment % 4096) and generally affects
throughput only, not latency (i.e with an lfence to the benchmark loop
the deviations go away). Generally I think they can be ignored (both
positive and negative affects).

The interesting part of the data is in the medium size range [128,
1024] where the new implementation has a reasonable speedup. This is
especially pronounced when the more conservative alignment saves a
full loop iteration. The only significant exception is
skylake-avx2-erms case for size = 416, alignment = 416 where the
current implementation is meaningfully faster. I am unsure of the root
cause for this. The skylake-avx2 case only performs a bit worse in
this case which makes me think part of it is code alignment related,
though comparative to the speedup in other size/alignment
configurations it is still a trough.  Despite this, I still think the
numbers are overall an improvement.

As well due to aligning the loop (and possibly slightly more efficient
DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop
with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight
improvement to the main loop for large sizes.

 .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------
 1 file changed, 28 insertions(+), 22 deletions(-)

-- 
2.25.1

Comments

Florian Weimer via Libc-alpha May 20, 2021, 6:47 p.m. | #1
On Thu, May 20, 2021 at 2:45 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

> No bug. This commit makes a few small improvements to

> memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

> instead of 128. Either alignment will perform equally well in a loop

> and 128 just increases the odds of having to do an extra iteration

> which can be significant overhead for small values. 2) Align some

> targets and the loop. 3) Remove an ALU from the alignment process. 4)

> Reorder the last 4x VEC so that they are stored after the loop. 5)

> Move the condition for leq 8x VEC to before the alignment

> process. test-memset and test-wmemset are both passing.

>

> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> ---

> Tests where run on the following CPUs:

>

> Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>

> Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>

> Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>

> All times are the geometric mean of N=50. The unit of time is

> seconds.

>

> "Cur" refers to the current implementation

> "New" refers to this patches implementation

>

> Performance data attached in memset-data.pdf

>

> Some notes on the numbers:

>

> I only included numbers for the proper VEC_SIZE for the corresponding

> cpu.

>

> skl -> avx2

> icl -> evex

> tgl -> evex

>

> The changes only affect sizes > 2 * VEC_SIZE. The performance

> differences in the size <= 2 * VEC_SIZE come from changes in alignment

> after linking (i.e ENTRY aligns to 16, but performance can be affected

> by alignment % 64 or alignment % 4096) and generally affects

> throughput only, not latency (i.e with an lfence to the benchmark loop

> the deviations go away). Generally I think they can be ignored (both

> positive and negative affects).

>

> The interesting part of the data is in the medium size range [128,

> 1024] where the new implementation has a reasonable speedup. This is

> especially pronounced when the more conservative alignment saves a

> full loop iteration. The only significant exception is

> skylake-avx2-erms case for size = 416, alignment = 416 where the

> current implementation is meaningfully faster. I am unsure of the root

> cause for this. The skylake-avx2 case only performs a bit worse in

> this case which makes me think part of it is code alignment related,

> though comparative to the speedup in other size/alignment

> configurations it is still a trough.  Despite this, I still think the

> numbers are overall an improvement.

>

> As well due to aligning the loop (and possibly slightly more efficient

> DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

> improvement to the main loop for large sizes.

>

>  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>  1 file changed, 28 insertions(+), 22 deletions(-)

>

> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> index 08cfa49bd1..ff196844a0 100644

> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>         VMOVU   %VEC(0), (%rdi)

>         VZEROUPPER_RETURN

>

> +       .p2align 4

>  L(stosb_more_2x_vec):

>         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>         ja      L(stosb)

> +#else

> +       .p2align 4

>  #endif

>  L(more_2x_vec):

> -       cmpq  $(VEC_SIZE * 4), %rdx

> -       ja      L(loop_start)

> +       /* Stores to first 2x VEC before cmp as any path forward will

> +          require it.  */

>         VMOVU   %VEC(0), (%rdi)

>         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> +       cmpq    $(VEC_SIZE * 4), %rdx

> +       ja      L(loop_start)

>         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>  L(return):

>  #if VEC_SIZE > 16

>         ZERO_UPPER_VEC_REGISTERS_RETURN

> @@ -192,28 +197,29 @@ L(return):

>  #endif

>

>  L(loop_start):

> -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> -       VMOVU   %VEC(0), (%rdi)

> -       andq    $-(VEC_SIZE * 4), %rcx

> -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> -       addq    %rdi, %rdx

> -       andq    $-(VEC_SIZE * 4), %rdx

> -       cmpq    %rdx, %rcx

> -       je      L(return)

> +       cmpq    $(VEC_SIZE * 8), %rdx

> +       jbe     L(loop_end)

> +       andq    $-(VEC_SIZE * 2), %rdi

> +       subq    $-(VEC_SIZE * 4), %rdi

> +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> +       .p2align 4

>  L(loop):

> -       VMOVA   %VEC(0), (%rcx)

> -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> -       addq    $(VEC_SIZE * 4), %rcx

> -       cmpq    %rcx, %rdx

> -       jne     L(loop)

> +       VMOVA   %VEC(0), (%rdi)

> +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> +       subq    $-(VEC_SIZE * 4), %rdi

> +       cmpq    %rcx, %rdi

> +       jb      L(loop)

> +L(loop_end):

> +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> +              rdx as length is also unchanged.  */

> +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>         VZEROUPPER_SHORT_RETURN

>

>         .p2align 4

> --

> 2.25.1

>
Florian Weimer via Libc-alpha May 20, 2021, 7:39 p.m. | #2
On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

> No bug. This commit makes a few small improvements to

> memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

> instead of 128. Either alignment will perform equally well in a loop

> and 128 just increases the odds of having to do an extra iteration

> which can be significant overhead for small values. 2) Align some

> targets and the loop. 3) Remove an ALU from the alignment process. 4)

> Reorder the last 4x VEC so that they are stored after the loop. 5)

> Move the condition for leq 8x VEC to before the alignment

> process. test-memset and test-wmemset are both passing.

>

> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> ---

> Tests where run on the following CPUs:

>

> Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>

> Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>

> Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>

> All times are the geometric mean of N=50. The unit of time is

> seconds.

>

> "Cur" refers to the current implementation

> "New" refers to this patches implementation

>

> Performance data attached in memset-data.pdf

>

> Some notes on the numbers:

>

> I only included numbers for the proper VEC_SIZE for the corresponding

> cpu.

>

> skl -> avx2

> icl -> evex

> tgl -> evex

>

> The changes only affect sizes > 2 * VEC_SIZE. The performance

> differences in the size <= 2 * VEC_SIZE come from changes in alignment

> after linking (i.e ENTRY aligns to 16, but performance can be affected

> by alignment % 64 or alignment % 4096) and generally affects

> throughput only, not latency (i.e with an lfence to the benchmark loop

> the deviations go away). Generally I think they can be ignored (both

> positive and negative affects).

>

> The interesting part of the data is in the medium size range [128,

> 1024] where the new implementation has a reasonable speedup. This is

> especially pronounced when the more conservative alignment saves a

> full loop iteration. The only significant exception is

> skylake-avx2-erms case for size = 416, alignment = 416 where the

> current implementation is meaningfully faster. I am unsure of the root

> cause for this. The skylake-avx2 case only performs a bit worse in

> this case which makes me think part of it is code alignment related,

> though comparative to the speedup in other size/alignment

> configurations it is still a trough.  Despite this, I still think the

> numbers are overall an improvement.

>

> As well due to aligning the loop (and possibly slightly more efficient

> DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

> improvement to the main loop for large sizes.

>

>  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>  1 file changed, 28 insertions(+), 22 deletions(-)

>

> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> index 08cfa49bd1..ff196844a0 100644

> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>         VMOVU   %VEC(0), (%rdi)

>         VZEROUPPER_RETURN

>

> +       .p2align 4

>  L(stosb_more_2x_vec):

>         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>         ja      L(stosb)

> +#else

> +       .p2align 4

>  #endif

>  L(more_2x_vec):

> -       cmpq  $(VEC_SIZE * 4), %rdx

> -       ja      L(loop_start)

> +       /* Stores to first 2x VEC before cmp as any path forward will

> +          require it.  */

>         VMOVU   %VEC(0), (%rdi)

>         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> +       cmpq    $(VEC_SIZE * 4), %rdx

> +       ja      L(loop_start)

>         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>  L(return):

>  #if VEC_SIZE > 16

>         ZERO_UPPER_VEC_REGISTERS_RETURN

> @@ -192,28 +197,29 @@ L(return):

>  #endif

>

>  L(loop_start):

> -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> -       VMOVU   %VEC(0), (%rdi)

> -       andq    $-(VEC_SIZE * 4), %rcx

> -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> -       addq    %rdi, %rdx

> -       andq    $-(VEC_SIZE * 4), %rdx

> -       cmpq    %rdx, %rcx

> -       je      L(return)

> +       cmpq    $(VEC_SIZE * 8), %rdx

> +       jbe     L(loop_end)

> +       andq    $-(VEC_SIZE * 2), %rdi

> +       subq    $-(VEC_SIZE * 4), %rdi

> +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> +       .p2align 4

>  L(loop):

> -       VMOVA   %VEC(0), (%rcx)

> -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> -       addq    $(VEC_SIZE * 4), %rcx

> -       cmpq    %rcx, %rdx

> -       jne     L(loop)

> +       VMOVA   %VEC(0), (%rdi)

> +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> +       subq    $-(VEC_SIZE * 4), %rdi

> +       cmpq    %rcx, %rdi

> +       jb      L(loop)

> +L(loop_end):

> +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> +              rdx as length is also unchanged.  */

> +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>         VZEROUPPER_SHORT_RETURN

>

>         .p2align 4

> --

> 2.25.1

>


LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>


Thanks.

-- 
H.J.
Florian Weimer via Libc-alpha May 20, 2021, 8:03 p.m. | #3
On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>

> On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

> >

> > No bug. This commit makes a few small improvements to

> > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

> > instead of 128. Either alignment will perform equally well in a loop

> > and 128 just increases the odds of having to do an extra iteration

> > which can be significant overhead for small values. 2) Align some

> > targets and the loop. 3) Remove an ALU from the alignment process. 4)

> > Reorder the last 4x VEC so that they are stored after the loop. 5)

> > Move the condition for leq 8x VEC to before the alignment

> > process. test-memset and test-wmemset are both passing.

> >

> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> > ---

> > Tests where run on the following CPUs:

> >

> > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> >

> > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> >

> > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> >

> > All times are the geometric mean of N=50. The unit of time is

> > seconds.

> >

> > "Cur" refers to the current implementation

> > "New" refers to this patches implementation

> >

> > Performance data attached in memset-data.pdf

> >

> > Some notes on the numbers:

> >

> > I only included numbers for the proper VEC_SIZE for the corresponding

> > cpu.

> >

> > skl -> avx2

> > icl -> evex

> > tgl -> evex

> >

> > The changes only affect sizes > 2 * VEC_SIZE. The performance

> > differences in the size <= 2 * VEC_SIZE come from changes in alignment

> > after linking (i.e ENTRY aligns to 16, but performance can be affected

> > by alignment % 64 or alignment % 4096) and generally affects

> > throughput only, not latency (i.e with an lfence to the benchmark loop

> > the deviations go away). Generally I think they can be ignored (both

> > positive and negative affects).

> >

> > The interesting part of the data is in the medium size range [128,

> > 1024] where the new implementation has a reasonable speedup. This is

> > especially pronounced when the more conservative alignment saves a

> > full loop iteration. The only significant exception is

> > skylake-avx2-erms case for size = 416, alignment = 416 where the

> > current implementation is meaningfully faster. I am unsure of the root

> > cause for this. The skylake-avx2 case only performs a bit worse in

> > this case which makes me think part of it is code alignment related,

> > though comparative to the speedup in other size/alignment

> > configurations it is still a trough.  Despite this, I still think the

> > numbers are overall an improvement.

> >

> > As well due to aligning the loop (and possibly slightly more efficient

> > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

> > improvement to the main loop for large sizes.

> >

> >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

> >  1 file changed, 28 insertions(+), 22 deletions(-)

> >

> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > index 08cfa49bd1..ff196844a0 100644

> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

> >         VMOVU   %VEC(0), (%rdi)

> >         VZEROUPPER_RETURN

> >

> > +       .p2align 4

> >  L(stosb_more_2x_vec):

> >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> >         ja      L(stosb)

> > +#else

> > +       .p2align 4

> >  #endif

> >  L(more_2x_vec):

> > -       cmpq  $(VEC_SIZE * 4), %rdx

> > -       ja      L(loop_start)

> > +       /* Stores to first 2x VEC before cmp as any path forward will

> > +          require it.  */

> >         VMOVU   %VEC(0), (%rdi)

> >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > +       cmpq    $(VEC_SIZE * 4), %rdx

> > +       ja      L(loop_start)

> >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >  L(return):

> >  #if VEC_SIZE > 16

> >         ZERO_UPPER_VEC_REGISTERS_RETURN

> > @@ -192,28 +197,29 @@ L(return):

> >  #endif

> >

> >  L(loop_start):

> > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> > -       VMOVU   %VEC(0), (%rdi)

> > -       andq    $-(VEC_SIZE * 4), %rcx

> > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> > -       addq    %rdi, %rdx

> > -       andq    $-(VEC_SIZE * 4), %rdx

> > -       cmpq    %rdx, %rcx

> > -       je      L(return)

> > +       cmpq    $(VEC_SIZE * 8), %rdx

> > +       jbe     L(loop_end)

> > +       andq    $-(VEC_SIZE * 2), %rdi

> > +       subq    $-(VEC_SIZE * 4), %rdi

> > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> > +       .p2align 4

> >  L(loop):

> > -       VMOVA   %VEC(0), (%rcx)

> > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> > -       addq    $(VEC_SIZE * 4), %rcx

> > -       cmpq    %rcx, %rdx

> > -       jne     L(loop)

> > +       VMOVA   %VEC(0), (%rdi)

> > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > +       subq    $-(VEC_SIZE * 4), %rdi

> > +       cmpq    %rcx, %rdi

> > +       jb      L(loop)

> > +L(loop_end):

> > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> > +              rdx as length is also unchanged.  */

> > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> >         VZEROUPPER_SHORT_RETURN

> >

> >         .p2align 4

> > --

> > 2.25.1

> >

>

> LGTM.


Awesome!

For future patches do you prefer performance numbers like this or
raw text? Or some other alternative?

>

> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

>

> Thanks.

>

> --

> H.J.
Florian Weimer via Libc-alpha May 20, 2021, 8:44 p.m. | #4
On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

> On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >

> > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

> > >

> > > No bug. This commit makes a few small improvements to

> > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

> > > instead of 128. Either alignment will perform equally well in a loop

> > > and 128 just increases the odds of having to do an extra iteration

> > > which can be significant overhead for small values. 2) Align some

> > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

> > > Reorder the last 4x VEC so that they are stored after the loop. 5)

> > > Move the condition for leq 8x VEC to before the alignment

> > > process. test-memset and test-wmemset are both passing.

> > >

> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> > > ---

> > > Tests where run on the following CPUs:

> > >

> > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> > >

> > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> > >

> > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> > >

> > > All times are the geometric mean of N=50. The unit of time is

> > > seconds.

> > >

> > > "Cur" refers to the current implementation

> > > "New" refers to this patches implementation

> > >

> > > Performance data attached in memset-data.pdf

> > >

> > > Some notes on the numbers:

> > >

> > > I only included numbers for the proper VEC_SIZE for the corresponding

> > > cpu.

> > >

> > > skl -> avx2

> > > icl -> evex

> > > tgl -> evex

> > >

> > > The changes only affect sizes > 2 * VEC_SIZE. The performance

> > > differences in the size <= 2 * VEC_SIZE come from changes in alignment

> > > after linking (i.e ENTRY aligns to 16, but performance can be affected

> > > by alignment % 64 or alignment % 4096) and generally affects

> > > throughput only, not latency (i.e with an lfence to the benchmark loop

> > > the deviations go away). Generally I think they can be ignored (both

> > > positive and negative affects).

> > >

> > > The interesting part of the data is in the medium size range [128,

> > > 1024] where the new implementation has a reasonable speedup. This is

> > > especially pronounced when the more conservative alignment saves a

> > > full loop iteration. The only significant exception is

> > > skylake-avx2-erms case for size = 416, alignment = 416 where the

> > > current implementation is meaningfully faster. I am unsure of the root

> > > cause for this. The skylake-avx2 case only performs a bit worse in

> > > this case which makes me think part of it is code alignment related,

> > > though comparative to the speedup in other size/alignment

> > > configurations it is still a trough.  Despite this, I still think the

> > > numbers are overall an improvement.

> > >

> > > As well due to aligning the loop (and possibly slightly more efficient

> > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

> > > improvement to the main loop for large sizes.

> > >

> > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

> > >  1 file changed, 28 insertions(+), 22 deletions(-)

> > >

> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > index 08cfa49bd1..ff196844a0 100644

> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

> > >         VMOVU   %VEC(0), (%rdi)

> > >         VZEROUPPER_RETURN

> > >

> > > +       .p2align 4

> > >  L(stosb_more_2x_vec):

> > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> > >         ja      L(stosb)

> > > +#else

> > > +       .p2align 4

> > >  #endif

> > >  L(more_2x_vec):

> > > -       cmpq  $(VEC_SIZE * 4), %rdx

> > > -       ja      L(loop_start)

> > > +       /* Stores to first 2x VEC before cmp as any path forward will

> > > +          require it.  */

> > >         VMOVU   %VEC(0), (%rdi)

> > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > > +       cmpq    $(VEC_SIZE * 4), %rdx

> > > +       ja      L(loop_start)

> > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > >  L(return):

> > >  #if VEC_SIZE > 16

> > >         ZERO_UPPER_VEC_REGISTERS_RETURN

> > > @@ -192,28 +197,29 @@ L(return):

> > >  #endif

> > >

> > >  L(loop_start):

> > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> > > -       VMOVU   %VEC(0), (%rdi)

> > > -       andq    $-(VEC_SIZE * 4), %rcx

> > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> > > -       addq    %rdi, %rdx

> > > -       andq    $-(VEC_SIZE * 4), %rdx

> > > -       cmpq    %rdx, %rcx

> > > -       je      L(return)

> > > +       cmpq    $(VEC_SIZE * 8), %rdx

> > > +       jbe     L(loop_end)

> > > +       andq    $-(VEC_SIZE * 2), %rdi

> > > +       subq    $-(VEC_SIZE * 4), %rdi

> > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> > > +       .p2align 4

> > >  L(loop):

> > > -       VMOVA   %VEC(0), (%rcx)

> > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> > > -       addq    $(VEC_SIZE * 4), %rcx

> > > -       cmpq    %rcx, %rdx

> > > -       jne     L(loop)

> > > +       VMOVA   %VEC(0), (%rdi)

> > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > > +       subq    $-(VEC_SIZE * 4), %rdi

> > > +       cmpq    %rcx, %rdi

> > > +       jb      L(loop)

> > > +L(loop_end):

> > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> > > +              rdx as length is also unchanged.  */

> > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> > >         VZEROUPPER_SHORT_RETURN

> > >

> > >         .p2align 4

> > > --

> > > 2.25.1

> > >

> >

> > LGTM.

>

> Awesome!

>

> For future patches do you prefer performance numbers like this or

> raw text? Or some other alternative?


The current data format is fine.   Thanks.

> >

> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

> >

> > Thanks.

> >

> > --

> > H.J.




-- 
H.J.
Florian Weimer via Libc-alpha June 7, 2021, 2:35 a.m. | #5
On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >

> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> > >

> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> > > >

> > > > No bug. This commit makes a few small improvements to

> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

> > > > instead of 128. Either alignment will perform equally well in a loop

> > > > and 128 just increases the odds of having to do an extra iteration

> > > > which can be significant overhead for small values. 2) Align some

> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

> > > > Move the condition for leq 8x VEC to before the alignment

> > > > process. test-memset and test-wmemset are both passing.

> > > >

> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> > > > ---

> > > > Tests where run on the following CPUs:

> > > >

> > > > Skylake:

> https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> > > >

> > > > Icelake:

> https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> > > >

> > > > Tigerlake:

> https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> > > >

> > > > All times are the geometric mean of N=50. The unit of time is

> > > > seconds.

> > > >

> > > > "Cur" refers to the current implementation

> > > > "New" refers to this patches implementation

> > > >

> > > > Performance data attached in memset-data.pdf

> > > >

> > > > Some notes on the numbers:

> > > >

> > > > I only included numbers for the proper VEC_SIZE for the corresponding

> > > > cpu.

> > > >

> > > > skl -> avx2

> > > > icl -> evex

> > > > tgl -> evex

> > > >

> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

> > > > differences in the size <= 2 * VEC_SIZE come from changes in

> alignment

> > > > after linking (i.e ENTRY aligns to 16, but performance can be

> affected

> > > > by alignment % 64 or alignment % 4096) and generally affects

> > > > throughput only, not latency (i.e with an lfence to the benchmark

> loop

> > > > the deviations go away). Generally I think they can be ignored (both

> > > > positive and negative affects).

> > > >

> > > > The interesting part of the data is in the medium size range [128,

> > > > 1024] where the new implementation has a reasonable speedup. This is

> > > > especially pronounced when the more conservative alignment saves a

> > > > full loop iteration. The only significant exception is

> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

> > > > current implementation is meaningfully faster. I am unsure of the

> root

> > > > cause for this. The skylake-avx2 case only performs a bit worse in

> > > > this case which makes me think part of it is code alignment related,

> > > > though comparative to the speedup in other size/alignment

> > > > configurations it is still a trough.  Despite this, I still think the

> > > > numbers are overall an improvement.

> > > >

> > > > As well due to aligning the loop (and possibly slightly more

> efficient

> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a

> slight

> > > > improvement to the main loop for large sizes.

> > > >

> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50

> +++++++++++--------

> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

> > > >

> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > > index 08cfa49bd1..ff196844a0 100644

> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset,

> unaligned_erms))

> > > >         VMOVU   %VEC(0), (%rdi)

> > > >         VZEROUPPER_RETURN

> > > >

> > > > +       .p2align 4

> > > >  L(stosb_more_2x_vec):

> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> > > >         ja      L(stosb)

> > > > +#else

> > > > +       .p2align 4

> > > >  #endif

> > > >  L(more_2x_vec):

> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

> > > > -       ja      L(loop_start)

> > > > +       /* Stores to first 2x VEC before cmp as any path forward will

> > > > +          require it.  */

> > > >         VMOVU   %VEC(0), (%rdi)

> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

> > > > +       ja      L(loop_start)

> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > > >  L(return):

> > > >  #if VEC_SIZE > 16

> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

> > > > @@ -192,28 +197,29 @@ L(return):

> > > >  #endif

> > > >

> > > >  L(loop_start):

> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> > > > -       VMOVU   %VEC(0), (%rdi)

> > > > -       andq    $-(VEC_SIZE * 4), %rcx

> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> > > > -       addq    %rdi, %rdx

> > > > -       andq    $-(VEC_SIZE * 4), %rdx

> > > > -       cmpq    %rdx, %rcx

> > > > -       je      L(return)

> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

> > > > +       jbe     L(loop_end)

> > > > +       andq    $-(VEC_SIZE * 2), %rdi

> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx


If this overflows loop will exit first iteration. Is that an issue?

If so the following commits from me have the same bug:
https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae
https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce
https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

>


> > > +       .p2align 4

> > > >  L(loop):

> > > > -       VMOVA   %VEC(0), (%rcx)

> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> > > > -       addq    $(VEC_SIZE * 4), %rcx

> > > > -       cmpq    %rcx, %rdx

> > > > -       jne     L(loop)

> > > > +       VMOVA   %VEC(0), (%rdi)

> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> > > > +       cmpq    %rcx, %rdi

> > > > +       jb      L(loop)


Issue because %rdi will not be below %rcx here.

> > > > +L(loop_end):

> > > > +       /* NB: rax is set as ptr in

> MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> > > > +              rdx as length is also unchanged.  */

> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> > > >         VZEROUPPER_SHORT_RETURN

> > > >

> > > >         .p2align 4

> > > > --

> > > > 2.25.1

> > > >

> > >

> > > LGTM.

> >

> > Awesome!

> >

> > For future patches do you prefer performance numbers like this or

> > raw text? Or some other alternative?

>

> The current data format is fine.   Thanks.

>

> > >

> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

> > >

> > > Thanks.

> > >

> > > --

> > > H.J.

>

>

>

> --

> H.J.

>
Florian Weimer via Libc-alpha June 7, 2021, 2:47 a.m. | #6
On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

>

>

> On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>>

>> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >

>> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> > >

>> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> > > >

>> > > > No bug. This commit makes a few small improvements to

>> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

>> > > > instead of 128. Either alignment will perform equally well in a loop

>> > > > and 128 just increases the odds of having to do an extra iteration

>> > > > which can be significant overhead for small values. 2) Align some

>> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

>> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

>> > > > Move the condition for leq 8x VEC to before the alignment

>> > > > process. test-memset and test-wmemset are both passing.

>> > > >

>> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

>> > > > ---

>> > > > Tests where run on the following CPUs:

>> > > >

>> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>> > > >

>> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>> > > >

>> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>> > > >

>> > > > All times are the geometric mean of N=50. The unit of time is

>> > > > seconds.

>> > > >

>> > > > "Cur" refers to the current implementation

>> > > > "New" refers to this patches implementation

>> > > >

>> > > > Performance data attached in memset-data.pdf

>> > > >

>> > > > Some notes on the numbers:

>> > > >

>> > > > I only included numbers for the proper VEC_SIZE for the corresponding

>> > > > cpu.

>> > > >

>> > > > skl -> avx2

>> > > > icl -> evex

>> > > > tgl -> evex

>> > > >

>> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

>> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment

>> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected

>> > > > by alignment % 64 or alignment % 4096) and generally affects

>> > > > throughput only, not latency (i.e with an lfence to the benchmark loop

>> > > > the deviations go away). Generally I think they can be ignored (both

>> > > > positive and negative affects).

>> > > >

>> > > > The interesting part of the data is in the medium size range [128,

>> > > > 1024] where the new implementation has a reasonable speedup. This is

>> > > > especially pronounced when the more conservative alignment saves a

>> > > > full loop iteration. The only significant exception is

>> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

>> > > > current implementation is meaningfully faster. I am unsure of the root

>> > > > cause for this. The skylake-avx2 case only performs a bit worse in

>> > > > this case which makes me think part of it is code alignment related,

>> > > > though comparative to the speedup in other size/alignment

>> > > > configurations it is still a trough.  Despite this, I still think the

>> > > > numbers are overall an improvement.

>> > > >

>> > > > As well due to aligning the loop (and possibly slightly more efficient

>> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

>> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

>> > > > improvement to the main loop for large sizes.

>> > > >

>> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

>> > > >

>> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> > > > index 08cfa49bd1..ff196844a0 100644

>> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>> > > >         VMOVU   %VEC(0), (%rdi)

>> > > >         VZEROUPPER_RETURN

>> > > >

>> > > > +       .p2align 4

>> > > >  L(stosb_more_2x_vec):

>> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>> > > >         ja      L(stosb)

>> > > > +#else

>> > > > +       .p2align 4

>> > > >  #endif

>> > > >  L(more_2x_vec):

>> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

>> > > > -       ja      L(loop_start)

>> > > > +       /* Stores to first 2x VEC before cmp as any path forward will

>> > > > +          require it.  */

>> > > >         VMOVU   %VEC(0), (%rdi)

>> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

>> > > > +       ja      L(loop_start)

>> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> > > >  L(return):

>> > > >  #if VEC_SIZE > 16

>> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

>> > > > @@ -192,28 +197,29 @@ L(return):

>> > > >  #endif

>> > > >

>> > > >  L(loop_start):

>> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

>> > > > -       VMOVU   %VEC(0), (%rdi)

>> > > > -       andq    $-(VEC_SIZE * 4), %rcx

>> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

>> > > > -       addq    %rdi, %rdx

>> > > > -       andq    $-(VEC_SIZE * 4), %rdx

>> > > > -       cmpq    %rdx, %rcx

>> > > > -       je      L(return)

>> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

>> > > > +       jbe     L(loop_end)

>> > > > +       andq    $-(VEC_SIZE * 2), %rdi

>> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

>

> If this overflows loop will exit first iteration. Is that an issue?


Please do following:

1.  Update memset assembly codes with
        check conditions for underwrite/overwrite.
        if true then branch to the HLT instruction.
2.  Update and run memset test to verify the test coverage for the condition.
3.  Update memset assembly codes to cover such conditions.

Thanks.

> If so the following commits from me have the same bug:

> https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

> https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

> https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

>>

>>

>>

>> > > > +       .p2align 4

>> > > >  L(loop):

>> > > > -       VMOVA   %VEC(0), (%rcx)

>> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

>> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

>> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

>> > > > -       addq    $(VEC_SIZE * 4), %rcx

>> > > > -       cmpq    %rcx, %rdx

>> > > > -       jne     L(loop)

>> > > > +       VMOVA   %VEC(0), (%rdi)

>> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

>> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> > > > +       cmpq    %rcx, %rdi

>> > > > +       jb      L(loop)

>

> Issue because %rdi will not be below %rcx here.

>>

>> > > > +L(loop_end):

>> > > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

>> > > > +              rdx as length is also unchanged.  */

>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

>> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>> > > >         VZEROUPPER_SHORT_RETURN

>> > > >

>> > > >         .p2align 4

>> > > > --

>> > > > 2.25.1

>> > > >

>> > >

>> > > LGTM.

>> >

>> > Awesome!

>> >

>> > For future patches do you prefer performance numbers like this or

>> > raw text? Or some other alternative?

>>

>> The current data format is fine.   Thanks.

>>

>> > >

>> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

>> > >

>> > > Thanks.

>> > >

>> > > --

>> > > H.J.

>>

>>

>>

>> --

>> H.J.




-- 
H.J.
Florian Weimer via Libc-alpha June 7, 2021, 3:05 a.m. | #7
On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >

> >

> >

> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >>

> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >> >

> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >> > >

> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> > > >

> >> > > > No bug. This commit makes a few small improvements to

> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to

> 64

> >> > > > instead of 128. Either alignment will perform equally well in a

> loop

> >> > > > and 128 just increases the odds of having to do an extra iteration

> >> > > > which can be significant overhead for small values. 2) Align some

> >> > > > targets and the loop. 3) Remove an ALU from the alignment

> process. 4)

> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

> >> > > > Move the condition for leq 8x VEC to before the alignment

> >> > > > process. test-memset and test-wmemset are both passing.

> >> > > >

> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> >> > > > ---

> >> > > > Tests where run on the following CPUs:

> >> > > >

> >> > > > Skylake:

> https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> >> > > >

> >> > > > Icelake:

> https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> >> > > >

> >> > > > Tigerlake:

> https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> >> > > >

> >> > > > All times are the geometric mean of N=50. The unit of time is

> >> > > > seconds.

> >> > > >

> >> > > > "Cur" refers to the current implementation

> >> > > > "New" refers to this patches implementation

> >> > > >

> >> > > > Performance data attached in memset-data.pdf

> >> > > >

> >> > > > Some notes on the numbers:

> >> > > >

> >> > > > I only included numbers for the proper VEC_SIZE for the

> corresponding

> >> > > > cpu.

> >> > > >

> >> > > > skl -> avx2

> >> > > > icl -> evex

> >> > > > tgl -> evex

> >> > > >

> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in

> alignment

> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be

> affected

> >> > > > by alignment % 64 or alignment % 4096) and generally affects

> >> > > > throughput only, not latency (i.e with an lfence to the benchmark

> loop

> >> > > > the deviations go away). Generally I think they can be ignored

> (both

> >> > > > positive and negative affects).

> >> > > >

> >> > > > The interesting part of the data is in the medium size range [128,

> >> > > > 1024] where the new implementation has a reasonable speedup. This

> is

> >> > > > especially pronounced when the more conservative alignment saves a

> >> > > > full loop iteration. The only significant exception is

> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

> >> > > > current implementation is meaningfully faster. I am unsure of the

> root

> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in

> >> > > > this case which makes me think part of it is code alignment

> related,

> >> > > > though comparative to the speedup in other size/alignment

> >> > > > configurations it is still a trough.  Despite this, I still think

> the

> >> > > > numbers are overall an improvement.

> >> > > >

> >> > > > As well due to aligning the loop (and possibly slightly more

> efficient

> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a

> slight

> >> > > > improvement to the main loop for large sizes.

> >> > > >

> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50

> +++++++++++--------

> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

> >> > > >

> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> > > > index 08cfa49bd1..ff196844a0 100644

> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset,

> unaligned_erms))

> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> > > >         VZEROUPPER_RETURN

> >> > > >

> >> > > > +       .p2align 4

> >> > > >  L(stosb_more_2x_vec):

> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> >> > > >         ja      L(stosb)

> >> > > > +#else

> >> > > > +       .p2align 4

> >> > > >  #endif

> >> > > >  L(more_2x_vec):

> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

> >> > > > -       ja      L(loop_start)

> >> > > > +       /* Stores to first 2x VEC before cmp as any path forward

> will

> >> > > > +          require it.  */

> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

> >> > > > +       ja      L(loop_start)

> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> > > >  L(return):

> >> > > >  #if VEC_SIZE > 16

> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

> >> > > > @@ -192,28 +197,29 @@ L(return):

> >> > > >  #endif

> >> > > >

> >> > > >  L(loop_start):

> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> >> > > > -       VMOVU   %VEC(0), (%rdi)

> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> >> > > > -       addq    %rdi, %rdx

> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

> >> > > > -       cmpq    %rdx, %rcx

> >> > > > -       je      L(return)

> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

> >> > > > +       jbe     L(loop_end)

> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> >

> > If this overflows loop will exit first iteration. Is that an issue?

>

> Please do following:

>

> 1.  Update memset assembly codes with

>         check conditions for underwrite/overwrite.

>         if true then branch to the HLT instruction.

>

codes? Just this file or others as well?

> 2.  Update and run memset test to verify the test coverage for the

> condition.

>

What is the desired result? Segfault?

> 3.  Update memset assembly codes to cover such conditions.

>

Memcmp as well? What about wmemset? Currently (and with previous versions
as well)
a value like 2^63 would cause similar behavior after the `salq $2, %rdx`

>

> Thanks.

>

> > If so the following commits from me have the same bug:

> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

> >>

> >>

> >>

> >> > > > +       .p2align 4

> >> > > >  L(loop):

> >> > > > -       VMOVA   %VEC(0), (%rcx)

> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

> >> > > > -       cmpq    %rcx, %rdx

> >> > > > -       jne     L(loop)

> >> > > > +       VMOVA   %VEC(0), (%rdi)

> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> > > > +       cmpq    %rcx, %rdi

> >> > > > +       jb      L(loop)

> >

> > Issue because %rdi will not be below %rcx here.

> >>

> >> > > > +L(loop_end):

> >> > > > +       /* NB: rax is set as ptr in

> MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> >> > > > +              rdx as length is also unchanged.  */

> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> >> > > >         VZEROUPPER_SHORT_RETURN

> >> > > >

> >> > > >         .p2align 4

> >> > > > --

> >> > > > 2.25.1

> >> > > >

> >> > >

> >> > > LGTM.

> >> >

> >> > Awesome!

> >> >

> >> > For future patches do you prefer performance numbers like this or

> >> > raw text? Or some other alternative?

> >>

> >> The current data format is fine.   Thanks.

> >>

> >> > >

> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

> >> > >

> >> > > Thanks.

> >> > >

> >> > > --

> >> > > H.J.

> >>

> >>

> >>

> >> --

> >> H.J.

>

>

>

> --

> H.J.

>
Florian Weimer via Libc-alpha June 7, 2021, 3:20 a.m. | #8
On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

>

>

> On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>>

>> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >

>> >

>> >

>> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >>

>> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >

>> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> > >

>> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> > > >

>> >> > > > No bug. This commit makes a few small improvements to

>> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

>> >> > > > instead of 128. Either alignment will perform equally well in a loop

>> >> > > > and 128 just increases the odds of having to do an extra iteration

>> >> > > > which can be significant overhead for small values. 2) Align some

>> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

>> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

>> >> > > > Move the condition for leq 8x VEC to before the alignment

>> >> > > > process. test-memset and test-wmemset are both passing.

>> >> > > >

>> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

>> >> > > > ---

>> >> > > > Tests where run on the following CPUs:

>> >> > > >

>> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>> >> > > >

>> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>> >> > > >

>> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>> >> > > >

>> >> > > > All times are the geometric mean of N=50. The unit of time is

>> >> > > > seconds.

>> >> > > >

>> >> > > > "Cur" refers to the current implementation

>> >> > > > "New" refers to this patches implementation

>> >> > > >

>> >> > > > Performance data attached in memset-data.pdf

>> >> > > >

>> >> > > > Some notes on the numbers:

>> >> > > >

>> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding

>> >> > > > cpu.

>> >> > > >

>> >> > > > skl -> avx2

>> >> > > > icl -> evex

>> >> > > > tgl -> evex

>> >> > > >

>> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

>> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment

>> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected

>> >> > > > by alignment % 64 or alignment % 4096) and generally affects

>> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop

>> >> > > > the deviations go away). Generally I think they can be ignored (both

>> >> > > > positive and negative affects).

>> >> > > >

>> >> > > > The interesting part of the data is in the medium size range [128,

>> >> > > > 1024] where the new implementation has a reasonable speedup. This is

>> >> > > > especially pronounced when the more conservative alignment saves a

>> >> > > > full loop iteration. The only significant exception is

>> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

>> >> > > > current implementation is meaningfully faster. I am unsure of the root

>> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in

>> >> > > > this case which makes me think part of it is code alignment related,

>> >> > > > though comparative to the speedup in other size/alignment

>> >> > > > configurations it is still a trough.  Despite this, I still think the

>> >> > > > numbers are overall an improvement.

>> >> > > >

>> >> > > > As well due to aligning the loop (and possibly slightly more efficient

>> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

>> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

>> >> > > > improvement to the main loop for large sizes.

>> >> > > >

>> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

>> >> > > >

>> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> > > > index 08cfa49bd1..ff196844a0 100644

>> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> > > >         VZEROUPPER_RETURN

>> >> > > >

>> >> > > > +       .p2align 4

>> >> > > >  L(stosb_more_2x_vec):

>> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>> >> > > >         ja      L(stosb)

>> >> > > > +#else

>> >> > > > +       .p2align 4

>> >> > > >  #endif

>> >> > > >  L(more_2x_vec):

>> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

>> >> > > > -       ja      L(loop_start)

>> >> > > > +       /* Stores to first 2x VEC before cmp as any path forward will

>> >> > > > +          require it.  */

>> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

>> >> > > > +       ja      L(loop_start)

>> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> > > >  L(return):

>> >> > > >  #if VEC_SIZE > 16

>> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

>> >> > > > @@ -192,28 +197,29 @@ L(return):

>> >> > > >  #endif

>> >> > > >

>> >> > > >  L(loop_start):

>> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

>> >> > > > -       VMOVU   %VEC(0), (%rdi)

>> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

>> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

>> >> > > > -       addq    %rdi, %rdx

>> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

>> >> > > > -       cmpq    %rdx, %rcx

>> >> > > > -       je      L(return)

>> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

>> >> > > > +       jbe     L(loop_end)

>> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

>> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

>> >

>> > If this overflows loop will exit first iteration. Is that an issue?

>>

>> Please do following:

>>

>> 1.  Update memset assembly codes with

>>         check conditions for underwrite/overwrite.

>>         if true then branch to the HLT instruction.

>

> codes? Just this file or others as well?


All string/memory functions should work for all valid inputs.

>>

>> 2.  Update and run memset test to verify the test coverage for the condition.

>

> What is the desired result? Segfault?


For invalid inputs, anything can happen, including segfault.

>>

>> 3.  Update memset assembly codes to cover such conditions.

>

> Memcmp as well? What about wmemset? Currently (and with previous versions as well)

> a value like 2^63 would cause similar behavior after the `salq $2, %rdx`


Is this condition a valid input?  If not, there is nothing to do.

>>

>>

>> Thanks.

>>

>> > If so the following commits from me have the same bug:

>> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

>> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

>> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

>> >>

>> >>

>> >>

>> >> > > > +       .p2align 4

>> >> > > >  L(loop):

>> >> > > > -       VMOVA   %VEC(0), (%rcx)

>> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

>> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

>> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

>> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

>> >> > > > -       cmpq    %rcx, %rdx

>> >> > > > -       jne     L(loop)

>> >> > > > +       VMOVA   %VEC(0), (%rdi)

>> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

>> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> > > > +       cmpq    %rcx, %rdi

>> >> > > > +       jb      L(loop)

>> >

>> > Issue because %rdi will not be below %rcx here.

>> >>

>> >> > > > +L(loop_end):

>> >> > > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

>> >> > > > +              rdx as length is also unchanged.  */

>> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

>> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

>> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

>> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>> >> > > >         VZEROUPPER_SHORT_RETURN

>> >> > > >

>> >> > > >         .p2align 4

>> >> > > > --

>> >> > > > 2.25.1

>> >> > > >

>> >> > >

>> >> > > LGTM.

>> >> >

>> >> > Awesome!

>> >> >

>> >> > For future patches do you prefer performance numbers like this or

>> >> > raw text? Or some other alternative?

>> >>

>> >> The current data format is fine.   Thanks.

>> >>

>> >> > >

>> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

>> >> > >

>> >> > > Thanks.

>> >> > >

>> >> > > --

>> >> > > H.J.

>> >>

>> >>

>> >>

>> >> --

>> >> H.J.

>>

>>

>>

>> --

>> H.J.




-- 
H.J.
Florian Weimer via Libc-alpha June 7, 2021, 3:38 a.m. | #9
On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >

> >

> >

> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >>

> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >> >

> >> >

> >> >

> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >> >>

> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> >> >

> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com>

> wrote:

> >> >> > >

> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> >> > > >

> >> >> > > > No bug. This commit makes a few small improvements to

> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning

> to 64

> >> >> > > > instead of 128. Either alignment will perform equally well in

> a loop

> >> >> > > > and 128 just increases the odds of having to do an extra

> iteration

> >> >> > > > which can be significant overhead for small values. 2) Align

> some

> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment

> process. 4)

> >> >> > > > Reorder the last 4x VEC so that they are stored after the

> loop. 5)

> >> >> > > > Move the condition for leq 8x VEC to before the alignment

> >> >> > > > process. test-memset and test-wmemset are both passing.

> >> >> > > >

> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> >> >> > > > ---

> >> >> > > > Tests where run on the following CPUs:

> >> >> > > >

> >> >> > > > Skylake:

> https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> >> >> > > >

> >> >> > > > Icelake:

> https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> >> >> > > >

> >> >> > > > Tigerlake:

> https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> >> >> > > >

> >> >> > > > All times are the geometric mean of N=50. The unit of time is

> >> >> > > > seconds.

> >> >> > > >

> >> >> > > > "Cur" refers to the current implementation

> >> >> > > > "New" refers to this patches implementation

> >> >> > > >

> >> >> > > > Performance data attached in memset-data.pdf

> >> >> > > >

> >> >> > > > Some notes on the numbers:

> >> >> > > >

> >> >> > > > I only included numbers for the proper VEC_SIZE for the

> corresponding

> >> >> > > > cpu.

> >> >> > > >

> >> >> > > > skl -> avx2

> >> >> > > > icl -> evex

> >> >> > > > tgl -> evex

> >> >> > > >

> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in

> alignment

> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be

> affected

> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects

> >> >> > > > throughput only, not latency (i.e with an lfence to the

> benchmark loop

> >> >> > > > the deviations go away). Generally I think they can be ignored

> (both

> >> >> > > > positive and negative affects).

> >> >> > > >

> >> >> > > > The interesting part of the data is in the medium size range

> [128,

> >> >> > > > 1024] where the new implementation has a reasonable speedup.

> This is

> >> >> > > > especially pronounced when the more conservative alignment

> saves a

> >> >> > > > full loop iteration. The only significant exception is

> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where

> the

> >> >> > > > current implementation is meaningfully faster. I am unsure of

> the root

> >> >> > > > cause for this. The skylake-avx2 case only performs a bit

> worse in

> >> >> > > > this case which makes me think part of it is code alignment

> related,

> >> >> > > > though comparative to the speedup in other size/alignment

> >> >> > > > configurations it is still a trough.  Despite this, I still

> think the

> >> >> > > > numbers are overall an improvement.

> >> >> > > >

> >> >> > > > As well due to aligning the loop (and possibly slightly more

> efficient

> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the

> loop

> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often

> a slight

> >> >> > > > improvement to the main loop for large sizes.

> >> >> > > >

> >> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50

> +++++++++++--------

> >> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

> >> >> > > >

> >> >> > > > diff --git

> a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> > > > index 08cfa49bd1..ff196844a0 100644

> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset,

> unaligned_erms))

> >> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> >> > > >         VZEROUPPER_RETURN

> >> >> > > >

> >> >> > > > +       .p2align 4

> >> >> > > >  L(stosb_more_2x_vec):

> >> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> >> >> > > >         ja      L(stosb)

> >> >> > > > +#else

> >> >> > > > +       .p2align 4

> >> >> > > >  #endif

> >> >> > > >  L(more_2x_vec):

> >> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

> >> >> > > > -       ja      L(loop_start)

> >> >> > > > +       /* Stores to first 2x VEC before cmp as any path

> forward will

> >> >> > > > +          require it.  */

> >> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

> >> >> > > > +       ja      L(loop_start)

> >> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> > > >  L(return):

> >> >> > > >  #if VEC_SIZE > 16

> >> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

> >> >> > > > @@ -192,28 +197,29 @@ L(return):

> >> >> > > >  #endif

> >> >> > > >

> >> >> > > >  L(loop_start):

> >> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> >> >> > > > -       VMOVU   %VEC(0), (%rdi)

> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> >> >> > > > -       addq    %rdi, %rdx

> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

> >> >> > > > -       cmpq    %rdx, %rcx

> >> >> > > > -       je      L(return)

> >> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

> >> >> > > > +       jbe     L(loop_end)

> >> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> >> >

> >> > If this overflows loop will exit first iteration. Is that an issue?

> >>

> >> Please do following:

> >>

> >> 1.  Update memset assembly codes with

> >>         check conditions for underwrite/overwrite.

> >>         if true then branch to the HLT instruction.

> >

> > codes? Just this file or others as well?

>

> All string/memory functions should work for all valid inputs.

>

> >>

> >> 2.  Update and run memset test to verify the test coverage for the

> condition.

> >

> > What is the desired result? Segfault?

>

> For invalid inputs, anything can happen, including segfault.

>

> >>

> >> 3.  Update memset assembly codes to cover such conditions.

> >

> > Memcmp as well? What about wmemset? Currently (and with previous

> versions as well)

> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx`

>

> Is this condition a valid input?  If not, there is nothing to do.

>


AFAIK and value length [0, SIZE_MAX] for either is a valid input for any
string/memory function
from the perspective of the standard. But I don't know if it has any
qualifiers.

As well I don't know what is meant to happen if the machine/OS is unable to
perform the necessary operations.
Normally you would see segfault. The difference from the 3 commits below is
essentially just that it won't
segfault.

That is already case, and has been for a while, certain inputs for many of
the wcsmbs will have roughly the
same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1)
will currently set 1 wchar then
return.


>

> >>

> >>

> >> Thanks.

> >>

> >> > If so the following commits from me have the same bug:

> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

> >> >>

> >> >>

> >> >>

> >> >> > > > +       .p2align 4

> >> >> > > >  L(loop):

> >> >> > > > -       VMOVA   %VEC(0), (%rcx)

> >> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> >> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

> >> >> > > > -       cmpq    %rcx, %rdx

> >> >> > > > -       jne     L(loop)

> >> >> > > > +       VMOVA   %VEC(0), (%rdi)

> >> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> >> > > > +       cmpq    %rcx, %rdi

> >> >> > > > +       jb      L(loop)

> >> >

> >> > Issue because %rdi will not be below %rcx here.

> >> >>

> >> >> > > > +L(loop_end):

> >> >> > > > +       /* NB: rax is set as ptr in

> MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> >> >> > > > +              rdx as length is also unchanged.  */

> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> >> >> > > >         VZEROUPPER_SHORT_RETURN

> >> >> > > >

> >> >> > > >         .p2align 4

> >> >> > > > --

> >> >> > > > 2.25.1

> >> >> > > >

> >> >> > >

> >> >> > > LGTM.

> >> >> >

> >> >> > Awesome!

> >> >> >

> >> >> > For future patches do you prefer performance numbers like this or

> >> >> > raw text? Or some other alternative?

> >> >>

> >> >> The current data format is fine.   Thanks.

> >> >>

> >> >> > >

> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

> >> >> > >

> >> >> > > Thanks.

> >> >> > >

> >> >> > > --

> >> >> > > H.J.

> >> >>

> >> >>

> >> >>

> >> >> --

> >> >> H.J.

> >>

> >>

> >>

> >> --

> >> H.J.

>

>

>

> --

> H.J.

>
Florian Weimer via Libc-alpha June 7, 2021, 4:33 a.m. | #10
On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

>

>

> On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>>

>> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >

>> >

>> >

>> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >>

>> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >

>> >> >

>> >> >

>> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> >>

>> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >> >

>> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> >> > >

>> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >> > > >

>> >> >> > > > No bug. This commit makes a few small improvements to

>> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

>> >> >> > > > instead of 128. Either alignment will perform equally well in a loop

>> >> >> > > > and 128 just increases the odds of having to do an extra iteration

>> >> >> > > > which can be significant overhead for small values. 2) Align some

>> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

>> >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

>> >> >> > > > Move the condition for leq 8x VEC to before the alignment

>> >> >> > > > process. test-memset and test-wmemset are both passing.

>> >> >> > > >

>> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

>> >> >> > > > ---

>> >> >> > > > Tests where run on the following CPUs:

>> >> >> > > >

>> >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>> >> >> > > >

>> >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>> >> >> > > >

>> >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>> >> >> > > >

>> >> >> > > > All times are the geometric mean of N=50. The unit of time is

>> >> >> > > > seconds.

>> >> >> > > >

>> >> >> > > > "Cur" refers to the current implementation

>> >> >> > > > "New" refers to this patches implementation

>> >> >> > > >

>> >> >> > > > Performance data attached in memset-data.pdf

>> >> >> > > >

>> >> >> > > > Some notes on the numbers:

>> >> >> > > >

>> >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding

>> >> >> > > > cpu.

>> >> >> > > >

>> >> >> > > > skl -> avx2

>> >> >> > > > icl -> evex

>> >> >> > > > tgl -> evex

>> >> >> > > >

>> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

>> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment

>> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected

>> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects

>> >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop

>> >> >> > > > the deviations go away). Generally I think they can be ignored (both

>> >> >> > > > positive and negative affects).

>> >> >> > > >

>> >> >> > > > The interesting part of the data is in the medium size range [128,

>> >> >> > > > 1024] where the new implementation has a reasonable speedup. This is

>> >> >> > > > especially pronounced when the more conservative alignment saves a

>> >> >> > > > full loop iteration. The only significant exception is

>> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

>> >> >> > > > current implementation is meaningfully faster. I am unsure of the root

>> >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in

>> >> >> > > > this case which makes me think part of it is code alignment related,

>> >> >> > > > though comparative to the speedup in other size/alignment

>> >> >> > > > configurations it is still a trough.  Despite this, I still think the

>> >> >> > > > numbers are overall an improvement.

>> >> >> > > >

>> >> >> > > > As well due to aligning the loop (and possibly slightly more efficient

>> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

>> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

>> >> >> > > > improvement to the main loop for large sizes.

>> >> >> > > >

>> >> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>> >> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

>> >> >> > > >

>> >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> > > > index 08cfa49bd1..ff196844a0 100644

>> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>> >> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> >> > > >         VZEROUPPER_RETURN

>> >> >> > > >

>> >> >> > > > +       .p2align 4

>> >> >> > > >  L(stosb_more_2x_vec):

>> >> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>> >> >> > > >         ja      L(stosb)

>> >> >> > > > +#else

>> >> >> > > > +       .p2align 4

>> >> >> > > >  #endif

>> >> >> > > >  L(more_2x_vec):

>> >> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

>> >> >> > > > -       ja      L(loop_start)

>> >> >> > > > +       /* Stores to first 2x VEC before cmp as any path forward will

>> >> >> > > > +          require it.  */

>> >> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

>> >> >> > > > +       ja      L(loop_start)

>> >> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> > > >  L(return):

>> >> >> > > >  #if VEC_SIZE > 16

>> >> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

>> >> >> > > > @@ -192,28 +197,29 @@ L(return):

>> >> >> > > >  #endif

>> >> >> > > >

>> >> >> > > >  L(loop_start):

>> >> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

>> >> >> > > > -       VMOVU   %VEC(0), (%rdi)

>> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

>> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

>> >> >> > > > -       addq    %rdi, %rdx

>> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

>> >> >> > > > -       cmpq    %rdx, %rcx

>> >> >> > > > -       je      L(return)

>> >> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

>> >> >> > > > +       jbe     L(loop_end)

>> >> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

>> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

>> >> >

>> >> > If this overflows loop will exit first iteration. Is that an issue?

>> >>

>> >> Please do following:

>> >>

>> >> 1.  Update memset assembly codes with

>> >>         check conditions for underwrite/overwrite.

>> >>         if true then branch to the HLT instruction.

>> >

>> > codes? Just this file or others as well?

>>

>> All string/memory functions should work for all valid inputs.

>>

>> >>

>> >> 2.  Update and run memset test to verify the test coverage for the condition.

>> >

>> > What is the desired result? Segfault?

>>

>> For invalid inputs, anything can happen, including segfault.

>>

>> >>

>> >> 3.  Update memset assembly codes to cover such conditions.

>> >

>> > Memcmp as well? What about wmemset? Currently (and with previous versions as well)

>> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx`

>>

>> Is this condition a valid input?  If not, there is nothing to do.

>

>

> AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function

> from the perspective of the standard. But I don't know if it has any qualifiers.

>

> As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations.

> Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't

> segfault.

>

> That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the

> same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then

> return.


Please construct a testcase which will return normally instead of
segfault if not fixed.  So on
x86-64, the expected behavior should be segfault.

>>

>>

>> >>

>> >>

>> >> Thanks.

>> >>

>> >> > If so the following commits from me have the same bug:

>> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

>> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

>> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

>> >> >>

>> >> >>

>> >> >>

>> >> >> > > > +       .p2align 4

>> >> >> > > >  L(loop):

>> >> >> > > > -       VMOVA   %VEC(0), (%rcx)

>> >> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

>> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

>> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

>> >> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

>> >> >> > > > -       cmpq    %rcx, %rdx

>> >> >> > > > -       jne     L(loop)

>> >> >> > > > +       VMOVA   %VEC(0), (%rdi)

>> >> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

>> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> >> > > > +       cmpq    %rcx, %rdi

>> >> >> > > > +       jb      L(loop)

>> >> >

>> >> > Issue because %rdi will not be below %rcx here.

>> >> >>

>> >> >> > > > +L(loop_end):

>> >> >> > > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

>> >> >> > > > +              rdx as length is also unchanged.  */

>> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

>> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

>> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

>> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>> >> >> > > >         VZEROUPPER_SHORT_RETURN

>> >> >> > > >

>> >> >> > > >         .p2align 4

>> >> >> > > > --

>> >> >> > > > 2.25.1

>> >> >> > > >

>> >> >> > >

>> >> >> > > LGTM.

>> >> >> >

>> >> >> > Awesome!

>> >> >> >

>> >> >> > For future patches do you prefer performance numbers like this or

>> >> >> > raw text? Or some other alternative?

>> >> >>

>> >> >> The current data format is fine.   Thanks.

>> >> >>

>> >> >> > >

>> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

>> >> >> > >

>> >> >> > > Thanks.

>> >> >> > >

>> >> >> > > --

>> >> >> > > H.J.

>> >> >>

>> >> >>

>> >> >>

>> >> >> --

>> >> >> H.J.

>> >>

>> >>

>> >>

>> >> --

>> >> H.J.

>>

>>

>>

>> --

>> H.J.




-- 
H.J.
Florian Weimer via Libc-alpha June 7, 2021, 7:12 a.m. | #11
On Mon, Jun 7, 2021 at 12:34 AM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >

> >

> >

> > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >>

> >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com>

> wrote:

> >> >

> >> >

> >> >

> >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> >> >>

> >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> >> >

> >> >> >

> >> >> >

> >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com>

> wrote:

> >> >> >>

> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> >> >> >

> >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com>

> wrote:

> >> >> >> > >

> >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <

> goldstein.w.n@gmail.com> wrote:

> >> >> >> > > >

> >> >> >> > > > No bug. This commit makes a few small improvements to

> >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only

> aligning to 64

> >> >> >> > > > instead of 128. Either alignment will perform equally well

> in a loop

> >> >> >> > > > and 128 just increases the odds of having to do an extra

> iteration

> >> >> >> > > > which can be significant overhead for small values. 2)

> Align some

> >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment

> process. 4)

> >> >> >> > > > Reorder the last 4x VEC so that they are stored after the

> loop. 5)

> >> >> >> > > > Move the condition for leq 8x VEC to before the alignment

> >> >> >> > > > process. test-memset and test-wmemset are both passing.

> >> >> >> > > >

> >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

> >> >> >> > > > ---

> >> >> >> > > > Tests where run on the following CPUs:

> >> >> >> > > >

> >> >> >> > > > Skylake:

> https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

> >> >> >> > > >

> >> >> >> > > > Icelake:

> https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

> >> >> >> > > >

> >> >> >> > > > Tigerlake:

> https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

> >> >> >> > > >

> >> >> >> > > > All times are the geometric mean of N=50. The unit of time

> is

> >> >> >> > > > seconds.

> >> >> >> > > >

> >> >> >> > > > "Cur" refers to the current implementation

> >> >> >> > > > "New" refers to this patches implementation

> >> >> >> > > >

> >> >> >> > > > Performance data attached in memset-data.pdf

> >> >> >> > > >

> >> >> >> > > > Some notes on the numbers:

> >> >> >> > > >

> >> >> >> > > > I only included numbers for the proper VEC_SIZE for the

> corresponding

> >> >> >> > > > cpu.

> >> >> >> > > >

> >> >> >> > > > skl -> avx2

> >> >> >> > > > icl -> evex

> >> >> >> > > > tgl -> evex

> >> >> >> > > >

> >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The

> performance

> >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes

> in alignment

> >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can

> be affected

> >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects

> >> >> >> > > > throughput only, not latency (i.e with an lfence to the

> benchmark loop

> >> >> >> > > > the deviations go away). Generally I think they can be

> ignored (both

> >> >> >> > > > positive and negative affects).

> >> >> >> > > >

> >> >> >> > > > The interesting part of the data is in the medium size

> range [128,

> >> >> >> > > > 1024] where the new implementation has a reasonable

> speedup. This is

> >> >> >> > > > especially pronounced when the more conservative alignment

> saves a

> >> >> >> > > > full loop iteration. The only significant exception is

> >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416

> where the

> >> >> >> > > > current implementation is meaningfully faster. I am unsure

> of the root

> >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit

> worse in

> >> >> >> > > > this case which makes me think part of it is code alignment

> related,

> >> >> >> > > > though comparative to the speedup in other size/alignment

> >> >> >> > > > configurations it is still a trough.  Despite this, I still

> think the

> >> >> >> > > > numbers are overall an improvement.

> >> >> >> > > >

> >> >> >> > > > As well due to aligning the loop (and possibly slightly

> more efficient

> >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in

> the loop

> >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is

> often a slight

> >> >> >> > > > improvement to the main loop for large sizes.

> >> >> >> > > >

> >> >> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50

> +++++++++++--------

> >> >> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

> >> >> >> > > >

> >> >> >> > > > diff --git

> a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> >> > > > index 08cfa49bd1..ff196844a0 100644

> >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

> >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset,

> unaligned_erms))

> >> >> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> >> >> > > >         VZEROUPPER_RETURN

> >> >> >> > > >

> >> >> >> > > > +       .p2align 4

> >> >> >> > > >  L(stosb_more_2x_vec):

> >> >> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

> >> >> >> > > >         ja      L(stosb)

> >> >> >> > > > +#else

> >> >> >> > > > +       .p2align 4

> >> >> >> > > >  #endif

> >> >> >> > > >  L(more_2x_vec):

> >> >> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

> >> >> >> > > > -       ja      L(loop_start)

> >> >> >> > > > +       /* Stores to first 2x VEC before cmp as any path

> forward will

> >> >> >> > > > +          require it.  */

> >> >> >> > > >         VMOVU   %VEC(0), (%rdi)

> >> >> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

> >> >> >> > > > +       ja      L(loop_start)

> >> >> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> >> > > >  L(return):

> >> >> >> > > >  #if VEC_SIZE > 16

> >> >> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

> >> >> >> > > > @@ -192,28 +197,29 @@ L(return):

> >> >> >> > > >  #endif

> >> >> >> > > >

> >> >> >> > > >  L(loop_start):

> >> >> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

> >> >> >> > > > -       VMOVU   %VEC(0), (%rdi)

> >> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

> >> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

> >> >> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

> >> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

> >> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

> >> >> >> > > > -       addq    %rdi, %rdx

> >> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

> >> >> >> > > > -       cmpq    %rdx, %rcx

> >> >> >> > > > -       je      L(return)

> >> >> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

> >> >> >> > > > +       jbe     L(loop_end)

> >> >> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

> >> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> >> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

> >> >> >

> >> >> > If this overflows loop will exit first iteration. Is that an issue?

> >> >>

> >> >> Please do following:

> >> >>

> >> >> 1.  Update memset assembly codes with

> >> >>         check conditions for underwrite/overwrite.

> >> >>         if true then branch to the HLT instruction.

> >> >

> >> > codes? Just this file or others as well?

> >>

> >> All string/memory functions should work for all valid inputs.

> >>

> >> >>

> >> >> 2.  Update and run memset test to verify the test coverage for the

> condition.

> >> >

> >> > What is the desired result? Segfault?

> >>

> >> For invalid inputs, anything can happen, including segfault.

> >>

> >> >>

> >> >> 3.  Update memset assembly codes to cover such conditions.

> >> >

> >> > Memcmp as well? What about wmemset? Currently (and with previous

> versions as well)

> >> > a value like 2^63 would cause similar behavior after the `salq $2,

> %rdx`

> >>

> >> Is this condition a valid input?  If not, there is nothing to do.

> >

> >

> > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any

> string/memory function

> > from the perspective of the standard. But I don't know if it has any

> qualifiers.

> >

> > As well I don't know what is meant to happen if the machine/OS is unable

> to perform the necessary operations.

> > Normally you would see segfault. The difference from the 3 commits below

> is essentially just that it won't

> > segfault.

> >

> > That is already case, and has been for a while, certain inputs for many

> of the wcsmbs will have roughly the

> > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 +

> 1) will currently set 1 wchar then

> > return.

>

> Please construct a testcase which will return normally instead of

> segfault if not fixed.  So on

> x86-64, the expected behavior should be segfault.

>


memcmp / test for it as well?

>

> >>

> >>

> >> >>

> >> >>

> >> >> Thanks.

> >> >>

> >> >> > If so the following commits from me have the same bug:

> >> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

> >> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

> >> >> >

> https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

> >> >> >>

> >> >> >>

> >> >> >>

> >> >> >> > > > +       .p2align 4

> >> >> >> > > >  L(loop):

> >> >> >> > > > -       VMOVA   %VEC(0), (%rcx)

> >> >> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

> >> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

> >> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

> >> >> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

> >> >> >> > > > -       cmpq    %rcx, %rdx

> >> >> >> > > > -       jne     L(loop)

> >> >> >> > > > +       VMOVA   %VEC(0), (%rdi)

> >> >> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

> >> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

> >> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

> >> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

> >> >> >> > > > +       cmpq    %rcx, %rdi

> >> >> >> > > > +       jb      L(loop)

> >> >> >

> >> >> > Issue because %rdi will not be below %rcx here.

> >> >> >>

> >> >> >> > > > +L(loop_end):

> >> >> >> > > > +       /* NB: rax is set as ptr in

> MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

> >> >> >> > > > +              rdx as length is also unchanged.  */

> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

> >> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

> >> >> >> > > >         VZEROUPPER_SHORT_RETURN

> >> >> >> > > >

> >> >> >> > > >         .p2align 4

> >> >> >> > > > --

> >> >> >> > > > 2.25.1

> >> >> >> > > >

> >> >> >> > >

> >> >> >> > > LGTM.

> >> >> >> >

> >> >> >> > Awesome!

> >> >> >> >

> >> >> >> > For future patches do you prefer performance numbers like this

> or

> >> >> >> > raw text? Or some other alternative?

> >> >> >>

> >> >> >> The current data format is fine.   Thanks.

> >> >> >>

> >> >> >> > >

> >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

> >> >> >> > >

> >> >> >> > > Thanks.

> >> >> >> > >

> >> >> >> > > --

> >> >> >> > > H.J.

> >> >> >>

> >> >> >>

> >> >> >>

> >> >> >> --

> >> >> >> H.J.

> >> >>

> >> >>

> >> >>

> >> >> --

> >> >> H.J.

> >>

> >>

> >>

> >> --

> >> H.J.

>

>

>

> --

> H.J.

>
Florian Weimer via Libc-alpha June 7, 2021, 12:54 p.m. | #12
On Mon, Jun 7, 2021 at 12:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>

>

>

> On Mon, Jun 7, 2021 at 12:34 AM H.J. Lu <hjl.tools@gmail.com> wrote:

>>

>> On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >

>> >

>> >

>> > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >>

>> >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >

>> >> >

>> >> >

>> >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> >>

>> >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >> >

>> >> >> >

>> >> >> >

>> >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> >> >>

>> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >> >> >

>> >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:

>> >> >> >> > >

>> >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

>> >> >> >> > > >

>> >> >> >> > > > No bug. This commit makes a few small improvements to

>> >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64

>> >> >> >> > > > instead of 128. Either alignment will perform equally well in a loop

>> >> >> >> > > > and 128 just increases the odds of having to do an extra iteration

>> >> >> >> > > > which can be significant overhead for small values. 2) Align some

>> >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)

>> >> >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)

>> >> >> >> > > > Move the condition for leq 8x VEC to before the alignment

>> >> >> >> > > > process. test-memset and test-wmemset are both passing.

>> >> >> >> > > >

>> >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>

>> >> >> >> > > > ---

>> >> >> >> > > > Tests where run on the following CPUs:

>> >> >> >> > > >

>> >> >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

>> >> >> >> > > >

>> >> >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

>> >> >> >> > > >

>> >> >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

>> >> >> >> > > >

>> >> >> >> > > > All times are the geometric mean of N=50. The unit of time is

>> >> >> >> > > > seconds.

>> >> >> >> > > >

>> >> >> >> > > > "Cur" refers to the current implementation

>> >> >> >> > > > "New" refers to this patches implementation

>> >> >> >> > > >

>> >> >> >> > > > Performance data attached in memset-data.pdf

>> >> >> >> > > >

>> >> >> >> > > > Some notes on the numbers:

>> >> >> >> > > >

>> >> >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding

>> >> >> >> > > > cpu.

>> >> >> >> > > >

>> >> >> >> > > > skl -> avx2

>> >> >> >> > > > icl -> evex

>> >> >> >> > > > tgl -> evex

>> >> >> >> > > >

>> >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance

>> >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment

>> >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected

>> >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects

>> >> >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop

>> >> >> >> > > > the deviations go away). Generally I think they can be ignored (both

>> >> >> >> > > > positive and negative affects).

>> >> >> >> > > >

>> >> >> >> > > > The interesting part of the data is in the medium size range [128,

>> >> >> >> > > > 1024] where the new implementation has a reasonable speedup. This is

>> >> >> >> > > > especially pronounced when the more conservative alignment saves a

>> >> >> >> > > > full loop iteration. The only significant exception is

>> >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the

>> >> >> >> > > > current implementation is meaningfully faster. I am unsure of the root

>> >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in

>> >> >> >> > > > this case which makes me think part of it is code alignment related,

>> >> >> >> > > > though comparative to the speedup in other size/alignment

>> >> >> >> > > > configurations it is still a trough.  Despite this, I still think the

>> >> >> >> > > > numbers are overall an improvement.

>> >> >> >> > > >

>> >> >> >> > > > As well due to aligning the loop (and possibly slightly more efficient

>> >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop

>> >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight

>> >> >> >> > > > improvement to the main loop for large sizes.

>> >> >> >> > > >

>> >> >> >> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------

>> >> >> >> > > >  1 file changed, 28 insertions(+), 22 deletions(-)

>> >> >> >> > > >

>> >> >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> >> > > > index 08cfa49bd1..ff196844a0 100644

>> >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S

>> >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))

>> >> >> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> >> >> > > >         VZEROUPPER_RETURN

>> >> >> >> > > >

>> >> >> >> > > > +       .p2align 4

>> >> >> >> > > >  L(stosb_more_2x_vec):

>> >> >> >> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP

>> >> >> >> > > >         ja      L(stosb)

>> >> >> >> > > > +#else

>> >> >> >> > > > +       .p2align 4

>> >> >> >> > > >  #endif

>> >> >> >> > > >  L(more_2x_vec):

>> >> >> >> > > > -       cmpq  $(VEC_SIZE * 4), %rdx

>> >> >> >> > > > -       ja      L(loop_start)

>> >> >> >> > > > +       /* Stores to first 2x VEC before cmp as any path forward will

>> >> >> >> > > > +          require it.  */

>> >> >> >> > > >         VMOVU   %VEC(0), (%rdi)

>> >> >> >> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> >> > > > +       cmpq    $(VEC_SIZE * 4), %rdx

>> >> >> >> > > > +       ja      L(loop_start)

>> >> >> >> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> >> > > >  L(return):

>> >> >> >> > > >  #if VEC_SIZE > 16

>> >> >> >> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN

>> >> >> >> > > > @@ -192,28 +197,29 @@ L(return):

>> >> >> >> > > >  #endif

>> >> >> >> > > >

>> >> >> >> > > >  L(loop_start):

>> >> >> >> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx

>> >> >> >> > > > -       VMOVU   %VEC(0), (%rdi)

>> >> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rcx

>> >> >> >> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)

>> >> >> >> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)

>> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)

>> >> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)

>> >> >> >> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> >> >> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)

>> >> >> >> > > > -       addq    %rdi, %rdx

>> >> >> >> > > > -       andq    $-(VEC_SIZE * 4), %rdx

>> >> >> >> > > > -       cmpq    %rdx, %rcx

>> >> >> >> > > > -       je      L(return)

>> >> >> >> > > > +       cmpq    $(VEC_SIZE * 8), %rdx

>> >> >> >> > > > +       jbe     L(loop_end)

>> >> >> >> > > > +       andq    $-(VEC_SIZE * 2), %rdi

>> >> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> >> >> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx

>> >> >> >

>> >> >> > If this overflows loop will exit first iteration. Is that an issue?

>> >> >>

>> >> >> Please do following:

>> >> >>

>> >> >> 1.  Update memset assembly codes with

>> >> >>         check conditions for underwrite/overwrite.

>> >> >>         if true then branch to the HLT instruction.

>> >> >

>> >> > codes? Just this file or others as well?

>> >>

>> >> All string/memory functions should work for all valid inputs.

>> >>

>> >> >>

>> >> >> 2.  Update and run memset test to verify the test coverage for the condition.

>> >> >

>> >> > What is the desired result? Segfault?

>> >>

>> >> For invalid inputs, anything can happen, including segfault.

>> >>

>> >> >>

>> >> >> 3.  Update memset assembly codes to cover such conditions.

>> >> >

>> >> > Memcmp as well? What about wmemset? Currently (and with previous versions as well)

>> >> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx`

>> >>

>> >> Is this condition a valid input?  If not, there is nothing to do.

>> >

>> >

>> > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function

>> > from the perspective of the standard. But I don't know if it has any qualifiers.

>> >

>> > As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations.

>> > Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't

>> > segfault.

>> >

>> > That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the

>> > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then

>> > return.

>>

>> Please construct a testcase which will return normally instead of

>> segfault if not fixed.  So on

>> x86-64, the expected behavior should be segfault.

>

>

> memcmp / test for it as well?


Yes.

>>

>>

>> >>

>> >>

>> >> >>

>> >> >>

>> >> >> Thanks.

>> >> >>

>> >> >> > If so the following commits from me have the same bug:

>> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae

>> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce

>> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428

>> >> >> >>

>> >> >> >>

>> >> >> >>

>> >> >> >> > > > +       .p2align 4

>> >> >> >> > > >  L(loop):

>> >> >> >> > > > -       VMOVA   %VEC(0), (%rcx)

>> >> >> >> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)

>> >> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)

>> >> >> >> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)

>> >> >> >> > > > -       addq    $(VEC_SIZE * 4), %rcx

>> >> >> >> > > > -       cmpq    %rcx, %rdx

>> >> >> >> > > > -       jne     L(loop)

>> >> >> >> > > > +       VMOVA   %VEC(0), (%rdi)

>> >> >> >> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)

>> >> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)

>> >> >> >> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)

>> >> >> >> > > > +       subq    $-(VEC_SIZE * 4), %rdi

>> >> >> >> > > > +       cmpq    %rcx, %rdi

>> >> >> >> > > > +       jb      L(loop)

>> >> >> >

>> >> >> > Issue because %rdi will not be below %rcx here.

>> >> >> >>

>> >> >> >> > > > +L(loop_end):

>> >> >> >> > > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.

>> >> >> >> > > > +              rdx as length is also unchanged.  */

>> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)

>> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)

>> >> >> >> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)

>> >> >> >> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)

>> >> >> >> > > >         VZEROUPPER_SHORT_RETURN

>> >> >> >> > > >

>> >> >> >> > > >         .p2align 4

>> >> >> >> > > > --

>> >> >> >> > > > 2.25.1

>> >> >> >> > > >

>> >> >> >> > >

>> >> >> >> > > LGTM.

>> >> >> >> >

>> >> >> >> > Awesome!

>> >> >> >> >

>> >> >> >> > For future patches do you prefer performance numbers like this or

>> >> >> >> > raw text? Or some other alternative?

>> >> >> >>

>> >> >> >> The current data format is fine.   Thanks.

>> >> >> >>

>> >> >> >> > >

>> >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

>> >> >> >> > >

>> >> >> >> > > Thanks.

>> >> >> >> > >

>> >> >> >> > > --

>> >> >> >> > > H.J.

>> >> >> >>

>> >> >> >>

>> >> >> >>

>> >> >> >> --

>> >> >> >> H.J.

>> >> >>

>> >> >>

>> >> >>

>> >> >> --

>> >> >> H.J.

>> >>

>> >>

>> >>

>> >> --

>> >> H.J.

>>

>>

>>

>> --

>> H.J.




-- 
H.J.

Patch

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 08cfa49bd1..ff196844a0 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -173,17 +173,22 @@  ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
 	VMOVU	%VEC(0), (%rdi)
 	VZEROUPPER_RETURN
 
+	.p2align 4
 L(stosb_more_2x_vec):
 	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb)
+#else
+	.p2align 4
 #endif
 L(more_2x_vec):
-	cmpq  $(VEC_SIZE * 4), %rdx
-	ja	L(loop_start)
+	/* Stores to first 2x VEC before cmp as any path forward will
+	   require it.  */
 	VMOVU	%VEC(0), (%rdi)
 	VMOVU	%VEC(0), VEC_SIZE(%rdi)
-	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
+	cmpq	$(VEC_SIZE * 4), %rdx
+	ja	L(loop_start)
 	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
+	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
 L(return):
 #if VEC_SIZE > 16
 	ZERO_UPPER_VEC_REGISTERS_RETURN
@@ -192,28 +197,29 @@  L(return):
 #endif
 
 L(loop_start):
-	leaq	(VEC_SIZE * 4)(%rdi), %rcx
-	VMOVU	%VEC(0), (%rdi)
-	andq	$-(VEC_SIZE * 4), %rcx
-	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
-	VMOVU	%VEC(0), VEC_SIZE(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
 	VMOVU	%VEC(0), (VEC_SIZE * 2)(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)
 	VMOVU	%VEC(0), (VEC_SIZE * 3)(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)
-	addq	%rdi, %rdx
-	andq	$-(VEC_SIZE * 4), %rdx
-	cmpq	%rdx, %rcx
-	je	L(return)
+	cmpq	$(VEC_SIZE * 8), %rdx
+	jbe	L(loop_end)
+	andq	$-(VEC_SIZE * 2), %rdi
+	subq	$-(VEC_SIZE * 4), %rdi
+	leaq	-(VEC_SIZE * 4)(%rax, %rdx), %rcx
+	.p2align 4
 L(loop):
-	VMOVA	%VEC(0), (%rcx)
-	VMOVA	%VEC(0), VEC_SIZE(%rcx)
-	VMOVA	%VEC(0), (VEC_SIZE * 2)(%rcx)
-	VMOVA	%VEC(0), (VEC_SIZE * 3)(%rcx)
-	addq	$(VEC_SIZE * 4), %rcx
-	cmpq	%rcx, %rdx
-	jne	L(loop)
+	VMOVA	%VEC(0), (%rdi)
+	VMOVA	%VEC(0), VEC_SIZE(%rdi)
+	VMOVA	%VEC(0), (VEC_SIZE * 2)(%rdi)
+	VMOVA	%VEC(0), (VEC_SIZE * 3)(%rdi)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rcx, %rdi
+	jb	L(loop)
+L(loop_end):
+	/* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
+	       rdx as length is also unchanged.  */
+	VMOVU	%VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)
+	VMOVU	%VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)
+	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)
+	VMOVU	%VEC(0), -VEC_SIZE(%rax, %rdx)
 	VZEROUPPER_SHORT_RETURN
 
 	.p2align 4