[0/5] i386: Optimize for Jump Conditional Code Erratum

Message ID 20191112161905.10048-1-hjl.tools@gmail.com
Headers show
Series
  • i386: Optimize for Jump Conditional Code Erratum
Related show

Message

H.J. Lu Nov. 12, 2019, 4:19 p.m.
Microcode update for Jump Conditional Code Erratum may cause performance
loss for some workloads:

https://www.intel.com/content/www/us/en/support/articles/000055650.html

Here is the set of assembler patches to mitigate performance impact by
aligning branches within 32-byte boundary.  The impacted instructions
are:

  a. Conditional jump.
  b. Fused conditional jump.
  c. Unconditional jump.
  d. Call.
  e. Ret.
  f. Indirect jump and call.

The new -mbranches-within-32B-boundaries command-line option aligns
conditional jump, fused conditional jump and unconditional jump within
32-byte boundary.

md_cons_worker, which allows a backend to track hard-coded opcodes in
instruction stream, and md_generic_table_relax_frag, which allows a
backend to extend relax_frag, are added to implement this new feature.

H.J. Lu (5):
  gas: Add md_cons_worker
  gas: Add md_generic_table_relax_frag
  i386: Align branches within a fixed boundary
  i386: Add -mbranches-within-32B-boundaries
  i386: Add tests for -malign-branch-boundary and -malign-branch

 gas/config/tc-i386.c                          | 968 +++++++++++++++++-
 gas/config/tc-i386.h                          |  31 +
 gas/doc/c-i386.texi                           |  37 +
 gas/read.c                                    |   4 +
 gas/testsuite/gas/i386/align-branch-1.s       |  72 ++
 gas/testsuite/gas/i386/align-branch-1a.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1b.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1c.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1d.d      |  76 ++
 gas/testsuite/gas/i386/align-branch-1e.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1f.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1g.d      |  77 ++
 gas/testsuite/gas/i386/align-branch-1h.d      |  76 ++
 gas/testsuite/gas/i386/align-branch-2.s       |  49 +
 gas/testsuite/gas/i386/align-branch-2a.d      |  55 +
 gas/testsuite/gas/i386/align-branch-2b.d      |  55 +
 gas/testsuite/gas/i386/align-branch-2c.d      |  55 +
 gas/testsuite/gas/i386/align-branch-3.d       |  33 +
 gas/testsuite/gas/i386/align-branch-3.s       |  28 +
 gas/testsuite/gas/i386/align-branch-4.s       |  30 +
 gas/testsuite/gas/i386/align-branch-4a.d      |  36 +
 gas/testsuite/gas/i386/align-branch-4b.d      |  36 +
 gas/testsuite/gas/i386/align-branch-5.d       |  36 +
 gas/testsuite/gas/i386/align-branch-5.s       |  32 +
 gas/testsuite/gas/i386/i386.exp               |  37 +
 .../gas/i386/x86-64-align-branch-1.s          |  70 ++
 .../gas/i386/x86-64-align-branch-1a.d         |  75 ++
 .../gas/i386/x86-64-align-branch-1b.d         |  75 ++
 .../gas/i386/x86-64-align-branch-1c.d         |  75 ++
 .../gas/i386/x86-64-align-branch-1d.d         |  74 ++
 .../gas/i386/x86-64-align-branch-1e.d         |  74 ++
 .../gas/i386/x86-64-align-branch-1f.d         |  75 ++
 .../gas/i386/x86-64-align-branch-1g.d         |  75 ++
 .../gas/i386/x86-64-align-branch-1h.d         |  74 ++
 .../gas/i386/x86-64-align-branch-2.s          |  44 +
 .../gas/i386/x86-64-align-branch-2a.d         |  50 +
 .../gas/i386/x86-64-align-branch-2b.d         |  50 +
 .../gas/i386/x86-64-align-branch-2c.d         |  50 +
 .../gas/i386/x86-64-align-branch-3.d          |  32 +
 .../gas/i386/x86-64-align-branch-3.s          |  27 +
 .../gas/i386/x86-64-align-branch-4.s          |  27 +
 .../gas/i386/x86-64-align-branch-4a.d         |  33 +
 .../gas/i386/x86-64-align-branch-4b.d         |  33 +
 .../gas/i386/x86-64-align-branch-5.d          |  37 +
 gas/write.c                                   |   5 +
 45 files changed, 3259 insertions(+), 4 deletions(-)
 create mode 100644 gas/testsuite/gas/i386/align-branch-1.s
 create mode 100644 gas/testsuite/gas/i386/align-branch-1a.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1b.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1c.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1d.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1e.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1f.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1g.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-1h.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-2.s
 create mode 100644 gas/testsuite/gas/i386/align-branch-2a.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-2b.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-2c.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-3.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-3.s
 create mode 100644 gas/testsuite/gas/i386/align-branch-4.s
 create mode 100644 gas/testsuite/gas/i386/align-branch-4a.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-4b.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-5.d
 create mode 100644 gas/testsuite/gas/i386/align-branch-5.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1a.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1b.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1c.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1d.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1e.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1f.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1g.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-1h.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-2.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-2a.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-2b.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-2c.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-3.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-3.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-4.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-4a.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-4b.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-align-branch-5.d

-- 
2.21.0

Comments

Florian Weimer Nov. 14, 2019, 11:59 a.m. | #1
* H. J. Lu:

> Microcode update for Jump Conditional Code Erratum may cause performance

> loss for some workloads:

>

> https://www.intel.com/content/www/us/en/support/articles/000055650.html

>

> Here is the set of assembler patches to mitigate performance impact by

> aligning branches within 32-byte boundary.  The impacted instructions

> are:

>

>   a. Conditional jump.

>   b. Fused conditional jump.

>   c. Unconditional jump.

>   d. Call.

>   e. Ret.

>   f. Indirect jump and call.

>

> The new -mbranches-within-32B-boundaries command-line option aligns

> conditional jump, fused conditional jump and unconditional jump within

> 32-byte boundary.


Should this mitigation be enabled by default?

According to the whitepaper, this mitigation has some overhead on
non-affected CPUs (some Intel Atom-type CPUs are mentioned).  Is there a
way to avoid this overhead, so that the decision to enable this by
default is easier?

Thanks,
Florian
H.J. Lu Nov. 14, 2019, 6:20 p.m. | #2
On Thu, Nov 14, 2019 at 3:59 AM Florian Weimer <fweimer@redhat.com> wrote:
>

> * H. J. Lu:

>

> > Microcode update for Jump Conditional Code Erratum may cause performance

> > loss for some workloads:

> >

> > https://www.intel.com/content/www/us/en/support/articles/000055650.html

> >

> > Here is the set of assembler patches to mitigate performance impact by

> > aligning branches within 32-byte boundary.  The impacted instructions

> > are:

> >

> >   a. Conditional jump.

> >   b. Fused conditional jump.

> >   c. Unconditional jump.

> >   d. Call.

> >   e. Ret.

> >   f. Indirect jump and call.

> >

> > The new -mbranches-within-32B-boundaries command-line option aligns

> > conditional jump, fused conditional jump and unconditional jump within

> > 32-byte boundary.

>

> Should this mitigation be enabled by default?


We'd like to see it enabled as much as it can.   The potential issues are

1.  Some assembly codes, like Linux kernel, check the code size.   Add prefix
increases code size and may break such assembly codes.
2. Linker may re-write instructions, like TLS optimization.  Add prefixes may
cause linker error or incorrect linker output.

> According to the whitepaper, this mitigation has some overhead on

> non-affected CPUs (some Intel Atom-type CPUs are mentioned).  Is there a

> way to avoid this overhead, so that the decision to enable this by


Yes, we are looking into it.

> default is easier?

>

> Thanks,

> Florian

>



-- 
H.J.
Jeff Law Nov. 14, 2019, 7:15 p.m. | #3
On 11/14/19 11:20 AM, H.J. Lu wrote:
> On Thu, Nov 14, 2019 at 3:59 AM Florian Weimer <fweimer@redhat.com> wrote:

>>

>> * H. J. Lu:

>>

>>> Microcode update for Jump Conditional Code Erratum may cause performance

>>> loss for some workloads:

>>>

>>> https://www.intel.com/content/www/us/en/support/articles/000055650.html

>>>

>>> Here is the set of assembler patches to mitigate performance impact by

>>> aligning branches within 32-byte boundary.  The impacted instructions

>>> are:

>>>

>>>   a. Conditional jump.

>>>   b. Fused conditional jump.

>>>   c. Unconditional jump.

>>>   d. Call.

>>>   e. Ret.

>>>   f. Indirect jump and call.

>>>

>>> The new -mbranches-within-32B-boundaries command-line option aligns

>>> conditional jump, fused conditional jump and unconditional jump within

>>> 32-byte boundary.

>>

>> Should this mitigation be enabled by default?

> 

> We'd like to see it enabled as much as it can.   The potential issues are

> 

> 1.  Some assembly codes, like Linux kernel, check the code size.   Add prefix

> increases code size and may break such assembly codes.

ISTM the kernel could just turn off the flag if we were to turn it on by
default.

But I can't help but point out that if we do this, then we're forcing
everyone to pay a price in terms of runtime performance and codesize --
even if they're on a processor where this doesn't matter.

Additionally, I have yet to find any documentation which indicates with
better precision when this happens and what the consequences are when it
does happen.  That makes it impossible to know if there's any kind of
filtering we can do to avoid inserting to many nops/prefixes to ensure
branch alignment.  It's also exceedingly hard to assess the real world
impact of the bug which in turn makes it hard to assess the importance
of the mitigations.

> 2. Linker may re-write instructions, like TLS optimization.  Add prefixes may

> cause linker error or incorrect linker output.

> 

>> According to the whitepaper, this mitigation has some overhead on

>> non-affected CPUs (some Intel Atom-type CPUs are mentioned).  Is there a

>> way to avoid this overhead, so that the decision to enable this by

> 

> Yes, we are looking into it.

I don't want to open a can of worms here, but...

Aligning everything seems needlessly wasteful.  So at a high level I
wouldn't bother aligning anything in a cold code segment.   I'd probably
also avoid aligning things in code that isn't on the hot paths.

With some hackery we probably could identify the former.  The latter is
tougher because all the mitigation work is done in the assembler with no
cooperation from the compiler.  One might reasonably ask that the
compiler identify jumps that should be aligned, which we could do with a
pseudo-op or directive.

The other thing that comes to mind is branch target alignment.   If a
jump is going to need alignment, might it be better to instead insert
the nops/prefixes at the previous label in some cases?  This might be
particularly interesting is those nops/prefixes happen to increase the
alignment of the label to a nicer value.  If this happens with any
regularity, then we're killing two birds with one stone -- we're fixing
the alignment of the jump itself, but also improving the alignment of a
branch target which can be good for performance.

Jeff
Florian Weimer Nov. 14, 2019, 7:45 p.m. | #4
* Jeff Law:

> Additionally, I have yet to find any documentation which indicates with

> better precision when this happens and what the consequences are when it

> does happen.  That makes it impossible to know if there's any kind of

> filtering we can do to avoid inserting to many nops/prefixes to ensure

> branch alignment.  It's also exceedingly hard to assess the real world

> impact of the bug which in turn makes it hard to assess the importance

> of the mitigations.


My understanding is that the mitigation for the Jcc erratum and the
mitigation of the performance impact of the microupdate differs
considerably.

The performance mitigation is required for pretty much all control
flow instructions, and the criteria are reasonably clearly explained
in the Intel whitepaper, I think.

I suspect that if you do not run with a microcode update, the
mitigation would only have to be applied to certain Jcc opcodes, so
considerably fewer changes are needed.  I don't think there is public
guidance regarding the required mitigation for that.  But given that
it targets only Jcc instead of JMP/CALL/RET/Jcc, I expect that the
impact would be less.  And who knows, maybe only conditional jumps of
a certain encoding are affected.

> The other thing that comes to mind is branch target alignment.   If a

> jump is going to need alignment, might it be better to instead insert

> the nops/prefixes at the previous label in some cases?


Yes, that's exactly what's suggested in the whitepaper.  We could also
shift the function alignment itself, so that the entry point is not
aligned on a 16-byte boundary, if that happens to move the control
flow instructions away from the problematic 32-byte boundaries.
Although that would likely break some software that assumes 16-byte
function alignment.  There is at least one example of enterprise
software that uses the padding around certain glibc system call
wrappers to store their hot-patching trampolines, for example.
Jeff Law Nov. 14, 2019, 8:26 p.m. | #5
On 11/14/19 12:45 PM, Florian Weimer wrote:
[ ...]

> I suspect that if you do not run with a microcode update, the

> mitigation would only have to be applied to certain Jcc opcodes, so

> considerably fewer changes are needed.  I don't think there is public

> guidance regarding the required mitigation for that.  But given that

> it targets only Jcc instead of JMP/CALL/RET/Jcc, I expect that the

> impact would be less.  And who knows, maybe only conditional jumps of

> a certain encoding are affected.

This is what I was referring to.  Under what conditions does this
problem show up.  Yea, I know it's branches spanning cache lines, but
it's got to be more complex than that :-)


> 

>> The other thing that comes to mind is branch target alignment.   If a

>> jump is going to need alignment, might it be better to instead insert

>> the nops/prefixes at the previous label in some cases?

> 

> Yes, that's exactly what's suggested in the whitepaper.  We could also

> shift the function alignment itself, so that the entry point is not

> aligned on a 16-byte boundary, if that happens to move the control

> flow instructions away from the problematic 32-byte boundaries.

> Although that would likely break some software that assumes 16-byte

> function alignment.  There is at least one example of enterprise

> software that uses the padding around certain glibc system call

> wrappers to store their hot-patching trampolines, for example.

I  must have missed that.  Good to see my idea isn't totally off base.

Jeff
Fangrui Song Nov. 15, 2019, 12:16 a.m. | #6
On 2019-11-12, H.J. Lu wrote:
>Microcode update for Jump Conditional Code Erratum may cause performance

>loss for some workloads:

>

>https://www.intel.com/content/www/us/en/support/articles/000055650.html

>

>Here is the set of assembler patches to mitigate performance impact by

>aligning branches within 32-byte boundary.  The impacted instructions

>are:


So, a few questions.

1. Without the assembler mitigation, what is the performance hit with and
  without the microcode update?
2. What is the code size increase of this assembler mitigation?
3. Why is the jcc+fused+jmp set suggested? (-mbranches-within-32B-boundaries)
  What is the performance and code size impact with this set compared
  with the full set? Among "jcc+fused+call+jmp+ret+indirect", which one
  gives the largest hit?
4. Shall we default to -mbranches-within-32B-boundaries if the specified
   -march= or -mtune= may be affected by the erratum?
5. Do we need to increase the section alignment (sh_addralign)?
H.J. Lu Dec. 3, 2019, 8:19 p.m. | #7
On Thu, Nov 14, 2019 at 4:16 PM Fangrui Song <i@maskray.me> wrote:
>

> On 2019-11-12, H.J. Lu wrote:

> >Microcode update for Jump Conditional Code Erratum may cause performance

> >loss for some workloads:

> >

> >https://www.intel.com/content/www/us/en/support/articles/000055650.html

> >

> >Here is the set of assembler patches to mitigate performance impact by

> >aligning branches within 32-byte boundary.  The impacted instructions

> >are:

>

> So, a few questions.

>

> 1. Without the assembler mitigation, what is the performance hit with and

>   without the microcode update?


The JCC erratum microcode update will cause a greater number of misses
out of the Decoded ICache and subsequent switches to the legacy decode
pipeline. This occurs since branches that overlay or end on a 32-byte
boundary are unable to fill into the Decoded ICache.

The potential performance impact of the JCC erratum mitigation arises
from two different sources:

1.      A switch penalty that occurs when executing in the Decoded
ICache and switching over to the legacy decode pipeline.
2.      Inefficiencies that occur when executing from the legacy
decode pipeline that are potentially hidden by the Decoded ICache.

Intel has observed performance effects associated with the workaround
ranging from 0-4% on many industry-standard benchmarks. In
subcomponents of these benchmarks, Intel has observed outliers higher
than the 0-4% range. The effects on other workloads not observed by
Intel may vary. Intel has developed software-based tools to minimize
the impact on potentially affected applications and workloads.

> 2. What is the code size increase of this assembler mitigation?


We measured the increase in code size due to the addition of padding
to instructions to align branches correctly. The geomean increase in
code size is 3-4% with individual outliers of up to 5%.

> 3. Why is the jcc+fused+jmp set suggested? (-mbranches-within-32B-boundaries)

>   What is the performance and code size impact with this set compared

>   with the full set? Among "jcc+fused+call+jmp+ret+indirect", which one

>   gives the largest hit?


From our investigation, we observed jcc+fused+jmp has mitigated most
of the performance effect from the benchmarks with moderate code size
increase. We think it strikes an appropriate balance between
performance gain and code size increase for most workloads.  For other
cases, we provide separate options for users to explore additional
performance improvement.

> 4. Shall we default to -mbranches-within-32B-boundaries if the specified

>    -march= or -mtune= may be affected by the erratum?


No. It’s a performance mitigation for the microcode update not a
functional fix. While it can mitigate the potential performance effect
in most cases as we observed, it increases the code size and may harm
the performance in some cases. It may also impact the performance of
those architectures which are not affected by this JCC erratum.

Software mitigation cannot be applied in some scenarios where
application behavior is dependent on exact code size. In other words,
the inserted padding (prefix, nop) may break the assumption of code
size that the programmer has made.  We have observed such assumptions
in the compilation of the Linux kernel.

Therefore we do not enable it by default. The user should evaluate its
impact and make their own determination as to whether to enable the
software mitigation  knowing that when this option is enabled, the
performance impact may vary case-by-case.

> 5. Do we need to increase the section alignment (sh_addralign)?


The minimum section alignment is increased to 32 bytes to ensure
that branches can be properly aligned.

-- 
H.J.
Jeff Law Dec. 4, 2019, 12:01 a.m. | #8
On 12/3/19 1:19 PM, H.J. Lu wrote:
> 

>> 4. Shall we default to -mbranches-within-32B-boundaries if the specified

>>    -march= or -mtune= may be affected by the erratum?

> 

> No. It’s a performance mitigation for the microcode update not a

> functional fix. While it can mitigate the potential performance effect

> in most cases as we observed, it increases the code size and may harm

> the performance in some cases. It may also impact the performance of

> those architectures which are not affected by this JCC erratum.

> 

> Software mitigation cannot be applied in some scenarios where

> application behavior is dependent on exact code size. In other words,

> the inserted padding (prefix, nop) may break the assumption of code

> size that the programmer has made.  We have observed such assumptions

> in the compilation of the Linux kernel.

ISTM those cases (like the kernel startup code) could/should opt-opt,
possibly at the file level since IIRC it's just one assembly file where
the sizes of jumps are supposed to be fixed.

> 

> Therefore we do not enable it by default. The user should evaluate its

> impact and make their own determination as to whether to enable the

> software mitigation  knowing that when this option is enabled, the

> performance impact may vary case-by-case.

The problem with not enabling it by default is a distro would have to
inject the flag into their builds.  It's not uncommon for injection
mechanisms to not work on packages like gcc, glibc, etc.

Jeff
Fangrui Song Dec. 4, 2019, 4:45 a.m. | #9
On 2019-12-03, Jeff Law wrote:
>On 12/3/19 1:19 PM, H.J. Lu wrote:

>>

>>> 4. Shall we default to -mbranches-within-32B-boundaries if the specified

>>>    -march= or -mtune= may be affected by the erratum?

>>

>> No. It’s a performance mitigation for the microcode update not a

>> functional fix. While it can mitigate the potential performance effect

>> in most cases as we observed, it increases the code size and may harm

>> the performance in some cases. It may also impact the performance of

>> those architectures which are not affected by this JCC erratum.

>>

>> Software mitigation cannot be applied in some scenarios where

>> application behavior is dependent on exact code size. In other words,

>> the inserted padding (prefix, nop) may break the assumption of code

>> size that the programmer has made.  We have observed such assumptions

>> in the compilation of the Linux kernel.


Padding instructions with prefixes instead of inserting multi-byte NOPs
may be a generally-useful feature. This may be very useful if it can be
applied at basic-block level, especially for loops. It will be very nice
if profile guided optimizations can insert some directives to guide the
prefix placement in appropriate positions. This part should probably be
made a bit more general so that it can be reused by performance
improvement changes.

>ISTM those cases (like the kernel startup code) could/should opt-opt,

>possibly at the file level since IIRC it's just one assembly file where

>the sizes of jumps are supposed to be fixed.


Inserting prefixes at arbitrary positions can break assembly like:

.if . - label == 4

Such constructs are very rare. I've checked Linux kernel, there are
indeed a few tricky constructs, only one instance is relevant, though:

   // arch/arm/include/asm/assembler.h
   .if . - 9997b == 2
   // tools/testing/selftests/x86/sysret_rip.c (only this one is relevant)
   .ifne . - test_page - 4096
   // arch/powerpc/include/asm/head-64.h
   .if (. - name > (start) + (size) - name##_start)

>> Therefore we do not enable it by default. The user should evaluate its

>> impact and make their own determination as to whether to enable the

>> software mitigation  knowing that when this option is enabled, the

>> performance impact may vary case-by-case.

>The problem with not enabling it by default is a distro would have to

>inject the flag into their builds.  It's not uncommon for injection

>mechanisms to not work on packages like gcc, glibc, etc.


The code size increase (3-4%) is large. In gcc, if an optimization can
improve performance by a% at the cost of >a% code size increase, is it
considered as a good trade-off for -O2? -O1? -Os?

The performance decrease may not even be perceived for lots of software
in a distribution. Opt-in may be a good first choice when we still lack
statistics/feedback from users.

If we have profile information, we can teach GCC to insert some
directives at basic-block/function/file level to hint that jump
instructions in some code sequences need more care.
Jeff Law Dec. 4, 2019, 6:24 p.m. | #10
On 12/3/19 9:45 PM, Fangrui Song wrote:
> On 2019-12-03, Jeff Law wrote:

>> On 12/3/19 1:19 PM, H.J. Lu wrote:

>>>

>>>> 4. Shall we default to -mbranches-within-32B-boundaries if the

>>>> specified

>>>>    -march= or -mtune= may be affected by the erratum?

>>>

>>> No. It’s a performance mitigation for the microcode update not a

>>> functional fix. While it can mitigate the potential performance effect

>>> in most cases as we observed, it increases the code size and may harm

>>> the performance in some cases. It may also impact the performance of

>>> those architectures which are not affected by this JCC erratum.

>>>

>>> Software mitigation cannot be applied in some scenarios where

>>> application behavior is dependent on exact code size. In other words,

>>> the inserted padding (prefix, nop) may break the assumption of code

>>> size that the programmer has made.  We have observed such assumptions

>>> in the compilation of the Linux kernel.

> 

> Padding instructions with prefixes instead of inserting multi-byte NOPs

> may be a generally-useful feature. This may be very useful if it can be

> applied at basic-block level, especially for loops. It will be very nice

> if profile guided optimizations can insert some directives to guide the

> prefix placement in appropriate positions. This part should probably be

> made a bit more general so that it can be reused by performance

> improvement changes.

> 

>> ISTM those cases (like the kernel startup code) could/should opt-opt,

>> possibly at the file level since IIRC it's just one assembly file where

>> the sizes of jumps are supposed to be fixed.

> 

> Inserting prefixes at arbitrary positions can break assembly like:

> 

> .if . - label == 4

> 

> Such constructs are very rare. I've checked Linux kernel, there are

> indeed a few tricky constructs, only one instance is relevant, though:

Again, I consider this a niche case and if the option were turned on by
deafult, the kernel can either opt-out or fix its code to handle the new
sequences.  Yea, it happens and it's a factor in the overall decision
making process, but it's not a major factor in my mind.


> 

>>> Therefore we do not enable it by default. The user should evaluate its

>>> impact and make their own determination as to whether to enable the

>>> software mitigation  knowing that when this option is enabled, the

>>> performance impact may vary case-by-case.

>> The problem with not enabling it by default is a distro would have to

>> inject the flag into their builds.  It's not uncommon for injection

>> mechanisms to not work on packages like gcc, glibc, etc.

> 

> The code size increase (3-4%) is large. In gcc, if an optimization can

> improve performance by a% at the cost of >a% code size increase, is it

> considered as a good trade-off for -O2? -O1? -Os?

It's far from that simple from a distro standpoint.  WHen I look at
everything we have to balance I can argue for both choices -- there
isn't a clear winner.

FWIW, I'd give a 3-4% codesize regression to restore a comparable
performance regression.  But that's just where I land.  Others will
certainly have a differing opinion here.


> 

> The performance decrease may not even be perceived for lots of software

> in a distribution. Opt-in may be a good first choice when we still lack

> statistics/feedback from users.

> 

> If we have profile information, we can teach GCC to insert some

> directives at basic-block/function/file level to hint that jump

> instructions in some code sequences need more care.

I've actually already suggested this in the thread.

jeff
H.J. Lu Dec. 4, 2019, 11:36 p.m. | #11
On Tue, Dec 3, 2019 at 4:01 PM Jeff Law <law@redhat.com> wrote:
>

> On 12/3/19 1:19 PM, H.J. Lu wrote:

> >

> >> 4. Shall we default to -mbranches-within-32B-boundaries if the specified

> >>    -march= or -mtune= may be affected by the erratum?

> >

> > No. It’s a performance mitigation for the microcode update not a

> > functional fix. While it can mitigate the potential performance effect

> > in most cases as we observed, it increases the code size and may harm

> > the performance in some cases. It may also impact the performance of

> > those architectures which are not affected by this JCC erratum.

> >

> > Software mitigation cannot be applied in some scenarios where

> > application behavior is dependent on exact code size. In other words,

> > the inserted padding (prefix, nop) may break the assumption of code

> > size that the programmer has made.  We have observed such assumptions

> > in the compilation of the Linux kernel.

> ISTM those cases (like the kernel startup code) could/should opt-opt,

> possibly at the file level since IIRC it's just one assembly file where

> the sizes of jumps are supposed to be fixed.

>

> >

> > Therefore we do not enable it by default. The user should evaluate its

> > impact and make their own determination as to whether to enable the

> > software mitigation  knowing that when this option is enabled, the

> > performance impact may vary case-by-case.

> The problem with not enabling it by default is a distro would have to

> inject the flag into their builds.  It's not uncommon for injection

> mechanisms to not work on packages like gcc, glibc, etc.


I can certainly add an option to make it the default.   It can be
turned off with

-malign-branch-boundary=0

-- 
H.J.
H.J. Lu Dec. 4, 2019, 11:44 p.m. | #12
On Tue, Dec 3, 2019 at 8:45 PM Fangrui Song <i@maskray.me> wrote:
>

> On 2019-12-03, Jeff Law wrote:

> >On 12/3/19 1:19 PM, H.J. Lu wrote:

> >>

> >>> 4. Shall we default to -mbranches-within-32B-boundaries if the specified

> >>>    -march= or -mtune= may be affected by the erratum?

> >>

> >> No. It’s a performance mitigation for the microcode update not a

> >> functional fix. While it can mitigate the potential performance effect

> >> in most cases as we observed, it increases the code size and may harm

> >> the performance in some cases. It may also impact the performance of

> >> those architectures which are not affected by this JCC erratum.

> >>

> >> Software mitigation cannot be applied in some scenarios where

> >> application behavior is dependent on exact code size. In other words,

> >> the inserted padding (prefix, nop) may break the assumption of code

> >> size that the programmer has made.  We have observed such assumptions

> >> in the compilation of the Linux kernel.

>

> Padding instructions with prefixes instead of inserting multi-byte NOPs

> may be a generally-useful feature. This may be very useful if it can be

> applied at basic-block level, especially for loops. It will be very nice


This will require a new directive.

> if profile guided optimizations can insert some directives to guide the

> prefix placement in appropriate positions. This part should probably be

> made a bit more general so that it can be reused by performance

> improvement changes.


It can be extended to do it.  But I prefer to do it as a followup change after
it has been used for a while.
> >ISTM those cases (like the kernel startup code) could/should opt-opt,

> >possibly at the file level since IIRC it's just one assembly file where

> >the sizes of jumps are supposed to be fixed.

>

> Inserting prefixes at arbitrary positions can break assembly like:

>

> .if . - label == 4

>

> Such constructs are very rare. I've checked Linux kernel, there are

> indeed a few tricky constructs, only one instance is relevant, though:

>

>    // arch/arm/include/asm/assembler.h

>    .if . - 9997b == 2

>    // tools/testing/selftests/x86/sysret_rip.c (only this one is relevant)

>    .ifne . - test_page - 4096

>    // arch/powerpc/include/asm/head-64.h

>    .if (. - name > (start) + (size) - name##_start)

>

> >> Therefore we do not enable it by default. The user should evaluate its

> >> impact and make their own determination as to whether to enable the

> >> software mitigation  knowing that when this option is enabled, the

> >> performance impact may vary case-by-case.

> >The problem with not enabling it by default is a distro would have to

> >inject the flag into their builds.  It's not uncommon for injection

> >mechanisms to not work on packages like gcc, glibc, etc.

>

> The code size increase (3-4%) is large. In gcc, if an optimization can

> improve performance by a% at the cost of >a% code size increase, is it

> considered as a good trade-off for -O2? -O1? -Os?

>

> The performance decrease may not even be perceived for lots of software

> in a distribution. Opt-in may be a good first choice when we still lack

> statistics/feedback from users.

>

> If we have profile information, we can teach GCC to insert some

> directives at basic-block/function/file level to hint that jump

> instructions in some code sequences need more care.



-- 
H.J.