[v3] combine: perform jump threading at the end

Message ID 20180910105919.19602-1-iii@linux.ibm.com
State New
Headers show
Series
  • [v3] combine: perform jump threading at the end
Related show

Commit Message

Ilya Leoshkevich Sept. 10, 2018, 10:59 a.m.
Consider the following RTL:

(code_label 11 10 26 4 2 (nil) [1 uses])
(note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 12 26 15 4 (set (reg:SI 65)
        (if_then_else:SI (eq (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (const_int 1 [0x1])
            (const_int 0 [0]))) "pr80080-4.c":9 1674 {*movsicc})
(insn 15 12 16 4 (parallel [
            (set (reg:CCZ 33 %cc)
                (compare:CCZ (reg:SI 65)
                    (const_int 0 [0])))
            (clobber (scratch:SI))
        ]) "pr80080-4.c":9 1216 {*tstsi_cconly_extimm})
(jump_insn 16 15 17 4 (set (pc)
        (if_then_else (ne (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (label_ref:DI 23)
            (pc))) "pr80080-4.c":9 1897 {*cjump_64})

Combine simplifies this into:

(code_label 11 10 26 4 2 (nil) [1 uses])
(note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(note 12 26 15 4 NOTE_INSN_DELETED)
(note 15 12 16 4 NOTE_INSN_DELETED)
(jump_insn 16 15 17 4 (set (pc)
        (if_then_else (eq (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (label_ref:DI 23)
            (pc))) "pr80080-4.c":9 1897 {*cjump_64})

opening up the possibility to perform jump threading.  Since this
happens infrequently, perform jump threading only when there is a
changed basic block, whose sole side effect is a trailing jump.

Also remove redundant usage of TV_JUMP, because rebuild_jump_labels ()
and cleanup_cfg () already have their own timevars.

gcc/ChangeLog:

2018-09-05  Ilya Leoshkevich  <iii@linux.ibm.com>

	PR target/80080
	* combine.c (is_single_jump_bb): New function.
	(struct combine_summary): New struct.
	(combine_instructions): Instead of returning
	new_direct_jump_p, fill struct combine_summary. In addition
	to the existing new_direct_jump_p, it contains a new
	new_single_jump_p field, which controls whether or not
	jump threading should be performed after combine.
	(rest_of_handle_combine): Perform jump threading if there is
	a possibility that it would be profitable.  Remove redundant
	usage of TV_JUMP.

gcc/testsuite/ChangeLog:

2018-09-05  Ilya Leoshkevich  <iii@linux.ibm.com>

	PR target/80080
	* gcc.target/s390/pr80080-4.c: New test.
---
 gcc/combine.c                             | 89 +++++++++++++++++++----
 gcc/testsuite/gcc.target/s390/pr80080-4.c | 16 ++++
 2 files changed, 89 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/pr80080-4.c

-- 
2.18.0

Comments

Segher Boessenkool Sept. 14, 2018, 9:35 p.m. | #1
Hi!

On Mon, Sep 10, 2018 at 12:59:19PM +0200, Ilya Leoshkevich wrote:
> Consider the following RTL:

> 

> (code_label 11 10 26 4 2 (nil) [1 uses])

> (note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

> (insn 12 26 15 4 (set (reg:SI 65)

>         (if_then_else:SI (eq (reg:CCZ 33 %cc)

>                 (const_int 0 [0]))

>             (const_int 1 [0x1])

>             (const_int 0 [0]))) "pr80080-4.c":9 1674 {*movsicc})

> (insn 15 12 16 4 (parallel [

>             (set (reg:CCZ 33 %cc)

>                 (compare:CCZ (reg:SI 65)

>                     (const_int 0 [0])))

>             (clobber (scratch:SI))

>         ]) "pr80080-4.c":9 1216 {*tstsi_cconly_extimm})

> (jump_insn 16 15 17 4 (set (pc)

>         (if_then_else (ne (reg:CCZ 33 %cc)

>                 (const_int 0 [0]))

>             (label_ref:DI 23)

>             (pc))) "pr80080-4.c":9 1897 {*cjump_64})

> 

> Combine simplifies this into:

> 

> (code_label 11 10 26 4 2 (nil) [1 uses])

> (note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

> (note 12 26 15 4 NOTE_INSN_DELETED)

> (note 15 12 16 4 NOTE_INSN_DELETED)

> (jump_insn 16 15 17 4 (set (pc)

>         (if_then_else (eq (reg:CCZ 33 %cc)

>                 (const_int 0 [0]))

>             (label_ref:DI 23)

>             (pc))) "pr80080-4.c":9 1897 {*cjump_64})

> 

> opening up the possibility to perform jump threading.  Since this

> happens infrequently, perform jump threading only when there is a

> changed basic block, whose sole side effect is a trailing jump.


So this happens because now there is *only* a conditional jump in this BB?

Could you please show generated code before and after this patch?
I mean generated assembler code.  What -S gives you.

> +/* Return true iff the only side effect of BB is its trailing jump_insn.  */

> +

> +static bool

> +is_single_jump_bb (basic_block bb)

> +{

> +  rtx_insn *end = BB_END (bb);

> +  rtx_insn *insn;

> +

> +  if (!JUMP_P (end))

> +    return false;

> +

> +  for (insn = BB_HEAD (bb); insn != end; insn = NEXT_INSN (insn))

> +    if (INSN_P (insn) && side_effects_p (PATTERN (insn)))

> +      return false;

> +  return true;

> +}


Hrm, so it is more than that.

Why does the existing jump threading not work for you; should it happen
at another time?


Segher
Ilya Leoshkevich Sept. 17, 2018, 8:50 a.m. | #2
> Am 14.09.2018 um 23:35 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> 

> Hi!

> 

> On Mon, Sep 10, 2018 at 12:59:19PM +0200, Ilya Leoshkevich wrote:

>> Consider the following RTL:

>> 

>> (code_label 11 10 26 4 2 (nil) [1 uses])

>> (note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

>> (insn 12 26 15 4 (set (reg:SI 65)

>>        (if_then_else:SI (eq (reg:CCZ 33 %cc)

>>                (const_int 0 [0]))

>>            (const_int 1 [0x1])

>>            (const_int 0 [0]))) "pr80080-4.c":9 1674 {*movsicc})

>> (insn 15 12 16 4 (parallel [

>>            (set (reg:CCZ 33 %cc)

>>                (compare:CCZ (reg:SI 65)

>>                    (const_int 0 [0])))

>>            (clobber (scratch:SI))

>>        ]) "pr80080-4.c":9 1216 {*tstsi_cconly_extimm})

>> (jump_insn 16 15 17 4 (set (pc)

>>        (if_then_else (ne (reg:CCZ 33 %cc)

>>                (const_int 0 [0]))

>>            (label_ref:DI 23)

>>            (pc))) "pr80080-4.c":9 1897 {*cjump_64})

>> 

>> Combine simplifies this into:

>> 

>> (code_label 11 10 26 4 2 (nil) [1 uses])

>> (note 26 11 12 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

>> (note 12 26 15 4 NOTE_INSN_DELETED)

>> (note 15 12 16 4 NOTE_INSN_DELETED)

>> (jump_insn 16 15 17 4 (set (pc)

>>        (if_then_else (eq (reg:CCZ 33 %cc)

>>                (const_int 0 [0]))

>>            (label_ref:DI 23)

>>            (pc))) "pr80080-4.c":9 1897 {*cjump_64})

>> 

>> opening up the possibility to perform jump threading.  Since this

>> happens infrequently, perform jump threading only when there is a

>> changed basic block, whose sole side effect is a trailing jump.

> 

> So this happens because now there is *only* a conditional jump in this BB?


Yes.

> Could you please show generated code before and after this patch?

> I mean generated assembler code.  What -S gives you.


Before:

foo4:
.LFB0:
	.cfi_startproc
	lt	%r1,0(%r2)
	jne	.L2
	lhi	%r3,1
	cs	%r1,%r3,0(%r2)
.L2:
	jne	.L5
	br	%r14
.L5:
	jg	bar

After:

foo4:
.LFB0:
	.cfi_startproc
	lt	%r1,0(%r2)
	jne	.L4
	lhi	%r3,1
	cs	%r1,%r3,0(%r2)
	jne	.L4
	br	%r14
.L4:
	jg	bar

>> +/* Return true iff the only side effect of BB is its trailing jump_insn.  */

>> +

>> +static bool

>> +is_single_jump_bb (basic_block bb)

>> +{

>> +  rtx_insn *end = BB_END (bb);

>> +  rtx_insn *insn;

>> +

>> +  if (!JUMP_P (end))

>> +    return false;

>> +

>> +  for (insn = BB_HEAD (bb); insn != end; insn = NEXT_INSN (insn))

>> +    if (INSN_P (insn) && side_effects_p (PATTERN (insn)))

>> +      return false;

>> +  return true;

>> +}

> 

> Hrm, so it is more than that.


The intention of this check is to match the „Short circuit cases where
block B contains some side effects, as we can't safely bypass it“ one in
thread_jump ().  I've just noticed that I got it wrong - I should also
check whether JUMP_INSN itself has side-effects, like it's done there.

> Why does the existing jump threading not work for you; should it happen

> at another time?


We call cleanup_cfg (CLEANUP_THREADING) only once - during the „jump“
pass, which happens before combine.  There is also „jump2“ pass, which
happens afterwards, and after discussion with Ulrich Weigand I tried to
move jump threading there.  While this change had the desired effect on
the testcase, the code got worse in another places. Example from
GemsFDTD:

Before:

103e96c:       b9 04 00 31             lgr     %r3,%r1
103e970:       a7 18 00 00             lhi     %r1,0
103e974:       e3 20 f0 d0 00 0d       dsg     %r2,208(%r15)
103e97a:       b9 20 00 c3             cgr     %r12,%r3
103e97e:       a7 29 ff ff             lghi    %r2,-1
103e982:       ec 12 00 01 00 42       lochih  %r1,1
103e988:       e3 20 f0 f8 00 82       xg      %r2,248(%r15)
103e98e:       1a 15                   ar      %r1,%r5
103e990:       b9 e9 c0 72             sgrk    %r7,%r2,%r12
103e994:       b3 c1 00 e2             ldgr    %f14,%r2
103e998:       b9 f8 a0 21             ark     %r2,%r1,%r10
103e99c:       12 99                   ltr     %r9,%r9
103e99e:       a7 18 00 00             lhi     %r1,0
103e9a2:       ec 1c 00 01 00 42       lochile %r1,1
103e9a8:       50 10 f1 28             st      %r1,296(%r15)
103e9ac:       c2 9d 00 00 00 00       cfi     %r9,0
103e9b2:       c0 c4 00 01 28 4e       jgle    1063a4e
103e9b8:       e5 4c f1 08 00 00       mvhi    264(%r15),0
103e9be:       eb 14 00 03 00 0d       sllg    %r1,%r4,3

After:

103e96c:       b9 04 00 31             lgr     %r3,%r1       
103e970:       e5 4c f1 18 00 01       mvhi    280(%r15),1
103e976:       a7 18 00 00             lhi     %r1,0         
103e97a:       e3 20 f0 d0 00 0d       dsg     %r2,208(%r15) 
103e980:       b9 20 00 c3             cgr     %r12,%r3      
103e984:       a7 29 ff ff             lghi    %r2,-1        
103e988:       ec 12 00 01 00 42       lochih  %r1,1         
103e98e:       e3 20 f0 f8 00 82       xg      %r2,248(%r15) 
103e994:       1a 15                   ar      %r1,%r5       
103e996:       b3 c1 00 e2             ldgr    %f14,%r2
103e99a:       b9 e9 c0 72             sgrk    %r7,%r2,%r12  
103e99e:       b9 f8 a0 21             ark     %r2,%r1,%r10  
103e9a2:       c2 9d 00 00 00 00       cfi     %r9,0         
103e9a8:       c0 c4 00 01 28 51       jgle    1063a4a
103e9ae:       e5 4c f1 18 00 00       mvhi    280(%r15),0
103e9b4:       e5 4c f1 08 00 00       mvhi    264(%r15),0   
103e9ba:       eb 14 00 03 00 0d       sllg    %r1,%r4,3     

Diff:

                lgr     %rn,%rn
-               mvhi    m(%rx),1
                lhi     %rn,0

                ar      %rn,%rn
-               ldgr    %fn,%rn
                sgrk    %rn,%rn,%rn
+               ldgr    %fn,%rn
                ark     %rn,%rn,%rn
+               ltr     %rn,%rn
+               lhi     %rn,0
+               lochile %rn,1
+               st      %rn,m(%rx)
                cfi     %rn,0

                mvhi    m(%rx),0
-               mvhi    m(%rx),0
                sllg    %rn,%rn,3

The code used to compare %r9 with 0, jump based on the comparison
result, and set the local variable in the branch.  Now it compares %r9
with 0 twice - I didn’t look at the details yet, but apparently we’re
missing some optimization because jump threading is done too late.

> 

> Segher
Segher Boessenkool Sept. 17, 2018, 5:11 p.m. | #3
On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote:
> > Am 14.09.2018 um 23:35 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> > Could you please show generated code before and after this patch?

> > I mean generated assembler code.  What -S gives you.

> 

> Before:

> 

> foo4:

> .LFB0:

> 	.cfi_startproc

> 	lt	%r1,0(%r2)

> 	jne	.L2

> 	lhi	%r3,1

> 	cs	%r1,%r3,0(%r2)

> .L2:

> 	jne	.L5

> 	br	%r14

> .L5:

> 	jg	bar

> 

> After:

> 

> foo4:

> .LFB0:

> 	.cfi_startproc

> 	lt	%r1,0(%r2)

> 	jne	.L4

> 	lhi	%r3,1

> 	cs	%r1,%r3,0(%r2)

> 	jne	.L4

> 	br	%r14

> .L4:

> 	jg	bar


Ah.  And a compiler of some weeks old gives

foo4:
.LFB0:
	.cfi_startproc
	lhi	%r3,0
	lhi	%r4,1
	cs	%r3,%r4,0(%r2)
	jne	.L4
	br	%r14
.L4:
	jg	bar

so this is all caused by the recent optimisation that avoids the "cs" if
it can.

> > Why does the existing jump threading not work for you; should it happen

> > at another time?

> 

> We call cleanup_cfg (CLEANUP_THREADING) only once - during the „jump“

> pass, which happens before combine.  There is also „jump2“ pass, which

> happens afterwards, and after discussion with Ulrich Weigand I tried to

> move jump threading there.  While this change had the desired effect on

> the testcase, the code got worse in another places. Example from

> GemsFDTD:

> 

> Before:

> 

> 103e96c:       b9 04 00 31             lgr     %r3,%r1

> 103e970:       a7 18 00 00             lhi     %r1,0

> 103e974:       e3 20 f0 d0 00 0d       dsg     %r2,208(%r15)

> 103e97a:       b9 20 00 c3             cgr     %r12,%r3

> 103e97e:       a7 29 ff ff             lghi    %r2,-1

> 103e982:       ec 12 00 01 00 42       lochih  %r1,1

> 103e988:       e3 20 f0 f8 00 82       xg      %r2,248(%r15)

> 103e98e:       1a 15                   ar      %r1,%r5

> 103e990:       b9 e9 c0 72             sgrk    %r7,%r2,%r12

> 103e994:       b3 c1 00 e2             ldgr    %f14,%r2

> 103e998:       b9 f8 a0 21             ark     %r2,%r1,%r10

> 103e99c:       12 99                   ltr     %r9,%r9

> 103e99e:       a7 18 00 00             lhi     %r1,0

> 103e9a2:       ec 1c 00 01 00 42       lochile %r1,1

> 103e9a8:       50 10 f1 28             st      %r1,296(%r15)

> 103e9ac:       c2 9d 00 00 00 00       cfi     %r9,0

> 103e9b2:       c0 c4 00 01 28 4e       jgle    1063a4e

> 103e9b8:       e5 4c f1 08 00 00       mvhi    264(%r15),0

> 103e9be:       eb 14 00 03 00 0d       sllg    %r1,%r4,3

> 

> After:

> 

> 103e96c:       b9 04 00 31             lgr     %r3,%r1       

> 103e970:       e5 4c f1 18 00 01       mvhi    280(%r15),1

> 103e976:       a7 18 00 00             lhi     %r1,0         

> 103e97a:       e3 20 f0 d0 00 0d       dsg     %r2,208(%r15) 

> 103e980:       b9 20 00 c3             cgr     %r12,%r3      

> 103e984:       a7 29 ff ff             lghi    %r2,-1        

> 103e988:       ec 12 00 01 00 42       lochih  %r1,1         

> 103e98e:       e3 20 f0 f8 00 82       xg      %r2,248(%r15) 

> 103e994:       1a 15                   ar      %r1,%r5       

> 103e996:       b3 c1 00 e2             ldgr    %f14,%r2

> 103e99a:       b9 e9 c0 72             sgrk    %r7,%r2,%r12  

> 103e99e:       b9 f8 a0 21             ark     %r2,%r1,%r10  

> 103e9a2:       c2 9d 00 00 00 00       cfi     %r9,0         

> 103e9a8:       c0 c4 00 01 28 51       jgle    1063a4a

> 103e9ae:       e5 4c f1 18 00 00       mvhi    280(%r15),0

> 103e9b4:       e5 4c f1 08 00 00       mvhi    264(%r15),0   

> 103e9ba:       eb 14 00 03 00 0d       sllg    %r1,%r4,3     

> 

> Diff:

> 

>                 lgr     %rn,%rn

> -               mvhi    m(%rx),1

>                 lhi     %rn,0

> 

>                 ar      %rn,%rn

> -               ldgr    %fn,%rn

>                 sgrk    %rn,%rn,%rn

> +               ldgr    %fn,%rn

>                 ark     %rn,%rn,%rn

> +               ltr     %rn,%rn

> +               lhi     %rn,0

> +               lochile %rn,1

> +               st      %rn,m(%rx)

>                 cfi     %rn,0

> 

>                 mvhi    m(%rx),0

> -               mvhi    m(%rx),0

>                 sllg    %rn,%rn,3

> 

> The code used to compare %r9 with 0, jump based on the comparison

> result, and set the local variable in the branch.  Now it compares %r9

> with 0 twice - I didn’t look at the details yet, but apparently we’re

> missing some optimization because jump threading is done too late.


Yeah, jump2 is quite late.  I think you should do it before postreload_cse
or similar.

Btw, what percentage of files (or subroutines) does jump threading run
after combine, with your patch?


Segher
Ilya Leoshkevich Sept. 18, 2018, 10:22 a.m. | #4
> Am 17.09.2018 um 19:11 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> 

> On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote:

>>> Am 14.09.2018 um 23:35 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

>>> Could you please show generated code before and after this patch?

>>> I mean generated assembler code.  What -S gives you.

>> 

>> Before:

>> 

>> foo4:

>> .LFB0:

>> 	.cfi_startproc

>> 	lt	%r1,0(%r2)

>> 	jne	.L2

>> 	lhi	%r3,1

>> 	cs	%r1,%r3,0(%r2)

>> .L2:

>> 	jne	.L5

>> 	br	%r14

>> .L5:

>> 	jg	bar

>> 

>> After:

>> 

>> foo4:

>> .LFB0:

>> 	.cfi_startproc

>> 	lt	%r1,0(%r2)

>> 	jne	.L4

>> 	lhi	%r3,1

>> 	cs	%r1,%r3,0(%r2)

>> 	jne	.L4

>> 	br	%r14

>> .L4:

>> 	jg	bar

> 

> Ah.  And a compiler of some weeks old gives

> 

> foo4:

> .LFB0:

> 	.cfi_startproc

> 	lhi	%r3,0

> 	lhi	%r4,1

> 	cs	%r3,%r4,0(%r2)

> 	jne	.L4

> 	br	%r14

> .L4:

> 	jg	bar

> 

> so this is all caused by the recent optimisation that avoids the "cs" if

> it can.


Could you please try building with -march=z13?  I don’t see the „lt“ 
instruction in your output.  On z196+ we try to speed up the code by
jumping around the „cs“ when possible.
Ilya Leoshkevich Sept. 18, 2018, 11:29 a.m. | #5
> Am 17.09.2018 um 19:11 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> 

> On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote:

>>> Am 14.09.2018 um 23:35 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> 

>>> Why does the existing jump threading not work for you; should it happen

>>> at another time?

>> 

>> We call cleanup_cfg (CLEANUP_THREADING) only once - during the „jump“

>> pass, which happens before combine.  There is also „jump2“ pass, which

>> happens afterwards, and after discussion with Ulrich Weigand I tried to

>> move jump threading there.  While this change had the desired effect on

>> the testcase, the code got worse in another places.

> 

> Yeah, jump2 is quite late.  I think you should do it before postreload_cse

> or similar.


Thanks, I will give it a try.

> 

> Btw, what percentage of files (or subroutines) does jump threading run

> after combine, with your patch?

> 


I checked this on SPEC CPU 2006.

Turns out it’s not as good as I expected: the additional jump threading
is performed on 64% of the functions.  I think this is because I only
check whether PATTERNs have any side-effects, but I don’t look at
INSN_CODEs.  I'll try to change this and see whether I'll get a better
number.

Here are some other figures:

- The total number of cleanup_cfg calls: +3.3%.
- The total run time of cleanup_cfg:     +5.5%.
- The average run time of cleanup_cfg:   +2.1%.
Segher Boessenkool Sept. 18, 2018, 2:23 p.m. | #6
On Tue, Sep 18, 2018 at 12:22:14PM +0200, Ilya Leoshkevich wrote:
> 

> 

> > Am 17.09.2018 um 19:11 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> > 

> > On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote:

> >>> Am 14.09.2018 um 23:35 schrieb Segher Boessenkool <segher@kernel.crashing.org>:

> >>> Could you please show generated code before and after this patch?

> >>> I mean generated assembler code.  What -S gives you.

> >> 

> >> Before:

> >> 

> >> foo4:

> >> .LFB0:

> >> 	.cfi_startproc

> >> 	lt	%r1,0(%r2)

> >> 	jne	.L2

> >> 	lhi	%r3,1

> >> 	cs	%r1,%r3,0(%r2)

> >> .L2:

> >> 	jne	.L5

> >> 	br	%r14

> >> .L5:

> >> 	jg	bar

> >> 

> >> After:

> >> 

> >> foo4:

> >> .LFB0:

> >> 	.cfi_startproc

> >> 	lt	%r1,0(%r2)

> >> 	jne	.L4

> >> 	lhi	%r3,1

> >> 	cs	%r1,%r3,0(%r2)

> >> 	jne	.L4

> >> 	br	%r14

> >> .L4:

> >> 	jg	bar

> > 

> > Ah.  And a compiler of some weeks old gives

> > 

> > foo4:

> > .LFB0:

> > 	.cfi_startproc

> > 	lhi	%r3,0

> > 	lhi	%r4,1

> > 	cs	%r3,%r4,0(%r2)

> > 	jne	.L4

> > 	br	%r14

> > .L4:

> > 	jg	bar

> > 

> > so this is all caused by the recent optimisation that avoids the "cs" if

> > it can.

> 

> Could you please try building with -march=z13?  I don’t see the „lt“ 

> instruction in your output.  On z196+ we try to speed up the code by

> jumping around the „cs“ when possible.


This was a few weeks old source (because it didn't bootstrap elsewhere).
This is s390-linux-gcc -Wall -W -march=z196 -O2 as in the PR (and the
compiler has largely default options).

-march=z13 has identical output on that compiler.



Segher
Segher Boessenkool Sept. 18, 2018, 2:30 p.m. | #7
On Tue, Sep 18, 2018 at 01:29:56PM +0200, Ilya Leoshkevich wrote:
> > Yeah, jump2 is quite late.  I think you should do it before postreload_cse

> > or similar.

> 

> Thanks, I will give it a try.

> 

> > Btw, what percentage of files (or subroutines) does jump threading run

> > after combine, with your patch?

> 

> I checked this on SPEC CPU 2006.

> 

> Turns out it’s not as good as I expected: the additional jump threading

> is performed on 64% of the functions.  I think this is because I only

> check whether PATTERNs have any side-effects, but I don’t look at

> INSN_CODEs.  I'll try to change this and see whether I'll get a better

> number.

> 

> Here are some other figures:

> 

> - The total number of cleanup_cfg calls: +3.3%.

> - The total run time of cleanup_cfg:     +5.5%.

> - The average run time of cleanup_cfg:   +2.1%.


Thanks for all those numbers!

So IMO it is not worth doing this only _some_ of the time after combine,
given it doesn't help much and requires awful code, and awful code
duplication (setting state no less, eww).

It is also clear that jump threading indeed is quite a bit more expensive
than the average cleanup_cfg.

But it is also clear that in the grand scheme of things doing it always
does not cost much.

I wonder if it also helps on other targets to do an extra jump threading
(say right before postreload_cse)?  That would pretty much seal the deal
I'd say.


Segher

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index a2649b6d5a1..65f5d7d092b 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -1139,13 +1139,42 @@  insn_a_feeds_b (rtx_insn *a, rtx_insn *b)
   return false;
 }
 
+/* Return true iff the only side effect of BB is its trailing jump_insn.  */
+
+static bool
+is_single_jump_bb (basic_block bb)
+{
+  rtx_insn *end = BB_END (bb);
+  rtx_insn *insn;
+
+  if (!JUMP_P (end))
+    return false;
+
+  for (insn = BB_HEAD (bb); insn != end; insn = NEXT_INSN (insn))
+    if (INSN_P (insn) && side_effects_p (PATTERN (insn)))
+      return false;
+  return true;
+}
+
+/* Summary of changes performed by the combiner.  */
+struct combine_summary {
+  /* True if the combiner has turned an indirect jump instruction into a direct
+     jump.  */
+  bool new_direct_jump_p;
+
+  /* True if the combiner changed at least one basic block in a way that it
+     ended up containing a single jump_insn.  */
+  bool new_single_jump_p;
+};
+
 /* Main entry point for combiner.  F is the first insn of the function.
    NREGS is the first unused pseudo-reg number.
 
-   Return nonzero if the combiner has turned an indirect jump
-   instruction into a direct jump.  */
-static int
-combine_instructions (rtx_insn *f, unsigned int nregs)
+   If performed changes satisfy certain criteria, set the corresponding fields
+   of SUMMARY to true.  */
+static void
+combine_instructions (rtx_insn *f, unsigned int nregs,
+		      struct combine_summary *summary)
 {
   rtx_insn *insn, *next;
   rtx_insn *prev;
@@ -1158,7 +1187,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
   for (first = f; first && !NONDEBUG_INSN_P (first); )
     first = NEXT_INSN (first);
   if (!first)
-    return 0;
+    return;
 
   combine_attempts = 0;
   combine_merges = 0;
@@ -1251,6 +1280,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
   FOR_EACH_BB_FN (this_basic_block, cfun)
     {
       rtx_insn *last_combined_insn = NULL;
+      bool bb_changed = false;
 
       /* Ignore instruction combination in basic blocks that are going to
 	 be removed as unreachable anyway.  See PR82386.  */
@@ -1302,6 +1332,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 				     last_combined_insn)) != 0)
 	      {
 		statistics_counter_event (cfun, "two-insn combine", 1);
+		bb_changed = true;
 		goto retry;
 	      }
 
@@ -1323,6 +1354,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 					   last_combined_insn)) != 0)
 		    {
 		      statistics_counter_event (cfun, "three-insn combine", 1);
+		      bb_changed = true;
 		      goto retry;
 		    }
 	      }
@@ -1349,7 +1381,10 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 		  if ((next = try_combine (insn, prev, nextlinks->insn,
 					   NULL, &new_direct_jump_p,
 					   last_combined_insn)) != 0)
-		    goto retry;
+		    {
+		      bb_changed = true;
+		      goto retry;
+		    }
 	    }
 
 	  /* Do the same for an insn that explicitly references CC0.  */
@@ -1363,13 +1398,19 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 	      if ((next = try_combine (insn, prev, NULL, NULL,
 				       &new_direct_jump_p,
 				       last_combined_insn)) != 0)
-		goto retry;
+		{
+		  bb_changed = true;
+		  goto retry;
+		}
 
 	      FOR_EACH_LOG_LINK (nextlinks, prev)
 		  if ((next = try_combine (insn, prev, nextlinks->insn,
 					   NULL, &new_direct_jump_p,
 					   last_combined_insn)) != 0)
+		  {
+		    bb_changed = true;
 		    goto retry;
+		  }
 	    }
 
 	  /* Finally, see if any of the insns that this insn links to
@@ -1387,7 +1428,10 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 		    && (next = try_combine (insn, links->insn,
 					    prev, NULL, &new_direct_jump_p,
 					    last_combined_insn)) != 0)
-		  goto retry;
+		  {
+		    bb_changed = true;
+		    goto retry;
+		  }
 	    }
 
 	  /* Try combining an insn with two different insns whose results it
@@ -1403,6 +1447,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 
 		  {
 		    statistics_counter_event (cfun, "three-insn combine", 1);
+		    bb_changed = true;
 		    goto retry;
 		  }
 
@@ -1431,6 +1476,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 					       last_combined_insn)) != 0)
 			{
 			  statistics_counter_event (cfun, "four-insn combine", 1);
+			  bb_changed = true;
 			  goto retry;
 			}
 		    /* I0, I1 -> I2, I2 -> I3.  */
@@ -1442,6 +1488,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 					       last_combined_insn)) != 0)
 			{
 			  statistics_counter_event (cfun, "four-insn combine", 1);
+			  bb_changed = true;
 			  goto retry;
 			}
 		  }
@@ -1459,6 +1506,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 					       last_combined_insn)) != 0)
 			{
 			  statistics_counter_event (cfun, "four-insn combine", 1);
+			  bb_changed = true;
 			  goto retry;
 			}
 		    /* I0 -> I1; I1, I2 -> I3.  */
@@ -1469,6 +1517,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 					       last_combined_insn)) != 0)
 			{
 			  statistics_counter_event (cfun, "four-insn combine", 1);
+			  bb_changed = true;
 			  goto retry;
 			}
 		  }
@@ -1510,6 +1559,7 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 		  if (next)
 		    {
 		      statistics_counter_event (cfun, "insn-with-note combine", 1);
+		      bb_changed = true;
 		      goto retry;
 		    }
 		  SET_SRC (set) = orig_src;
@@ -1523,6 +1573,10 @@  combine_instructions (rtx_insn *f, unsigned int nregs)
 retry:
 	  ;
 	}
+
+      if (flag_thread_jumps)
+	summary->new_single_jump_p
+	    |= bb_changed && is_single_jump_bb (this_basic_block);
     }
 
   default_rtl_profile ();
@@ -1557,7 +1611,7 @@  retry:
   /* Make recognizer allow volatile MEMs again.  */
   init_recog ();
 
-  return new_direct_jump_p;
+  summary->new_direct_jump_p = new_direct_jump_p;
 }
 
 /* Wipe the last_xxx fields of reg_stat in preparation for another pass.  */
@@ -14939,7 +14993,7 @@  dump_combine_total_stats (FILE *file)
 static unsigned int
 rest_of_handle_combine (void)
 {
-  int rebuild_jump_labels_after_combine;
+  struct combine_summary summary;
 
   df_set_flags (DF_LR_RUN_DCE + DF_DEFER_INSN_RESCAN);
   df_note_add_problem ();
@@ -14948,22 +15002,25 @@  rest_of_handle_combine (void)
   regstat_init_n_sets_and_refs ();
   reg_n_sets_max = max_reg_num ();
 
-  rebuild_jump_labels_after_combine
-    = combine_instructions (get_insns (), max_reg_num ());
+  memset (&summary, 0, sizeof (summary));
+  combine_instructions (get_insns (), max_reg_num (), &summary);
 
   /* Combining insns may have turned an indirect jump into a
      direct jump.  Rebuild the JUMP_LABEL fields of jumping
      instructions.  */
-  if (rebuild_jump_labels_after_combine)
+  if (summary.new_direct_jump_p)
     {
       if (dom_info_available_p (CDI_DOMINATORS))
 	free_dominance_info (CDI_DOMINATORS);
-      timevar_push (TV_JUMP);
       rebuild_jump_labels (get_insns ());
-      cleanup_cfg (0);
-      timevar_pop (TV_JUMP);
     }
 
+  /* Combining insns can change basic blocks in a way that they end up
+     containing a single jump_insn.  This creates an opportunity to improve
+     code with jump threading.  */
+  if (summary.new_direct_jump_p || summary.new_single_jump_p)
+      cleanup_cfg (summary.new_single_jump_p ? CLEANUP_THREADING : 0);
+
   regstat_free_n_sets_and_refs ();
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/s390/pr80080-4.c b/gcc/testsuite/gcc.target/s390/pr80080-4.c
new file mode 100644
index 00000000000..91d31ec7845
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/pr80080-4.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=z196 -O2" } */
+
+extern void bar(int *mem);
+
+void foo4(int *mem)
+{
+  int oldval = 0;
+  if (!__atomic_compare_exchange_n (mem, (void *) &oldval, 1,
+				    1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+    {
+      bar (mem);
+    }
+}
+
+/* { dg-final { scan-assembler "\n\tlt\t.*\n\tjne\t(\\.L\\d+)\n(.*\n)*\tcs\t.*\n\tber\t%r14\n\\1:\n\tjg\tbar\n" } } */