[committed] amdgcn: Add fold_left_plus vector reductions

Message ID 12897a6f-ef73-5657-a67b-4b6118540972@codesourcery.com
State New
Headers show
Series
  • [committed] amdgcn: Add fold_left_plus vector reductions
Related show

Commit Message

Andrew Stubbs July 3, 2020, 10:11 a.m.
This patch implements a floating-point fold_left_plus vector pattern, 
which gives a significant speed-up in the BabelStream "dot" benchmark.

The GCN architecture can't actually do an in-order vector reduction any 
more efficiently than that equivalent scalar algorithm, so this is a bit 
of a cheat.  However, dividing the problem into threads using OpenACC or 
OpenMP has already broken the in-order semantics, so we may as well 
optimize the operation at the vector level too.

If the user has specifically sorted the input data in order to get a 
more correct FP result then using multiple threads is already the wrong 
thing to do. But, if the input data is in no particular numerical order 
then this optimization will give a correct answer much faster, albeit 
possibly a slightly different one each run.

Andrew

Comments

Andrew Stubbs July 3, 2020, 10:50 a.m. | #1
Now backported to OG10.

Andrew

On 03/07/2020 11:11, Andrew Stubbs wrote:
> This patch implements a floating-point fold_left_plus vector pattern, 

> which gives a significant speed-up in the BabelStream "dot" benchmark.

> 

> The GCN architecture can't actually do an in-order vector reduction any 

> more efficiently than that equivalent scalar algorithm, so this is a bit 

> of a cheat.  However, dividing the problem into threads using OpenACC or 

> OpenMP has already broken the in-order semantics, so we may as well 

> optimize the operation at the vector level too.

> 

> If the user has specifically sorted the input data in order to get a 

> more correct FP result then using multiple threads is already the wrong 

> thing to do. But, if the input data is in no particular numerical order 

> then this optimization will give a correct answer much faster, albeit 

> possibly a slightly different one each run.

> 

> Andrew
Richard Sandiford July 7, 2020, 11:03 a.m. | #2
Andrew Stubbs <ams@codesourcery.com> writes:
> This patch implements a floating-point fold_left_plus vector pattern, 

> which gives a significant speed-up in the BabelStream "dot" benchmark.

>

> The GCN architecture can't actually do an in-order vector reduction any 

> more efficiently than that equivalent scalar algorithm, so this is a bit 

> of a cheat.  However, dividing the problem into threads using OpenACC or 

> OpenMP has already broken the in-order semantics, so we may as well 

> optimize the operation at the vector level too.

>

> If the user has specifically sorted the input data in order to get a 

> more correct FP result then using multiple threads is already the wrong 

> thing to do. But, if the input data is in no particular numerical order 

> then this optimization will give a correct answer much faster, albeit 

> possibly a slightly different one each run.


There doesn't seem to be anything GCN-specific here though.
If pragmas say that we can ignore associativity rules, we should apply
that in target-independent code rather than in each individual target.

Thanks,
Richard
Andrew Stubbs July 7, 2020, 11:44 a.m. | #3
On 07/07/2020 12:03, Richard Sandiford wrote:
> Andrew Stubbs <ams@codesourcery.com> writes:

>> This patch implements a floating-point fold_left_plus vector pattern,

>> which gives a significant speed-up in the BabelStream "dot" benchmark.

>>

>> The GCN architecture can't actually do an in-order vector reduction any

>> more efficiently than that equivalent scalar algorithm, so this is a bit

>> of a cheat.  However, dividing the problem into threads using OpenACC or

>> OpenMP has already broken the in-order semantics, so we may as well

>> optimize the operation at the vector level too.

>>

>> If the user has specifically sorted the input data in order to get a

>> more correct FP result then using multiple threads is already the wrong

>> thing to do. But, if the input data is in no particular numerical order

>> then this optimization will give a correct answer much faster, albeit

>> possibly a slightly different one each run.

> 

> There doesn't seem to be anything GCN-specific here though.

> If pragmas say that we can ignore associativity rules, we should apply

> that in target-independent code rather than in each individual target.


Yes, I'm lazy. That, and I'm not sure what a target independent solution 
would look like.

Presumably we'd need something for both OpenMP and OpenACC, and it would 
need to be specific to certain operations (not just blanket 
-fassociative-math), which means the vectorizer (anywhere else?) would 
need to be taught about the new thing?

The nearest example I can think of is the force_vectorize flag that 
OpenMP "simd" and OpenACC "vector" already use (the latter being 
amdgcn-only as nvptx does its own OpenACC vectorization).

I'm also not completely convinced that this -- or other cases like it -- 
isn't simply a target-specific issue. Could it be harmful on other 
architectures?

Anyway, ultimately I don't have time to do much more here.

Andrew
Richard Sandiford July 9, 2020, 8:27 a.m. | #4
Andrew Stubbs <ams@codesourcery.com> writes:
> On 07/07/2020 12:03, Richard Sandiford wrote:

>> Andrew Stubbs <ams@codesourcery.com> writes:

>>> This patch implements a floating-point fold_left_plus vector pattern,

>>> which gives a significant speed-up in the BabelStream "dot" benchmark.

>>>

>>> The GCN architecture can't actually do an in-order vector reduction any

>>> more efficiently than that equivalent scalar algorithm, so this is a bit

>>> of a cheat.  However, dividing the problem into threads using OpenACC or

>>> OpenMP has already broken the in-order semantics, so we may as well

>>> optimize the operation at the vector level too.

>>>

>>> If the user has specifically sorted the input data in order to get a

>>> more correct FP result then using multiple threads is already the wrong

>>> thing to do. But, if the input data is in no particular numerical order

>>> then this optimization will give a correct answer much faster, albeit

>>> possibly a slightly different one each run.

>> 

>> There doesn't seem to be anything GCN-specific here though.

>> If pragmas say that we can ignore associativity rules, we should apply

>> that in target-independent code rather than in each individual target.

>

> Yes, I'm lazy. That, and I'm not sure what a target independent solution 

> would look like.

>

> Presumably we'd need something for both OpenMP and OpenACC, and it would 

> need to be specific to certain operations (not just blanket 

> -fassociative-math), which means the vectorizer (anywhere else?) would 

> need to be taught about the new thing?

>

> The nearest example I can think of is the force_vectorize flag that 

> OpenMP "simd" and OpenACC "vector" already use (the latter being 

> amdgcn-only as nvptx does its own OpenACC vectorization).


Yeah, I guess we'd need a way of querying whether a given reduction
is by nature reassociative due to pragmas.  It would probably need
to name the specific reductions somehow.

Agree it doesn't sound easy.

> I'm also not completely convinced that this -- or other cases like it -- 

> isn't simply a target-specific issue. Could it be harmful on other 

> architectures?


I'd hope not.  No target should prefer in-order reductions over
any-order reductions, since in-order implements any-order.

Thanks,
Richard

Patch

amdgcn: Add fold_left_plus vector reductions

These aren't real in-order instructions, because the ISA can't do that
quickly, but a means to allow regular out-of-order reductions when that's
good enough, but the middle-end doesn't know so.

	gcc/
	* config/gcn/gcn-valu.md (fold_left_plus_<mode>): New.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 6d7fecaa12c..26559ff765e 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3076,6 +3076,26 @@  (define_expand "reduc_<reduc_op>_scal_<mode>"
     DONE;
   })
 
+;; Warning: This "-ffast-math" implementation converts in-order reductions
+;;          into associative reductions. It's also used where OpenMP or
+;;          OpenACC paralellization has already broken the in-order semantics.
+(define_expand "fold_left_plus_<mode>"
+ [(match_operand:<SCALAR_MODE> 0 "register_operand")
+  (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand")
+  (match_operand:V_FP 2 "gcn_alu_operand")]
+  "can_create_pseudo_p ()
+   && (flag_openacc || flag_openmp
+       || flag_associative_math)"
+  {
+    rtx dest = operands[0];
+    rtx scalar = operands[1];
+    rtx vector = operands[2];
+    rtx tmp = gen_reg_rtx (<SCALAR_MODE>mode);
+
+    emit_insn (gen_reduc_plus_scal_<mode> (tmp, vector));
+    emit_insn (gen_add<scalar_mode>3 (dest, scalar, tmp));
+     DONE;
+   })
 
 (define_insn "*<reduc_op>_dpp_shr_<mode>"
   [(set (match_operand:V_1REG 0 "register_operand"   "=v")