[og10] amdgcn: Add waitcnt after LDS write instructions

Message ID 20200629201653.147326-5-julian@codesourcery.com
State New
Headers show
Series
  • [og10] amdgcn: Add waitcnt after LDS write instructions
Related show

Commit Message

Julian Brown June 29, 2020, 8:16 p.m.
Data-share write (ds_write) instructions do not necessarily complete
the write to LDS immediately. When a write completes, LGKM_CNT is
decremented. For now, we wait until LGKM_CNT reaches zero after each
ds_write instruction.

This fixes a race condition in the case where LDS is read immediately
after being written. This can happen with broadcast operations.

OK for og10 branch?

Julian

ChangeLog

	gcc/
	* config/gcn/gcn-valu.md (scatter<mode>_insn_1offset_ds<exec_scatter>):
	Add waitcnt.
	(*mov<mode>_insn, *movti_insn): Add waitcnt to ds_write alternatives.
---
 gcc/config/gcn/gcn-valu.md | 2 +-
 gcc/config/gcn/gcn.md      | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

-- 
2.23.0

Comments

Andrew Stubbs June 29, 2020, 8:32 p.m. | #1
On 29/06/2020 21:16, Julian Brown wrote:
> Data-share write (ds_write) instructions do not necessarily complete

> the write to LDS immediately. When a write completes, LGKM_CNT is

> decremented. For now, we wait until LGKM_CNT reaches zero after each

> ds_write instruction.

> 

> This fixes a race condition in the case where LDS is read immediately

> after being written. This can happen with broadcast operations.

> 

> OK for og10 branch?


I'm not saying no (because this issue needs a fix), but the thought 
occurs that inserting one wait before the barrier might be better than 
inserting a wait after each and every write.

In particular, it seems logical that any barrier should be a memory 
barrier, so inserting it in the barrier pattern is not a big deal. IIRC, 
only OpenACC is using that anyway (OpenMP has explicit asm inserts in 
libgomp).

WDYT?

Andrew
Julian Brown June 29, 2020, 9:03 p.m. | #2
On Mon, 29 Jun 2020 21:32:41 +0100
Andrew Stubbs <ams@codesourcery.com> wrote:

> On 29/06/2020 21:16, Julian Brown wrote:

> > Data-share write (ds_write) instructions do not necessarily complete

> > the write to LDS immediately. When a write completes, LGKM_CNT is

> > decremented. For now, we wait until LGKM_CNT reaches zero after each

> > ds_write instruction.

> > 

> > This fixes a race condition in the case where LDS is read

> > immediately after being written. This can happen with broadcast

> > operations.

> > 

> > OK for og10 branch?  

> 

> I'm not saying no (because this issue needs a fix), but the thought 

> occurs that inserting one wait before the barrier might be better

> than inserting a wait after each and every write.

> 

> In particular, it seems logical that any barrier should be a memory 

> barrier, so inserting it in the barrier pattern is not a big deal.

> IIRC, only OpenACC is using that anyway (OpenMP has explicit asm

> inserts in libgomp).


I'd be happier with that idea if ds_{read,write} operations were *only*
used for broadcasting -- but they're not, they may also be used for
(some) gang-private variables and for reduction temporaries. I don't
have a test case for either of those at present demonstrating bad
behaviour with no waitcnt, but I guess it's theoretically possible for
there to be one, at least.

The "proper" solution is a general (& "optimal") waitcnt insertion
pass, I think, that works with other memory operations as well as the
DS ones.

Thanks,

Julian

Patch

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 6d7fecaa12c2..9dfaec1d0645 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -923,7 +923,7 @@ 
   {
     addr_space_t as = INTVAL (operands[3]);
     static char buf[200];
-    sprintf (buf, "ds_write%%b2\t%%0, %%2 offset:%%1%s",
+    sprintf (buf, "ds_write%%b2\t%%0, %%2 offset:%%1%s\;s_waitcnt\tlgkmcnt(0)",
 	     (AS_GDS_P (as) ? " gds" : ""));
     return buf;
   }
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 8cfb3a85d256..e58669240c67 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -554,7 +554,7 @@ 
   flat_load_dword\t%0, %A1%O1%g1\;s_waitcnt\t0
   flat_store_dword\t%A0, %1%O0%g0
   v_mov_b32\t%0, %1
-  ds_write_b32\t%A0, %1%O0
+  ds_write_b32\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
   ds_read_b32\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
   s_mov_b32\t%0, %1
   global_load_dword\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
@@ -582,7 +582,7 @@ 
   flat_load%o1\t%0, %A1%O1%g1\;s_waitcnt\t0
   flat_store%s0\t%A0, %1%O0%g0
   v_mov_b32\t%0, %1
-  ds_write%b0\t%A0, %1%O0
+  ds_write%b0\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
   ds_read%u1\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
   global_load%o1\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
   global_store%s0\t%A0, %1%O0%g0"
@@ -611,7 +611,7 @@ 
   #
   flat_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\t0
   flat_store_dwordx2\t%A0, %1%O0%g0
-  ds_write_b64\t%A0, %1%O0
+  ds_write_b64\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
   ds_read_b64\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
   global_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
   global_store_dwordx2\t%A0, %1%O0%g0"
@@ -667,7 +667,7 @@ 
   #
   global_store_dwordx4\t%A0, %1%O0%g0
   global_load_dwordx4\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
-  ds_write_b128\t%A0, %1%O0
+  ds_write_b128\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
   ds_read_b128\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)"
   "reload_completed
    && REG_P (operands[0])