[og7] vector_length extension part 2: Generalize state propagation and synchronization

Message ID 823cc381-8752-14df-d6e2-0203de5da2fb@codesourcery.com
State New
Headers show
Series
  • [og7] vector_length extension part 2: Generalize state propagation and synchronization
Related show

Commit Message

Cesar Philippidis March 2, 2018, 4:55 p.m.
The attached patch generalizes the worker state propagation and
synchronization code to handle large vectors. When the vector_length is
larger than a CUDA warp, the nvptx BE will now use shared-memory to
spill-and-fill vector state when transitioning from vector-single mode
to vector partitioned.

In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to assign one barrier for each logical vector. Worker-single
synchronization is controlled by barrier 0. Therefore, the vector
barrier ID is set to tid.y+1 (because there's one vector unit per
worker) in nvptx_init_oacc_workers and placed into a register stored in
cfun->machine->sync_bar. If no workers are present, then the barrier ID
falls back to 0.

As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.
This is because I observed that the PTX JIT does not reliable generate
SASS code to keep warps convergent in large vectors. While it works for
99% of the libgomp test cases, the ones that fail usually deadlock
because the PTX JIT generates BRA instructions for the vector code
instead of SSY/SYNC. At this point, I'm not sure if the nvptx is
generating back code, or if there is a bug in the PTX JIT. Hopefully,
Volta's warp sync functionality will resolve this problem regardless.

These changes are relatively straightforward and noncontroversial. I'll
commit this patch to openacc-gcc-7-branch once the other patches are
ready. There will be three more patches in this series.

Cesar

Comments

Tom de Vries March 21, 2018, 5:10 p.m. | #1
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,

> have been extended to take a barrier ID and a thread count. The idea

> here is to assign one barrier for each logical vector. Worker-single

> synchronization is controlled by barrier 0. Therefore, the vector

> barrier ID is set to tid.y+1 (because there's one vector unit per

> worker) in nvptx_init_oacc_workers and placed into a register stored in

> cfun->machine->sync_bar. If no workers are present, then the barrier ID

> falls back to 0.


I compiled a worker loop before and after the patch series, and observed 
this change:
...
@@ -70,7 +71,7 @@
   $L2:
    // joining 2;
   $L5:
-  bar.sync 1;
+  bar.sync 0;
    // join 2;
    ret;
  }
...

AFAICT from your explanation above, that change is intentional.

Changing the code generation scheme for workers is fine, but obviously 
that should be a minimal, separate patch that we can bisect back to.

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 3:59 a.m. | #2
On 03/21/2018 10:10 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,

>> have been extended to take a barrier ID and a thread count. The idea

>> here is to assign one barrier for each logical vector. Worker-single

>> synchronization is controlled by barrier 0. Therefore, the vector

>> barrier ID is set to tid.y+1 (because there's one vector unit per

>> worker) in nvptx_init_oacc_workers and placed into a register stored in

>> cfun->machine->sync_bar. If no workers are present, then the barrier ID

>> falls back to 0.

> 

> I compiled a worker loop before and after the patch series, and observed

> this change:

> ...

> @@ -70,7 +71,7 @@

>   $L2:

>    // joining 2;

>   $L5:

> -  bar.sync 1;

> +  bar.sync 0;

>    // join 2;

>    ret;

>  }

> ...

> 

> AFAICT from your explanation above, that change is intentional.

> 

> Changing the code generation scheme for workers is fine, but obviously

> that should be a minimal, separate patch that we can bisect back to.


That sounds reasonable. I'll apply this patch to og7 once testing has
completed. While all of the functionality it introduces is unnecessary
without the vector length changes, at least it can be applied independently.

Cesar
Update bar.sync usage

2018-03-21  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take
	in a lock and thread count.  Update call to gen_nvptx_barsync.
	(nvptx_single): Update call to nvptx_cta_sync.
	(nvptx_process_pars): Likewise.
	* config/nvptx/nvptx.md (nvptx_barsync): Adjust operands.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59fed7..029628f8a0e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3936,13 +3936,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4192,6 +4193,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4204,14 +4207,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4328,12 +4331,14 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2b4bcb3a45b..e638a13c366 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1421,10 +1421,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (!REG_P (operands[0]))
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"
Tom de Vries March 22, 2018, 1:43 p.m. | #3
On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
> On 03/21/2018 10:10 AM, Tom de Vries wrote:

>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,

>>> have been extended to take a barrier ID and a thread count. The idea

>>> here is to assign one barrier for each logical vector. Worker-single

>>> synchronization is controlled by barrier 0. Therefore, the vector

>>> barrier ID is set to tid.y+1 (because there's one vector unit per

>>> worker) in nvptx_init_oacc_workers and placed into a register stored in

>>> cfun->machine->sync_bar. If no workers are present, then the barrier ID

>>> falls back to 0.

>>

>> I compiled a worker loop before and after the patch series, and observed

>> this change:

>> ...

>> @@ -70,7 +71,7 @@

>>    $L2:

>>     // joining 2;

>>    $L5:

>> -  bar.sync 1;

>> +  bar.sync 0;

>>     // join 2;

>>     ret;

>>   }

>> ...

>>

>> AFAICT from your explanation above, that change is intentional.

>>

>> Changing the code generation scheme for workers is fine, but obviously

>> that should be a minimal, separate patch that we can bisect back to.

> 

> That sounds reasonable. I'll apply this patch to og7 once testing has

> completed. While all of the functionality it introduces is unnecessary


In other words, the patch is not minimal.

Thanks,
- Tom

> without the vector length changes, at least it can be applied independently.

>
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59..16d846e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3936,13 +3936,13 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock);
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4192,6 +4192,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4204,14 +4205,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier), tail);
 	}
 
       extract_insn (tail);
@@ -4328,12 +4329,13 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
Tom de Vries March 22, 2018, 2:23 p.m. | #4
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

> 	(nvptx_declare_function_name): Emit a .maxntid directive hint and

> 	call nvptx_init_oacc_workers.


> +

> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */

> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))

> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))

> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "

> +	<< cfun->machine->axis_dim[1] << ", 1\n";

> +


This change:
...
  // BEGIN FUNCTION DEF: main$_omp_fn$0
  .entry main$_omp_fn$0 (.param .u64 %in_ar0)
+  .maxntid 32, 32, 1
...
needs to be an individual patch.


 > +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */


'Help' is too strongly formulated, given that there's no clear link 
between the semantics of the directive, and the observed effect.

Use "seems to have the effect" or some such formulation.

Also, list in the comment a JIT driver version, and sm_ version and a 
testcase for which this is required.

Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.)

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 2:26 p.m. | #5
On 03/22/2018 06:43 AM, Tom de Vries wrote:
> On 03/22/2018 04:59 AM, Cesar Philippidis wrote:

>> On 03/21/2018 10:10 AM, Tom de Vries wrote:


>>> Changing the code generation scheme for workers is fine, but obviously

>>> that should be a minimal, separate patch that we can bisect back to.

>>

>> That sounds reasonable. I'll apply this patch to og7 once testing has

>> completed. While all of the functionality it introduces is unnecessary

> 

> In other words, the patch is not minimal.


My intention was to reduce the size of the final vector length patch.
But I can commit this patch after testing as it's equivalent at this point.

Cesar
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59fed7..eff87732c4b 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4211,7 +4211,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	}
 
       extract_insn (tail);
@@ -4333,7 +4333,7 @@ nvptx_process_pars (parallel *par)
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
Tom de Vries March 22, 2018, 2:44 p.m. | #6
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> The attached patch generalizes the worker state propagation and

> synchronization code to handle large vectors. When the vector_length is

> larger than a CUDA warp, the nvptx BE will now use shared-memory to

> spill-and-fill vector state when transitioning from vector-single mode

> to vector partitioned.


I've compiled this test-case:
...
int
main (void)
{
   int a[10];
#pragma acc parallel loop worker
   for (int i = 0; i < 10; i++)
     a[i] = i;

   return 0;
}
...

without and with the patch series, and observed the following difference 
in generated ptx:
...
-.shared .align 8 .u8 __oacc_bcast[8];
+.shared .align 8 .u8 __oacc_bcast[264];
...

Why is the example using 33 times more shared memory space with the 
patch series applied?

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 3:11 p.m. | #7
On 03/22/2018 07:23 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

> 

>>     (nvptx_declare_function_name): Emit a .maxntid directive hint and

>>     call nvptx_init_oacc_workers.

> 

>> +

>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */

>> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))

>> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))

>> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "

>> +    << cfun->machine->axis_dim[1] << ", 1\n";

>> +

> 

> This change:

> ...

>  // BEGIN FUNCTION DEF: main$_omp_fn$0

>  .entry main$_omp_fn$0 (.param .u64 %in_ar0)

> +  .maxntid 32, 32, 1

> ...

> needs to be an individual patch.


cfun->machine->axis_dims is something new to the vector length changes,
so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution.

>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */

> 

> 'Help' is too strongly formulated, given that there's no clear link

> between the semantics of the directive, and the observed effect.

> 

> Use "seems to have the effect" or some such formulation.

> 

> Also, list in the comment a JIT driver version, and sm_ version and a

> testcase for which this is required.

> 

> Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.)


Sounds reasonable. I'll commit the patch to og7 once the regression
testing has completed.

Thanks,
Cesar
From b89ec8060de3affb94b580be3260381028d4c183 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>

Date: Thu, 22 Mar 2018 08:05:53 -0700
Subject: [PATCH] add .maxntid hint

---
 gcc/config/nvptx/nvptx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index eff87732c4b..9fb2bcd6852 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -76,6 +76,7 @@
 #include "target-def.h"
 
 #define WORKAROUND_PTXJIT_BUG 1
+#define WORKAROUND_PTXJIT_BUG_3 1
 
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
@@ -1219,6 +1220,15 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+#if WORKAROUND_PTXJIT_BUG_3
+  /* Emitting a .maxntid seems to have the effect of encouraging the
+     PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid 32, 32, 1\n";
+#endif
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
-- 
2.14.3
Tom de Vries March 22, 2018, 4:18 p.m. | #8
On 03/22/2018 04:11 PM, Cesar Philippidis wrote:
> On 03/22/2018 07:23 AM, Tom de Vries wrote:

>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>>

>>>      (nvptx_declare_function_name): Emit a .maxntid directive hint and

>>>      call nvptx_init_oacc_workers.

>>> +

>>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */

>>> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))

>>> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))

>>> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "

>>> +    << cfun->machine->axis_dim[1] << ", 1\n";

>>> +

>> This change:

>> ...

>>   // BEGIN FUNCTION DEF: main$_omp_fn$0

>>   .entry main$_omp_fn$0 (.param .u64 %in_ar0)

>> +  .maxntid 32, 32, 1

>> ...

>> needs to be an individual patch.

> cfun->machine->axis_dims is something new to the vector length changes,

> so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution.

> 


That's obviously not good enough.

When I compile this test-case:
...
int
main (void)
{
   int a[10];
#pragma acc parallel num_workers (16)
#pragma acc loop worker
   for (int i = 0; i < 10; i++)
     a[i] = i;

   return 0;
}
...

I get:
...
  .maxntid 32, 16, 1
...

That's the change you need to isolate.

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 4:58 p.m. | #9
On 03/22/2018 07:44 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>> The attached patch generalizes the worker state propagation and

>> synchronization code to handle large vectors. When the vector_length is

>> larger than a CUDA warp, the nvptx BE will now use shared-memory to

>> spill-and-fill vector state when transitioning from vector-single mode

>> to vector partitioned.

> 

> I've compiled this test-case:

> ...

> int

> main (void)

> {

>   int a[10];

> #pragma acc parallel loop worker

>   for (int i = 0; i < 10; i++)

>     a[i] = i;

> 

>   return 0;

> }

> ...

> 

> without and with the patch series, and observed the following difference

> in generated ptx:

> ...

> -.shared .align 8 .u8 __oacc_bcast[8];

> +.shared .align 8 .u8 __oacc_bcast[264];

> ...

> 

> Why is the example using 33 times more shared memory space with the

> patch series applied?


Because the nvptx BE wasn't taking into account that vector_length = 32
doesn't need to use shared-memory to broadcast variables.

That magic value of 33 was derived from nvptx_mach_max_workers () + 1.
When vector_length > 32, there needs to be nvptx_mach_max_workers ()
partitions for vector state propagation. There also needs to be a
shared-memory buffer for worker-state propagation, because I found
situations where some threads where still spilling and filling workers
before vector 0 transitioned vector-partitioned mode.

The attached, untested, patch should resolve that issue.

Cesar
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3102c79bf96..f81fb0113d5 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       if (oacc_bcast_partition < data.offset)
 	{
 	  int psize = data.offset;
+	  int pnum = 1;
+
+	  if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+	    pnum = nvptx_mach_max_workers () + 1;
+
 	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	  oacc_bcast_partition = psize;
-	  oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	  oacc_bcast_size = psize * pnum;
 	}
     }
   return empty;
@@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  if (oacc_bcast_partition < size)
 	    {
 	      int psize = size;
+	      int pnum = 1;
+
+	      if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+		pnum = nvptx_mach_max_workers () + 1;
+
 	      psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	      oacc_bcast_partition = psize;
-	      oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	      oacc_bcast_size = psize * pnum;
 	    }
 
 	  data.offset = 0;
Cesar Philippidis March 22, 2018, 5:24 p.m. | #10
On 03/22/2018 09:18 AM, Tom de Vries wrote:

> That's obviously not good enough.

> 

> When I compile this test-case:

> ...

> int

> main (void)

> {

>   int a[10];

> #pragma acc parallel num_workers (16)

> #pragma acc loop worker

>   for (int i = 0; i < 10; i++)

>     a[i] = i;

> 

>   return 0;

> }

> ...

> 

> I get:

> ...

>  .maxntid 32, 16, 1

> ...

> 

> That's the change you need to isolate.


I attached an updated patch which incorporates the
cfun->machine->axis_dim changes. It now generates more precise arguments
for maxntid.

Cesar
From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>

Date: Thu, 22 Mar 2018 08:05:53 -0700
Subject: [PATCH] emit .maxntid hint

---
 gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++
 gcc/config/nvptx/nvptx.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index eff87732c4b..3958f71e995 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -76,6 +76,7 @@
 #include "target-def.h"
 
 #define WORKAROUND_PTXJIT_BUG 1
+#define WORKAROUND_PTXJIT_BUG_3 1
 
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
@@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+#if WORKAROUND_PTXJIT_BUG_3
+  /* Emitting a .maxntid seems to have the effect of encouraging the
+     PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
+	<< cfun->machine->axis_dim[1] << ", 1\n";
+#endif
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
@@ -2831,6 +2842,11 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
 struct parallel
 {
   /* Parent parallel.  */
@@ -4525,6 +4541,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507c88a..958516da604 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.   */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
-- 
2.14.3
Tom de Vries March 22, 2018, 5:39 p.m. | #11
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +  rtx red_partition; /* Similar to bcast_partition, except for vector

> +			reductions.  */


Shouldn't this be in "[og7] vector_length extension part 3: reductions"?

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 5:47 p.m. | #12
On 03/22/2018 10:39 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>> +  rtx red_partition; /* Similar to bcast_partition, except for vector

>> +            reductions.  */

> 

> Shouldn't this be in "[og7] vector_length extension part 3: reductions"?


Maybe. But keep in mind, with the exception of the bar.sync and maxntid
changes you requested, I don't think the vector length patch makes sense
to go in as individual hunks. Maybe I could split out the new
TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch.
But, at the same time, if something isn't being used, what's the point
of going through that extra work?

Cesar
Tom de Vries March 22, 2018, 5:51 p.m. | #13
On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
> On 03/22/2018 09:18 AM, Tom de Vries wrote:

> 

>> That's obviously not good enough.

>>

>> When I compile this test-case:

>> ...

>> int

>> main (void)

>> {

>>    int a[10];

>> #pragma acc parallel num_workers (16)

>> #pragma acc loop worker

>>    for (int i = 0; i < 10; i++)

>>      a[i] = i;

>>

>>    return 0;

>> }

>> ...

>>

>> I get:

>> ...

>>   .maxntid 32, 16, 1

>> ...

>>

>> That's the change you need to isolate.

> 

> I attached an updated patch which incorporates the

> cfun->machine->axis_dim changes. It now generates more precise arguments

> for maxntid.


I'll try this out.

Still, this doesn't address my request: "Also, list in the comment a JIT 
driver version, and sm_ version and a testcase for which this is required"

Thanks,
- Tom

> 

> Cesar

> 

> 

> 0001-emit-.maxntid-hint.patch

> 

> 

>  From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001

> From: Cesar Philippidis <cesar@codesourcery.com>

> Date: Thu, 22 Mar 2018 08:05:53 -0700

> Subject: [PATCH] emit .maxntid hint

> 

> ---

>   gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++

>   gcc/config/nvptx/nvptx.h |  2 ++

>   2 files changed, 21 insertions(+)

> 

> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c

> index eff87732c4b..3958f71e995 100644

> --- a/gcc/config/nvptx/nvptx.c

> +++ b/gcc/config/nvptx/nvptx.c

> @@ -76,6 +76,7 @@

>   #include "target-def.h"

>   

>   #define WORKAROUND_PTXJIT_BUG 1

> +#define WORKAROUND_PTXJIT_BUG_3 1

>   

>   /* Define dimension sizes for known hardware.  */

>   #define PTX_VECTOR_LENGTH 32

> @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)

>        stream, in order to share the prototype writing code.  */

>     std::stringstream s;

>     write_fn_proto (s, true, name, decl);

> +

> +#if WORKAROUND_PTXJIT_BUG_3

> +  /* Emitting a .maxntid seems to have the effect of encouraging the

> +     PTX JIT emit SYNC branches.  */

> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))

> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))

> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "

> +	<< cfun->machine->axis_dim[1] << ", 1\n";

> +#endif

> +

>     s << "{\n";

>   

>     bool return_in_mem = write_return_type (s, false, result_type);

> @@ -2831,6 +2842,11 @@ struct offload_attrs

>     int max_workers;

>   };

>   

> +/* Define entries for cfun->machine->axis_dim.  */

> +

> +#define MACH_VECTOR_LENGTH 0

> +#define MACH_MAX_WORKERS 1

> +

>   struct parallel

>   {

>     /* Parent parallel.  */

> @@ -4525,6 +4541,9 @@ nvptx_reorg (void)

>   

>         populate_offload_attrs (&oa);

>   

> +      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;

> +      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;

> +

>         /* If there is worker neutering, there must be vector

>   	 neutering.  Otherwise the hardware will fail.  */

>         gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))

> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h

> index 8a14507c88a..958516da604 100644

> --- a/gcc/config/nvptx/nvptx.h

> +++ b/gcc/config/nvptx/nvptx.h

> @@ -226,6 +226,8 @@ struct GTY(()) machine_function

>     int return_mode; /* Return mode of current fn.

>   		      (machine_mode not defined yet.) */

>     rtx axis_predicate[2]; /* Neutering predicates.  */

> +  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is

> +		      vector_length, dim[1] is num_workers.   */

>     rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */

>     rtx unisimt_predicate; /* Predicate for -muniform-simt.  */

>     rtx unisimt_location; /* Mask location for -muniform-simt.  */

>
Tom de Vries March 22, 2018, 5:59 p.m. | #14
On 03/22/2018 06:47 PM, Cesar Philippidis wrote:
> On 03/22/2018 10:39 AM, Tom de Vries wrote:

>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>>> +  rtx red_partition; /* Similar to bcast_partition, except for vector

>>> +            reductions.  */

>>

>> Shouldn't this be in "[og7] vector_length extension part 3: reductions"?

> 

> Maybe. But keep in mind, with the exception of the bar.sync and maxntid

> changes you requested, I don't think the vector length patch makes sense

> to go in as individual hunks. Maybe I could split out the new

> TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch.

> But, at the same time, if something isn't being used, what's the point

> of going through that extra work?


Because patches that are split into logically consistent parts are easy 
to review, and easy to analyze and fix or undo when bisected back to. 
And yes, that's extra work.

Thanks,
- Tom
Cesar Philippidis March 22, 2018, 7:04 p.m. | #15
On 03/22/2018 10:51 AM, Tom de Vries wrote:
> On 03/22/2018 06:24 PM, Cesar Philippidis wrote:

>> On 03/22/2018 09:18 AM, Tom de Vries wrote:

>>

>>> That's obviously not good enough.

>>>

>>> When I compile this test-case:

>>> ...

>>> int

>>> main (void)

>>> {

>>>    int a[10];

>>> #pragma acc parallel num_workers (16)

>>> #pragma acc loop worker

>>>    for (int i = 0; i < 10; i++)

>>>      a[i] = i;

>>>

>>>    return 0;

>>> }

>>> ...

>>>

>>> I get:

>>> ...

>>>   .maxntid 32, 16, 1

>>> ...

>>>

>>> That's the change you need to isolate.

>>

>> I attached an updated patch which incorporates the

>> cfun->machine->axis_dim changes. It now generates more precise arguments

>> for maxntid.

> 

> I'll try this out.

> 

> Still, this doesn't address my request: "Also, list in the comment a JIT

> driver version, and sm_ version and a testcase for which this is required"


I attached the test case where it used to fail without maxntid. But
after looking at again, the maxntid directive was probably masking that
other PTX JIT bug involving abort and exiting threads that you fixed.
And in fact, the test case works without the maxntid patch on my sm_60 GPU.

I'm going to retest the variable vector length changes without it and
see if it's still necessary. On one hand, maxntid should be fairly
innocuous, but I don't like how it can mask other PTX JIT bugs. At this
point, I'm leaning towards dropping it if does not impact the libgomp
regression test suite anymore. What do you want to do?

Cesar
/* This test was failing with nvptx offloading without the .maxntid
   PTX directive.  */

int i;
int main(void)
{
  int j, v;
  i = -1;
  j = -2;
  v = 0;

  j = -2;
  v = 0;
#pragma acc parallel present_or_copyout (v) copyout (i, j) vector_length(128)
  {
    i = 2;
    j = 1;
    if (i != 2 || j != 1)
      __builtin_abort ();
    v = 1;
  }
  if (v != 1 || i != 2 || j != 1)
    __builtin_abort ();
  i = -1;
  j = -2;
  v = 0;
#pragma acc parallel present_or_copyout (v) copy (i, j) vector_length(128)
  {
    if (i != -1 || j != -2)
      __builtin_abort ();
    i = 2;
    j = 1;
    if (i != 2 || j != 1)
      __builtin_abort ();
    v = 1;
  }
  if (v != 1 || i != 2 || j != 1)
    __builtin_abort ();

  return 0;
}
Tom de Vries March 23, 2018, 8:54 a.m. | #16
On 03/22/2018 08:04 PM, Cesar Philippidis wrote:
> I'm going to retest the variable vector length changes without it and

> see if it's still necessary. On one hand, maxntid should be fairly

> innocuous, but I don't like how it can mask other PTX JIT bugs. At this

> point, I'm leaning towards dropping it if does not impact the libgomp

> regression test suite anymore. What do you want to do?


If there is no observable difference in tests passing/failing, then we 
should drop it.

Thanks,
- Tom
Tom de Vries March 23, 2018, 1:12 p.m. | #17
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> @@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)

>   	    pred = gen_reg_rtx (BImode);

>   	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;

>   	  }

> -	

> +


It's fine to clean up whitespace, but please do that in separate patches.

Committed.

Thanks,
- Tom
[nvptx] Fix whitespace in nvptx_single

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_single): Fix whitespace.

---
 gcc/config/nvptx/nvptx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59..50d7319 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4100,7 +4100,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
Tom de Vries March 23, 2018, 1:14 p.m. | #18
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +/* Loop structure of the function.  The entire function is described as

> +   a NULL loop.  */

> +

>   struct parallel

>   {

>     /* Parent parallel.  */


You dropped this comment in "vector_length extension part 1: generalize 
function and variable names".

It's good to add it back, but that needs to be a separate patch.

Committed.

Thanks,
- Tom
[nvptx] Re-add removed struct parallel comment

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (struct parallel): Re-add comment.

---
 gcc/config/nvptx/nvptx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 50d7319..9873449 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2831,6 +2831,9 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
 struct parallel
 {
   /* Parent parallel.  */
Tom de Vries March 23, 2018, 2:18 p.m. | #19
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md

> index 28ae263c867..ac2731233dd 100644

> --- a/gcc/config/nvptx/nvptx.md

> +++ b/gcc/config/nvptx/nvptx.md

> @@ -1418,10 +1418,16 @@

>     [(set_attr "atomic" "true")])

>   

>   (define_insn "nvptx_barsync"

> -  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]

> +  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")

> +		     (match_operand:SI 1 "const_int_operand")]

>   		    UNSPECV_BARSYNC)]

>     ""

> -  "\\tbar.sync\\t%0;"

> +  {

> +    if (!REG_P (operands[0]))

> +      return "\\tbar.sync\\t%0;";

> +    else

> +      return "\\tbar.sync\\t%0, %1;";

> +  }

>     [(set_attr "predicable" "false")])


This is wrong. The first operand can be a register or a constant, and 
the second operand is independent. Whether or not we print the second 
operand is independent of whether the first is a register.

In this patch I've reserved INTVAL (operands[1]) == 0 for the "no second 
operand" case.

Committed.

Thanks,
- Tom
[nvptx] Add thread count parm to bar.sync

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.md (nvptx_barsync): Add and handle operand.
	* config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take in a
	lock and thread count.  Update call to gen_nvptx_barsync.
	(nvptx_single, nvptx_process_pars): Update calls to nvptx_cta_sync.

---
 gcc/config/nvptx/nvptx.c  | 22 ++++++++++++++--------
 gcc/config/nvptx/nvptx.md | 10 ++++++++--
 3 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 12441cb..32f2efb 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3939,13 +3939,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4195,6 +4196,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4207,14 +4210,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4331,12 +4334,15 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2b4bcb3a..2609222 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1421,10 +1421,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (INTVAL (operands[1]) == 0)
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"
Tom de Vries March 23, 2018, 2:22 p.m. | #20
On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
> On 03/22/2018 09:18 AM, Tom de Vries wrote:

> 

>> That's obviously not good enough.

>>

>> When I compile this test-case:

>> ...

>> int

>> main (void)

>> {

>>    int a[10];

>> #pragma acc parallel num_workers (16)

>> #pragma acc loop worker

>>    for (int i = 0; i < 10; i++)

>>      a[i] = i;

>>

>>    return 0;

>> }

>> ...

>>

>> I get:

>> ...

>>   .maxntid 32, 16, 1

>> ...

>>

>> That's the change you need to isolate.

> 

> I attached an updated patch which incorporates the

> cfun->machine->axis_dim changes. It now generates more precise arguments

> for maxntid.

> 


Even with maxntid dropped, axis_dim is still used elsewhere in the patch 
series, so we can split off the introduction of axis_dim and helper 
functions in a separate patch.

Committed.

Thanks,
- Tom

> Cesar

> 

> 

> 0001-emit-.maxntid-hint.patch

> 

> 

>  From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001

> From: Cesar Philippidis <cesar@codesourcery.com>

> Date: Thu, 22 Mar 2018 08:05:53 -0700

> Subject: [PATCH] emit .maxntid hint

> 

> ---

>   gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++

>   gcc/config/nvptx/nvptx.h |  2 ++

>   2 files changed, 21 insertions(+)

> 

> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c

> index eff87732c4b..3958f71e995 100644

> --- a/gcc/config/nvptx/nvptx.c

> +++ b/gcc/config/nvptx/nvptx.c

> @@ -76,6 +76,7 @@

>   #include "target-def.h"

>   

>   #define WORKAROUND_PTXJIT_BUG 1

> +#define WORKAROUND_PTXJIT_BUG_3 1

>   

>   /* Define dimension sizes for known hardware.  */

>   #define PTX_VECTOR_LENGTH 32

> @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)

>        stream, in order to share the prototype writing code.  */

>     std::stringstream s;

>     write_fn_proto (s, true, name, decl);

> +

> +#if WORKAROUND_PTXJIT_BUG_3

> +  /* Emitting a .maxntid seems to have the effect of encouraging the

> +     PTX JIT emit SYNC branches.  */

> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))

> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))

> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "

> +	<< cfun->machine->axis_dim[1] << ", 1\n";

> +#endif

> +

>     s << "{\n";

>   

>     bool return_in_mem = write_return_type (s, false, result_type);

> @@ -2831,6 +2842,11 @@ struct offload_attrs

>     int max_workers;

>   };

>   

> +/* Define entries for cfun->machine->axis_dim.  */

> +

> +#define MACH_VECTOR_LENGTH 0

> +#define MACH_MAX_WORKERS 1

> +

>   struct parallel

>   {

>     /* Parent parallel.  */

> @@ -4525,6 +4541,9 @@ nvptx_reorg (void)

>   

>         populate_offload_attrs (&oa);

>   

> +      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;

> +      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;

> +

>         /* If there is worker neutering, there must be vector

>   	 neutering.  Otherwise the hardware will fail.  */

>         gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))

> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h

> index 8a14507c88a..958516da604 100644

> --- a/gcc/config/nvptx/nvptx.h

> +++ b/gcc/config/nvptx/nvptx.h

> @@ -226,6 +226,8 @@ struct GTY(()) machine_function

>     int return_mode; /* Return mode of current fn.

>   		      (machine_mode not defined yet.) */

>     rtx axis_predicate[2]; /* Neutering predicates.  */

> +  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is

> +		      vector_length, dim[1] is num_workers.   */

>     rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */

>     rtx unisimt_predicate; /* Predicate for -muniform-simt.  */

>     rtx unisimt_location; /* Mask location for -muniform-simt.  */

>
[nvptx] Add axis_dim

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(nvptx_mach_max_workers, nvptx_mach_vector_length): New function.
	(nvptx_reorg): Set function-specific axis_dim's.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims.

---
 gcc/config/nvptx/nvptx.c | 20 ++++++++++++++++++++
 gcc/config/nvptx/nvptx.h |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 32f2efb..3cb33ae 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2831,6 +2831,23 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_max_workers ()
+{
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_vector_length ()
+{
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -4534,6 +4551,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507..784628e 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
Tom de Vries March 23, 2018, 4:21 p.m. | #21
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +  if (cfun->machine->sync_bar)

> +    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "

> +	     "// vector synchronization barrier\n",

> +	     REGNO (cfun->machine->sync_bar));


I realize that atm we don't support large vector length when nesting a 
vector loop inside a worker loop, but ... if we did support that, and 
used a vector_length of 64, then with the "Maximum number of threads per 
block" of 1024 we have a possible 16 workers. And when using the maximum 
number of workers, we'll end up using logical barrier 16 (while we only 
have 0..15).

It would be good to have at least an assert detecting this situation.

Thanks,
- Tom
Tom de Vries March 30, 2018, 1:07 a.m. | #22
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> As a follow up patch will show, the nvptx BE falls back to using

> vector_length = 32 when a vector loop is nested inside a worker loop.


I disabled the fallback, and analyzed the vred2d-128.c illegal memory 
access execution failure.

I minimized that down to this ptx:
...
.shared .align 8 .u8 __oacc_bcast[176];

{
   {
     .reg .u32 %x;
     mov.u32 %x,%tid.x;
     setp.ne.u32 %r86,%x,0;
   }

   {
     .reg .u32 %tidy;
     .reg .u64 %t_bcast;
     .reg .u64 %y64;
     mov.u32 %tidy,%tid.y;
     cvt.u64.u32 %y64,%tidy;
     add.u64 %y64,%y64,1;
     cvta.shared.u64 %t_bcast,__oacc_bcast;
     mad.lo.u64 %r66,%y64,88,%t_bcast;
   }

   @ %r86 bra $L28;
   st.u32 [%r66+80],0;
  $L28:
   ret;
}
...

The ptx is called with 2 workers and 128 vector_length.

So, 2 workers mean %tid.y has values 0 and 1.
Then %y64 has values 1 and 2.
Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + 
(2 * 88) + 80.

So we're accessing memory at location 256, while the __oacc_bcast is 
only 176 bytes big.

I formulated this assert that AFAIU detects this situation in the compiler:
...
@@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, 
const char *name)
    fprintf (file, "\t}\n");
  }

+static int nvptx_mach_max_workers ();
+
  /* Emit code to initialize OpenACC worker broadcast and synchronization
     registers.  */

@@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
                "// vector broadcast offset\n",
                REGNO (cfun->machine->bcast_partition),
                oacc_bcast_partition);
+      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 
1) <= oacc_bcast_size);
      }
    if (cfun->machine->sync_bar)
      fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
...

The assert is not triggered when the fallback is used.

Thanks,
- Tom
Tom de Vries March 30, 2018, 2:45 p.m. | #23
On 03/30/2018 03:07 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>> As a follow up patch will show, the nvptx BE falls back to using

>> vector_length = 32 when a vector loop is nested inside a worker loop.

> 

> I disabled the fallback, and analyzed the vred2d-128.c illegal memory 

> access execution failure.

> 

> I minimized that down to this ptx:

> ...

> .shared .align 8 .u8 __oacc_bcast[176];

> 

> {

>    {

>      .reg .u32 %x;

>      mov.u32 %x,%tid.x;

>      setp.ne.u32 %r86,%x,0;

>    }

> 

>    {

>      .reg .u32 %tidy;

>      .reg .u64 %t_bcast;

>      .reg .u64 %y64;

>      mov.u32 %tidy,%tid.y;

>      cvt.u64.u32 %y64,%tidy;

>      add.u64 %y64,%y64,1;

>      cvta.shared.u64 %t_bcast,__oacc_bcast;

>      mad.lo.u64 %r66,%y64,88,%t_bcast;

>    }

> 

>    @ %r86 bra $L28;

>    st.u32 [%r66+80],0;

>   $L28:

>    ret;

> }

> ...

> 

> The ptx is called with 2 workers and 128 vector_length.

> 

> So, 2 workers mean %tid.y has values 0 and 1.

> Then %y64 has values 1 and 2.

> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).

> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + 

> (2 * 88) + 80.

> 

> So we're accessing memory at location 256, while the __oacc_bcast is 

> only 176 bytes big.

> 

> I formulated this assert that AFAIU detects this situation in the compiler:

> ...

> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, 

> const char *name)

>     fprintf (file, "\t}\n");

>   }

> 

> +static int nvptx_mach_max_workers ();

> +

>   /* Emit code to initialize OpenACC worker broadcast and synchronization

>      registers.  */

> 

> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)

>                 "// vector broadcast offset\n",

>                 REGNO (cfun->machine->bcast_partition),

>                 oacc_bcast_partition);

> +      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 

> 1) <= oacc_bcast_size);

>       }

>     if (cfun->machine->sync_bar)

>       fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "

> ...

> 

> The assert is not triggered when the fallback is used.


I've tracked the problem down to:
...
> -      if (oacc_bcast_size < data.offset)                                                                          

> -       oacc_bcast_size = data.offset;                                                                             

> +      if (oacc_bcast_partition < data.offset)                                                                     

> +       {                                                                                                          

> +         int psize = data.offset;                                                                                 

> +         psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);                                        

> +         oacc_bcast_partition = psize;                                                                            

> +         oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);                                               

> +       }                                                                                                          

...

We hit this if clause for a first compiled function, with num_workers(1).

This sets oacc_bcast_partition and oacc_bcast_size as required for that 
functions.

Then we hit this if clause for a second compiled function, with 
num_workers (2).

We need oacc_bcast_size updated, but the 'oacc_bcast_partition < 
data.offset' is false, so the update doesn't happen.

I managed to fix this by making the code unconditional, and using MAX to 
update oacc_bcast_partition and oacc_bcast_size.

Thanks,
- Tom
Cesar Philippidis March 30, 2018, 3 p.m. | #24
On 03/30/2018 07:45 AM, Tom de Vries wrote:
> On 03/30/2018 03:07 AM, Tom de Vries wrote:

>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>>> As a follow up patch will show, the nvptx BE falls back to using

>>> vector_length = 32 when a vector loop is nested inside a worker loop.

>>

>> I disabled the fallback, and analyzed the vred2d-128.c illegal memory

>> access execution failure.

>>

>> I minimized that down to this ptx:

>> ...

>> .shared .align 8 .u8 __oacc_bcast[176];

>>

>> {

>>    {

>>      .reg .u32 %x;

>>      mov.u32 %x,%tid.x;

>>      setp.ne.u32 %r86,%x,0;

>>    }

>>

>>    {

>>      .reg .u32 %tidy;

>>      .reg .u64 %t_bcast;

>>      .reg .u64 %y64;

>>      mov.u32 %tidy,%tid.y;

>>      cvt.u64.u32 %y64,%tidy;

>>      add.u64 %y64,%y64,1;

>>      cvta.shared.u64 %t_bcast,__oacc_bcast;

>>      mad.lo.u64 %r66,%y64,88,%t_bcast;

>>    }

>>

>>    @ %r86 bra $L28;

>>    st.u32 [%r66+80],0;

>>   $L28:

>>    ret;

>> }

>> ...

>>

>> The ptx is called with 2 workers and 128 vector_length.

>>

>> So, 2 workers mean %tid.y has values 0 and 1.

>> Then %y64 has values 1 and 2.

>> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).

>> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast

>> + (2 * 88) + 80.

>>

>> So we're accessing memory at location 256, while the __oacc_bcast is

>> only 176 bytes big.

>>

>> I formulated this assert that AFAIU detects this situation in the

>> compiler:

>> ...

>> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int

>> regno, const char *name)

>>     fprintf (file, "\t}\n");

>>   }

>>

>> +static int nvptx_mach_max_workers ();

>> +

>>   /* Emit code to initialize OpenACC worker broadcast and synchronization

>>      registers.  */

>>

>> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)

>>                 "// vector broadcast offset\n",

>>                 REGNO (cfun->machine->bcast_partition),

>>                 oacc_bcast_partition);

>> +      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () +

>> 1) <= oacc_bcast_size);

>>       }

>>     if (cfun->machine->sync_bar)

>>       fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "

>> ...

>>

>> The assert is not triggered when the fallback is used.

> 

> I've tracked the problem down to:

> ...

>> -      if (oacc_bcast_size <

>> data.offset)                                                                         

>> -       oacc_bcast_size =

>> data.offset;                                                                            

>> +      if (oacc_bcast_partition <

>> data.offset)                                                                    

>> +      

>> {                                                                                                         

>> +         int psize =

>> data.offset;                                                                                

>> +         psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align

>> - 1);                                        +        

>> oacc_bcast_partition =

>> psize;                                                                           

>> +         oacc_bcast_size = psize * (nvptx_mach_max_workers () +

>> 1);                                               +      

>> }                                                                                                         

> 

> ...

> 

> We hit this if clause for a first compiled function, with num_workers(1).

> 

> This sets oacc_bcast_partition and oacc_bcast_size as required for that

> functions.

> 

> Then we hit this if clause for a second compiled function, with

> num_workers (2).

> 

> We need oacc_bcast_size updated, but the 'oacc_bcast_partition <

> data.offset' is false, so the update doesn't happen.

> 

> I managed to fix this by making the code unconditional, and using MAX to

> update oacc_bcast_partition and oacc_bcast_size.


It looks like that's fallout from this patch
<https://gcc.gnu.org/ml/gcc-patches/2018-03/msg01212.html>. I should
have checked that patch with the vector length fallback disabled.

Cesar
Tom de Vries March 30, 2018, 3:14 p.m. | #25
On 03/30/2018 05:00 PM, Cesar Philippidis wrote:
> I should

> have checked that patch with the vector length fallback disabled.


Right. The patch series introduces a lot of code that is not exercised.

I've added an -mlong-vector-in-workers option in my local branch and 
added 3 test-cases to exercise the code with fallback disabled everytime 
I run the libgomp tests.

Thanks,
- Tom
Tom de Vries April 3, 2018, 3 p.m. | #26
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> 	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.


One last thing: this variable needs to be reset to zero for every function.

Without this reset, we can generated different code for a function 
depending on whether there's another function in front or not.


> 	(populate_offload_attrs): Handle the situation where the default

> 	runtime geometry has not been initialized yet for reductions.


I've moved this bit to "vector_length extension part 4: target hooks and 
automatic parallelism".


Build on x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom
[nvptx] Generalize state propagation and synchronization

2018-04-03  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_option_override): Init oacc_bcast_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Call nvptx_init_oacc_workers.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Initialize vector bcast partition and
	synchronization state.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	* config/nvptx/nvptx.h (struct machine_function): Add
	bcast_partition and sync_bar members.

---
 gcc/config/nvptx/nvptx.c | 137 ++++++++++++++++++++++++++++++++++++++++++-----
 gcc/config/nvptx/nvptx.h |   4 ++
 2 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index d4ff730..0b46e13 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -133,6 +133,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -157,6 +158,8 @@ static bool need_softstack_decl;
 /* True if any function references __nvptx_uni.  */
 static bool need_unisimt_decl;
 
+static int nvptx_mach_max_workers ();
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -210,6 +213,7 @@ nvptx_option_override (void)
   oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
   SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
   oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_partition = 0;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1097,6 +1101,40 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  /* Verify oacc_bcast_size.  */
+  gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1)
+	      <= oacc_bcast_size);
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
+  fprintf (file, "\t}\n");
+}
+
 /* Emit code to initialize predicate and master lane index registers for
    -muniform-simt code generation variant.  */
 
@@ -1323,6 +1361,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -3000,6 +3040,19 @@ nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3075,7 +3128,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3090,7 +3143,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -3947,11 +4000,33 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
+      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
+      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			   ? nvptx_mach_max_workers () + 1
+			   : 1);
+
+      oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+      oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
     }
   return empty;
 }
@@ -4146,7 +4221,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4213,23 +4289,51 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = true;
 	  rtx barrier = GEN_INT (0);
 	  int threads = 0;
 
+	  if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask)
+	    vector = false;
+
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->bcast_partition)
+	    data.base = cfun->machine->bcast_partition;
+
+	  gcc_assert (data.base != NULL);
+
+	  unsigned int psize = ROUND_UP (size, oacc_bcast_align);
+	  unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			       ? nvptx_mach_max_workers () + 1
+			       : 1);
+
+	  oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+	  oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->sync_bar)
+	    {
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+
 	  /* Barrier so other workers can see the write.  */
 	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
@@ -4342,17 +4446,26 @@ nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
-  
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
+
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
       rtx barrier = GEN_INT (0);
       int threads = 0;
 
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
+
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 784628e..fb9f04b 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -228,6 +228,10 @@ struct GTY(()) machine_function
   rtx axis_predicate[2]; /* Neutering predicates.  */
   int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
 		      vector_length, dim[1] is num_workers.  */
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
Tom de Vries April 5, 2018, 2:05 p.m. | #27
On 04/03/2018 05:00 PM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

>>     * config/nvptx/nvptx.c (oacc_bcast_partition): Declare.

> 

> One last thing: this variable needs to be reset to zero for every function.

> 

> Without this reset, we can generated different code for a function 

> depending on whether there's another function in front or not.


In the previous commit, I set that variable in nvptx_option_override, 
but as I've found out that's not enough.

This patch does the init in nvptx_set_current_function.

Build x86_64 with nvptx accelerator and reg-tested libgomp.

Committed.

Thanks,
- Tom
[nvptx] Add per-function initialization of oacc_broadcast_partition

2018-04-05  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_set_current_function): Initialize
	oacc_broadcast_partition.

---
 gcc/config/nvptx/nvptx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0b46e13..009ca59 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5962,6 +5962,7 @@ nvptx_set_current_function (tree fndecl)
 
   gangprivate_shared_hmap.empty ();
   nvptx_previous_fndecl = fndecl;
+  oacc_bcast_partition = 0;
 }
 
 #undef TARGET_OPTION_OVERRIDE
Tom de Vries April 5, 2018, 2:14 p.m. | #28
On 04/03/2018 05:00 PM, Tom de Vries wrote:
> +      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);

> +      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE

> +			   ? nvptx_mach_max_workers () + 1

> +			   : 1);


This claims too much space for a simple long vector loop. Filed as 
PR85231 - "[og7, openacc, nvptx] Too much shared memory claimed for long 
vector length".

Thanks,
- Tom
Tom de Vries April 5, 2018, 4:33 p.m. | #29
On 03/30/2018 05:14 PM, Tom de Vries wrote:
> On 03/30/2018 05:00 PM, Cesar Philippidis wrote:

>> I should

>> have checked that patch with the vector length fallback disabled.

> 

> Right. The patch series introduces a lot of code that is not exercised.

> 

> I've added an -mlong-vector-in-workers option in my local branch and 

> added 3 test-cases to exercise the code with fallback disabled everytime 

> I run the libgomp tests.

> 


This patch adds that option.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom
[nvptx] Add -mlong-vector-in-workers

2018-04-05  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_adjust_parallelism): Handle
	nvptx_long_vectors_in_workers.
	* config/nvptx/nvptx.opt (mlong-vector-in-workers): Add option.

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c: New test.

---
 gcc/config/nvptx/nvptx.c                           |  3 +-
 gcc/config/nvptx/nvptx.opt                         |  3 ++
 .../vector-length-128-4.c                          | 41 ++++++++++++++++++++
 .../vector-length-128-5.c                          | 42 +++++++++++++++++++++
 .../vector-length-128-6.c                          | 42 +++++++++++++++++++++
 .../vector-length-128-8.c                          | 44 ++++++++++++++++++++++
 .../vector-length-128-9.c                          | 44 ++++++++++++++++++++++
 7 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 595413a..b5e6dce 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5397,7 +5397,8 @@ nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
      worker loop.  Therefore, fallback to setting vector_length to
      PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
      sm_70+ targets.  */
-  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+  if (nvptx_long_vectors_in_workers == 0
+      && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
       && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
     {
       tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index e2d64bd..f7f37ec 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -62,3 +62,6 @@ Enum(ptx_isa) String(sm_35) Value(PTX_ISA_SM35)
 misa=
 Target RejectNegative ToLower Joined Enum(ptx_isa) Var(ptx_isa_option) Init(PTX_ISA_SM30)
 Specify the version of the ptx ISA to use.
+
+mlong-vector-in-workers
+Target Var(nvptx_long_vectors_in_workers) Undocumented Init(0)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
new file mode 100644
index 0000000..6d43f82
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
@@ -0,0 +1,41 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel num_workers (2) vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
new file mode 100644
index 0000000..661fdc7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
@@ -0,0 +1,42 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-fopenacc-dim=-:2:128" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
new file mode 100644
index 0000000..91f611e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
@@ -0,0 +1,42 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" ":2:" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c
new file mode 100644
index 0000000..6246067
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-additional-options "-fopenacc-dim=-:-:-" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+  
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=32, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c
new file mode 100644
index 0000000..2f8b4b7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-additional-options "-fopenacc-dim=-:8:-" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+  
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 8, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=32" } */

Patch

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_init_axis_predicate): Initialize vector_red_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Emit a .maxntid directive hint and
	call nvptx_init_oacc_workers.
	(MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(nvptx_mach_max_workers): New function.
	(nvptx_mach_vector_length): New function.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Iniitalize vector bcast partition and
	synchronization state.
	(nvptx_cta_sync): Change arguments to take in a lock and thread count.
	Update call to gen_nvptx_barsync.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	(populate_offload_attrs): Handle the situation where the default
	runtime geometry has not been initialized yet for reductions.
	(nvptx_reorg): Set function-specific axis_dim's.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims,
	bcast_partition, red_partition and sync_bar members.
	* config/nvptx/nvptx.md (nvptx_barsync): Adjust operands.

From 0a1dd1d85e47feeaa6f7a2e070baba69dadea444 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 07:39:25 -0800
Subject: [PATCH] bar and sync

---
 gcc/config/nvptx/nvptx.c  | 226 ++++++++++++++++++++++++++++++++++++++++------
 gcc/config/nvptx/nvptx.h  |   8 ++
 gcc/config/nvptx/nvptx.md |  10 +-
 3 files changed, 214 insertions(+), 30 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9d77176c638..507c8671704 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -133,6 +133,7 @@  static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -1104,8 +1105,53 @@  nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
 {
   fprintf (file, "\t{\n");
   fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
-  fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_red;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%%s, %%tid.%s;\n", name, name);
   fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; "
+	       "// vector reduction buffer\n",
+	       REGNO (cfun->machine->red_partition),
+	       vector_red_partition);
+    }
+  fprintf (file, "\t}\n");
+}
+
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
   fprintf (file, "\t}\n");
 }
 
@@ -1231,6 +1277,13 @@  nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
+	<< cfun->machine->axis_dim[1] << ", 1\n";
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
@@ -1341,6 +1394,8 @@  nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -2849,6 +2904,26 @@  struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static int
+nvptx_mach_max_workers ()
+{
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int
+nvptx_mach_vector_length ()
+{
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
 struct parallel
 {
   /* Parent parallel.  */
@@ -2996,6 +3071,19 @@  nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3071,7 +3159,7 @@  nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3086,7 +3174,7 @@  nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -3944,23 +4032,45 @@  nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      if (oacc_bcast_size < data.offset)
-	oacc_bcast_size = data.offset;
+      if (oacc_bcast_partition < data.offset)
+	{
+	  int psize = data.offset;
+	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
+	  oacc_bcast_partition = psize;
+	  oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	}
     }
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4115,13 +4225,23 @@  nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
 	else
 	  br = gen_br_true_uni (pred, label);
-	emit_insn_before (br, head);
+
+	if (recog_memoized (head) == CODE_FOR_nvptx_forked
+	    && recog_memoized (NEXT_INSN (head)) == CODE_FOR_nvptx_barsync)
+	  {
+	    head = NEXT_INSN (head);
+	    emit_insn_after (br, head);
+	  }
+	else if (recog_memoized (head) == CODE_FOR_nvptx_barsync)
+	  emit_insn_after (br, head);
+	else
+	  emit_insn_before (br, head);
 
 	LABEL_NUSES (label)++;
 	if (tail_branch)
@@ -4135,7 +4255,8 @@  nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4202,26 +4323,55 @@  nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = true;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
+
+	  if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask)
+	    vector = false;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
-	    oacc_bcast_size = GET_MODE_SIZE (SImode);
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->bcast_partition)
+	    data.base = cfun->machine->bcast_partition;
+
+	  gcc_assert (data.base != NULL);
+
+	  if (oacc_bcast_partition < size)
+	    {
+	      int psize = size;
+	      psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
+	      oacc_bcast_partition = psize;
+	      oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	    }
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->sync_bar)
+	    {
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4330,20 +4480,32 @@  nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
-  
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
+
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
+
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_after (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->joining_insn);
+	  emit_insn_after (nvptx_cta_sync (barrier, threads), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->joining_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
@@ -4469,15 +4631,20 @@  populate_offload_attrs (offload_attrs *oa)
   if (oa->vector_length == 0)
     {
       /* FIXME: Need a more graceful way to handle large vector
-	 lengths in OpenACC routines.  */
+	 lengths in OpenACC routines and also -fopenacc-dims.  */
       if (!lookup_attribute ("omp target entrypoint",
 			     DECL_ATTRIBUTES (current_function_decl)))
 	oa->vector_length = PTX_WARP_SIZE;
-      else
+      else if (PTX_VECTOR_LENGTH != PTX_WARP_SIZE)
 	oa->vector_length = PTX_VECTOR_LENGTH;
     }
   if (oa->num_workers == 0)
-    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    {
+      if (oa->vector_length == 0)
+	oa->max_workers = PTX_WORKER_LENGTH;
+      else
+	oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    }
   else
     oa->max_workers = oa->num_workers;
 }
@@ -4535,6 +4702,9 @@  nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507c88a..99943025a50 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,14 @@  struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.   */
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx red_partition; /* Similar to bcast_partition, except for vector
+			reductions.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 28ae263c867..ac2731233dd 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1418,10 +1418,16 @@ 
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (!REG_P (operands[0]))
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"
-- 
2.14.3