[og7] vector_length extension part 4: target hooks and automatic parallelism

Message ID 0e3891f9-aec9-2d34-f58f-6927c821d00d@codesourcery.com
State New
Headers show
Series
  • [og7] vector_length extension part 4: target hooks and automatic parallelism
Related show

Commit Message

Cesar Philippidis March 2, 2018, 7:18 p.m.
The attached patch adjusts the existing goacc validate_dims target hook
and introduces a new goacc adjust_parallelism target hook. Now that
vector length is no longer hard-coded to 32, there are four different
ways to set it:

  1) compiler default
  2) explicitly via the vector_length clause
  3) compile time using -fopenacc-dim or the GOMP_OPENACC_DIM
     environment variable
  4) fallback to vector_length = 32 due to insufficient parallelism

The compiler default is activated in the absence of 2) and 3). It is
controlled by the macro PTX_VECTOR_LENGTH in nvptx.c. While working on
this patch set, I had it set to 128 to get more test coverage. But in
order to maintain backwards compatibility with acc routines (which is
still a work in progress), I've kept the default vector length to 32.
Besides, large vector reductions are expected to run slower until the
parallel reduction finalizer is ready.

The new default_dims arguments to validate_dims represents is necessary
to accommodate option 3) from above. validate_dims is called after
oaccdevlow has assigned parallelism to each acc loop.

Prior to this patch, oaccdevlow automatically assigned parallelism to
acc loops using oacc_loop_fixed_partitions and
oacc_loop_auto_partitions. Both of those functions were
processor-agnostic. In the case of nvptx, due to the current limitations
in this patch set, the nvptx BE needs to fallback to using a
vector_length of 32 whenever a vector loop is nested inside a worker
loop. By supplying the parallelism mask for both the current loop and
the outer loops, the goacc adjust_parallelism hook allows the back ends
to fine tune any parallelism as necessary.

Inside the nvptx BE, nvptx_goacc_adjust_parallelism uses a new "nvptx vl
warp" function attribute to denote that the offloaded function must
fallback to using a vector length of 32. Later,
nvptx_goacc_validate_dims uses the attribute to adjust vector_length
accordingly.

Going forward, in addition to adding a new parallel reduction finalizer,
the nvptx BE would benefit from merging synchronization and reduction
code for combined worker-reduction loops, e.g.

  #pragma acc loop worker vector

At present, GCC partitions acc loops with internal function markers for
each level of parallelism associated with the loop. If a loop has both
worker and vector level parallelism, it will have a dummy outer worker
loop, and dummy inner vector loop. On CUDA hardware, there's no strong
difference between workers and vectors as CUDA blocks are a loose
collection of warps. Therefore, it would make more sense to merge the
two loops together into a special WV loop. That would at least require
some changes in the BE in addition to oacc_loop_{auto,fixed}_partitions.
There were some problems in the past where CUDA hardware would lock up
because the synchronization requirements for those two levels of
parallelism. Merging them ought to simplify the synchronization code and
enable the PTX JIT to generate better code.

Overall, the changes in this patch are mild. I'll apply it to
openacc-gcc-7-branch after Tom approves the reduction patch.

Cesar

Comments

Tom de Vries March 21, 2018, 3:49 p.m. | #1
On 03/02/2018 08:18 PM, Cesar Philippidis wrote:

> og7-vl-part4-hooks.diff


> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c

> index 5642941c6a3..507c8671704 100644

> --- a/gcc/config/nvptx/nvptx.c

> +++ b/gcc/config/nvptx/nvptx.c

> @@ -5205,14 +5205,36 @@ nvptx_simt_vf ()

>     return PTX_WARP_SIZE;

>   }

>   

> +#define NVPTX_GOACC_VL_WARP "nvptx vl warp"

> +

> +/* Return true of the offloaded function needs a vector_length of

> +   PTX_WARP_SIZE.  */

> +

> +static bool

> +nvptx_goacc_needs_vl_warp ()

> +{

> +  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,

> +				DECL_ATTRIBUTES (current_function_decl));

> +  return attr == NULL_TREE;

> +}

> +


I just wrote an example using "#pragma acc parallel vector_length (128)" 
and looked at the generated code. I found that the actual vector_length 
was still 32. I tracked this back to this function returning true.

I think we need "return attr != NULL_TREE".

Thanks,
- Tom
Cesar Philippidis March 21, 2018, 8:26 p.m. | #2
On 03/21/2018 08:49 AM, Tom de Vries wrote:
> On 03/02/2018 08:18 PM, Cesar Philippidis wrote:

> 

>> og7-vl-part4-hooks.diff

> 

>> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c

>> index 5642941c6a3..507c8671704 100644

>> --- a/gcc/config/nvptx/nvptx.c

>> +++ b/gcc/config/nvptx/nvptx.c

>> @@ -5205,14 +5205,36 @@ nvptx_simt_vf ()

>>     return PTX_WARP_SIZE;

>>   }

>>   +#define NVPTX_GOACC_VL_WARP "nvptx vl warp"

>> +

>> +/* Return true of the offloaded function needs a vector_length of

>> +   PTX_WARP_SIZE.  */

>> +

>> +static bool

>> +nvptx_goacc_needs_vl_warp ()

>> +{

>> +  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,

>> +                DECL_ATTRIBUTES (current_function_decl));

>> +  return attr == NULL_TREE;

>> +}

>> +

> 

> I just wrote an example using "#pragma acc parallel vector_length (128)"

> and looked at the generated code. I found that the actual vector_length

> was still 32. I tracked this back to this function returning true.

> 

> I think we need "return attr != NULL_TREE".


Yes. Good catch. I've added another test case for this.

Thanks,
Cesar
Tom de Vries March 26, 2018, 2:14 p.m. | #3
On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c

> index ba3f4317f4e..f15ce6b8f8d 100644

> --- a/gcc/omp-offload.c

> +++ b/gcc/omp-offload.c

> @@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)

>      function.  */

>   

>   static void

> -oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)

> +oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used,

> +		    int * ARG_UNUSED (default_dims))

>   {

>     tree purpose[GOMP_DIM_MAX];

>     unsigned ix;


> @@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()

>       }

>   

>     int dims[GOMP_DIM_MAX];

> -  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask);

> +  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask,

> +		      NULL);

>   

>     if (dump_file)

>       {


What's the purpose of this unused parameter default_dims, that only ever 
gets to be NULL?

Thanks,
- Tom
Cesar Philippidis March 26, 2018, 2:25 p.m. | #4
On 03/26/2018 07:14 AM, Tom de Vries wrote:
> On 03/02/2018 08:18 PM, Cesar Philippidis wrote:

>> diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c

>> index ba3f4317f4e..f15ce6b8f8d 100644

>> --- a/gcc/omp-offload.c

>> +++ b/gcc/omp-offload.c

>> @@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)

>>      function.  */

>>     static void

>> -oacc_validate_dims (tree fn, tree attrs, int *dims, int level,

>> unsigned used)

>> +oacc_validate_dims (tree fn, tree attrs, int *dims, int level,

>> unsigned used,

>> +            int * ARG_UNUSED (default_dims))

>>   {

>>     tree purpose[GOMP_DIM_MAX];

>>     unsigned ix;

> 

>> @@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()

>>       }

>>       int dims[GOMP_DIM_MAX];

>> -  oacc_validate_dims (current_function_decl, attrs, dims, fn_level,

>> used_mask);

>> +  oacc_validate_dims (current_function_decl, attrs, dims, fn_level,

>> used_mask,

>> +              NULL);

>>       if (dump_file)

>>       {

> 

> What's the purpose of this unused parameter default_dims, that only ever

> gets to be NULL?


That's stale and can be removed. In an earlier, and more complicated,
version of the patch I was still trying to get large vector lengths to
work with multiple workers.

I'll remove it from my patch.

Thanks,
Cesar
Tom de Vries March 26, 2018, 4:33 p.m. | #5
On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> introduces a new goacc adjust_parallelism target hook.


That's another separate patch.

Committed.

Thanks,
- Tom
[openacc] Add target hook TARGET_GOACC_ADJUST_PARALLELISM

2018-03-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM.
	* doc/tm.texi: Regenerate.
	* omp-offload.c (oacc_loop_fixed_partitions): Use the adjust_parallelism
	hook to modify this_mask.
	(oacc_loop_auto_partitions): Use the adjust_parallelism hook to modify
	this_mask and loop->mask.
	(default_goacc_adjust_parallelism): New function.
	* target.def (adjust_parallelism): New hook.
	* targhooks.h (default_goacc_adjust_parallelism): Declare.

---
 gcc/doc/tm.texi       |  6 ++++++
 gcc/doc/tm.texi.in    |  2 ++
 gcc/omp-offload.c     | 19 +++++++++++++++++++
 gcc/target.def        |  8 ++++++++
 gcc/targhooks.h       |  1 +
 6 files changed, 49 insertions(+)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 0fcb9c6..271eb4d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5883,6 +5883,12 @@ This hook should return the maximum size of a particular dimension,
 or zero if unbounded.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask})
+This hook allows the accelerator compiler to remove any unused
+parallelism exposed in the current loop @var{THIS_MASK}, and the
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
 This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
 function calls to target-specific gimple, or indicate whether they
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4187da1..fc73ad1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4298,6 +4298,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_DIM_LIMIT
 
+@hook TARGET_GOACC_ADJUST_PARALLELISM
+
 @hook TARGET_GOACC_FORK_JOIN
 
 @hook TARGET_GOACC_REDUCTION
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ba3f431..aa4de24 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1258,6 +1258,13 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
 	}
     }
 
+  /* FIXME: Ideally, we should be coalescing parallelism here if the
+     hardware supports it.  E.g. Instead of partitioning a loop
+     across worker and vector axes, sometimes the hardware can
+     execute those loops together without resorting to placing
+     extra thread barriers.  */
+  this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
+
   mask_all |= this_mask;
 
   if (loop->flags & OLF_TILE)
@@ -1349,6 +1356,7 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	  this_mask ^= loop->e_mask;
 	}
 
+      this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
       loop->mask |= this_mask;
     }
 
@@ -1396,7 +1404,9 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 			" to parallelize element loop");
 	}
 
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
       loop->mask |= this_mask;
+
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
 		    tiling
@@ -1774,6 +1784,15 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 #endif
 }
 
+/* Default adjustment of loop parallelism is not required.  */
+
+unsigned
+default_goacc_adjust_parallelism (unsigned this_mask,
+				  unsigned ARG_UNUSED (outer_mask))
+{
+  return this_mask;
+}
+
 namespace {
 
 const pass_data pass_data_oacc_device_lower =
diff --git a/gcc/target.def b/gcc/target.def
index b302d36..c878fee 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1697,6 +1697,14 @@ int, (int axis),
 default_goacc_dim_limit)
 
 DEFHOOK
+(adjust_parallelism,
+"This hook allows the accelerator compiler to remove any unused\n\
+parallelism exposed in the current loop @var{THIS_MASK}, and the\n\
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.",
+unsigned, (unsigned this_mask, unsigned outer_mask),
+default_goacc_adjust_parallelism)
+
+DEFHOOK
 (fork_join,
 "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
 function calls to target-specific gimple, or indicate whether they\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 18070df..f4f6864 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -115,6 +115,7 @@ extern bool default_goacc_validate_dims (tree, int [], int);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
Tom de Vries March 26, 2018, 5:10 p.m. | #6
On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> The attached patch adjusts the existing goacc validate_dims target hook


This is overkill. All we need is a function
"int oacc_get_default_dim (int dim)".

Thanks,
- Tom
Tom de Vries March 27, 2018, 12:13 p.m. | #7
On 03/26/2018 06:33 PM, Tom de Vries wrote:
> +      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);

>         loop->mask |= this_mask;


I committed the above, but the original:
...
> @@ -1397,6 +1407,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,

>  	}

>  

>        loop->mask |= this_mask;

> +      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);

> +

>        if (!loop->mask && noisy)

>  	warning_at (loop->loc, 0,

>  		    tiling

...
has the two loop->mask lines in the reverse order.

Fixed in attached patch.

Committed.

Thanks,
- Tom
[openacc] Fix adjust_parallism usage in oacc_loop_auto_partitions

2018-03-27  Tom de Vries  <tom@codesourcery.com>

	* omp-offload.c (oacc_loop_auto_partitions): Fix adjust_parallism usage.

---
 gcc/omp-offload.c     | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index aa4de24..ed17160 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1404,8 +1404,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 			" to parallelize element loop");
 	}
 
-      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
       loop->mask |= this_mask;
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
 
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
Tom de Vries April 5, 2018, 4:32 p.m. | #8
On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> The attached patch adjusts the existing goacc validate_dims target hook

> and introduces a new goacc adjust_parallelism target hook.


The attached patch now just introduces the nvptx_adjust_parallelism 
target hook implementation, which enables test-cases to start using the 
feature.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom
[nvptx] Enable large vectors

2018-04-05  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* omp-offload.c (oacc_get_default_dim): New function.
	* omp-offload.h (oacc_get_default_dim): Declare.
	* config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define.
	(nvptx_goacc_needs_vl_warp): New function.
	(nvptx_goacc_validate_dims): Take larger vector lengths into
	account.
	(nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.
	(populate_offload_attrs): Handle the situation where the default
	runtime geometry has not been initialized yet for reductions.

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c: Expect
	vector length to be 128.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: Same.
	* testsuite/libgomp.oacc-fortran/gemm.f90: Same.

---
 gcc/config/nvptx/nvptx.c                           | 148 +++++++++++++++++++--
 gcc/omp-offload.c                                  |   7 +
 gcc/omp-offload.h                                  |   2 +
 .../vector-length-128-1.c                          |   5 +-
 .../vector-length-128-10.c                         |   1 -
 .../vector-length-128-2.c                          |   5 +-
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |   2 -
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    |   1 -
 8 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 51bd69d..595413a 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -71,6 +71,7 @@
 #include "fold-const.h"
 #include "intl.h"
 #include "tree-hash-traits.h"
+#include "omp-offload.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -4634,15 +4635,20 @@ populate_offload_attrs (offload_attrs *oa)
   if (oa->vector_length == 0)
     {
       /* FIXME: Need a more graceful way to handle large vector
-	 lengths in OpenACC routines.  */
+	 lengths in OpenACC routines and also -fopenacc-dims.  */
       if (!lookup_attribute ("omp target entrypoint",
 			     DECL_ATTRIBUTES (current_function_decl)))
 	oa->vector_length = PTX_WARP_SIZE;
-      else
+      else if (PTX_VECTOR_LENGTH != PTX_WARP_SIZE)
 	oa->vector_length = PTX_VECTOR_LENGTH;
     }
   if (oa->num_workers == 0)
-    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    {
+      if (oa->vector_length == 0)
+	oa->max_workers = PTX_WORKER_LENGTH;
+      else
+	oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    }
   else
     oa->max_workers = oa->num_workers;
 }
@@ -5193,6 +5199,19 @@ nvptx_simt_vf ()
   return PTX_WARP_SIZE;
 }
 
+#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
+
+/* Return true of the offloaded function needs a vector_length of
+   PTX_WARP_SIZE.  */
+
+static bool
+nvptx_goacc_needs_vl_warp ()
+{
+  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
+				DECL_ATTRIBUTES (current_function_decl));
+  return attr != NULL_TREE;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
@@ -5201,6 +5220,14 @@ nvptx_simt_vf ()
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
+  int default_vector_length = PTX_VECTOR_LENGTH;
+
+  /* For capability reasons, fallback to vl = 32 for runtime values.  */
+  if (dims[GOMP_DIM_VECTOR] == 0)
+    default_vector_length = PTX_WARP_SIZE;
+  else if (decl)
+    default_vector_length = oacc_get_default_dim (GOMP_DIM_VECTOR);
+
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
@@ -5225,18 +5252,20 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
-  /* The vector size must be 32, unless this is a SEQ routine.  */
+  /* The vector size must be a positive multiple of the warp size,
+     unless this is a SEQ routine.  */
   if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
       && dims[GOMP_DIM_VECTOR] >= 0
-      && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
+      && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	  || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
 		    : G_("using vector_length (%d), ignoring runtime setting"),
-		    PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]);
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+		    default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
       changed = true;
     }
 
@@ -5250,16 +5279,77 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
+  /* Ensure that num_worker * vector_length < cta size.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
+  /* vector_length must not exceed PTX_CTA_SIZE.  */
+  if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE)
+    {
+      int new_vector = PTX_CTA_SIZE;
+      if (decl)
+	new_vector = default_vector_length;
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  new_vector, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = new_vector;
+      changed = true;
+    }
+
+  /* Set vector_length to default_vector_length if there are a sufficient
+     number of free threads in the CTA.  */
+  if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0)
+    {
+      if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = default_vector_length;
+      else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      else
+	error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  "vector_length must be at least 32");
+      changed = true;
+    }
+
+  /* Specify a default vector_length.  */
+  if (dims[GOMP_DIM_VECTOR] < 0)
+    {
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
+      changed = true;
+    }
+
+  if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE)
+    {
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
   if (!decl)
     {
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+      bool new_vector = false;
+      if (dims[GOMP_DIM_VECTOR] <= 1)
+	{
+	  dims[GOMP_DIM_VECTOR] = default_vector_length;
+	  new_vector = true;
+	}
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
+      if (new_vector
+	  && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
       changed = true;
     }
 
+  gcc_assert (dims[GOMP_DIM_VECTOR] != 0);
+  gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE);
+
   return changed;
 }
 
@@ -5279,6 +5369,45 @@ nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  if (nvptx_goacc_needs_vl_warp ())
+    return inner_mask;
+
+  bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.vector_length == PTX_WARP_SIZE)
+    return inner_mask;
+
+  /* FIXME: This is overly conservative; worker and vector loop will
+     eventually be combined.  */
+  if (wv)
+    return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  /* It's difficult to guarantee that warps in large vector_lengths
+     will remain convergent when a vector loop is nested inside a
+     worker loop.  Therefore, fallback to setting vector_length to
+     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
+     sm_70+ targets.  */
+  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
+			      DECL_ATTRIBUTES (current_function_decl));
+      DECL_ATTRIBUTES (current_function_decl) = attr;
+    }
+
+  return inner_mask;
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6169,6 +6298,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ed17160..66c6212 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -551,6 +551,13 @@ oacc_xform_tile (gcall *call)
 static int oacc_default_dims[GOMP_DIM_MAX];
 static int oacc_min_dims[GOMP_DIM_MAX];
 
+int
+oacc_get_default_dim (int dim)
+{
+  gcc_assert (0 <= dim && dim < GOMP_DIM_MAX);
+  return oacc_default_dims[dim];
+}
+
 /* Parse the default dimension parameter.  This is a set of
    :-separated optional compute dimensions.  Each dimension is either
    a positive integer, or '-' for a dynamic value computed at
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index 528448b..014ee52 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -22,6 +22,8 @@ along with GCC; see the file COPYING3.  If not see
 #ifndef GCC_OMP_DEVICE_H
 #define GCC_OMP_DEVICE_H
 
+extern int oacc_get_default_dim (int dim);
+
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
index fab5b0d..18d77cc 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
@@ -33,7 +33,6 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
index e46b5cf..0658cfd 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
@@ -37,4 +37,3 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
index cc6fd55..2ab6499 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
@@ -34,7 +34,6 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
index 1dc5fe0..318c0e6 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -42,8 +42,6 @@ gentest (test3, "acc parallel loop gang worker vector_length (128)",
 gentest (test4, "acc parallel loop",
 	 "acc loop reduction(+:t1) reduction(-:t2)")
 
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
-
 
 int
 main ()
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
index 62b8a45..ad67dce 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -39,7 +39,6 @@ subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
   real :: temp
 
   !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
-  ! { dg-prune-output "using vector_length \\(32\\), ignoring 128" }
   do j = 1, n
      !$acc loop
      do i = 1, m

Patch

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define.
	(nvptx_goacc_needs_vl_warp): New function.
	(nvptx_goacc_validate_dims): Add new default_dims argument and take
	larger vector lengths into account.
	(nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM.
	* omp-offload.c (oacc_parse_default_dims): Update usage of the
	targetm.goacc_valdate_dims hook.
	(oacc_validate_dims): Add default_dims argument.
	(oacc_loop_fixed_partitions): Use the adjust_parallelism hook to
	modify this_mask.
	(oacc_loop_auto_partitions): Use the adjust_parallelism hook to
	modify this_mask and loop->mask.
	(execute_oacc_device_lower): Update call to oacc_validate_dims.
	(default_goacc_adjust_parallelism): New function.
	* target.def (validate_dims): Add new default_dims argument.
	(adjust_parallelism): New hook.
	* targhooks.h (default_goacc_validate_dims): Add new argument.
	(default_goacc_adjust_parallelism): Declare.

From 1ee16b267dfbb0a148e8ec3b83ca463c21cbac1d Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 10:08:23 -0800
Subject: [PATCH] New target hooks

---
 gcc/config/nvptx/nvptx.c | 139 +++++++++++++++++++++++++++++++++++++++++++++--
 gcc/doc/tm.texi          |  15 +++--
 gcc/doc/tm.texi.in       |   2 +
 gcc/omp-offload.c        |  35 ++++++++++--
 gcc/target.def           |  17 ++++--
 gcc/targhooks.h          |   3 +-
 6 files changed, 190 insertions(+), 21 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5642941c6a3..507c8671704 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5205,14 +5205,36 @@  nvptx_simt_vf ()
   return PTX_WARP_SIZE;
 }
 
+#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
+
+/* Return true of the offloaded function needs a vector_length of
+   PTX_WARP_SIZE.  */
+
+static bool
+nvptx_goacc_needs_vl_warp ()
+{
+  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
+				DECL_ATTRIBUTES (current_function_decl));
+  return attr == NULL_TREE;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
    DECL is null, we are validating the default dimensions.  */
 
 static bool
-nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
+nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level,
+			   int default_dims[])
 {
+  int default_vector_length = PTX_VECTOR_LENGTH;
+
+  /* For capability reasons, fallback to vl = 32 for runtime values.  */
+  if (dims[GOMP_DIM_VECTOR] == 0)
+    default_vector_length = PTX_WARP_SIZE;
+  else if (default_dims)
+      default_vector_length = default_dims[GOMP_DIM_VECTOR];
+
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
@@ -5237,18 +5259,20 @@  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
-  /* The vector size must be 32, unless this is a SEQ routine.  */
+  /* The vector size must be a positive multiple of the warp size,
+     unless this is a SEQ routine.  */
   if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
       && dims[GOMP_DIM_VECTOR] >= 0
-      && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
+      && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	  || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
 		    : G_("using vector_length (%d), ignoring runtime setting"),
-		    PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]);
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+		    default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
       changed = true;
     }
 
@@ -5262,16 +5286,77 @@  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
+  /* Ensure that num_worker * vector_length < cta size.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
+  /* vector_length must not exceed PTX_CTA_SIZE.  */
+  if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE)
+    {
+      int new_vector = PTX_CTA_SIZE;
+      if (default_dims)
+	new_vector = default_vector_length;
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  new_vector, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = new_vector;
+      changed = true;
+    }
+
+  /* Set vector_length to default_vector_length if there are a sufficient
+     number of free threads in the CTA.  */
+  if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0)
+    {
+      if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = default_vector_length;
+      else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      else
+	error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  "vector_length must be at least 32");
+      changed = true;
+    }
+
+  /* Specify a default vector_length.  */
+  if (dims[GOMP_DIM_VECTOR] < 0)
+    {
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
+      changed = true;
+    }
+
+  if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE)
+    {
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
   if (!decl)
     {
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+      bool new_vector = false;
+      if (dims[GOMP_DIM_VECTOR] <= 1)
+	{
+	  dims[GOMP_DIM_VECTOR] = default_vector_length;
+	  new_vector = true;
+	}
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
+      if (new_vector
+	  && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
       changed = true;
     }
 
+  gcc_assert (dims[GOMP_DIM_VECTOR] != 0);
+  gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE);
+
   return changed;
 }
 
@@ -5291,6 +5376,45 @@  nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  if (nvptx_goacc_needs_vl_warp ())
+    return inner_mask;
+
+  bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.vector_length == PTX_WARP_SIZE)
+    return inner_mask;
+
+  /* FIXME: This is overly conservative; worker and vector loop will
+     eventually be combined.  */
+  if (wv)
+    return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  /* It's difficult to guarantee that warps in large vector_lengths
+     will remain convergent when a vector loop is nested inside a
+     worker loop.  Therefore, fallback to setting vector_length to
+     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
+     sm_70+ targets.  */
+  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
+			      DECL_ATTRIBUTES (current_function_decl));
+      DECL_ATTRIBUTES (current_function_decl) = attr;
+    }
+
+  return inner_mask;
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6180,6 +6304,9 @@  nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 0fcb9c64bf4..3028e438ddd 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5865,7 +5865,7 @@  to use it.
 Return number of threads in SIMT thread group on the target.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
+@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}, int *@var{default_dims})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
 and non-constant values as 0.  The @var{fn_level} is negative for the
@@ -5873,9 +5873,10 @@  function corresponding to the compute region.  For a routine is is the
 outermost level at which partitioned execution may be spawned.  The hook
 should verify non-default values.  If DECL is NULL, global defaults
 are being validated and unspecified defaults should be filled in.
-Diagnostics should be issued as appropriate.  Return
-true, if changes have been made.  You must override this hook to
-provide dimensions larger than 1.
+Diagnostics should be issued as appropriate.  The @var{default_dims}
+contain the user-specified default dims.  Return true, if changes have
+been made.  You must override this hook to provide dimensions larger
+than 1.
 @end deftypefn
 
 @deftypefn {Target Hook} int TARGET_GOACC_DIM_LIMIT (int @var{axis})
@@ -5883,6 +5884,12 @@  This hook should return the maximum size of a particular dimension,
 or zero if unbounded.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask})
+This hook allows the accelerator compiler to remove any unused
+parallelism exposed in the current loop @var{THIS_MASK}, and the
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
 This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
 function calls to target-specific gimple, or indicate whether they
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4187da139a9..fc73ad13e0a 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4298,6 +4298,8 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_DIM_LIMIT
 
+@hook TARGET_GOACC_ADJUST_PARALLELISM
+
 @hook TARGET_GOACC_FORK_JOIN
 
 @hook TARGET_GOACC_REDUCTION
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ba3f4317f4e..f15ce6b8f8d 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -615,8 +615,8 @@  oacc_parse_default_dims (const char *dims)
     }
 
   /* Allow the backend to validate the dimensions.  */
-  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1);
-  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1, NULL);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2, NULL);
 }
 
 /* Validate and update the dimensions for offloaded FN.  ATTRS is the
@@ -626,7 +626,8 @@  oacc_parse_default_dims (const char *dims)
    function.  */
 
 static void
-oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
+oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used,
+		    int * ARG_UNUSED (default_dims))
 {
   tree purpose[GOMP_DIM_MAX];
   unsigned ix;
@@ -675,7 +676,8 @@  oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
 		      axes[ix], axes[ix]);
     }
 
-  bool changed = targetm.goacc.validate_dims (fn, dims, level);
+  bool changed = targetm.goacc.validate_dims (fn, dims, level,
+					      oacc_default_dims);
 
   /* Default anything left to 1 or a partitioned default.  */
   for (ix = 0; ix != GOMP_DIM_MAX; ix++)
@@ -1258,6 +1260,13 @@  oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
 	}
     }
 
+  /* FIXME: Ideally, we should be coalescing parallelism here if the
+     hardware supports it.  E.g. Instead of partitioning a loop
+     across worker and vector axes, sometimes the hardware can
+     execute those loops together without resorting to placing
+     extra thread barriers.  */
+  this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
+
   mask_all |= this_mask;
 
   if (loop->flags & OLF_TILE)
@@ -1349,6 +1358,7 @@  oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	  this_mask ^= loop->e_mask;
 	}
 
+      this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
       loop->mask |= this_mask;
     }
 
@@ -1397,6 +1407,8 @@  oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	}
 
       loop->mask |= this_mask;
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
+
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
 		    tiling
@@ -1604,7 +1616,8 @@  execute_oacc_device_lower ()
     }
 
   int dims[GOMP_DIM_MAX];
-  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask);
+  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask,
+		      NULL);
 
   if (dump_file)
     {
@@ -1746,7 +1759,8 @@  execute_oacc_device_lower ()
 
 bool
 default_goacc_validate_dims (tree ARG_UNUSED (decl), int *dims,
-			     int ARG_UNUSED (fn_level))
+			     int ARG_UNUSED (fn_level),
+			     int * ARG_UNUSED (default_dims))
 {
   bool changed = false;
 
@@ -1774,6 +1788,15 @@  default_goacc_dim_limit (int ARG_UNUSED (axis))
 #endif
 }
 
+/* Default adjustment of loop parallelism is not required.  */
+
+unsigned
+default_goacc_adjust_parallelism (unsigned this_mask,
+				  unsigned ARG_UNUSED (outer_mask))
+{
+  return this_mask;
+}
+
 namespace {
 
 const pass_data pass_data_oacc_device_lower =
diff --git a/gcc/target.def b/gcc/target.def
index b302d3639da..aa7da2c1b2c 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1683,10 +1683,11 @@  function corresponding to the compute region.  For a routine is is the\n\
 outermost level at which partitioned execution may be spawned.  The hook\n\
 should verify non-default values.  If DECL is NULL, global defaults\n\
 are being validated and unspecified defaults should be filled in.\n\
-Diagnostics should be issued as appropriate.  Return\n\
-true, if changes have been made.  You must override this hook to\n\
-provide dimensions larger than 1.",
-bool, (tree decl, int *dims, int fn_level),
+Diagnostics should be issued as appropriate.  The @var{default_dims}\n\
+contain the user-specified default dims.  Return true, if changes have\n\
+been made.  You must override this hook to provide dimensions larger\n\
+than 1.",
+bool, (tree decl, int *dims, int fn_level, int *default_dims),
 default_goacc_validate_dims)
 
 DEFHOOK
@@ -1696,6 +1697,14 @@  or zero if unbounded.",
 int, (int axis),
 default_goacc_dim_limit)
 
+DEFHOOK
+(adjust_parallelism,
+"This hook allows the accelerator compiler to remove any unused\n\
+parallelism exposed in the current loop @var{THIS_MASK}, and the\n\
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.",
+unsigned, (unsigned this_mask, unsigned outer_mask),
+default_goacc_adjust_parallelism)
+
 DEFHOOK
 (fork_join,
 "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 18070df7839..b60c72a38f1 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -111,10 +111,11 @@  extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
-extern bool default_goacc_validate_dims (tree, int [], int);
+extern bool default_goacc_validate_dims (tree, int [], int, int []);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
-- 
2.14.3