adjust default nvptx launch geometry for OpenACC offloaded regions

Message ID c36981ca-04bc-08df-0af7-c2e1a8201640@codesourcery.com
State New
Headers show
Series
  • adjust default nvptx launch geometry for OpenACC offloaded regions
Related show

Commit Message

Cesar Philippidis June 20, 2018, 9:59 p.m.
At present, the nvptx libgomp plugin does not take into account the
amount of shared resources on GPUs (mostly shared-memory are register
usage) when selecting the default num_gangs and num_workers. In certain
situations, an OpenACC offloaded function can fail to launch if the GPU
does not have sufficient shared resources to accommodate all of the
threads in a CUDA block. This typically manifests when a PTX function
uses a lot of registers and num_workers is set too large, although it
can also happen if the shared-memory has been exhausted by the threads
in a vector.

This patch resolves that issue by adjusting num_workers based the amount
of shared resources used by each threads. If worker parallelism has been
requested, libgomp will spawn as many workers as possible up to 32.
Without this patch, libgomp would always default to launching 32 workers
when worker parallelism is used.

Besides for the worker parallelism, this patch also includes some
heuristics on selecting num_gangs. Before, the plugin would launch two
gangs per GPU multiprocessor. Now it follows the formula contained in
the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

Is this patch OK for trunk?

Thanks,
Cesar

Comments

Tom de Vries June 20, 2018, 10:15 p.m. | #1
On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
> Now it follows the formula contained in

> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.


Any reason we're not using the cuda runtime functions to get the
occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
to determine launch configuration in nvptx ) ?

Thanks,
- Tom
Cesar Philippidis June 21, 2018, 1:58 p.m. | #2
On 06/20/2018 03:15 PM, Tom de Vries wrote:
> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:

>> Now it follows the formula contained in

>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

> 

> Any reason we're not using the cuda runtime functions to get the

> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns

> to determine launch configuration in nvptx ) ?


There are two reasons:

  1) cuda_occupancy.h depends on the CUDA runtime to extract the device
     properties instead of the CUDA driver API. However, we can always
     teach libgomp how to populate the cudaDeviceProp struct using the
     driver API.

  2) CUDA is not always present on the build host, and that's why
     libgomp maintains its own cuda.h. So at the very least, this
     functionality would be good to have in libgomp as a fallback
     implementation; its not good to have program fail due to
     insufficient hardware resources errors when it is avoidable.

Cesar
Cesar Philippidis June 29, 2018, 5:12 p.m. | #3
Ping.

Ceasr

On 06/20/2018 02:59 PM, Cesar Philippidis wrote:
> At present, the nvptx libgomp plugin does not take into account the

> amount of shared resources on GPUs (mostly shared-memory are register

> usage) when selecting the default num_gangs and num_workers. In certain

> situations, an OpenACC offloaded function can fail to launch if the GPU

> does not have sufficient shared resources to accommodate all of the

> threads in a CUDA block. This typically manifests when a PTX function

> uses a lot of registers and num_workers is set too large, although it

> can also happen if the shared-memory has been exhausted by the threads

> in a vector.

> 

> This patch resolves that issue by adjusting num_workers based the amount

> of shared resources used by each threads. If worker parallelism has been

> requested, libgomp will spawn as many workers as possible up to 32.

> Without this patch, libgomp would always default to launching 32 workers

> when worker parallelism is used.

> 

> Besides for the worker parallelism, this patch also includes some

> heuristics on selecting num_gangs. Before, the plugin would launch two

> gangs per GPU multiprocessor. Now it follows the formula contained in

> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

> 

> Is this patch OK for trunk?

> 

> Thanks,

> Cesar

>
Cesar Philippidis June 29, 2018, 9:22 p.m. | #4
On 06/29/2018 10:12 AM, Cesar Philippidis wrote:
> Ping.


While porting the vector length patches to trunk, I realized that I
mistakenly removed support for the environment variable GOMP_OPENACC_DIM
in this patch (thanks for adding those test case Tom!). I'll post an
updated version of this patch once I got the vector length patches
working with it.

Cesar

> On 06/20/2018 02:59 PM, Cesar Philippidis wrote:

>> At present, the nvptx libgomp plugin does not take into account the

>> amount of shared resources on GPUs (mostly shared-memory are register

>> usage) when selecting the default num_gangs and num_workers. In certain

>> situations, an OpenACC offloaded function can fail to launch if the GPU

>> does not have sufficient shared resources to accommodate all of the

>> threads in a CUDA block. This typically manifests when a PTX function

>> uses a lot of registers and num_workers is set too large, although it

>> can also happen if the shared-memory has been exhausted by the threads

>> in a vector.

>>

>> This patch resolves that issue by adjusting num_workers based the amount

>> of shared resources used by each threads. If worker parallelism has been

>> requested, libgomp will spawn as many workers as possible up to 32.

>> Without this patch, libgomp would always default to launching 32 workers

>> when worker parallelism is used.

>>

>> Besides for the worker parallelism, this patch also includes some

>> heuristics on selecting num_gangs. Before, the plugin would launch two

>> gangs per GPU multiprocessor. Now it follows the formula contained in

>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

>>

>> Is this patch OK for trunk?

>>

>> Thanks,

>> Cesar

>>

>
Tom de Vries July 2, 2018, 2:14 p.m. | #5
On 06/21/2018 03:58 PM, Cesar Philippidis wrote:
> On 06/20/2018 03:15 PM, Tom de Vries wrote:

>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:

>>> Now it follows the formula contained in

>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

>>

>> Any reason we're not using the cuda runtime functions to get the

>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns

>> to determine launch configuration in nvptx ) ?

> 

> There are two reasons:

> 

>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device

>      properties instead of the CUDA driver API. However, we can always

>      teach libgomp how to populate the cudaDeviceProp struct using the

>      driver API.

> 

>   2) CUDA is not always present on the build host, and that's why

>      libgomp maintains its own cuda.h. So at the very least, this

>      functionality would be good to have in libgomp as a fallback

>      implementation;


Libgomp maintains its own cuda.h to "allow building GCC with PTX
offloading even without CUDA being installed" (
https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).

The libgomp nvptx plugin however uses the cuda driver API to launch
kernels etc, so we can assume that's always available at launch time.
And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch
Configuration", the occupancy API is also available in the driver API.

What we cannot assume to be available is the occupancy API pre cuda-6.5.
So it's fine to have a fallback for that (properly isolated in utility
functions), but for cuda 6.5 and up we want to use the occupancy API.

>      its not good to have program fail due to

>      insufficient hardware resources errors when it is avoidable.

>


Right, in fact there are two separate things you're trying to address
here: launch failure and occupancy heuristic, so split the patch.

Thanks,
- Tom
Cesar Philippidis July 2, 2018, 2:39 p.m. | #6
On 07/02/2018 07:14 AM, Tom de Vries wrote:
> On 06/21/2018 03:58 PM, Cesar Philippidis wrote:

>> On 06/20/2018 03:15 PM, Tom de Vries wrote:

>>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:

>>>> Now it follows the formula contained in

>>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

>>>

>>> Any reason we're not using the cuda runtime functions to get the

>>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns

>>> to determine launch configuration in nvptx ) ?

>>

>> There are two reasons:

>>

>>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device

>>      properties instead of the CUDA driver API. However, we can always

>>      teach libgomp how to populate the cudaDeviceProp struct using the

>>      driver API.

>>

>>   2) CUDA is not always present on the build host, and that's why

>>      libgomp maintains its own cuda.h. So at the very least, this

>>      functionality would be good to have in libgomp as a fallback

>>      implementation;

> 

> Libgomp maintains its own cuda.h to "allow building GCC with PTX

> offloading even without CUDA being installed" (

> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).

> 

> The libgomp nvptx plugin however uses the cuda driver API to launch

> kernels etc, so we can assume that's always available at launch time.

> And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch

> Configuration", the occupancy API is also available in the driver API.


Thanks for the info. I was not aware that the CUDA driver API had a
thread occupancy calculator (it' described in section 4.18).

> What we cannot assume to be available is the occupancy API pre cuda-6.5.

> So it's fine to have a fallback for that (properly isolated in utility

> functions), but for cuda 6.5 and up we want to use the occupancy API.


That seems reasonable. I'll run some experiments with that. In the
meantime, would it be OK to make this fallback the default, then add
support for the driver occupancy calculator as a follow up?

>>      its not good to have program fail due to

>>      insufficient hardware resources errors when it is avoidable.

>>

> 

> Right, in fact there are two separate things you're trying to address

> here: launch failure and occupancy heuristic, so split the patch.


ACK. I'll split those changes into separate patches.

By the way, do you have any preferences on how to break up the nvptx
vector length changes for trunk submission? I was planning on breaking
it down into four components - generic ME changes, tests, nvptx
reductions and the rest. Those two nvptx compoinents are large, so I'll
probably break them down to smaller patches, but I'm not sure if it's
worthwhile to make them independent from one another with the use of a
lot of stub functions.

Cesar
Cesar Philippidis July 11, 2018, 7:13 p.m. | #7
On 07/02/2018 07:14 AM, Tom de Vries wrote:
> On 06/21/2018 03:58 PM, Cesar Philippidis wrote:

>> On 06/20/2018 03:15 PM, Tom de Vries wrote:

>>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:

>>>> Now it follows the formula contained in

>>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

>>>

>>> Any reason we're not using the cuda runtime functions to get the

>>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns

>>> to determine launch configuration in nvptx ) ?

>>

>> There are two reasons:

>>

>>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device

>>      properties instead of the CUDA driver API. However, we can always

>>      teach libgomp how to populate the cudaDeviceProp struct using the

>>      driver API.

>>

>>   2) CUDA is not always present on the build host, and that's why

>>      libgomp maintains its own cuda.h. So at the very least, this

>>      functionality would be good to have in libgomp as a fallback

>>      implementation;

> 

> Libgomp maintains its own cuda.h to "allow building GCC with PTX

> offloading even without CUDA being installed" (

> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).

> 

> The libgomp nvptx plugin however uses the cuda driver API to launch

> kernels etc, so we can assume that's always available at launch time.

> And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch

> Configuration", the occupancy API is also available in the driver API.

> 

> What we cannot assume to be available is the occupancy API pre cuda-6.5.

> So it's fine to have a fallback for that (properly isolated in utility

> functions), but for cuda 6.5 and up we want to use the occupancy API.


Here's revision 2 to the patch. I replaced all of my thread occupancy
heuristics with calls to the CUDA driver as you suggested. The
performance is worse than my heuristics, but that's to be expected
because the CUDA driver only guarantees the minimal launch geometry to
to fully utilize the hardware, and not the optimal value. I'll
reintroduce my heuristics later as a follow up patch. The major
advantage of the CUDA thread occupancy calculator is that it allows the
runtime to select sensible default num_workers to avoid those annoying
runtime failures due to insufficient GPU hardware resources.

One thing that may stick out in this patch is how it probes for the
driver version instead of the API version. It turns out that the API
version corresponds to the SM version declared in the PTX sources,
whereas the driver version corresponds to the latest version of CUDA
supported by the driver. At least that's the case with driver version
396.24.

>>      its not good to have program fail due to

>>      insufficient hardware resources errors when it is avoidable.

>>

> 

> Right, in fact there are two separate things you're trying to address

> here: launch failure and occupancy heuristic, so split the patch.


That hunk was small, so I included it with this patch. Although if you
insist, I can remove it.

Is this patch OK for trunk? I tested it x86_64 with nvptx offloading.

Cesar
2018-07-XX  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	gcc/
	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...
	(PTX_DEFAULT_RUNTIME_DIM): ... this.
	(nvptx_goacc_validate_dims): Set default worker and gang dims to
	PTX_DEFAULT_RUNTIME_DIM.
	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER;

	libgomp/
	* plugin/cuda/cuda.h (CUoccupancyB2DSize): Declare.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Likewise.
	* plugin/plugin-nvptx.c (struct ptx_device): Add driver_version member.
	(nvptx_open_device): Set it.
	(nvptx_exec): Use the CUDA driver to both determine default num_gangs
	and num_workers, and error if the hardware doesn't have sufficient
	resources to launch a kernel.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee8a8d..c1946e75f42 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@ nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;
 
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..1ee59db172c 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -170,6 +171,9 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSizeWithFlags(int *, int *, CUfunction,
+						   CUoccupancyB2DSize, size_t,
+						   int, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 89326e57741..5022e462a3d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -94,6 +94,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)	\
 CUDA_ONE_CALL (cuModuleLoad)		\
 CUDA_ONE_CALL (cuModuleLoadData)	\
 CUDA_ONE_CALL (cuModuleUnload)		\
+CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \
 CUDA_ONE_CALL (cuStreamCreate)		\
 CUDA_ONE_CALL (cuStreamDestroy)		\
 CUDA_ONE_CALL (cuStreamQuery)		\
@@ -414,6 +415,7 @@ struct ptx_device
   int num_sms;
   int regs_per_block;
   int regs_per_sm;
+  int driver_version;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -725,6 +727,7 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->driver_version = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -806,6 +809,9 @@ nvptx_open_device (int n)
   if (r != CUDA_SUCCESS)
     async_engines = 1;
 
+  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);
+  ptx_dev->driver_version = pi;
+
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
@@ -1120,6 +1126,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   void *hp, *dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
+  int dev_size = nvthd->ptx_dev->num_sms;
 
   function = targ_fn->fn;
 
@@ -1140,8 +1147,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
-      /* See if the user provided GOMP_OPENACC_DIM environment
-	 variable to specify runtime defaults. */
+      /* Specify runtime defaults. */
       static int default_dims[GOMP_DIM_MAX];
 
       pthread_mutex_lock (&ptx_dev_lock);
@@ -1150,23 +1156,20 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
 	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
 
-	  int warp_size, block_size, dev_size, cpu_size;
+	  int warp_size, block_size, cpu_size;
 	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
 	  /* 32 is the default for known hardware.  */
 	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
+	  CUdevice_attribute cu_tpb, cu_ws, cu_tpm;
 
 	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
 	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
 	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
 
 	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
 				 dev) == CUDA_SUCCESS
 	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
 				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
-				    dev) == CUDA_SUCCESS
 	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
 				    dev) == CUDA_SUCCESS)
 	    {
@@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 			     default_dims[GOMP_DIM_VECTOR]);
 	}
       pthread_mutex_unlock (&ptx_dev_lock);
+      int vectors = default_dims[GOMP_DIM_VECTOR];
+      int workers = default_dims[GOMP_DIM_WORKER];
+      int gangs = default_dims[GOMP_DIM_GANG];
+
+      if (nvptx_thread()->ptx_dev->driver_version > 6050)
+	{
+	  int grids, blocks;
+	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			    &blocks, function, NULL, 0,
+			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			     "grid = %d, block = %d\n", grids, blocks);
+
+	  gangs = grids * dev_size;
+	  workers = blocks / vectors;
+	}
 
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
-	  dims[i] = default_dims[i];
+	  {
+	    switch (i)
+	      {
+	      case GOMP_DIM_GANG:
+		/* The constant 2 was emperically.  The justification
+		   behind it is to prevent the hardware from idling by
+		   throwing twice the amount of work that it can
+		   physically handle.  */
+		dims[i] = gangs;
+		break;
+	      case GOMP_DIM_WORKER:
+		dims[i] = workers;
+		break;
+	      case GOMP_DIM_VECTOR:
+		dims[i] = vectors;
+		break;
+	      default:
+		abort ();
+	      }
+	  }
     }
 
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]
+      > targ_fn->max_threads_per_block)
+    GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"
+		       " launch '%s' with num_workers = %d and vector_length ="
+		       " %d; recompile the program with 'num_workers = x and"
+		       " vector_length = y' on that offloaded region or "
+		       "'-fopenacc-dim=-:x:y' where x * y <= %d.\n",
+		       targ_fn->launch->fn, dims[GOMP_DIM_WORKER],
+		       dims[GOMP_DIM_VECTOR], targ_fn->max_threads_per_block);
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */
Tom de Vries July 26, 2018, 11:59 a.m. | #8
> Content-Type: text/x-patch; name="trunk-libgomp-default-par.diff"

> Content-Transfer-Encoding: 7bit

> Content-Disposition: attachment; filename="trunk-libgomp-default-par.diff"


From https://gcc.gnu.org/contribute.html#patches :
...
We prefer patches posted as plain text or as MIME parts of type
text/x-patch or text/plain, disposition inline, encoded as 7bit or 8bit.
It is strongly discouraged to post patches as MIME parts of type
application/whatever, disposition attachment or encoded as base64 or
quoted-printable.
...

Please post with content-disposition inline instead of attachment (or,
as plain text).

Thanks,
- Tom
Tom de Vries July 26, 2018, 12:46 p.m. | #9
>> Right, in fact there are two separate things you're trying to address

>> here: launch failure and occupancy heuristic, so split the patch.


> That hunk was small, so I included it with this patch. Although if you

> insist, I can remove it.


Please, for future reference, always assume that I insist instead of
asking me, unless you have an argument to present why that is not a good
idea. And just to be clear here: "small" is not such an argument.

Please keep in mind ( https://gcc.gnu.org/contribute.html#patches ):
...
Don't mix together changes made for different reasons. Send them
individually.
...

> +  /* Check if the accelerator has sufficient hardware resources to

> +     launch the offloaded kernel.  */

> +  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]

> +      > targ_fn->max_threads_per_block)

> +    GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"

> +		       " launch '%s' with num_workers = %d and vector_length ="

> +		       " %d; recompile the program with 'num_workers = x and"

> +		       " vector_length = y' on that offloaded region or "

> +		       "'-fopenacc-dim=-:x:y' where x * y <= %d.\n",

> +		       targ_fn->launch->fn, dims[GOMP_DIM_WORKER],

> +		       dims[GOMP_DIM_VECTOR], targ_fn->max_threads_per_block);

> +


This is copied from the state on an openacc branch where vector-length
is variable, and the error message text doesn't make sense on current
trunk for that reason. Also, it suggests a syntax for fopenacc-dim
that's not supported on trunk.

Committed as attached.

Thanks,
- Tom
[libgomp, nvptx] Add error with recompilation hint for launch failure

Currently, when a kernel is lauched with too many workers, it results in a cuda
launch failure.  This is triggered f.i. for parallel-loop-1.c at -O0 on a Quadro
M1200.

This patch detects this situation, and errors out with a hint on how to fix it.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-07-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	* plugin/plugin-nvptx.c (nvptx_exec): Error if the hardware doesn't have
	sufficient resources to launch a kernel, and give a hint on how to fix
	it.

---
 libgomp/plugin/plugin-nvptx.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5d9b5151e95..3a4077a1315 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1204,6 +1204,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	  dims[i] = default_dims[i];
     }
 
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]
+      > targ_fn->max_threads_per_block)
+    {
+      int suggest_workers
+	= targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR];
+      GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"
+			 " launch '%s' with num_workers = %d; recompile the"
+			 " program with 'num_workers = %d' on that offloaded"
+			 " region or '-fopenacc-dim=:%d'",
+			 targ_fn->launch->fn, dims[GOMP_DIM_WORKER],
+			 suggest_workers, suggest_workers);
+    }
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */
Cesar Philippidis July 26, 2018, 2:27 p.m. | #10
Hi Tom,

I see that you're reviewing the libgomp changes. Please disregard the
following hunk:

On 07/11/2018 12:13 PM, Cesar Philippidis wrote:
> @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>  			     default_dims[GOMP_DIM_VECTOR]);

>  	}

>        pthread_mutex_unlock (&ptx_dev_lock);

> +      int vectors = default_dims[GOMP_DIM_VECTOR];

> +      int workers = default_dims[GOMP_DIM_WORKER];

> +      int gangs = default_dims[GOMP_DIM_GANG];

> +

> +      if (nvptx_thread()->ptx_dev->driver_version > 6050)

> +	{

> +	  int grids, blocks;

> +	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

> +			    &blocks, function, NULL, 0,

> +			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

> +	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

> +			     "grid = %d, block = %d\n", grids, blocks);

> +

> +	  gangs = grids * dev_size;

> +	  workers = blocks / vectors;

> +	}


I revisited this change yesterday and I noticed it was setting gangs
incorrectly. Basically, gangs should be set as follows

  gangs = grids * (blocks / warp_size);

or to be more closer to og8 as

  gangs = 2 * grids * (blocks / warp_size);

The use of that magic constant 2 is to prevent thread starvation. That's
a similar concept behind make -j<2*#threads>.

Anyway, I'm still experimenting with that change. There are still some
discrepancies between the way that I select num_workers and how the
driver does. The driver appears to be a little bit more conservative,
but according to the thread occupancy calculator, that should yield
greater performance on GPUs.

I just wanted to give you a heads up because you seem to be working on this.

Thanks for all of your reviews!

By the way, are you now maintainer of the libgomp nvptx plugin?

Cesar
Tom de Vries July 26, 2018, 3:19 p.m. | #11
On 07/26/2018 04:27 PM, Cesar Philippidis wrote:
> Hi Tom,

> 

> I see that you're reviewing the libgomp changes. Please disregard the

> following hunk:

> 

> On 07/11/2018 12:13 PM, Cesar Philippidis wrote:

>> @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>>  			     default_dims[GOMP_DIM_VECTOR]);

>>  	}

>>        pthread_mutex_unlock (&ptx_dev_lock);

>> +      int vectors = default_dims[GOMP_DIM_VECTOR];

>> +      int workers = default_dims[GOMP_DIM_WORKER];

>> +      int gangs = default_dims[GOMP_DIM_GANG];

>> +

>> +      if (nvptx_thread()->ptx_dev->driver_version > 6050)

>> +	{

>> +	  int grids, blocks;

>> +	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

>> +			    &blocks, function, NULL, 0,

>> +			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

>> +	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

>> +			     "grid = %d, block = %d\n", grids, blocks);

>> +

>> +	  gangs = grids * dev_size;

>> +	  workers = blocks / vectors;

>> +	}

> 

> I revisited this change yesterday and I noticed it was setting gangs

> incorrectly. Basically, gangs should be set as follows

> 

>   gangs = grids * (blocks / warp_size);

> 

> or to be more closer to og8 as

> 

>   gangs = 2 * grids * (blocks / warp_size);

> 

> The use of that magic constant 2 is to prevent thread starvation. That's

> a similar concept behind make -j<2*#threads>.

> 

> Anyway, I'm still experimenting with that change. There are still some

> discrepancies between the way that I select num_workers and how the

> driver does. The driver appears to be a little bit more conservative,

> but according to the thread occupancy calculator, that should yield

> greater performance on GPUs.

> 

> I just wanted to give you a heads up because you seem to be working on this.

> 


Ack, thanks for letting me know.

> Thanks for all of your reviews!

> 

> By the way, are you now maintainer of the libgomp nvptx plugin?


I'm not sure if that's a separate thing.

AFAIU the responsibilities of the nvptx maintainer are:
- the nvptx backend (under supervision of the global maintainers)
- and anything nvptx-y in all other components (under supervision of the
  component and global maintainers)

So, I'd say I'm on the hook to review patches for the nvptx plugin in
libgomp.

Thanks,
- Tom
Tom de Vries July 30, 2018, 10:16 a.m. | #12
On 07/11/2018 09:13 PM, Cesar Philippidis wrote:
> 2018-07-XX  Cesar Philippidis  <cesar@codesourcery.com>

> 	    Tom de Vries  <tdevries@suse.de>

> 

> 	gcc/

> 	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...

> 	(PTX_DEFAULT_RUNTIME_DIM): ... this.

> 	(nvptx_goacc_validate_dims): Set default worker and gang dims to

> 	PTX_DEFAULT_RUNTIME_DIM.

> 	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER;


That's an independent patch.

Committed at below.

Thanks,
- Tom
[nvptx, offloading] Determine default workers at runtime

Currently, if the user doesn't specify the number of workers for an openacc
region, the compiler hardcodes it to a default value.

This patch removes this functionality, such that the libgomp runtime can decide
on a default value.

2018-07-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...
	(PTX_DEFAULT_RUNTIME_DIM): ... this.
	(nvptx_goacc_validate_dims): Set default worker and gang dims to
	PTX_DEFAULT_RUNTIME_DIM.
	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER.

---
 gcc/config/nvptx/nvptx.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee8a8d..c1946e75f42 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@ nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;

Patch

2018-06-20  Cesar Philippidis  <cesar@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Delete define.
        (PTX_DEFAULT_RUNTIME_DIM): New define.
        (nvptx_goacc_validate_dims): Use it to allow the runtime to
        dynamically allocate num_workers and num_gangs.
        (nvptx_dim_limit): Don't impose an arbritary num_workers.

        libgomp/
        * plugin/plugin-nvptx.c (struct ptx_device): Add
        max_threads_per_block, warp_size, max_threads_per_multiprocessor,
        max_shared_memory_per_multiprocessor, binary_version,
        register_allocation_unit_size, register_allocation_granularity,
        compute_capability_major, compute_capability_minor members.
        (nvptx_open_device): Probe driver for those values.  Adjust
        regs_per_sm and max_shared_memory_per_multiprocessor for K80
        hardware. Dynamically allocate default num_workers.
        (nvptx_exec): Don't probe the CUDA runtime for the hardware
        info.  Use the new variables inside targ_fn_descriptor and
        ptx_device instead.  (GOMP_OFFLOAD_load_image): Set num_gangs,
        register_allocation_{unit_size,granularity}.  Adjust the
        default num_gangs.  Add diagnostic when the hardware cannot
        support the requested num_workers.
        * plugin/cuda/cuda.h (CUdevice_attribute): Add
        CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR,
        CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee..c1946e7 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@  nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@  nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;
 
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825..c7d50db 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -69,6 +69,8 @@  typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 89326e5..ada1df2 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -409,11 +409,25 @@  struct ptx_device
   bool map;
   bool concur;
   bool mkern;
-  int  mode;
+  int mode;
+  int compute_capability_major;
+  int compute_capability_minor;
   int clock_khz;
   int num_sms;
   int regs_per_block;
   int regs_per_sm;
+  int max_threads_per_block;
+  int warp_size;
+  int max_threads_per_multiprocessor;
+  int max_shared_memory_per_multiprocessor;
+
+  int binary_version;
+
+  /* register_allocation_unit_size and register_allocation_granularity
+     were extracted from the "Register Allocation Granularity" in
+     Nvidia's CUDA Occupancy Calculator spreadsheet.  */
+  int register_allocation_unit_size;
+  int register_allocation_granularity;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -725,6 +739,9 @@  nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->binary_version = 0;
+  ptx_dev->register_allocation_unit_size = 0;
+  ptx_dev->register_allocation_granularity = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -765,6 +782,14 @@  nvptx_open_device (int n)
   ptx_dev->mode = pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev);
+  ptx_dev->compute_capability_major = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev);
+  ptx_dev->compute_capability_minor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
 		  &pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev);
   ptx_dev->mkern = pi;
 
@@ -794,13 +819,28 @@  nvptx_open_device (int n)
   ptx_dev->regs_per_sm = pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev);
+  ptx_dev->max_threads_per_block = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
 		  &pi, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev);
+  ptx_dev->warp_size = pi;
   if (pi != 32)
     {
       GOMP_PLUGIN_error ("Only warp size 32 is supported");
       return NULL;
     }
 
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
+  ptx_dev->max_threads_per_multiprocessor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi,
+		  CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR,
+		  dev);
+  ptx_dev->max_shared_memory_per_multiprocessor = pi;
+
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
@@ -809,6 +849,39 @@  nvptx_open_device (int n)
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
+  GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
+		     "\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
+		     "\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
+		     "\tCU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = %d\n"
+		     "\tCU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = %d\n"
+		     "\tINTEGRATED = %d\n"
+		     "\tMAX_THREADS_PER_BLOCK = %d\n\tWARP_SIZE = %d\n"
+		     "\tMULTIPROCESSOR_COUNT = %d\n"
+		     "\tMAX_THREADS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_REGISTERS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_SHARED_MEMORY_PER_MULTIPROCESSOR = %d\n",
+		     ptx_dev->ord, ptx_dev->overlap, ptx_dev->map,
+		     ptx_dev->concur, ptx_dev->mode, ptx_dev->mkern,
+		     ptx_dev->compute_capability_major,
+		     ptx_dev->compute_capability_minor,
+		     ptx_dev->mkern, ptx_dev->max_threads_per_block,
+		     ptx_dev->warp_size, ptx_dev->num_sms,
+		     ptx_dev->max_threads_per_multiprocessor,
+		     ptx_dev->regs_per_sm,
+		     ptx_dev->max_shared_memory_per_multiprocessor);
+
+  /* K80 (SM_37) boards contain two physical GPUs.  Consequntly they
+     report 2x larger values for MAX_REGISTERS_PER_MULTIPROCESSOR and
+     MAX_SHARED_MEMORY_PER_MULTIPROCESSOR.  Those values need to be
+     adjusted on order to allow the nvptx_exec to select an
+     appropriate num_workers.  */
+  if (ptx_dev->compute_capability_major == 3
+      && ptx_dev->compute_capability_minor == 7)
+    {
+      ptx_dev->regs_per_sm /= 2;
+      ptx_dev->max_shared_memory_per_multiprocessor /= 2;
+    }
+
   if (!init_streams_for_device (ptx_dev, async_engines))
     return NULL;
 
@@ -1120,6 +1193,14 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   void *hp, *dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
+  int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
+  int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
+  int dev_size = nvptx_thread ()->ptx_dev->num_sms;
+  int warp_size = nvptx_thread ()->ptx_dev->warp_size;
+  int rf_size = nvptx_thread ()->ptx_dev->regs_per_sm;
+  int reg_unit_size = nvptx_thread ()->ptx_dev->register_allocation_unit_size;
+  int reg_granularity
+    = nvptx_thread ()->ptx_dev->register_allocation_granularity;
 
   function = targ_fn->fn;
 
@@ -1138,71 +1219,92 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
        seen_zero = 1;
     }
 
-  if (seen_zero)
-    {
-      /* See if the user provided GOMP_OPENACC_DIM environment
-	 variable to specify runtime defaults. */
-      static int default_dims[GOMP_DIM_MAX];
+  /* Calculate the optimal number of gangs for the current device.  */
+  int reg_used = targ_fn->regs_per_thread;
+  int reg_per_warp = ((reg_used * warp_size + reg_unit_size - 1)
+		      / reg_unit_size) * reg_unit_size;
+  int threads_per_sm = (rf_size / reg_per_warp / reg_granularity)
+    * reg_granularity * warp_size;
+  int threads_per_block = threads_per_sm > block_size
+    ? block_size : threads_per_sm;
 
-      pthread_mutex_lock (&ptx_dev_lock);
-      if (!default_dims[0])
-	{
-	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
-	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
-
-	  int warp_size, block_size, dev_size, cpu_size;
-	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
-	  /* 32 is the default for known hardware.  */
-	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
-
-	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
-	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
-	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
-
-	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
-				 dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
-				    dev) == CUDA_SUCCESS)
-	    {
-	      GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
-				 " dev_size=%d, cpu_size=%d\n",
-				 warp_size, block_size, dev_size, cpu_size);
-	      gang = (cpu_size / block_size) * dev_size;
-	      worker = block_size / warp_size;
-	      vector = warp_size;
-	    }
+  threads_per_block /= warp_size;
 
-	  /* There is no upper bound on the gang size.  The best size
-	     matches the hardware configuration.  Logical gangs are
-	     scheduled onto physical hardware.  To maximize usage, we
-	     should guess a large number.  */
-	  if (default_dims[GOMP_DIM_GANG] < 1)
-	    default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
-	  /* The worker size must not exceed the hardware.  */
-	  if (default_dims[GOMP_DIM_WORKER] < 1
-	      || (default_dims[GOMP_DIM_WORKER] > worker && gang))
-	    default_dims[GOMP_DIM_WORKER] = worker;
-	  /* The vector size must exactly match the hardware.  */
-	  if (default_dims[GOMP_DIM_VECTOR] < 1
-	      || (default_dims[GOMP_DIM_VECTOR] != vector && gang))
-	    default_dims[GOMP_DIM_VECTOR] = vector;
-
-	  GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n",
-			     default_dims[GOMP_DIM_GANG],
-			     default_dims[GOMP_DIM_WORKER],
-			     default_dims[GOMP_DIM_VECTOR]);
-	}
-      pthread_mutex_unlock (&ptx_dev_lock);
+  if (threads_per_sm > cpu_size)
+    threads_per_sm = cpu_size;
 
+  /* Set default launch geometry.  */
+  static int default_dims[GOMP_DIM_MAX];
+  pthread_mutex_lock (&ptx_dev_lock);
+  if (!default_dims[0])
+    {
+      /* 32 is the default for known hardware.  */
+      int gang = 0, worker = 32, vector = 32;
+
+      gang = (cpu_size / block_size) * dev_size;
+      vector = warp_size;
+
+      /* If the user hasn't specified the number of gangs, determine
+	 it dynamically based on the hardware configuration.  */
+      if (default_dims[GOMP_DIM_GANG] == 0)
+	default_dims[GOMP_DIM_GANG] = -1;
+      /* The worker size must not exceed the hardware.  */
+      if (default_dims[GOMP_DIM_WORKER] < 1
+	  || (default_dims[GOMP_DIM_WORKER] > worker && gang))
+	default_dims[GOMP_DIM_WORKER] = -1;
+      /* The vector size must exactly match the hardware.  */
+      if (default_dims[GOMP_DIM_VECTOR] < 1
+	  || (default_dims[GOMP_DIM_VECTOR] != vector && gang))
+	default_dims[GOMP_DIM_VECTOR] = vector;
+
+      GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n",
+			 default_dims[GOMP_DIM_GANG],
+			 default_dims[GOMP_DIM_WORKER],
+			 default_dims[GOMP_DIM_VECTOR]);
+    }
+  pthread_mutex_unlock (&ptx_dev_lock);
+
+  if (seen_zero)
+    {
       for (i = 0; i != GOMP_DIM_MAX; i++)
-	if (!dims[i])
-	  dims[i] = default_dims[i];
+ 	if (!dims[i])
+	  {
+	    if (default_dims[i] > 0)
+	      dims[i] = default_dims[i];
+	    else
+	      switch (i) {
+	      case GOMP_DIM_GANG:
+		/* The constant 2 was emperically.  The justification
+		   behind it is to prevent the hardware from idling by
+		   throwing twice the amount of work that it can
+		   physically handle.  */
+		dims[i] = (reg_granularity > 0)
+		  ? 2 * threads_per_sm / warp_size * dev_size
+		  : 2 * dev_size;
+		break;
+	      case GOMP_DIM_WORKER:
+		dims[i] = threads_per_block;
+		break;
+	      case GOMP_DIM_VECTOR:
+		dims[i] = warp_size;
+		break;
+	      default:
+		abort ();
+	      }
+	  }
+    }
+
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] > 1)
+    {
+      if (reg_granularity > 0 && dims[GOMP_DIM_WORKER] > threads_per_block)
+	GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
+			   "to launch '%s'; recompile the program with "
+			   "'num_workers = %d' on that offloaded region or "
+			   "'-fopenacc-dim=-:%d'.\n",
+			   targ_fn->launch->fn, threads_per_block,
+			   threads_per_block);
     }
 
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
@@ -1870,6 +1972,39 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       targ_fns->regs_per_thread = nregs;
       targ_fns->max_threads_per_block = mthrs;
 
+      if (!dev->binary_version)
+	{
+	  int val;
+	  CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+			  CU_FUNC_ATTRIBUTE_BINARY_VERSION, function);
+	  dev->binary_version = val;
+
+	  /* These values were obtained from the CUDA Occupancy Calculator
+	     spreadsheet.  */
+	  if (dev->binary_version == 20
+	      || dev->binary_version == 21)
+	    {
+	    dev->register_allocation_unit_size = 128;
+	    dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version == 60)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version <= 70)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 4;
+	    }
+	  else
+	    {
+	      /* Fallback to -1 to for unknown targets.  */
+	      dev->register_allocation_unit_size = -1;
+	      dev->register_allocation_granularity = -1;
+	    }
+	}
+
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
     }