[nvptx] Use CUDA driver API to select default runtime launch, geometry

Message ID 791625c9-911f-972d-ed4e-746dc5fe5f43@codesourcery.com
State New
Headers show
Series
  • [nvptx] Use CUDA driver API to select default runtime launch, geometry
Related show

Commit Message

Cesar Philippidis July 31, 2018, 2:58 p.m.
The attached patch teaches libgomp how to use the CUDA thread occupancy
calculator built into the CUDA driver. Despite both being based off the
CUDA thread occupancy spreadsheet distributed with CUDA, the built in
occupancy calculator differs from the occupancy calculator in og8 in two
key ways. First, og8 launches twice the number of gangs as the driver
thread occupancy calculator. This was my attempt at preventing threads
from idling, and it operating on a similar principle of running 'make
-jN', where N is twice the number of CPU threads. Second, whereas og8
always attempts to maximize the CUDA block size, the driver may select a
smaller block, which effectively decreases num_workers.

In terms of performance, there really isn't that much of a difference
between the CUDA driver's occupancy calculator and og8's. However, on
the tests that are impacted, they are generally within a factor of two
from one another, with some tests running faster with the driver
occupancy calculator and others with og8's.

Unfortunately, support for the CUDA driver API isn't universal; it's
only available in CUDA version 6.5 (or 6050) and newer. In this patch,
I'm exploiting the fact that init_cuda_lib only checks for errors on the
last library function initialized. Therefore it guards the usage of

  cuOccupancyMaxPotentialBlockSizeWithFlags

by checking driver_version. If the driver occupancy calculator isn't
available, it falls back to the existing defaults. Maybe the og8 thread
occupancy would make a better default for older versions of CUDA, but
that's a patch for another day.

Is this patch OK for trunk? I bootstrapped and regression tested it
using x86_64 with nvptx offloading.

Thanks,
Cesar

Comments

Tom de Vries Aug. 1, 2018, 10:18 a.m. | #1
On 07/31/2018 04:58 PM, Cesar Philippidis wrote:
> The attached patch teaches libgomp how to use the CUDA thread occupancy

> calculator built into the CUDA driver. Despite both being based off the

> CUDA thread occupancy spreadsheet distributed with CUDA, the built in

> occupancy calculator differs from the occupancy calculator in og8 in two

> key ways. First, og8 launches twice the number of gangs as the driver

> thread occupancy calculator. This was my attempt at preventing threads

> from idling, and it operating on a similar principle of running 'make

> -jN', where N is twice the number of CPU threads.


You're saying the two methods are different, and that the difference
between the two methods is a factor two, which is a heuristic you added
yourself on top of one of the methods, which implies that in fact the
two methods are identical. Is my understanding correct here?

> Second, whereas og8

> always attempts to maximize the CUDA block size, the driver may select a

> smaller block, which effectively decreases num_workers.

> 


So, do I understand it correctly that using the function
cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can
achieve the maximum occupancy" or some such and og8 gives us "maximum
block size"?

> In terms of performance, there really isn't that much of a difference

> between the CUDA driver's occupancy calculator and og8's. However, on

> the tests that are impacted, they are generally within a factor of two

> from one another, with some tests running faster with the driver

> occupancy calculator and others with og8's.

> 


Ack. Well, until we understand that in more detail, going with the
driver's occupancy calculator seems the right thing to do.

> Unfortunately, support for the CUDA driver API isn't universal; it's

> only available in CUDA version 6.5 (or 6050) and newer. In this patch,

> I'm exploiting the fact that init_cuda_lib only checks for errors on the

> last library function initialized.


That sounds incorrect to me. In init_cuda_lib I see:
...
# define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)
# define CUDA_ONE_CALL_1(call) \
  cuda_lib.call = dlsym (h, #call);     \
  if (cuda_lib.call == NULL)            \
    return false;
  CUDA_CALLS
...
so in fact every library function is checked. Have you tested this with
pre 6-5 cuda?

I think we need to add and handle:
...
  CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
...

> Therefore it guards the usage of

> 

>   cuOccupancyMaxPotentialBlockSizeWithFlags

> 

> by checking driver_version.


If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we
can test for NULL, which seems a simpler solution than testing the version.

> If the driver occupancy calculator isn't

> available, it falls back to the existing defaults. Maybe the og8 thread

> occupancy would make a better default for older versions of CUDA, but

> that's a patch for another day.

> 


Agreed.

> Is this patch OK for trunk?


The patch doesn't build in a setup with
--enable-offload-targets=nvptx-none and without cuda, that enables usage
of plugin/cuda/cuda.h:
...
/data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:
‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);
did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?
 CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \
...

> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>  

>        {

>  	bool default_dim_p[GOMP_DIM_MAX];

> +	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];

> +	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];

> +	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];

> +

> +	/* The CUDA driver occupancy calculator is only available on

> +	   CUDA version 6.5 (6050) and newer.  */

> +	if (nvthd->ptx_dev->driver_version > 6050)

> +	  {

> +	    int grids, blocks;

> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

> +			      &blocks, function, NULL, 0,

> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

> +			       "grid = %d, block = %d\n", grids, blocks);

> +



> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)


You should use gomp_openacc_dims[0].

> +	      gangs = grids * (blocks / warp_size);


So, we launch with gangs == grids * workers ? Is that intentional?

> +

> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)

> +	      workers = blocks / vectors;


Also, the new default calculation is not nicely separated from the
fallback default calculation.  I've updated the patch with a cleaner
separation, attached and build without cuda but untested.

Thanks,
- Tom
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..67a475b695e 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -170,6 +171,9 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSize (int *, int *, CUfunction,
+					   CUoccupancyB2DSize, size_t,
+					   int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index b6ec5f88d59..47518f1a73f 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -94,6 +94,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)	\
 CUDA_ONE_CALL (cuModuleLoad)		\
 CUDA_ONE_CALL (cuModuleLoadData)	\
 CUDA_ONE_CALL (cuModuleUnload)		\
+CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize) \
 CUDA_ONE_CALL (cuStreamCreate)		\
 CUDA_ONE_CALL (cuStreamDestroy)		\
 CUDA_ONE_CALL (cuStreamQuery)		\
@@ -101,6 +102,7 @@ CUDA_ONE_CALL (cuStreamSynchronize)	\
 CUDA_ONE_CALL (cuStreamWaitEvent)
 # define CUDA_ONE_CALL(call) \
   __typeof (call) *call;
+# define CUDA_ONE_CALL_MAYBE_NULL(call) CUDA_ONE_CALL(call)
 struct cuda_lib_s {
   CUDA_CALLS
 } cuda_lib;
@@ -122,10 +124,12 @@ init_cuda_lib (void)
   if (h == NULL)
     return false;
 # undef CUDA_ONE_CALL
-# define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)
-# define CUDA_ONE_CALL_1(call) \
+# undef CUDA_ONE_CALL_MAYBE_NULL
+# define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call, true)
+# define CUDA_ONE_CALL_MAYBE_NULL(call) CUDA_ONE_CALL_1 (call, false)
+# define CUDA_ONE_CALL_1(call, check)		\
   cuda_lib.call = dlsym (h, #call);	\
-  if (cuda_lib.call == NULL)		\
+  if (check && cuda_lib.call == NULL)		\
     return false;
   CUDA_CALLS
   cuda_lib_inited = true;
@@ -1221,10 +1225,44 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
       {
 	bool default_dim_p[GOMP_DIM_MAX];
 	for (i = 0; i != GOMP_DIM_MAX; i++)
+	  default_dim_p[i] = !dims[i];
+
+	if (cuda_lib.cuOccupancyMaxPotentialBlockSize == NULL)
+	  {
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      if (default_dim_p[i])
+		dims[i] = nvthd->ptx_dev->default_dims[i];
+	  }
+	else
 	  {
-	    default_dim_p[i] = !dims[i];
-	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
+	    int vectors = gomp_openacc_dims[GOMP_DIM_VECTOR];
+	    int workers = gomp_openacc_dims[GOMP_DIM_WORKER];
+	    int gangs = gomp_openacc_dims[GOMP_DIM_GANG];
+	    int grids, blocks;
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    if (vectors == 0)
+	      vectors = warp_size;
+
+	    if (workers == 0)
+	      workers = blocks / vectors;
+
+	    if (gangs == 0)
+	      gangs = grids * workers;
+
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      if (default_dim_p[i])
+		switch (i)
+		  {
+		  case GOMP_DIM_GANG: dims[i] = gangs; break;
+		  case GOMP_DIM_WORKER: dims[i] = workers; break;
+		  case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		  default: GOMP_PLUGIN_fatal ("invalid dim");
+		  }
 	  }
 
 	if (default_dim_p[GOMP_DIM_VECTOR])
Cesar Philippidis Aug. 1, 2018, 2:01 p.m. | #2
On 08/01/2018 03:18 AM, Tom de Vries wrote:
> On 07/31/2018 04:58 PM, Cesar Philippidis wrote:

>> The attached patch teaches libgomp how to use the CUDA thread occupancy

>> calculator built into the CUDA driver. Despite both being based off the

>> CUDA thread occupancy spreadsheet distributed with CUDA, the built in

>> occupancy calculator differs from the occupancy calculator in og8 in two

>> key ways. First, og8 launches twice the number of gangs as the driver

>> thread occupancy calculator. This was my attempt at preventing threads

>> from idling, and it operating on a similar principle of running 'make

>> -jN', where N is twice the number of CPU threads.

> 

> You're saying the two methods are different, and that the difference

> between the two methods is a factor two, which is a heuristic you added

> yourself on top of one of the methods, which implies that in fact the

> two methods are identical. Is my understanding correct here?


With the exception being that og8 multiples num_gangs by a factor of
two, those two algorithms are identical, at least with respect to gangs.

>> Second, whereas og8

>> always attempts to maximize the CUDA block size, the driver may select a

>> smaller block, which effectively decreases num_workers.

>>

> 

> So, do I understand it correctly that using the function

> cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can

> achieve the maximum occupancy" or some such and og8 gives us "maximum

> block size"?


Correct.

>> In terms of performance, there really isn't that much of a difference

>> between the CUDA driver's occupancy calculator and og8's. However, on

>> the tests that are impacted, they are generally within a factor of two

>> from one another, with some tests running faster with the driver

>> occupancy calculator and others with og8's.

>>

> 

> Ack. Well, until we understand that in more detail, going with the

> driver's occupancy calculator seems the right thing to do.

> 

>> Unfortunately, support for the CUDA driver API isn't universal; it's

>> only available in CUDA version 6.5 (or 6050) and newer. In this patch,

>> I'm exploiting the fact that init_cuda_lib only checks for errors on the

>> last library function initialized.

> 

> That sounds incorrect to me. In init_cuda_lib I see:

> ...

> # define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)

> # define CUDA_ONE_CALL_1(call) \

>   cuda_lib.call = dlsym (h, #call);     \

>   if (cuda_lib.call == NULL)            \

>     return false;

>   CUDA_CALLS

> ...

> so in fact every library function is checked. Have you tested this with

> pre 6-5 cuda?


I misread that. You're correct. So far, I've only tested this out with
CUDA 9.

> I think we need to add and handle:

> ...

>   CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)

> ...

> 

>> Therefore it guards the usage of

>>

>>   cuOccupancyMaxPotentialBlockSizeWithFlags

>>

>> by checking driver_version.

> 

> If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we

> can test for NULL, which seems a simpler solution than testing the version.

> 

>> If the driver occupancy calculator isn't

>> available, it falls back to the existing defaults. Maybe the og8 thread

>> occupancy would make a better default for older versions of CUDA, but

>> that's a patch for another day.

>>

> 

> Agreed.

> 

>> Is this patch OK for trunk?

> 

> The patch doesn't build in a setup with

> --enable-offload-targets=nvptx-none and without cuda, that enables usage

> of plugin/cuda/cuda.h:

> ...

> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:

> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);

> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?

>  CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \

> ...

> 

>> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>>  

>>        {

>>  	bool default_dim_p[GOMP_DIM_MAX];

>> +	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];

>> +	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];

>> +	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];

>> +

>> +	/* The CUDA driver occupancy calculator is only available on

>> +	   CUDA version 6.5 (6050) and newer.  */

>> +	if (nvthd->ptx_dev->driver_version > 6050)

>> +	  {

>> +	    int grids, blocks;

>> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

>> +			      &blocks, function, NULL, 0,

>> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

>> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

>> +			       "grid = %d, block = %d\n", grids, blocks);

>> +

> 

> 

>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)

> 

> You should use gomp_openacc_dims[0].

> 

>> +	      gangs = grids * (blocks / warp_size);

> 

> So, we launch with gangs == grids * workers ? Is that intentional?


Yes. At least that's what I've been using in og8. Setting num_gangs =
grids alone caused significant slow downs.

>> +

>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)

>> +	      workers = blocks / vectors;

> 

> Also, the new default calculation is not nicely separated from the

> fallback default calculation.  I've updated the patch with a cleaner

> separation, attached and build without cuda but untested.


That looks nicer. Did you want to take over this patch?

Cesar
Tom de Vries Aug. 1, 2018, 2:12 p.m. | #3
On 08/01/2018 04:01 PM, Cesar Philippidis wrote:
> On 08/01/2018 03:18 AM, Tom de Vries wrote:

>> On 07/31/2018 04:58 PM, Cesar Philippidis wrote:

>>> The attached patch teaches libgomp how to use the CUDA thread occupancy

>>> calculator built into the CUDA driver. Despite both being based off the

>>> CUDA thread occupancy spreadsheet distributed with CUDA, the built in

>>> occupancy calculator differs from the occupancy calculator in og8 in two

>>> key ways. First, og8 launches twice the number of gangs as the driver

>>> thread occupancy calculator. This was my attempt at preventing threads

>>> from idling, and it operating on a similar principle of running 'make

>>> -jN', where N is twice the number of CPU threads.

>>

>> You're saying the two methods are different, and that the difference

>> between the two methods is a factor two, which is a heuristic you added

>> yourself on top of one of the methods, which implies that in fact the

>> two methods are identical. Is my understanding correct here?

> 

> With the exception being that og8 multiples num_gangs by a factor of

> two, those two algorithms are identical, at least with respect to gangs.

> 

>>> Second, whereas og8

>>> always attempts to maximize the CUDA block size, the driver may select a

>>> smaller block, which effectively decreases num_workers.

>>>

>>

>> So, do I understand it correctly that using the function

>> cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can

>> achieve the maximum occupancy" or some such and og8 gives us "maximum

>> block size"?

> 

> Correct.

> 

>>> In terms of performance, there really isn't that much of a difference

>>> between the CUDA driver's occupancy calculator and og8's. However, on

>>> the tests that are impacted, they are generally within a factor of two

>>> from one another, with some tests running faster with the driver

>>> occupancy calculator and others with og8's.

>>>

>>

>> Ack. Well, until we understand that in more detail, going with the

>> driver's occupancy calculator seems the right thing to do.

>>

>>> Unfortunately, support for the CUDA driver API isn't universal; it's

>>> only available in CUDA version 6.5 (or 6050) and newer. In this patch,

>>> I'm exploiting the fact that init_cuda_lib only checks for errors on the

>>> last library function initialized.

>>

>> That sounds incorrect to me. In init_cuda_lib I see:

>> ...

>> # define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)

>> # define CUDA_ONE_CALL_1(call) \

>>   cuda_lib.call = dlsym (h, #call);     \

>>   if (cuda_lib.call == NULL)            \

>>     return false;

>>   CUDA_CALLS

>> ...

>> so in fact every library function is checked. Have you tested this with

>> pre 6-5 cuda?

> 

> I misread that. You're correct. So far, I've only tested this out with

> CUDA 9.

> 

>> I think we need to add and handle:

>> ...

>>   CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)

>> ...

>>

>>> Therefore it guards the usage of

>>>

>>>   cuOccupancyMaxPotentialBlockSizeWithFlags

>>>

>>> by checking driver_version.

>>

>> If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we

>> can test for NULL, which seems a simpler solution than testing the version.

>>

>>> If the driver occupancy calculator isn't

>>> available, it falls back to the existing defaults. Maybe the og8 thread

>>> occupancy would make a better default for older versions of CUDA, but

>>> that's a patch for another day.

>>>

>>

>> Agreed.

>>

>>> Is this patch OK for trunk?

>>

>> The patch doesn't build in a setup with

>> --enable-offload-targets=nvptx-none and without cuda, that enables usage

>> of plugin/cuda/cuda.h:

>> ...

>> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:

>> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);

>> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?

>>  CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \

>> ...

>>

>>> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>>>  

>>>        {

>>>  	bool default_dim_p[GOMP_DIM_MAX];

>>> +	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];

>>> +	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];

>>> +	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];

>>> +

>>> +	/* The CUDA driver occupancy calculator is only available on

>>> +	   CUDA version 6.5 (6050) and newer.  */

>>> +	if (nvthd->ptx_dev->driver_version > 6050)

>>> +	  {

>>> +	    int grids, blocks;

>>> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

>>> +			      &blocks, function, NULL, 0,

>>> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

>>> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

>>> +			       "grid = %d, block = %d\n", grids, blocks);

>>> +

>>

>>

>>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)

>>

>> You should use gomp_openacc_dims[0].

>>

>>> +	      gangs = grids * (blocks / warp_size);

>>

>> So, we launch with gangs == grids * workers ? Is that intentional?

> 

> Yes. At least that's what I've been using in og8. Setting num_gangs =

> grids alone caused significant slow downs.

> 


Well, what you're saying here is: increasing num_gangs increases
performance.

You don't explain why you multiply with workers specifically.

>>> +

>>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)

>>> +	      workers = blocks / vectors;

>>

>> Also, the new default calculation is not nicely separated from the

>> fallback default calculation.  I've updated the patch with a cleaner

>> separation, attached and build without cuda but untested.

> 

> That looks nicer. Did you want to take over this patch?


No.

Thanks,
- Tom
Cesar Philippidis Aug. 1, 2018, 7:11 p.m. | #4
On 08/01/2018 07:12 AM, Tom de Vries wrote:

>>>> +	      gangs = grids * (blocks / warp_size);

>>>

>>> So, we launch with gangs == grids * workers ? Is that intentional?

>>

>> Yes. At least that's what I've been using in og8. Setting num_gangs =

>> grids alone caused significant slow downs.

>>

> 

> Well, what you're saying here is: increasing num_gangs increases

> performance.

> 

> You don't explain why you multiply with workers specifically.


I set it that way because I think the occupancy calculator is
determining the occupancy of a single multiprocessor unit, rather than
the entire GPU. Looking at the og8 code again, I had

   num_gangs = 2 * threads_per_sm / warp_size * dev_size

which corresponds to

   2 * grids * blocks / warp_size

Because blocks is generally smaller than threads_per_block, the driver
occupancy calculator ends up launching fewer gangs.

I don't have a firm position with this default behavior. Perhaps we
should just set

  gang = grids

That's probably an improvement over what's there now.

Cesar
Tom de Vries Aug. 3, 2018, 3:22 p.m. | #5
On 08/01/2018 09:11 PM, Cesar Philippidis wrote:
> On 08/01/2018 07:12 AM, Tom de Vries wrote:

> 

>>>>> +	      gangs = grids * (blocks / warp_size);

>>>>

>>>> So, we launch with gangs == grids * workers ? Is that intentional?

>>>

>>> Yes. At least that's what I've been using in og8. Setting num_gangs =

>>> grids alone caused significant slow downs.

>>>

>>

>> Well, what you're saying here is: increasing num_gangs increases

>> performance.

>>

>> You don't explain why you multiply with workers specifically.

> 

> I set it that way because I think the occupancy calculator is

> determining the occupancy of a single multiprocessor unit, rather than

> the entire GPU. Looking at the og8 code again, I had

> 

>    num_gangs = 2 * threads_per_sm / warp_size * dev_size

> 

> which corresponds to

> 

>    2 * grids * blocks / warp_size

> 


I've done an experiment using the sample simpleOccupancy. The kernel is
small, so the blocks returned is the maximum: max_threads_per_block (1024).

The grids returned is 10, which I tentatively interpret as num_dev *
(max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and
max_threads_per_multi_processor == 2048. ]

Substituting that into the og8 code, and equating
max_threads_per_multi_processor with threads_per_sm, I indeed get

num_gangs = 2 * grids * blocks / warp_size.

So with this extra information I see how you got there.

But I still see no rationale why blocks is used here, and I wonder
whether something like num_gangs = grids * 64 would give similar results.

Anyway, given that this is what is used on og8, I'm ok with using that,
so let's go with:
...
	      gangs = 2 * grids * (blocks / warp_size);
...
[ so, including the factor two you explicitly left out from the original
patch. Unless you see a pressing reason not to include it. ]

Can you repost after retesting? [ note: the updated patch I posted
earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]

Thanks,
- Tom
Cesar Philippidis Aug. 3, 2018, 3:37 p.m. | #6
On 08/03/2018 08:22 AM, Tom de Vries wrote:
> On 08/01/2018 09:11 PM, Cesar Philippidis wrote:

>> On 08/01/2018 07:12 AM, Tom de Vries wrote:

>>

>>>>>> +	      gangs = grids * (blocks / warp_size);

>>>>>

>>>>> So, we launch with gangs == grids * workers ? Is that intentional?

>>>>

>>>> Yes. At least that's what I've been using in og8. Setting num_gangs =

>>>> grids alone caused significant slow downs.

>>>>

>>>

>>> Well, what you're saying here is: increasing num_gangs increases

>>> performance.

>>>

>>> You don't explain why you multiply with workers specifically.

>>

>> I set it that way because I think the occupancy calculator is

>> determining the occupancy of a single multiprocessor unit, rather than

>> the entire GPU. Looking at the og8 code again, I had

>>

>>    num_gangs = 2 * threads_per_sm / warp_size * dev_size

>>

>> which corresponds to

>>

>>    2 * grids * blocks / warp_size

>>

> 

> I've done an experiment using the sample simpleOccupancy. The kernel is

> small, so the blocks returned is the maximum: max_threads_per_block (1024).

> 

> The grids returned is 10, which I tentatively interpret as num_dev *

> (max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and

> max_threads_per_multi_processor == 2048. ]

> 

> Substituting that into the og8 code, and equating

> max_threads_per_multi_processor with threads_per_sm, I indeed get

> 

> num_gangs = 2 * grids * blocks / warp_size.

> 

> So with this extra information I see how you got there.

> 

> But I still see no rationale why blocks is used here, and I wonder

> whether something like num_gangs = grids * 64 would give similar results.


My original intent was to keep the load proportional to the block size.
So, in the case were a block size is limited by shared-memory or the
register file capacity, the runtime wouldn't excessively over assign
gangs to the multiprocessor units if their state is going to be swapped
out even more than necessary.

With that said, I could be wrong here. It would be nice if Nvidia
provided us with more insights into their hardware.

> Anyway, given that this is what is used on og8, I'm ok with using that,

> so let's go with:

> ...

> 	      gangs = 2 * grids * (blocks / warp_size);

> ...

> [ so, including the factor two you explicitly left out from the original

> patch. Unless you see a pressing reason not to include it. ]

> 

> Can you repost after retesting? [ note: the updated patch I posted

> earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]


Thanks for looking into this. I got bogged down tracking a problem with
allocatable scalars in fortran. I'll repost post this patch after I
tested it with an older version of CUDA (probably CUDA 5.5 using the
Nvidia driver 331.113 on a K40).

Cesar
Tom de Vries Aug. 4, 2018, 8:16 p.m. | #7
On 08/03/2018 05:37 PM, Cesar Philippidis wrote:
>> But I still see no rationale why blocks is used here, and I wonder

>> whether something like num_gangs = grids * 64 would give similar results.


> My original intent was to keep the load proportional to the block size.

> So, in the case were a block size is limited by shared-memory or the

> register file capacity, the runtime wouldn't excessively over assign

> gangs to the multiprocessor units if their state is going to be swapped

> out even more than necessary.


So, that's your rationale. Please add a comment describing this.

Thanks,
- Tom
Tom de Vries Aug. 7, 2018, 6:08 a.m. | #8
On 08/01/2018 12:18 PM, Tom de Vries wrote:

> I think we need to add and handle:

> ...

>   CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)

> ...

> 


I realized that the patch I posted introducing CUDA_ONE_CALL_MAYBE_NULL
was incomplete, and needed to use the weak attribute in case of linking
against a concrete libcuda.so.

So, I've now committed a patch implementing just CUDA_ONE_CALL_MAYBE_NULL:
"[libgomp, nvptx] Handle CUDA_ONE_CALL_MAYBE_NULL" @
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00447.html . You can use
"CUDA_CALL_EXISTS (cuOccupancyMaxPotentialBlockSize)" to test for
existence of the function in the cuda driver API.

> The patch doesn't build in a setup with

> --enable-offload-targets=nvptx-none and without cuda, that enables usage

> of plugin/cuda/cuda.h:

> ...

> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:

> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);

> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?

>  CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \

> ...

> 


I've committed a patch "[libgomp, nvptx, --without-cuda-driver] Don't
use system cuda driver" @
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00348.html .

Using --without-cuda-driver should make it easy to build using the
dlopen interface without having to de-install the system libcuda.so.

Thanks,
- Tom
Cesar Philippidis Aug. 7, 2018, 1:52 p.m. | #9
On 08/06/2018 11:08 PM, Tom de Vries wrote:
> On 08/01/2018 12:18 PM, Tom de Vries wrote:

> 

>> I think we need to add and handle:

>> ...

>>   CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)

>> ...

>>

> 

> I realized that the patch I posted introducing CUDA_ONE_CALL_MAYBE_NULL

> was incomplete, and needed to use the weak attribute in case of linking

> against a concrete libcuda.so.

> 

> So, I've now committed a patch implementing just CUDA_ONE_CALL_MAYBE_NULL:

> "[libgomp, nvptx] Handle CUDA_ONE_CALL_MAYBE_NULL" @

> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00447.html . You can use

> "CUDA_CALL_EXISTS (cuOccupancyMaxPotentialBlockSize)" to test for

> existence of the function in the cuda driver API.


Sorry for taking so long getting this patch updated. It's a slow build
and test cycle getting older versions of cuda to play nicely. So far,
I've managed to get CUDA 5.5 partially working with Nvidia driver
331.113 (which supports CUDA 6.0) in the sense that I spotted an error
with the patch; I realized that the cuda.h that ships with libgomp
emulates version CUDA 8.0. That lead to problems using cuLinkAddData,
because that function gets remapped to cuLinkAddData_v2 in CUDA 6.5 and
newer.

That leads me to a question, do we really want to support older versions
of CUDA without using the system's CUDA header files?

>> The patch doesn't build in a setup with

>> --enable-offload-targets=nvptx-none and without cuda, that enables usage

>> of plugin/cuda/cuda.h:

>> ...

>> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error:

>> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function);

>> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’?

>>  CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \

>> ...

>>

> 

> I've committed a patch "[libgomp, nvptx, --without-cuda-driver] Don't

> use system cuda driver" @

> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00348.html .

> 

> Using --without-cuda-driver should make it easy to build using the

> dlopen interface without having to de-install the system libcuda.so.


I attached an updated version of the CUDA driver patch, although I
haven't rebased it against your changes yet. It still needs to be tested
against CUDA 5.5 using the systems/Nvidia's cuda.h. But I wanted to give
you an update.

Does this patch look OK, at least after testing competes? I removed the
tests for CUDA_ONE_CALL_MAYBE_NULL, because the newer CUDA API isn't
supported in the older drivers.

Cesar
From 7fc093da173543b43e1d83dd5fb9e00e2b92eb09 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>

Date: Thu, 26 Jul 2018 11:47:35 -0700
Subject: [PATCH] [nvptx] Use CUDA driver API to select default runtime launch
 geometry

	libgomp/
	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
	(cuDriverGetVersion): Declare.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.
	plugin/plugin-nvptx.c (CUDA_ONE_CALL): Add entries for
	cuDriverGetVersion and cuOccupancyMaxPotentialBlockSize.
	(ptx_device): Add driver_version member.
	(nvptx_open_device): Initialize it.
	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
	default num_gangs and num_workers when the driver supports it.
---
 libgomp/plugin/cuda-lib.def   |  2 ++
 libgomp/plugin/cuda/cuda.h    |  4 ++++
 libgomp/plugin/plugin-nvptx.c | 41 +++++++++++++++++++++++++++++++++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index be8e3b3ec4d..f2433e1f0a9 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -2,6 +2,7 @@ CUDA_ONE_CALL (cuCtxCreate)
 CUDA_ONE_CALL (cuCtxDestroy)
 CUDA_ONE_CALL (cuCtxGetCurrent)
 CUDA_ONE_CALL (cuCtxGetDevice)
+CUDA_ONE_CALL (cuDriverGetVersion)
 CUDA_ONE_CALL (cuCtxPopCurrent)
 CUDA_ONE_CALL (cuCtxPushCurrent)
 CUDA_ONE_CALL (cuCtxSynchronize)
@@ -39,6 +40,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)
 CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
+CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..3a790e688e0 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -123,6 +124,7 @@ CUresult cuCtxSynchronize (void);
 CUresult cuDeviceGet (CUdevice *, int);
 CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);
 CUresult cuDeviceGetCount (int *);
+CUresult cuDriverGetVersion(int *);
 CUresult cuEventCreate (CUevent *, unsigned);
 #define cuEventDestroy cuEventDestroy_v2
 CUresult cuEventDestroy (CUevent);
@@ -170,6 +172,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index cc465b4addb..f88c571a38e 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 
 # define CUDA_ONE_CALL(call) \
   __typeof (call) *call;
+# define CUDA_ONE_CALL_MAYBE_NULL(call) \
+  __typeof (call) *call;
 struct cuda_lib_s {
 #include "cuda-lib.def"
 } cuda_lib;
@@ -80,7 +82,6 @@ init_cuda_lib (void)
   cuda_lib.call = dlsym (h, #call);	\
   if (cuda_lib.call == NULL)		\
     return false;
-#include "cuda-lib.def"
   cuda_lib_inited = true;
   return true;
 }
@@ -355,6 +356,7 @@ struct ptx_device
   int max_threads_per_block;
   int max_threads_per_multiprocessor;
   int default_dims[GOMP_DIM_MAX];
+  int driver_version;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -666,6 +668,7 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->driver_version = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -759,6 +762,9 @@ nvptx_open_device (int n)
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
+  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);
+  ptx_dev->driver_version = pi;
+
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
@@ -1152,11 +1158,42 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
       {
 	bool default_dim_p[GOMP_DIM_MAX];
+	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];
+	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];
+	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];
+
+	/* The CUDA driver occupancy calculator is only available on
+	   CUDA version 6.5 (6050) and newer.  */
+	if (nvthd->ptx_dev->driver_version > 6050)
+	  {
+	    int grids, blocks;
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    /* Keep the num_gangs proportional to the block size.  The
+	       constant factor 2 is there to prevent threads from
+	       idling when there is sufficient work for them.  */
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)
+	      gangs = 2 * grids * (blocks / warp_size);
+
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)
+	      workers = blocks / vectors;
+	  }
+
 	for (i = 0; i != GOMP_DIM_MAX; i++)
 	  {
 	    default_dim_p[i] = !dims[i];
 	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
+	      switch (i)
+		{
+		case GOMP_DIM_GANG: dims[i] = gangs; break;
+		case GOMP_DIM_WORKER: dims[i] = workers; break;
+		case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		default: GOMP_PLUGIN_fatal ("invalid dim");
+		}
 	  }
 
 	if (default_dim_p[GOMP_DIM_VECTOR])
-- 
2.17.1
Cesar Philippidis Aug. 8, 2018, 2:09 p.m. | #10
On 08/07/2018 06:52 AM, Cesar Philippidis wrote:

> I attached an updated version of the CUDA driver patch, although I

> haven't rebased it against your changes yet. It still needs to be tested

> against CUDA 5.5 using the systems/Nvidia's cuda.h. But I wanted to give

> you an update.

> 

> Does this patch look OK, at least after testing competes? I removed the

> tests for CUDA_ONE_CALL_MAYBE_NULL, because the newer CUDA API isn't

> supported in the older drivers.


I've finally finished testing this patch. Besides for a couple of
regressions with CUDA 5.5 in libgomp.oacc-c-c++-common/lib-75.c,
lib-76.c and lib-79.c, the results came back clean.

This patch has been tested the following ways using a K40 GPU:

  * Using GCC's cuda.h with CUDA 9.2 drivers.
  * Using cuda.h from CUDA 5.5 and Nvidia drivers 331.133 (supports CUDA
    6.0) and the driver from CUDA 8.0.
  * Using cuda.h from CUDA 8.0.

As mentioned before, because GCC's cuda.h defines CUDA_VERSION as 8000,
there was a conflict with using it against CUDA 5.5, because of the
missing cuLinkAddData_v2 symbol.

Note how the usage of cuOccupancyMaxPotentialBlockSize is guarded by
checking for the version of CUDA_VERSION. I don't really like this, but
it's a necessary evil of maintaining backwards compatibility.

Is this patch OK for trunk?

Thanks,
Cesar
[nvptx] Use CUDA driver API to select default runtime launch geometry

2018-08-YY  Cesar Philippidis  <cesar@codesourcery.com>

	libgomp/
	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
	(cuDriverGetVersion): Declare.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.
	plugin/plugin-nvptx.c (CUDA_ONE_CALL): Add entries for
	cuDriverGetVersion and cuOccupancyMaxPotentialBlockSize.
	(ptx_device): Add driver_version member.
	(nvptx_open_device): Initialize it.
	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
	default num_gangs and num_workers when the driver supports it.
---
 libgomp/plugin/cuda-lib.def   |  2 ++
 libgomp/plugin/cuda/cuda.h    |  4 ++++
 libgomp/plugin/plugin-nvptx.c | 40 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index be8e3b3..f2433e1 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -2,6 +2,7 @@ CUDA_ONE_CALL (cuCtxCreate)
 CUDA_ONE_CALL (cuCtxDestroy)
 CUDA_ONE_CALL (cuCtxGetCurrent)
 CUDA_ONE_CALL (cuCtxGetDevice)
+CUDA_ONE_CALL (cuDriverGetVersion)
 CUDA_ONE_CALL (cuCtxPopCurrent)
 CUDA_ONE_CALL (cuCtxPushCurrent)
 CUDA_ONE_CALL (cuCtxSynchronize)
@@ -39,6 +40,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)
 CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
+CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825..3a790e6 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -123,6 +124,7 @@ CUresult cuCtxSynchronize (void);
 CUresult cuDeviceGet (CUdevice *, int);
 CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);
 CUresult cuDeviceGetCount (int *);
+CUresult cuDriverGetVersion(int *);
 CUresult cuEventCreate (CUevent *, unsigned);
 #define cuEventDestroy cuEventDestroy_v2
 CUresult cuEventDestroy (CUevent);
@@ -170,6 +172,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 825470a..b0ccf0b 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -376,6 +376,7 @@ struct ptx_device
   int max_threads_per_block;
   int max_threads_per_multiprocessor;
   int default_dims[GOMP_DIM_MAX];
+  int driver_version;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -687,6 +688,7 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->driver_version = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -780,6 +782,9 @@ nvptx_open_device (int n)
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
+  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);
+  ptx_dev->driver_version = pi;
+
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
@@ -1173,11 +1178,44 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
       {
 	bool default_dim_p[GOMP_DIM_MAX];
+	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];
+	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];
+	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];
+
+	/* The CUDA driver occupancy calculator is only available on
+	   CUDA version 6.5 (6050) and newer.  */
+#if (CUDA_VERSION >= 6050)
+	if (nvthd->ptx_dev->driver_version > 6050)
+	  {
+	    int grids, blocks;
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    /* Keep the num_gangs proportional to the block size.  The
+	       constant factor 2 is there to prevent threads from
+	       idling when there is sufficient work for them.  */
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)
+	      gangs = 2 * grids * (blocks / warp_size);
+
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)
+	      workers = blocks / vectors;
+	  }
+#endif
+
 	for (i = 0; i != GOMP_DIM_MAX; i++)
 	  {
 	    default_dim_p[i] = !dims[i];
 	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
+	      switch (i)
+		{
+		case GOMP_DIM_GANG: dims[i] = gangs; break;
+		case GOMP_DIM_WORKER: dims[i] = workers; break;
+		case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		default: GOMP_PLUGIN_fatal ("invalid dim");
+		}
 	  }
 
 	if (default_dim_p[GOMP_DIM_VECTOR])
-- 
2.7.4
Tom de Vries Aug. 8, 2018, 3:19 p.m. | #11
On Wed, Aug 08, 2018 at 07:09:16AM -0700, Cesar Philippidis wrote:
> On 08/07/2018 06:52 AM, Cesar Philippidis wrote:

> 

> > I attached an updated version of the CUDA driver patch, although I

> > haven't rebased it against your changes yet. It still needs to be tested

> > against CUDA 5.5 using the systems/Nvidia's cuda.h. But I wanted to give

> > you an update.

> > 

> > Does this patch look OK, at least after testing competes? I removed the

> > tests for CUDA_ONE_CALL_MAYBE_NULL, because the newer CUDA API isn't

> > supported in the older drivers.

> 

> I've finally finished testing this patch. Besides for a couple of

> regressions with CUDA 5.5 in libgomp.oacc-c-c++-common/lib-75.c,

> lib-76.c and lib-79.c, the results came back clean.

> 

> This patch has been tested the following ways using a K40 GPU:

> 

>   * Using GCC's cuda.h with CUDA 9.2 drivers.

>   * Using cuda.h from CUDA 5.5 and Nvidia drivers 331.133 (supports CUDA

>     6.0) and the driver from CUDA 8.0.

>   * Using cuda.h from CUDA 8.0.

> 

> As mentioned before, because GCC's cuda.h defines CUDA_VERSION as 8000,

> there was a conflict with using it against CUDA 5.5, because of the

> missing cuLinkAddData_v2 symbol.

> 

> Note how the usage of cuOccupancyMaxPotentialBlockSize is guarded by

> checking for the version of CUDA_VERSION. I don't really like this, but

> it's a necessary evil of maintaining backwards compatibility.

> 

> Is this patch OK for trunk?

> 

> Thanks,

> Cesar


> [nvptx] Use CUDA driver API to select default runtime launch geometry

> 

> 2018-08-YY  Cesar Philippidis  <cesar@codesourcery.com>

> 

> 	libgomp/

> 	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.

> 	(cuDriverGetVersion): Declare.

> 	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.

> 	plugin/plugin-nvptx.c (CUDA_ONE_CALL): Add entries for

> 	cuDriverGetVersion and cuOccupancyMaxPotentialBlockSize.

> 	(ptx_device): Add driver_version member.

> 	(nvptx_open_device): Initialize it.

> 	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the

> 	default num_gangs and num_workers when the driver supports it.

> ---

>  libgomp/plugin/cuda-lib.def   |  2 ++

>  libgomp/plugin/cuda/cuda.h    |  4 ++++

>  libgomp/plugin/plugin-nvptx.c | 40 +++++++++++++++++++++++++++++++++++++++-

>  3 files changed, 45 insertions(+), 1 deletion(-)

> 

> diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def

> index be8e3b3..f2433e1 100644

> --- a/libgomp/plugin/cuda-lib.def

> +++ b/libgomp/plugin/cuda-lib.def

> @@ -2,6 +2,7 @@ CUDA_ONE_CALL (cuCtxCreate)

>  CUDA_ONE_CALL (cuCtxDestroy)

>  CUDA_ONE_CALL (cuCtxGetCurrent)

>  CUDA_ONE_CALL (cuCtxGetDevice)

> +CUDA_ONE_CALL (cuDriverGetVersion)


Don't use cuDriverGetVersion.

>  CUDA_ONE_CALL (cuCtxPopCurrent)

>  CUDA_ONE_CALL (cuCtxPushCurrent)

>  CUDA_ONE_CALL (cuCtxSynchronize)

> @@ -39,6 +40,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)

>  CUDA_ONE_CALL (cuModuleLoad)

>  CUDA_ONE_CALL (cuModuleLoadData)

>  CUDA_ONE_CALL (cuModuleUnload)

> +CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize)


Use CUDA_ONE_CALL_MAYBE_NULL.

>  CUDA_ONE_CALL (cuStreamCreate)

>  CUDA_ONE_CALL (cuStreamDestroy)

>  CUDA_ONE_CALL (cuStreamQuery)

> diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h

> index 4799825..3a790e6 100644

> --- a/libgomp/plugin/cuda/cuda.h

> +++ b/libgomp/plugin/cuda/cuda.h

> @@ -44,6 +44,7 @@ typedef void *CUevent;

>  typedef void *CUfunction;

>  typedef void *CUlinkState;

>  typedef void *CUmodule;

> +typedef size_t (*CUoccupancyB2DSize)(int);

>  typedef void *CUstream;

>  

>  typedef enum {

> @@ -123,6 +124,7 @@ CUresult cuCtxSynchronize (void);

>  CUresult cuDeviceGet (CUdevice *, int);

>  CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);

>  CUresult cuDeviceGetCount (int *);

> +CUresult cuDriverGetVersion(int *);

>  CUresult cuEventCreate (CUevent *, unsigned);

>  #define cuEventDestroy cuEventDestroy_v2

>  CUresult cuEventDestroy (CUevent);

> @@ -170,6 +172,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);

>  CUresult cuModuleLoad (CUmodule *, const char *);

>  CUresult cuModuleLoadData (CUmodule *, const void *);

>  CUresult cuModuleUnload (CUmodule);

> +CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,

> +					  CUoccupancyB2DSize, size_t, int);

>  CUresult cuStreamCreate (CUstream *, unsigned);

>  #define cuStreamDestroy cuStreamDestroy_v2

>  CUresult cuStreamDestroy (CUstream);

> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c

> index 825470a..b0ccf0b 100644

> --- a/libgomp/plugin/plugin-nvptx.c

> +++ b/libgomp/plugin/plugin-nvptx.c

> @@ -376,6 +376,7 @@ struct ptx_device

>    int max_threads_per_block;

>    int max_threads_per_multiprocessor;

>    int default_dims[GOMP_DIM_MAX];

> +  int driver_version;

>  

>    struct ptx_image_data *images;  /* Images loaded on device.  */

>    pthread_mutex_t image_lock;     /* Lock for above list.  */

> @@ -687,6 +688,7 @@ nvptx_open_device (int n)

>    ptx_dev->ord = n;

>    ptx_dev->dev = dev;

>    ptx_dev->ctx_shared = false;

> +  ptx_dev->driver_version = 0;

>  

>    r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);

>    if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)

> @@ -780,6 +782,9 @@ nvptx_open_device (int n)

>    for (int i = 0; i != GOMP_DIM_MAX; i++)

>      ptx_dev->default_dims[i] = 0;

>  

> +  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);

> +  ptx_dev->driver_version = pi;

> +

>    ptx_dev->images = NULL;

>    pthread_mutex_init (&ptx_dev->image_lock, NULL);

>  

> @@ -1173,11 +1178,44 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>  

>        {

>  	bool default_dim_p[GOMP_DIM_MAX];

> +	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];

> +	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];

> +	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];

> +

> +	/* The CUDA driver occupancy calculator is only available on

> +	   CUDA version 6.5 (6050) and newer.  */

> +#if (CUDA_VERSION >= 6050)

> +	if (nvthd->ptx_dev->driver_version > 6050)


Use CUDA_CALL_EXISTS.

> +	  {

> +	    int grids, blocks;

> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

> +			      &blocks, function, NULL, 0,

> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

> +			       "grid = %d, block = %d\n", grids, blocks);

> +

> +	    /* Keep the num_gangs proportional to the block size.  The

> +	       constant factor 2 is there to prevent threads from

> +	       idling when there is sufficient work for them.  */


typo: sufficient -> insufficient

Also, reformulate the first part of rationale comment to say: "Keep the
num_gangs proportional to the block size, in order to ..." or some such, along
the lines of what you mentioned here (
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00289.html ).

> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)

> +	      gangs = 2 * grids * (blocks / warp_size);

> +

> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)

> +	      workers = blocks / vectors;

> +	  }

> +#endif

> +

>  	for (i = 0; i != GOMP_DIM_MAX; i++)

>  	  {

>  	    default_dim_p[i] = !dims[i];

>  	    if (default_dim_p[i])

> -	      dims[i] = nvthd->ptx_dev->default_dims[i];

> +	      switch (i)

> +		{

> +		case GOMP_DIM_GANG: dims[i] = gangs; break;

> +		case GOMP_DIM_WORKER: dims[i] = workers; break;

> +		case GOMP_DIM_VECTOR: dims[i] = vectors; break;

> +		default: GOMP_PLUGIN_fatal ("invalid dim");

> +		}

>  	  }

>  


The new default calculation is not cleanly separated from the fallback default
calculation. I've already shown you how that should be done: (
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00027.html ).

Thanks,
- Tom
Cesar Philippidis Aug. 10, 2018, 6:39 p.m. | #12
On 08/08/2018 08:19 AM, Tom de Vries wrote:
> On Wed, Aug 08, 2018 at 07:09:16AM -0700, Cesar Philippidis wrote:

>> On 08/07/2018 06:52 AM, Cesar Philippidis wrote:


Thanks for review. This version should address all of the following
remarks. However, one thing to note ...

>> [nvptx] Use CUDA driver API to select default runtime launch geometry

>>

>> 2018-08-YY  Cesar Philippidis  <cesar@codesourcery.com>

>>

>> 	libgomp/

>> 	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.

>> 	(cuDriverGetVersion): Declare.

>> 	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.

>> 	plugin/plugin-nvptx.c (CUDA_ONE_CALL): Add entries for

>> 	cuDriverGetVersion and cuOccupancyMaxPotentialBlockSize.

>> 	(ptx_device): Add driver_version member.

>> 	(nvptx_open_device): Initialize it.

>> 	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the

>> 	default num_gangs and num_workers when the driver supports it.

>> ---

>>  libgomp/plugin/cuda-lib.def   |  2 ++

>>  libgomp/plugin/cuda/cuda.h    |  4 ++++

>>  libgomp/plugin/plugin-nvptx.c | 40 +++++++++++++++++++++++++++++++++++++++-

>>  3 files changed, 45 insertions(+), 1 deletion(-)

>>

>> diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def

>> index be8e3b3..f2433e1 100644

>> --- a/libgomp/plugin/cuda-lib.def

>> +++ b/libgomp/plugin/cuda-lib.def

>> @@ -2,6 +2,7 @@ CUDA_ONE_CALL (cuCtxCreate)

>>  CUDA_ONE_CALL (cuCtxDestroy)

>>  CUDA_ONE_CALL (cuCtxGetCurrent)

>>  CUDA_ONE_CALL (cuCtxGetDevice)

>> +CUDA_ONE_CALL (cuDriverGetVersion)

> 

> Don't use cuDriverGetVersion.

> 

>>  CUDA_ONE_CALL (cuCtxPopCurrent)

>>  CUDA_ONE_CALL (cuCtxPushCurrent)

>>  CUDA_ONE_CALL (cuCtxSynchronize)

>> @@ -39,6 +40,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)

>>  CUDA_ONE_CALL (cuModuleLoad)

>>  CUDA_ONE_CALL (cuModuleLoadData)

>>  CUDA_ONE_CALL (cuModuleUnload)

>> +CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize)

> 

> Use CUDA_ONE_CALL_MAYBE_NULL.

> 

>>  CUDA_ONE_CALL (cuStreamCreate)

>>  CUDA_ONE_CALL (cuStreamDestroy)

>>  CUDA_ONE_CALL (cuStreamQuery)

>> diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h

>> index 4799825..3a790e6 100644

>> --- a/libgomp/plugin/cuda/cuda.h

>> +++ b/libgomp/plugin/cuda/cuda.h

>> @@ -44,6 +44,7 @@ typedef void *CUevent;

>>  typedef void *CUfunction;

>>  typedef void *CUlinkState;

>>  typedef void *CUmodule;

>> +typedef size_t (*CUoccupancyB2DSize)(int);

>>  typedef void *CUstream;

>>  

>>  typedef enum {

>> @@ -123,6 +124,7 @@ CUresult cuCtxSynchronize (void);

>>  CUresult cuDeviceGet (CUdevice *, int);

>>  CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);

>>  CUresult cuDeviceGetCount (int *);

>> +CUresult cuDriverGetVersion(int *);

>>  CUresult cuEventCreate (CUevent *, unsigned);

>>  #define cuEventDestroy cuEventDestroy_v2

>>  CUresult cuEventDestroy (CUevent);

>> @@ -170,6 +172,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);

>>  CUresult cuModuleLoad (CUmodule *, const char *);

>>  CUresult cuModuleLoadData (CUmodule *, const void *);

>>  CUresult cuModuleUnload (CUmodule);

>> +CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,

>> +					  CUoccupancyB2DSize, size_t, int);

>>  CUresult cuStreamCreate (CUstream *, unsigned);

>>  #define cuStreamDestroy cuStreamDestroy_v2

>>  CUresult cuStreamDestroy (CUstream);

>> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c

>> index 825470a..b0ccf0b 100644

>> --- a/libgomp/plugin/plugin-nvptx.c

>> +++ b/libgomp/plugin/plugin-nvptx.c

>> @@ -376,6 +376,7 @@ struct ptx_device

>>    int max_threads_per_block;

>>    int max_threads_per_multiprocessor;

>>    int default_dims[GOMP_DIM_MAX];

>> +  int driver_version;

>>  

>>    struct ptx_image_data *images;  /* Images loaded on device.  */

>>    pthread_mutex_t image_lock;     /* Lock for above list.  */

>> @@ -687,6 +688,7 @@ nvptx_open_device (int n)

>>    ptx_dev->ord = n;

>>    ptx_dev->dev = dev;

>>    ptx_dev->ctx_shared = false;

>> +  ptx_dev->driver_version = 0;

>>  

>>    r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);

>>    if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)

>> @@ -780,6 +782,9 @@ nvptx_open_device (int n)

>>    for (int i = 0; i != GOMP_DIM_MAX; i++)

>>      ptx_dev->default_dims[i] = 0;

>>  

>> +  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);

>> +  ptx_dev->driver_version = pi;

>> +

>>    ptx_dev->images = NULL;

>>    pthread_mutex_init (&ptx_dev->image_lock, NULL);

>>  

>> @@ -1173,11 +1178,44 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>>  

>>        {

>>  	bool default_dim_p[GOMP_DIM_MAX];

>> +	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];

>> +	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];

>> +	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];

>> +


is that I modified the default value for vectors as follows

+	    int vectors = default_dim_p[GOMP_DIM_VECTOR]
+	      ? 0 : dims[GOMP_DIM_VECTOR];

Technically, trunk only supports warp-sized vectors, but the fallback
code is already checking for the presence of vectors as so

+	    if (default_dim_p[GOMP_DIM_VECTOR])
+	      dims[GOMP_DIM_VECTOR]
+		= MIN (dims[GOMP_DIM_VECTOR],
+		       (targ_fn->max_threads_per_block / warp_size
+			* warp_size));

therefore, I had the cuOccupancyMaxPotentialBlockSize code path behave
the same. If you want, I can resubmit a patch without that change though.

>> +	/* The CUDA driver occupancy calculator is only available on

>> +	   CUDA version 6.5 (6050) and newer.  */

>> +#if (CUDA_VERSION >= 6050)

>> +	if (nvthd->ptx_dev->driver_version > 6050)

> 

> Use CUDA_CALL_EXISTS.

> 

>> +	  {

>> +	    int grids, blocks;

>> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

>> +			      &blocks, function, NULL, 0,

>> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

>> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

>> +			       "grid = %d, block = %d\n", grids, blocks);

>> +

>> +	    /* Keep the num_gangs proportional to the block size.  The

>> +	       constant factor 2 is there to prevent threads from

>> +	       idling when there is sufficient work for them.  */

> 

> typo: sufficient -> insufficient

> 

> Also, reformulate the first part of rationale comment to say: "Keep the

> num_gangs proportional to the block size, in order to ..." or some such, along

> the lines of what you mentioned here (

> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00289.html ).

> 

>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)

>> +	      gangs = 2 * grids * (blocks / warp_size);

>> +

>> +	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)

>> +	      workers = blocks / vectors;

>> +	  }

>> +#endif

>> +

>>  	for (i = 0; i != GOMP_DIM_MAX; i++)

>>  	  {

>>  	    default_dim_p[i] = !dims[i];

>>  	    if (default_dim_p[i])

>> -	      dims[i] = nvthd->ptx_dev->default_dims[i];

>> +	      switch (i)

>> +		{

>> +		case GOMP_DIM_GANG: dims[i] = gangs; break;

>> +		case GOMP_DIM_WORKER: dims[i] = workers; break;

>> +		case GOMP_DIM_VECTOR: dims[i] = vectors; break;

>> +		default: GOMP_PLUGIN_fatal ("invalid dim");

>> +		}

>>  	  }

>>  

> 

> The new default calculation is not cleanly separated from the fallback default

> calculation. I've already shown you how that should be done: (

> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00027.html ).


Is this patch OK for trunk? I tested it with CUDA 5.5, 8.0 and 9.0, with
and without --without-cuda-driver.

Thanks,
Cesar
[nvptx] Use CUDA driver API to select default runtime launch geometry

	PR target/85590

	libgomp/
	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.
	plugin/cuda-lib.def (cuOccupancyMaxPotentialBlockSize): New
	CUDA_ONE_CALL_MAYBE_NULL.
	plugin/plugin-nvptx.c (CUDA_VERSION < 6050): Define
	CUoccupancyB2DSize and declare
	cuOccupancyMaxPotentialBlockSizeWithFlags.
	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
	default num_gangs and num_workers when the driver supports it.

---
 libgomp/plugin/cuda-lib.def   |  1 +
 libgomp/plugin/cuda/cuda.h    |  3 ++
 libgomp/plugin/plugin-nvptx.c | 77 +++++++++++++++++++++++++++++------
 3 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index 29028b504a0..b2a4c2154eb 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -41,6 +41,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)
 CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
+CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..b4c1b29c5d8 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -170,6 +171,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 6799a264976..9a4bc11e5fe 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -61,9 +61,12 @@ CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t,
 			const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);
 #else
+typedef size_t (*CUoccupancyB2DSize)(int);
 CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t,
 			   const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 #endif
 
 #define DO_PRAGMA(x) _Pragma (#x)
@@ -1200,21 +1203,71 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
       {
 	bool default_dim_p[GOMP_DIM_MAX];
 	for (i = 0; i != GOMP_DIM_MAX; i++)
+	  default_dim_p[i] = !dims[i];
+
+	if (!CUDA_CALL_EXISTS (cuOccupancyMaxPotentialBlockSize))
 	  {
-	    default_dim_p[i] = !dims[i];
-	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
-	  }
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      {
+		if (default_dim_p[i])
+		  dims[i] = nvthd->ptx_dev->default_dims[i];
+	      }
 
-	if (default_dim_p[GOMP_DIM_VECTOR])
-	  dims[GOMP_DIM_VECTOR]
-	    = MIN (dims[GOMP_DIM_VECTOR],
-		   (targ_fn->max_threads_per_block / warp_size * warp_size));
+	    if (default_dim_p[GOMP_DIM_VECTOR])
+	      dims[GOMP_DIM_VECTOR]
+		= MIN (dims[GOMP_DIM_VECTOR],
+		       (targ_fn->max_threads_per_block / warp_size
+			* warp_size));
 
-	if (default_dim_p[GOMP_DIM_WORKER])
-	  dims[GOMP_DIM_WORKER]
-	    = MIN (dims[GOMP_DIM_WORKER],
-		   targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);
+	    if (default_dim_p[GOMP_DIM_WORKER])
+	      dims[GOMP_DIM_WORKER]
+		= MIN (dims[GOMP_DIM_WORKER],
+		       targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);
+	  }
+	else
+	  {
+	    int vectors = default_dim_p[GOMP_DIM_VECTOR]
+	      ? 0 : dims[GOMP_DIM_VECTOR];
+	    int workers = gomp_openacc_dims[GOMP_DIM_WORKER];
+	    int gangs = gomp_openacc_dims[GOMP_DIM_GANG];
+	    int grids, blocks;
+
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    /* Keep the num_gangs proportional to the block size.  In
+	       the case were a block size is limited by shared-memory
+	       or the register file capacity, the runtime will not
+	       excessively over assign gangs to the multiprocessor
+	       units if their state is going to be swapped out even
+	       more than necessary. The constant factor 2 is there to
+	       prevent threads from idling when there is insufficient
+	       work for them.  */
+	    if (gangs == 0)
+	      gangs = 2 * grids * (blocks / warp_size);
+
+	    if (vectors == 0)
+	      vectors = warp_size;
+
+	    if (workers == 0)
+	      workers = blocks / vectors;
+
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      {
+		default_dim_p[i] = !dims[i];
+		if (default_dim_p[i])
+		  switch (i)
+		    {
+		    case GOMP_DIM_GANG: dims[i] = gangs; break;
+		    case GOMP_DIM_WORKER: dims[i] = workers; break;
+		    case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		    default: GOMP_PLUGIN_fatal ("invalid dim");
+		    }
+	      }
+	  }
       }
     }
 
-- 
2.17.1
Tom de Vries Aug. 13, 2018, 12:04 p.m. | #13
On 08/10/2018 08:39 PM, Cesar Philippidis wrote:
> is that I modified the default value for vectors as follows

> 

> +	    int vectors = default_dim_p[GOMP_DIM_VECTOR]

> +	      ? 0 : dims[GOMP_DIM_VECTOR];

> 

> Technically, trunk only supports warp-sized vectors, but the fallback

> code is already checking for the presence of vectors as so

> 

> +	    if (default_dim_p[GOMP_DIM_VECTOR])

> +	      dims[GOMP_DIM_VECTOR]

> +		= MIN (dims[GOMP_DIM_VECTOR],

> +		       (targ_fn->max_threads_per_block / warp_size

> +			* warp_size));

> 


That code handles the case that the default vector size is bigger than
the function being launched allows, independent from whether that
default is calculated by the runtime, or set by GOMP_OPENACC_DIM.

The GOMP_OPENACC_DIM part is forward compatible, given that currently
the compiler doesn't allow the runtime to choose the vector length, and
AFAICT that will remain the same after application of the submitted set
of vector_length patches.

> therefore, I had the cuOccupancyMaxPotentialBlockSize code path behave

> the same.


They don't behave the same. What you add here is ignoring
GOMP_OPENACC_DIM[GOMP_DIM_VECTOR], not handling it. That requires a comment.

Furthermore, by assigning dims[GOMP_DIM_VECTOR] at the start you break
the pattern of the code, which:
- first applies GOMP_OPENACC_DIM
- then further fills in defaults as required
- then applies defaults
I've rewritten this bit to fit the pattern. This result is not pretty,
but it'll do for now.  Changing the pattern may make things better
structured, but this is something we can do in a follow up patch, and
want to do for all dimensions at once, not just for vector, otherwise
the code will become too convoluted.

Btw, I've also noticed that we don't handle a too high
GOMP_OPENACC_DIM[GOMP_DIM_WORKER], I've added a TODO comment for this.

> If you want, I can resubmit a patch without that change though> 0001-nvptx-Use-CUDA-driver-API-to-select-default-runtime-.patch

> 

> 

> [nvptx] Use CUDA driver API to select default runtime launch geometry

> 

> 	PR target/85590

> 

> 	libgomp/

> 	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.


You forgot to add '* ' in front of all files.

> 	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.


You've added cuOccupancyMaxPotentialBlockSize, not
cuOccupancyMaxPotentialBlockSizeWithFlags.

> 	plugin/cuda-lib.def (cuOccupancyMaxPotentialBlockSize): New

> 	CUDA_ONE_CALL_MAYBE_NULL.

> 	plugin/plugin-nvptx.c (CUDA_VERSION < 6050): Define

> 	CUoccupancyB2DSize and declare

> 	cuOccupancyMaxPotentialBlockSizeWithFlags.


Same here.

> 	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the

> 	default num_gangs and num_workers when the driver supports it.

> 

> ---

>  libgomp/plugin/cuda-lib.def   |  1 +

>  libgomp/plugin/cuda/cuda.h    |  3 ++

>  libgomp/plugin/plugin-nvptx.c | 77 +++++++++++++++++++++++++++++------

>  3 files changed, 69 insertions(+), 12 deletions(-)

> 

> diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def

> index 29028b504a0..b2a4c2154eb 100644

> --- a/libgomp/plugin/cuda-lib.def

> +++ b/libgomp/plugin/cuda-lib.def

> @@ -41,6 +41,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)

>  CUDA_ONE_CALL (cuModuleLoad)

>  CUDA_ONE_CALL (cuModuleLoadData)

>  CUDA_ONE_CALL (cuModuleUnload)

> +CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)

>  CUDA_ONE_CALL (cuStreamCreate)

>  CUDA_ONE_CALL (cuStreamDestroy)

>  CUDA_ONE_CALL (cuStreamQuery)

> diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h

> index 4799825bda2..b4c1b29c5d8 100644

> --- a/libgomp/plugin/cuda/cuda.h

> +++ b/libgomp/plugin/cuda/cuda.h

> @@ -44,6 +44,7 @@ typedef void *CUevent;

>  typedef void *CUfunction;

>  typedef void *CUlinkState;

>  typedef void *CUmodule;

> +typedef size_t (*CUoccupancyB2DSize)(int);

>  typedef void *CUstream;

>  

>  typedef enum {

> @@ -170,6 +171,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);

>  CUresult cuModuleLoad (CUmodule *, const char *);

>  CUresult cuModuleLoadData (CUmodule *, const void *);

>  CUresult cuModuleUnload (CUmodule);

> +CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,

> +					  CUoccupancyB2DSize, size_t, int);

>  CUresult cuStreamCreate (CUstream *, unsigned);

>  #define cuStreamDestroy cuStreamDestroy_v2

>  CUresult cuStreamDestroy (CUstream);

> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c

> index 6799a264976..9a4bc11e5fe 100644

> --- a/libgomp/plugin/plugin-nvptx.c

> +++ b/libgomp/plugin/plugin-nvptx.c

> @@ -61,9 +61,12 @@ CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t,

>  			const char *, unsigned, CUjit_option *, void **);

>  CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);

>  #else

> +typedef size_t (*CUoccupancyB2DSize)(int);

>  CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t,

>  			   const char *, unsigned, CUjit_option *, void **);

>  CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *);

> +CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,

> +					  CUoccupancyB2DSize, size_t, int);

>  #endif

>  

>  #define DO_PRAGMA(x) _Pragma (#x)

> @@ -1200,21 +1203,71 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,

>        {

>  	bool default_dim_p[GOMP_DIM_MAX];

>  	for (i = 0; i != GOMP_DIM_MAX; i++)

> +	  default_dim_p[i] = !dims[i];

> +

> +	if (!CUDA_CALL_EXISTS (cuOccupancyMaxPotentialBlockSize))

>  	  {

> -	    default_dim_p[i] = !dims[i];

> -	    if (default_dim_p[i])

> -	      dims[i] = nvthd->ptx_dev->default_dims[i];

> -	  }

> +	    for (i = 0; i != GOMP_DIM_MAX; i++)

> +	      {

> +		if (default_dim_p[i])

> +		  dims[i] = nvthd->ptx_dev->default_dims[i];

> +	      }

>  

> -	if (default_dim_p[GOMP_DIM_VECTOR])

> -	  dims[GOMP_DIM_VECTOR]

> -	    = MIN (dims[GOMP_DIM_VECTOR],

> -		   (targ_fn->max_threads_per_block / warp_size * warp_size));

> +	    if (default_dim_p[GOMP_DIM_VECTOR])

> +	      dims[GOMP_DIM_VECTOR]

> +		= MIN (dims[GOMP_DIM_VECTOR],

> +		       (targ_fn->max_threads_per_block / warp_size

> +			* warp_size));

>  

> -	if (default_dim_p[GOMP_DIM_WORKER])

> -	  dims[GOMP_DIM_WORKER]

> -	    = MIN (dims[GOMP_DIM_WORKER],

> -		   targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);

> +	    if (default_dim_p[GOMP_DIM_WORKER])

> +	      dims[GOMP_DIM_WORKER]

> +		= MIN (dims[GOMP_DIM_WORKER],

> +		       targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);

> +	  }

> +	else

> +	  {

> +	    int vectors = default_dim_p[GOMP_DIM_VECTOR]

> +	      ? 0 : dims[GOMP_DIM_VECTOR];


This is badly formatted. Please read
https://www.gnu.org/prep/standards/html_node/Formatting.html .

This is typically formatted like:
...
int vectors = (default_dim_p[GOMP_DIM_VECTOR]
               ? 0 : dims[GOMP_DIM_VECTOR]);
...
or:
...
int vectors = (default_dim_p[GOMP_DIM_VECTOR]
               ? 0
               : dims[GOMP_DIM_VECTOR]);
...

> +	    int workers = gomp_openacc_dims[GOMP_DIM_WORKER];

> +	    int gangs = gomp_openacc_dims[GOMP_DIM_GANG];

> +	    int grids, blocks;

> +

> +	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,

> +			      &blocks, function, NULL, 0,

> +			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);

> +	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "

> +			       "grid = %d, block = %d\n", grids, blocks);

> +

> +	    /* Keep the num_gangs proportional to the block size.  In

> +	       the case were a block size is limited by shared-memory

> +	       or the register file capacity, the runtime will not

> +	       excessively over assign gangs to the multiprocessor

> +	       units if their state is going to be swapped out even

> +	       more than necessary. The constant factor 2 is there to

> +	       prevent threads from idling when there is insufficient

> +	       work for them.  */

> +	    if (gangs == 0)

> +	      gangs = 2 * grids * (blocks / warp_size);

> +

> +	    if (vectors == 0)

> +	      vectors = warp_size;

> +

> +	    if (workers == 0)

> +	      workers = blocks / vectors;

> +

> +	    for (i = 0; i != GOMP_DIM_MAX; i++)

> +	      {


> +		default_dim_p[i] = !dims[i];


This is duplicate code.


> +		if (default_dim_p[i])

> +		  switch (i)

> +		    {

> +		    case GOMP_DIM_GANG: dims[i] = gangs; break;

> +		    case GOMP_DIM_WORKER: dims[i] = workers; break;

> +		    case GOMP_DIM_VECTOR: dims[i] = vectors; break;

> +		    default: GOMP_PLUGIN_fatal ("invalid dim");

> +		    }

> +	      }

> +	  }

>        }

>      }

>  

> -- 2.17.1

> 


Committed as attached.

Thanks,
- Tom
[nvptx] Use CUDA driver API to select default runtime launch geometry

The CUDA driver API starting version 6.5 offers a set of runtime functions to
calculate several occupancy-related measures, as a replacement for the occupancy
calculator spreadsheet.

This patch adds a heuristic for default runtime launch geometry, based on the
new runtime function cuOccupancyMaxPotentialBlockSize.

Build on x86_64 with nvptx accelerator and ran libgomp testsuite.

2018-08-13  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	PR target/85590
	* plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
	(cuOccupancyMaxPotentialBlockSize): Declare.
	* plugin/cuda-lib.def (cuOccupancyMaxPotentialBlockSize): New
	CUDA_ONE_CALL_MAYBE_NULL.
	* plugin/plugin-nvptx.c (CUDA_VERSION < 6050): Define
	CUoccupancyB2DSize and declare
	cuOccupancyMaxPotentialBlockSize.
	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
	default num_gangs and num_workers when the driver supports it.

---
 libgomp/plugin/cuda-lib.def   |  1 +
 libgomp/plugin/cuda/cuda.h    |  3 ++
 libgomp/plugin/plugin-nvptx.c | 83 ++++++++++++++++++++++++++++++++++++-------
 3 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index 29028b504a0..b2a4c2154eb 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -41,6 +41,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)
 CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
+CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..b4c1b29c5d8 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -170,6 +171,8 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 6799a264976..bae1b05ccaa 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -61,9 +61,12 @@ CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t,
 			const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);
 #else
+typedef size_t (*CUoccupancyB2DSize)(int);
 CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t,
 			   const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *);
+CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
+					  CUoccupancyB2DSize, size_t, int);
 #endif
 
 #define DO_PRAGMA(x) _Pragma (#x)
@@ -1200,21 +1203,77 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
       {
 	bool default_dim_p[GOMP_DIM_MAX];
 	for (i = 0; i != GOMP_DIM_MAX; i++)
+	  default_dim_p[i] = !dims[i];
+
+	if (!CUDA_CALL_EXISTS (cuOccupancyMaxPotentialBlockSize))
 	  {
-	    default_dim_p[i] = !dims[i];
-	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      if (default_dim_p[i])
+		dims[i] = nvthd->ptx_dev->default_dims[i];
+
+	    if (default_dim_p[GOMP_DIM_VECTOR])
+	      dims[GOMP_DIM_VECTOR]
+		= MIN (dims[GOMP_DIM_VECTOR],
+		       (targ_fn->max_threads_per_block / warp_size
+			* warp_size));
+
+	    if (default_dim_p[GOMP_DIM_WORKER])
+	      dims[GOMP_DIM_WORKER]
+		= MIN (dims[GOMP_DIM_WORKER],
+		       targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);
 	  }
+	else
+	  {
+	    /* Handle the case that the compiler allows the runtime to choose
+	       the vector-length conservatively, by ignoring
+	       gomp_openacc_dims[GOMP_DIM_VECTOR].  TODO: actually handle
+	       it.  */
+	    int vectors = 0;
+	    /* TODO: limit gomp_openacc_dims[GOMP_DIM_WORKER] such that that
+	       gomp_openacc_dims[GOMP_DIM_WORKER] * actual_vectors does not
+	       exceed targ_fn->max_threads_per_block. */
+	    int workers = gomp_openacc_dims[GOMP_DIM_WORKER];
+	    int gangs = gomp_openacc_dims[GOMP_DIM_GANG];
+	    int grids, blocks;
+
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    /* Keep the num_gangs proportional to the block size.  In
+	       the case were a block size is limited by shared-memory
+	       or the register file capacity, the runtime will not
+	       excessively over assign gangs to the multiprocessor
+	       units if their state is going to be swapped out even
+	       more than necessary. The constant factor 2 is there to
+	       prevent threads from idling when there is insufficient
+	       work for them.  */
+	    if (gangs == 0)
+	      gangs = 2 * grids * (blocks / warp_size);
+
+	    if (vectors == 0)
+	      vectors = warp_size;
+
+	    if (workers == 0)
+	      {
+		int actual_vectors = (default_dim_p[GOMP_DIM_VECTOR]
+				      ? vectors
+				      : dims[GOMP_DIM_VECTOR]);
+		workers = blocks / actual_vectors;
+	      }
 
-	if (default_dim_p[GOMP_DIM_VECTOR])
-	  dims[GOMP_DIM_VECTOR]
-	    = MIN (dims[GOMP_DIM_VECTOR],
-		   (targ_fn->max_threads_per_block / warp_size * warp_size));
-
-	if (default_dim_p[GOMP_DIM_WORKER])
-	  dims[GOMP_DIM_WORKER]
-	    = MIN (dims[GOMP_DIM_WORKER],
-		   targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR]);
+	    for (i = 0; i != GOMP_DIM_MAX; i++)
+	      if (default_dim_p[i])
+		switch (i)
+		  {
+		  case GOMP_DIM_GANG: dims[i] = gangs; break;
+		  case GOMP_DIM_WORKER: dims[i] = workers; break;
+		  case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		  default: GOMP_PLUGIN_fatal ("invalid dim");
+		  }
+	  }
       }
     }
Cesar Philippidis Aug. 13, 2018, 2:54 p.m. | #14
On 08/13/2018 05:04 AM, Tom de Vries wrote:
> On 08/10/2018 08:39 PM, Cesar Philippidis wrote:

>> is that I modified the default value for vectors as follows

>>

>> +	    int vectors = default_dim_p[GOMP_DIM_VECTOR]

>> +	      ? 0 : dims[GOMP_DIM_VECTOR];

>>

>> Technically, trunk only supports warp-sized vectors, but the fallback

>> code is already checking for the presence of vectors as so

>>

>> +	    if (default_dim_p[GOMP_DIM_VECTOR])

>> +	      dims[GOMP_DIM_VECTOR]

>> +		= MIN (dims[GOMP_DIM_VECTOR],

>> +		       (targ_fn->max_threads_per_block / warp_size

>> +			* warp_size));

>>

> 

> That code handles the case that the default vector size is bigger than

> the function being launched allows, independent from whether that

> default is calculated by the runtime, or set by GOMP_OPENACC_DIM.

> 

> The GOMP_OPENACC_DIM part is forward compatible, given that currently

> the compiler doesn't allow the runtime to choose the vector length, and

> AFAICT that will remain the same after application of the submitted set

> of vector_length patches.

> 

>> therefore, I had the cuOccupancyMaxPotentialBlockSize code path behave

>> the same.

> 

> They don't behave the same. What you add here is ignoring

> GOMP_OPENACC_DIM[GOMP_DIM_VECTOR], not handling it. That requires a comment.


I meant, same in the sense that it inspects for a pre-defined value of
vector length; not the application of vector length. I should have been
more clear.

> Furthermore, by assigning dims[GOMP_DIM_VECTOR] at the start you break

> the pattern of the code, which:

> - first applies GOMP_OPENACC_DIM

> - then further fills in defaults as required

> - then applies defaults

> I've rewritten this bit to fit the pattern. This result is not pretty,

> but it'll do for now.  Changing the pattern may make things better

> structured, but this is something we can do in a follow up patch, and

> want to do for all dimensions at once, not just for vector, otherwise

> the code will become too convoluted.

> 

> Btw, I've also noticed that we don't handle a too high

> GOMP_OPENACC_DIM[GOMP_DIM_WORKER], I've added a TODO comment for this.


That's why I set vectors to dims[GOMP_DIM_VECTOR] when set. However, I
do agree that this is a task for a follow up patch.

> Committed as attached.


Thank you Tom!

Looking at my patch queue, there's only one more non-vector length
related patch in there - Remove use of CUDA unified memory in libgomp
<https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01970.html>. Going
forward, how would you like to proceed with the nvptx BE vector length
changes.

Cesar
Tom de Vries Aug. 13, 2018, 3:08 p.m. | #15
On 08/13/2018 04:54 PM, Cesar Philippidis wrote:
> Going

> forward, how would you like to proceed with the nvptx BE vector length

> changes.


Do you have a branch available on github containing the patch series
you've submitted?

Thanks,
- Tom
Cesar Philippidis Aug. 13, 2018, 3:48 p.m. | #16
On 08/13/2018 08:08 AM, Tom de Vries wrote:
> On 08/13/2018 04:54 PM, Cesar Philippidis wrote:

>> Going

>> forward, how would you like to proceed with the nvptx BE vector length

>> changes.

> 

> Do you have a branch available on github containing the patch series

> you've submitted?


Yes, https://github.com/cesarjp/gcc/tree/trunk-og8-vl-private

Beware that I'm constantly rebasing that branch to keep my patches up to
date. All of the commit subject lines prefixed with [nvptx] touch the
nvptx BE. The [OpenACC] patches are either involve platform-independent
code or libgomp.

Cesar

Patch

[nvptx] Use CUDA driver API to select default runtime launch geometry

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>
	libgomp/
	plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
	(cuDriverGetVersion): Declare.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Declare.
	plugin/plugin-nvptx.c (CUDA_ONE_CALL): Add entries for
	cuDriverGetVersion and cuOccupancyMaxPotentialBlockSize.
	(ptx_device): Add driver_version member.
	(nvptx_open_device): Initialize it.
	(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
	default num_gangs and num_workers when the driver supports it.
---
 libgomp/plugin/cuda/cuda.h    |  5 +++++
 libgomp/plugin/plugin-nvptx.c | 37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825..1fc694d 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@  typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -123,6 +124,7 @@  CUresult cuCtxSynchronize (void);
 CUresult cuDeviceGet (CUdevice *, int);
 CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);
 CUresult cuDeviceGetCount (int *);
+CUresult cuDriverGetVersion (int *);
 CUresult cuEventCreate (CUevent *, unsigned);
 #define cuEventDestroy cuEventDestroy_v2
 CUresult cuEventDestroy (CUevent);
@@ -170,6 +172,9 @@  CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSizeWithFlags (int *, int *, CUfunction,
+						    CUoccupancyB2DSize, size_t,
+						    int, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index b6ec5f8..2647af6 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -63,6 +63,7 @@  CUDA_ONE_CALL (cuCtxSynchronize)	\
 CUDA_ONE_CALL (cuDeviceGet)		\
 CUDA_ONE_CALL (cuDeviceGetAttribute)	\
 CUDA_ONE_CALL (cuDeviceGetCount)	\
+CUDA_ONE_CALL (cuDriverGetVersion)	\
 CUDA_ONE_CALL (cuEventCreate)		\
 CUDA_ONE_CALL (cuEventDestroy)		\
 CUDA_ONE_CALL (cuEventElapsedTime)	\
@@ -94,6 +95,7 @@  CUDA_ONE_CALL (cuModuleGetGlobal)	\
 CUDA_ONE_CALL (cuModuleLoad)		\
 CUDA_ONE_CALL (cuModuleLoadData)	\
 CUDA_ONE_CALL (cuModuleUnload)		\
+CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \
 CUDA_ONE_CALL (cuStreamCreate)		\
 CUDA_ONE_CALL (cuStreamDestroy)		\
 CUDA_ONE_CALL (cuStreamQuery)		\
@@ -423,6 +425,7 @@  struct ptx_device
   int max_threads_per_block;
   int max_threads_per_multiprocessor;
   int default_dims[GOMP_DIM_MAX];
+  int driver_version;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -734,6 +737,7 @@  nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->driver_version = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -827,6 +831,9 @@  nvptx_open_device (int n)
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
+  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);
+  ptx_dev->driver_version = pi;
+
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
@@ -1220,11 +1227,39 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
       {
 	bool default_dim_p[GOMP_DIM_MAX];
+	int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR];
+	int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER];
+	int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG];
+
+	/* The CUDA driver occupancy calculator is only available on
+	   CUDA version 6.5 (6050) and newer.  */
+	if (nvthd->ptx_dev->driver_version > 6050)
+	  {
+	    int grids, blocks;
+	    CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			      &blocks, function, NULL, 0,
+			      dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	    GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			       "grid = %d, block = %d\n", grids, blocks);
+
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0)
+	      gangs = grids * (blocks / warp_size);
+
+	    if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0)
+	      workers = blocks / vectors;
+	  }
+
 	for (i = 0; i != GOMP_DIM_MAX; i++)
 	  {
 	    default_dim_p[i] = !dims[i];
 	    if (default_dim_p[i])
-	      dims[i] = nvthd->ptx_dev->default_dims[i];
+	      switch (i)
+		{
+		case GOMP_DIM_GANG: dims[i] = gangs; break;
+		case GOMP_DIM_WORKER: dims[i] = workers; break;
+		case GOMP_DIM_VECTOR: dims[i] = vectors; break;
+		default: GOMP_PLUGIN_fatal ("invalid dim");
+		}
 	  }
 
 	if (default_dim_p[GOMP_DIM_VECTOR])
-- 
2.7.4