[2/6,OpenACC,libgomp] Async re-work, oacc-* parts

Message ID 12319572-dd02-c946-f2b9-9d047be9c707@mentor.com
State New
Headers show
Series
  • Async re-work
Related show

Commit Message

Chung-Lin Tang Sept. 25, 2018, 1:10 p.m.
Hi Thomas,
These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime
library API functions to use the new backend plugin interfaces, in a non-target specific way.

Thanks,
Chung-Lin


	* oacc-async.c (get_goacc_thread): New function.
	(get_goacc_thread_device): New function.
	(lookup_goacc_asyncqueue): New function.
	(get_goacc_asyncqueue): New function.
	(acc_async_test): Adjust code to use new async design.
	(acc_async_test_all): Likewise.
	(acc_wait): Likewise.
	(acc_wait_async): Likewise.
	(acc_wait_all): Likewise.
	(acc_wait_all_async): Likewise.
	(acc_get_default_async): New API function.
	(acc_set_default_async): Likewise.
	(goacc_async_unmap_tgt): New function.
	(goacc_async_copyout_unmap_vars): Likewise.
	(goacc_async_free): Likewise.
	(goacc_init_asyncqueues): Likewise.
	(goacc_fini_asyncqueues): Likewise.
	* oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
	design.
	(acc_set_cuda_stream): Likewise.
	* oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
	(host_openacc_register_async_cleanup): Remove.
	(host_openacc_async_exec): New function.
	(host_openacc_async_test): Adjust parameters.
	(host_openacc_async_test_all): Remove.
	(host_openacc_async_wait): Remove.
	(host_openacc_async_wait_async): Remove.
	(host_openacc_async_wait_all): Remove.
	(host_openacc_async_wait_all_async): Remove.
	(host_openacc_async_set_async): Remove.
	(host_openacc_async_synchronize): New function.
	(host_openacc_async_serialize): New function.
	(host_openacc_async_host2dev): New function.
	(host_openacc_async_dev2host): New function.
	(host_openacc_async_queue_callback): New function.
	(host_openacc_async_construct): New function.
	(host_openacc_async_destruct): New function.
	(struct gomp_device_descr host_dispatch): Remove initialization of old
	interface, add intialization of new async sub-struct.
	* oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
	(goacc_attach_host_thread_to_device): Remove old async code usage, add
	initialization of per-thread default_async.
	* oacc-int.h (struct goacc_thread): Add default_async field.
	(goacc_init_asyncqueues): New declaration.
	(goacc_fini_asyncqueues): Likewise.
	(goacc_async_copyout_unmap_vars): Likewise.
	(goacc_async_free): Likewise.
	(get_goacc_asyncqueue): Likewise.
	(lookup_goacc_asyncqueue): Likewise.

	* oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
	design.
	(present_create_copy): Likewise.
	(delete_copyout): Likewise.
	(update_dev_host): Likewise.
	(gomp_acc_insert_pointer): Add async parameter, adjust code to use new
	async design.
	(gomp_acc_remove_pointer): Adjust code to use new async design.
	* oacc-parallel.c (GOACC_parallel_keyed): Likewise.
	(GOACC_enter_exit_data): Likewise.
	(goacc_wait): Likewise.
	(GOACC_update): Likewise.
	* oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

Comments

Thomas Schwinge Dec. 7, 2018, 11:32 a.m. | #1
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime

> library API functions to use the new backend plugin interfaces, in a non-target specific way.


(The patch was missing the "libgomp/oacc-int.h" changes, but I had no
problem replicating these from openacc-gcc-8-branch.)


Does the following make sense?

commit f86b403dbe6ed17afa8d157ec4089ff169a63680
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Fri Dec 7 12:19:56 2018 +0100

    Don't create an asyncqueue just to then test/synchronize with it
---
 libgomp/oacc-async.c    | 12 ++++++++----
 libgomp/oacc-parallel.c |  4 +++-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 553082fe3d4a..c9b134ac3380 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -119,8 +119,11 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  return thr->dev->openacc.async.test_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (!aq)
+    return 1;
+  else
+    return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -148,8 +151,9 @@ acc_wait (int async)
 
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  thr->dev->openacc.async.synchronize_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (aq)
+    thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 7b6a6e515018..0be48f98036f 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -491,7 +491,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      goacc_aq aq = get_goacc_asyncqueue (qid);
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
+      if (!aq)
+	continue;
       if (acc_dev->openacc.async.test_func (aq))
 	continue;
       if (async == acc_async_sync)


Grüße
 Thomas
Chung-Lin Tang Dec. 7, 2018, 2:19 p.m. | #2
On 2018/12/7 07:32 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!

> 

> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

>> These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime

>> library API functions to use the new backend plugin interfaces, in a non-target specific way.

> 

> (The patch was missing the "libgomp/oacc-int.h" changes, but I had no

> problem replicating these from openacc-gcc-8-branch.)

> 

> 

> Does the following make sense?


I don't quite remember why I simply ensured asyncqueue creation here at the time,
maybe simply because it allowed simpler code at this level.

OTOH, the old logic is to GOMP_fatal upon such an unknown queue case, so maybe that's
the right thing to do (inside lookup_goacc_asyncqueue()), instead of silently allowing it to pass.

WDYT?

Chung-Lin


> commit f86b403dbe6ed17afa8d157ec4089ff169a63680

> Author: Thomas Schwinge <thomas@codesourcery.com>

> Date:   Fri Dec 7 12:19:56 2018 +0100

> 

>     Don't create an asyncqueue just to then test/synchronize with it

> ---

>  libgomp/oacc-async.c    | 12 ++++++++----

>  libgomp/oacc-parallel.c |  4 +++-

>  2 files changed, 11 insertions(+), 5 deletions(-)

> 

> diff --git libgomp/oacc-async.c libgomp/oacc-async.c

> index 553082fe3d4a..c9b134ac3380 100644

> --- libgomp/oacc-async.c

> +++ libgomp/oacc-async.c

> @@ -119,8 +119,11 @@ acc_async_test (int async)

>    if (!thr || !thr->dev)

>      gomp_fatal ("no device active");

>  

> -  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);

> -  return thr->dev->openacc.async.test_func (aq);

> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);

> +  if (!aq)

> +    return 1;

> +  else

> +    return thr->dev->openacc.async.test_func (aq);

>  }

>  

>  int

> @@ -148,8 +151,9 @@ acc_wait (int async)

>  

>    struct goacc_thread *thr = get_goacc_thread ();

>  

> -  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);

> -  thr->dev->openacc.async.synchronize_func (aq);

> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);

> +  if (aq)

> +    thr->dev->openacc.async.synchronize_func (aq);

>  }

>  

>  /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */

> diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c

> index 7b6a6e515018..0be48f98036f 100644

> --- libgomp/oacc-parallel.c

> +++ libgomp/oacc-parallel.c

> @@ -491,7 +491,9 @@ goacc_wait (int async, int num_waits, va_list *ap)

>      {

>        int qid = va_arg (*ap, int);

>        

> -      goacc_aq aq = get_goacc_asyncqueue (qid);

> +      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);

> +      if (!aq)

> +	continue;

>        if (acc_dev->openacc.async.test_func (aq))

>  	continue;

>        if (async == acc_async_sync)

> 

> 

> Grüße

>  Thomas

> 

>
Thomas Schwinge Dec. 14, 2018, 2:11 p.m. | #3
Hi Chung-Lin!

On Fri, 7 Dec 2018 22:19:14 +0800, Chung-Lin Tang <cltang@pllab.cs.nthu.edu.tw> wrote:
> On 2018/12/7 07:32 PM, Thomas Schwinge wrote:

> > Does the following make sense?

> 

> I don't quite remember why I simply ensured asyncqueue creation here at the time,

> maybe simply because it allowed simpler code at this level.


Well, I think it's just overhead we can avoid.  ;-)

> OTOH, the old logic is to GOMP_fatal upon such an unknown queue case, so maybe that's

> the right thing to do (inside lookup_goacc_asyncqueue()), instead of silently allowing it to pass.

> 

> WDYT?


I argued and posted patches (or will post if not yet done) to make this
defined, valid behavior, <https://gcc.gnu.org/PR88407> "[OpenACC]
Correctly handle unseen async-arguments".  Please speak up soon if you
disagree.

Thus, I still propose that you include the following.

Please especially review the "libgomp/oacc-parallel.c:goacc_wait" change,
and confirm no corresponding "libgomp/oacc-parallel.c:GOACC_wait" change
to be done, because that code is structured differently.

commit c96c6607b77bdbf562f35209718d8b8c5705c603
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Fri Dec 7 12:19:56 2018 +0100

    into async re-work: don't create an asyncqueue just to then test/synchronize with it
---
 libgomp/oacc-async.c    | 12 ++++++++----
 libgomp/oacc-parallel.c |  4 +++-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 553082fe3d4a..c9b134ac3380 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -119,8 +119,11 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  return thr->dev->openacc.async.test_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (!aq)
+    return 1;
+  else
+    return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -148,8 +151,9 @@ acc_wait (int async)
 
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  thr->dev->openacc.async.synchronize_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (aq)
+    thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 2815a10f0386..9519abeccc2c 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -493,7 +493,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      goacc_aq aq = get_goacc_asyncqueue (qid);
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
+      if (!aq)
+	continue;
       if (acc_dev->openacc.async.test_func (aq))
 	continue;
       if (async == acc_async_sync)


Grüße
 Thomas
Thomas Schwinge Dec. 14, 2018, 2:17 p.m. | #4
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> --- a/libgomp/oacc-async.c

> +++ b/libgomp/oacc-async.c


> +attribute_hidden struct goacc_asyncqueue *

> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

> +{

> +  /* The special value acc_async_noval (-1) maps to the thread-specific

> +     default async stream.  */

> +  if (async == acc_async_noval)

> +    async = thr->default_async;

> +

> +  if (async == acc_async_sync)

> +    return NULL;

> +

> +  if (async < 0)

> +    gomp_fatal ("bad async %d", async);


To make this "resolve" part more obvious, that is, the translation from
the "async" argument to an "asyncqueue" array index:

> +  if (!create

> +      && (async >= dev->openacc.async.nasyncqueue

> +	  || !dev->openacc.async.asyncqueue[async]))

> +    return NULL;

> +[...]


..., I propose adding a "async2id" function for that, and then rename all
"asyncqueue[async]" to "asyncqueue[id]".

And, this also restores the current trunk behavior, so that
"acc_async_noval" gets its own, separate "asyncqueue".

commit e0d10cd744906c031af536bbf523ed6607370bf7
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Wed Dec 12 15:22:29 2018 +0100

    into async re-work: libgomp/oacc-async.c:async2id
---
 libgomp/oacc-async.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index c9b134ac3380..b091ba2460ac 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -54,53 +54,73 @@ get_goacc_thread_device (void)
   return thr->dev;
 }
 
-attribute_hidden struct goacc_asyncqueue *
-lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+/* Translate from an OpenACC async-argument to an internal asyncqueue ID, or -1
+   if no asyncqueue is to be used.  */
+
+static int
+async2id (int async)
 {
-  /* The special value acc_async_noval (-1) maps to the thread-specific
-     default async stream.  */
-  if (async == acc_async_noval)
-    async = 0; //TODO thr->default_async;
+  if (!async_valid_p (async))
+    gomp_fatal ("invalid async-argument: %d", async);
 
   if (async == acc_async_sync)
+    return -1;
+  else if (async == acc_async_noval)
+    return 0;
+  else if (async >= 0)
+    return 1 + async;
+  else
+    __builtin_unreachable ();
+}
+
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, if CREATE,
+   create the asyncqueue if it doesn't exist yet.  */
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  int id = async2id (async);
+  if (id < 0)
     return NULL;
 
-  if (async < 0)
-    gomp_fatal ("bad async %d", async);
-
   struct gomp_device_descr *dev = thr->dev;
 
   if (!create
-      && (async >= dev->openacc.async.nasyncqueue
-	  || !dev->openacc.async.asyncqueue[async]))
+      && (id >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[id]))
     return NULL;
 
   gomp_mutex_lock (&dev->openacc.async.lock);
-  if (async >= dev->openacc.async.nasyncqueue)
+  if (id >= dev->openacc.async.nasyncqueue)
     {
-      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      int diff = id + 1 - dev->openacc.async.nasyncqueue;
       dev->openacc.async.asyncqueue
 	= gomp_realloc (dev->openacc.async.asyncqueue,
-			sizeof (goacc_aq) * (async + 1));
+			sizeof (goacc_aq) * (id + 1));
       memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
 	      0, sizeof (goacc_aq) * diff);
-      dev->openacc.async.nasyncqueue = async + 1;
+      dev->openacc.async.nasyncqueue = id + 1;
     }
 
-  if (!dev->openacc.async.asyncqueue[async])
+  if (!dev->openacc.async.asyncqueue[id])
     {
-      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+      dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();
 
       /* Link new async queue into active list.  */
       goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
-      n->aq = dev->openacc.async.asyncqueue[async];
+      n->aq = dev->openacc.async.asyncqueue[id];
       n->next = dev->openacc.async.active;
       dev->openacc.async.active = n;
     }
   gomp_mutex_unlock (&dev->openacc.async.lock);
-  return dev->openacc.async.asyncqueue[async];
+  return dev->openacc.async.asyncqueue[id];
 }
 
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, create the
+   asyncqueue if it doesn't exist yet.  */
+
 attribute_hidden struct goacc_asyncqueue *
 get_goacc_asyncqueue (int async)
 {


Grüße
 Thomas
Thomas Schwinge Dec. 14, 2018, 2:32 p.m. | #5
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
>  void

>  acc_wait_async (int async1, int async2)

>  {

> +  struct goacc_thread *thr = get_goacc_thread ();

>  

> +  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);

> +  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);

> +  if (!aq1)

> +    gomp_fatal ("invalid async 1");

> +  if (aq1 == aq2)

> +    gomp_fatal ("identical parameters");

>  

> +  thr->dev->openacc.async.synchronize_func (aq1);

> +  thr->dev->openacc.async.serialize_func (aq1, aq2);

>  }


Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test
case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and
will segfault in the nvptx "openacc.async.serialize_func".

Good to fix as follows?

commit 448ff855bd954a72b5edb19fc1f3d481833fcb59
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Thu Dec 13 17:43:42 2018 +0100

    into async re-work: adjust for test case added in "[PR88484] OpenACC wait directive without wait argument but with async clause"
---
 libgomp/oacc-async.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 7e61b5dc0a05..a38e42781aa0 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -196,7 +196,8 @@ acc_wait_async (int async1, int async2)
     gomp_fatal ("identical parameters");
 
   thr->dev->openacc.async.synchronize_func (aq1);
-  thr->dev->openacc.async.serialize_func (aq1, aq2);
+  if (aq2)
+    thr->dev->openacc.async.serialize_func (aq1, aq2);
 }
 
 void


Grüße
 Thomas
Chung-Lin Tang Dec. 14, 2018, 2:42 p.m. | #6
On 2018/12/14 10:32 PM, Thomas Schwinge wrote:
> Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test

> case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and

> will segfault in the nvptx "openacc.async.serialize_func".


What does "wait async(acc_async_sync)" supposed to mean? Instead of fixing
it here, will it make more sense to have the serialize_func hook to accommodate
the NULL asyncqueue?

Chung-Lin
Chung-Lin Tang Dec. 14, 2018, 2:52 p.m. | #7
On 2018/12/14 10:17 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!

> 

> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

>> --- a/libgomp/oacc-async.c

>> +++ b/libgomp/oacc-async.c

> 

>> +attribute_hidden struct goacc_asyncqueue *

>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

>> +{

>> +  /* The special value acc_async_noval (-1) maps to the thread-specific

>> +     default async stream.  */

>> +  if (async == acc_async_noval)

>> +    async = thr->default_async;

>> +

>> +  if (async == acc_async_sync)

>> +    return NULL;

>> +

>> +  if (async < 0)

>> +    gomp_fatal ("bad async %d", async);

> 

> To make this "resolve" part more obvious, that is, the translation from

> the "async" argument to an "asyncqueue" array index:

> 

>> +  if (!create

>> +      && (async >= dev->openacc.async.nasyncqueue

>> +	  || !dev->openacc.async.asyncqueue[async]))

>> +    return NULL;

>> +[...]

> 

> ..., I propose adding a "async2id" function for that, and then rename all

> "asyncqueue[async]" to "asyncqueue[id]".


I don't think this is needed. This is the only place in the entire runtime that
does asyncqueue indexing, adding more conceptual layers of re-directed indexing
seems unneeded.

I do think the more descriptive comments are nice though.

> And, this also restores the current trunk behavior, so that

> "acc_async_noval" gets its own, separate "asyncqueue".


Is there a reason we need to restore that behavior right now?

Thanks,
Chung-Lin
Thomas Schwinge Dec. 14, 2018, 2:53 p.m. | #8
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> --- a/libgomp/oacc-parallel.c

> +++ b/libgomp/oacc-parallel.c

> @@ -377,8 +360,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,

>  	finalize = true;

>      }

>  

> -  acc_dev->openacc.async_set_async_func (async);

> -

>    /* Determine if this is an "acc enter data".  */

>    for (i = 0; i < mapnum; ++i)

>      {

> @@ -450,7 +431,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,

>  	  else

>  	    {

>  	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],

> -				       &sizes[i], &kinds[i]);

> +				       &sizes[i], &kinds[i], async);

>  	      /* Increment 'i' by two because OpenACC requires fortran

>  		 arrays to be contiguous, so each PSET is associated with

>  		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and

> @@ -475,17 +456,17 @@ GOACC_enter_exit_data (int device, size_t mapnum,

>  		if (acc_is_present (hostaddrs[i], sizes[i]))

>  		  {

>  		    if (finalize)

> -		      acc_delete_finalize (hostaddrs[i], sizes[i]);

> +		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);

>  		    else

> -		      acc_delete (hostaddrs[i], sizes[i]);

> +		      acc_delete_async (hostaddrs[i], sizes[i], async);

>  		  }

>  		break;

>  	      case GOMP_MAP_FROM:

>  	      case GOMP_MAP_FORCE_FROM:

>  		if (finalize)

> -		  acc_copyout_finalize (hostaddrs[i], sizes[i]);

> +		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);

>  		else

> -		  acc_copyout (hostaddrs[i], sizes[i]);

> +		  acc_copyout_async (hostaddrs[i], sizes[i], async);

>  		break;

>  	      default:

>  		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",

> @@ -503,8 +484,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,

>  	    i += pointer - 1;

>  	  }

>        }

> -

> -  acc_dev->openacc.async_set_async_func (acc_async_sync);

>  }


Additionally the following, or why not?  Please comment on the one TODO
which before your async re-work also was -- incorrectly? -- run
asynchronously?

commit 34c9ce65ad1f9865d0716d18c364d8c6928e694c
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Fri Dec 14 14:34:17 2018 +0100

    into async re-work: more async function usage
---
 libgomp/oacc-parallel.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 5a441c9efe38..91875c57fc97 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -413,11 +413,11 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 		{
 		case GOMP_MAP_ALLOC:
 		case GOMP_MAP_FORCE_ALLOC:
-		  acc_create (hostaddrs[i], sizes[i]);
+		  acc_create_async (hostaddrs[i], sizes[i], async);
 		  break;
 		case GOMP_MAP_TO:
 		case GOMP_MAP_FORCE_TO:
-		  acc_copyin (hostaddrs[i], sizes[i]);
+		  acc_copyin_async (hostaddrs[i], sizes[i], async);
 		  break;
 		default:
 		  gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,
 		 the value of the allocated device memory in the
 		 previous pointer.  */
 	      *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;
+	      /* This is intentionally no calling acc_update_device_async,
+		 because TODO.  */
 	      acc_update_device (hostaddrs[i], sizeof (uintptr_t));
 
 	      /* Restore the host pointer.  */


Grüße
 Thomas
Thomas Schwinge Dec. 14, 2018, 2:56 p.m. | #9
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> --- a/libgomp/oacc-async.c

> +++ b/libgomp/oacc-async.c


> +attribute_hidden struct goacc_asyncqueue *

> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

> +{

> +  /* The special value acc_async_noval (-1) maps to the thread-specific

> +     default async stream.  */

> +  if (async == acc_async_noval)

> +    async = thr->default_async;

> +

> +  if (async == acc_async_sync)

> +    return NULL;

> +

> +  if (async < 0)

> +    gomp_fatal ("bad async %d", async);

> +

> +  struct gomp_device_descr *dev = thr->dev;

> +

> +  if (!create

> +      && (async >= dev->openacc.async.nasyncqueue

> +	  || !dev->openacc.async.asyncqueue[async]))

> +    return NULL;

> +


Doesn't this last block also have to be included in the lock you're
taking below?

> +  gomp_mutex_lock (&dev->openacc.async.lock);

> +  if (async >= dev->openacc.async.nasyncqueue)

> +    {

> +      int diff = async + 1 - dev->openacc.async.nasyncqueue;

> +      dev->openacc.async.asyncqueue

> +	= gomp_realloc (dev->openacc.async.asyncqueue,

> +			sizeof (goacc_aq) * (async + 1));

> +      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,

> +	      0, sizeof (goacc_aq) * diff);

> +      dev->openacc.async.nasyncqueue = async + 1;

> +    }

> +

> +  if (!dev->openacc.async.asyncqueue[async])

> +    {

> +      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

> +

> +      /* Link new async queue into active list.  */

> +      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

> +      n->aq = dev->openacc.async.asyncqueue[async];

> +      n->next = dev->openacc.async.active;

> +      dev->openacc.async.active = n;

> +    }

> +  gomp_mutex_unlock (&dev->openacc.async.lock);

> +  return dev->openacc.async.asyncqueue[async];

> +}


And then, some more concerns, as encoded in the following patch (but
please also continue reading below):

commit d2d6aaeca840debbec14e421be705ef56d444ac7
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Wed Dec 12 15:57:30 2018 +0100

    into async re-work: locking concerns
---
 libgomp/oacc-async.c          | 18 +++++++++++++++---
 libgomp/plugin/plugin-nvptx.c |  6 ++++++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 89a405ebcdb1..68e4e65e8182 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -84,17 +84,21 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
   if (id < 0)
     return NULL;
 
+  struct goacc_asyncqueue *ret = NULL;
+
   struct gomp_device_descr *dev = thr->dev;
 
+  gomp_mutex_lock (&dev->openacc.async.lock);
+
   if (!create
       && (id >= dev->openacc.async.nasyncqueue
 	  || !dev->openacc.async.asyncqueue[id]))
-    return NULL;
+    goto out;
 
-  gomp_mutex_lock (&dev->openacc.async.lock);
   if (id >= dev->openacc.async.nasyncqueue)
     {
       int diff = id + 1 - dev->openacc.async.nasyncqueue;
+      // TODO gomp_realloc might call "gomp_fatal" with "&dev->openacc.async.lock" locked.  Might cause deadlock?
       dev->openacc.async.asyncqueue
 	= gomp_realloc (dev->openacc.async.asyncqueue,
 			sizeof (goacc_aq) * (id + 1));
@@ -105,16 +109,23 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
 
   if (!dev->openacc.async.asyncqueue[id])
     {
+      //TODO We have "&dev->openacc.async.lock" locked here, and if "openacc.async.construct_func" calls "GOMP_PLUGIN_fatal" (via "CUDA_CALL_ASSERT", for example), that might cause deadlock?
+      //TODO Change the interface to emit an error in the plugin, but then "return NULL", and we catch that here, unlock, and bail out?
       dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();
 
       /* Link new async queue into active list.  */
+      // TODO gomp_malloc might call "gomp_fatal" with "&dev->openacc.async.lock" locked.  Might cause deadlock?
       goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
       n->aq = dev->openacc.async.asyncqueue[id];
       n->next = dev->openacc.async.active;
       dev->openacc.async.active = n;
     }
+  ret = dev->openacc.async.asyncqueue[id];
+
+ out:
   gomp_mutex_unlock (&dev->openacc.async.lock);
-  return dev->openacc.async.asyncqueue[id];
+
+  return ret;
 }
 
 /* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
@@ -305,6 +316,7 @@ goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
       goacc_aq_list next;
       for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
 	{
+	  //TODO Can/should/must we "synchronize" here (how?), so as to make sure that no other operation on this asyncqueue is going on while/after we've destructed it here?
 	  ret &= devicep->openacc.async.destruct_func (l->aq);
 	  next = l->next;
 	  free (l);
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index 577ed39ef3f6..872e91f05e78 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -1389,6 +1389,7 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
   if (r == CUDA_ERROR_NOT_READY)
     return 0;
 
+  //TODO Is this safe to call, or might this cause deadlock if something's locked?
   GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r));
   return -1;
 }
@@ -1396,6 +1397,7 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 void
 GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
 {
+  //TODO Is this safe to call, or might this cause deadlock if something's locked?
   CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
 }
 
@@ -1404,6 +1406,7 @@ GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
 {
   CUevent e;
+  //TODO Are these safe to call, or might this cause deadlock if something's locked?
   CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
   CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
   CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
@@ -1413,6 +1416,7 @@ static void
 cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
 {
   if (res != CUDA_SUCCESS)
+    //TODO Is this safe to call, or might this cause deadlock if something's locked?
     GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
   struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
   cb->fn (cb->ptr);
@@ -1424,10 +1428,12 @@ GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
 					   void (*callback_fn)(void *),
 					   void *userptr)
 {
+  //TODO Is this safe to call, or might this cause deadlock if something's locked?
   struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
   b->fn = callback_fn;
   b->ptr = userptr;
   b->aq = aq;
+  //TODO Is this safe to call, or might this cause deadlock if something's locked?
   CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
 		    cuda_callback_wrapper, (void *) b, 0);
 }


But then, I wonder if we couldn't skip all that locking, if we moved the
"asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?

commit c9282e058f67cb8f8ca1720d7f9e3fe0c04b6c89
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Thu Dec 13 18:00:16 2018 +0100

    [TODO] into async re-work: move "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?
---
 libgomp/libgomp.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git libgomp/libgomp.h libgomp/libgomp.h
index 574fcd1ee4ad..09852589d2f1 100644
--- libgomp/libgomp.h
+++ libgomp/libgomp.h
@@ -949,6 +949,11 @@ typedef struct acc_dispatch_t
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
 
   struct {
+    //TODO Why do these live in the "device" data structure, and not in the "per-thread" data structure?
+    //TODO Aren't they meant to be separate per thread?
+    //TODO That is, as far as I remember right now, OpenACC explicitly states that an asyncqueue doesn't entail any synchronization between different host threads.
+    //TODO Verify OpenACC.
+    //TODO With that moved into "goacc_thread", we could get rid of all the locking needed here?
     /* Once created and put into the "active" list, asyncqueues are then never
        destructed and removed from the "active" list, other than if the TODO
        device is shut down.  */

At this point, I will again (as in that other email) state that my
understanding of OpenACC is that an async queue does not entail any
inter-thread synchronization, so it would seem reasonable that all async
queues are separate per thread.


Grüße
 Thomas
Chung-Lin Tang Dec. 14, 2018, 3 p.m. | #10
On 2018/12/14 10:53 PM, Thomas Schwinge wrote:
> Additionally the following, or why not?  Please comment on the one TODO

> which before your async re-work also was -- incorrectly? -- run

> asynchronously?



> diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c

> index 5a441c9efe38..91875c57fc97 100644

> --- libgomp/oacc-parallel.c

> +++ libgomp/oacc-parallel.c

> @@ -413,11 +413,11 @@ GOACC_enter_exit_data (int device, size_t mapnum,

>   		{

>   		case GOMP_MAP_ALLOC:

>   		case GOMP_MAP_FORCE_ALLOC:

> -		  acc_create (hostaddrs[i], sizes[i]);

> +		  acc_create_async (hostaddrs[i], sizes[i], async);

>   		  break;

>   		case GOMP_MAP_TO:

>   		case GOMP_MAP_FORCE_TO:

> -		  acc_copyin (hostaddrs[i], sizes[i]);

> +		  acc_copyin_async (hostaddrs[i], sizes[i], async);

>   		  break;

>   		default:


Yes! I think these were somehow missed by mistake. Thanks for catching!

>   		  gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",

> @@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,

>   		 the value of the allocated device memory in the

>   		 previous pointer.  */

>   	      *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;

> +	      /* This is intentionally no calling acc_update_device_async,

> +		 because TODO.  */

>   	      acc_update_device (hostaddrs[i], sizeof (uintptr_t));

>   

>   	      /* Restore the host pointer.  */


I don't remember adding this piece of comment, it might have been Cesar I guess?
I'm not sure if there's any real reason not to use acc_update_device_async here...
Change and test to see?

Thanks,
Chung-Lin
Chung-Lin Tang Dec. 17, 2018, 11:03 a.m. | #11
On 2018/12/14 10:56 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!

> 

> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang<chunglin_tang@mentor.com>  wrote:

>> --- a/libgomp/oacc-async.c

>> +++ b/libgomp/oacc-async.c

>> +attribute_hidden struct goacc_asyncqueue *

>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

>> +{

>> +  /* The special value acc_async_noval (-1) maps to the thread-specific

>> +     default async stream.  */

>> +  if (async == acc_async_noval)

>> +    async = thr->default_async;

>> +

>> +  if (async == acc_async_sync)

>> +    return NULL;

>> +

>> +  if (async < 0)

>> +    gomp_fatal ("bad async %d", async);

>> +

>> +  struct gomp_device_descr *dev = thr->dev;

>> +

>> +  if (!create

>> +      && (async >= dev->openacc.async.nasyncqueue

>> +	  || !dev->openacc.async.asyncqueue[async]))

>> +    return NULL;

>> +

> Doesn't this last block also have to be included in the lock you're

> taking below?


Good catch, I'll revise this.

> -  gomp_mutex_lock (&dev->openacc.async.lock);

>     if (id >= dev->openacc.async.nasyncqueue)

>       {

>         int diff = id + 1 - dev->openacc.async.nasyncqueue;

> +      // TODO gomp_realloc might call "gomp_fatal" with "&dev->openacc.async.lock" locked.  Might cause deadlock?

>         dev->openacc.async.asyncqueue

>   	= gomp_realloc (dev->openacc.async.asyncqueue,

>   			sizeof (goacc_aq) * (id + 1));

> @@ -105,16 +109,23 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

>   

>     if (!dev->openacc.async.asyncqueue[id])

>       {

> +      //TODO We have "&dev->openacc.async.lock" locked here, and if "openacc.async.construct_func" calls "GOMP_PLUGIN_fatal" (via "CUDA_CALL_ASSERT", for example), that might cause deadlock?

> +      //TODO Change the interface to emit an error in the plugin, but then "return NULL", and we catch that here, unlock, and bail out?

>         dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();


> +  //TODO Are these safe to call, or might this cause deadlock if something's locked?

>     CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);

>     CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);

>     CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);

> @@ -1413,6 +1416,7 @@ static void

>   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)

>   {

>     if (res != CUDA_SUCCESS)

> +    //TODO Is this safe to call, or might this cause deadlock if something's locked?

>       GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));


The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the
struct gomp_device_descr's *device* lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.

I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,
though I can audit them again if deemed so.
> 

> But then, I wonder if we couldn't skip all that locking, if we moved the

> "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?

> 


>     struct {

> +    //TODO Why do these live in the "device" data structure, and not in the "per-thread" data structure?

> +    //TODO Aren't they meant to be separate per thread?

> +    //TODO That is, as far as I remember right now, OpenACC explicitly states that an asyncqueue doesn't entail any synchronization between different host threads.

> +    //TODO Verify OpenACC.

> +    //TODO With that moved into "goacc_thread", we could get rid of all the locking needed here?

>       /* Once created and put into the "active" list, asyncqueues are then never

>          destructed and removed from the "active" list, other than if the TODO

>          device is shut down.  */

> 

> At this point, I will again (as in that other email) state that my

> understanding of OpenACC is that an async queue does not entail any

> inter-thread synchronization, so it would seem reasonable that all async

> queues are separate per thread.


OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):

"If there are two or more host threads executing and sharing the same accelerator device,
two asynchronous operations with the same async-value will be enqueued on the same activity queue"

That said, I recall most (if not all) of the synchronization operations and behavior are all
defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with
other host threads here may be moot, as there's no way meaningful way to synchronize with
the OpenACC activity of other host threads (though correct me if I forgot some API)

Also, CUDA streams do not seem to support local-thread-operation-only synchronization.
I remember this was an issue in the old implementation inside the nvptx plugin as well, and we
had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)

Well, another issue that we might want to bring up to the OpenACC committee :)
I agree that if async queues spaces were simply thread-local then things would be much simpler.

Thanks,
Chung-Lin
Thomas Schwinge Dec. 17, 2018, 1:52 p.m. | #12
Hi Chung-Lin!

On Fri, 14 Dec 2018 22:52:44 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/14 10:17 PM, Thomas Schwinge wrote:

> > On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

> >> --- a/libgomp/oacc-async.c

> >> +++ b/libgomp/oacc-async.c

> > 

> >> +attribute_hidden struct goacc_asyncqueue *

> >> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

> >> +{

> >> +  /* The special value acc_async_noval (-1) maps to the thread-specific

> >> +     default async stream.  */

> >> +  if (async == acc_async_noval)

> >> +    async = thr->default_async;

> >> +

> >> +  if (async == acc_async_sync)

> >> +    return NULL;

> >> +

> >> +  if (async < 0)

> >> +    gomp_fatal ("bad async %d", async);

> > 

> > To make this "resolve" part more obvious, that is, the translation from

> > the "async" argument to an "asyncqueue" array index:

> > 

> >> +  if (!create

> >> +      && (async >= dev->openacc.async.nasyncqueue

> >> +	  || !dev->openacc.async.asyncqueue[async]))

> >> +    return NULL;

> >> +[...]

> > 

> > ..., I propose adding a "async2id" function for that, and then rename all

> > "asyncqueue[async]" to "asyncqueue[id]".

> 

> I don't think this is needed. This is the only place in the entire runtime that

> does asyncqueue indexing, adding more conceptual layers of re-directed indexing

> seems unneeded.


It makes the code better understandable?  Or, curious, why do you think
that the translation from an OpenACC async-argument to an internal
asyncqueue ID should not be a separate function?


> I do think the more descriptive comments are nice though.



> > And, this also restores the current trunk behavior, so that

> > "acc_async_noval" gets its own, separate "asyncqueue".

> 

> Is there a reason we need to restore that behavior right now?


Because otherwise that's a functional change ("regression") from the
current GCC trunk behavior, which I wouldn't expect in a re-work.


Grüße
 Thomas
Thomas Schwinge Dec. 17, 2018, 1:56 p.m. | #13
Hi Chung-Lin!

On Fri, 14 Dec 2018 22:42:28 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/14 10:32 PM, Thomas Schwinge wrote:

> > Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test

> > case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and

> > will segfault in the nvptx "openacc.async.serialize_func".

> 

> What does "wait async(acc_async_sync)" supposed to mean?


In my understanding, that'll translate to just "wait" without an "async"
clause, thus synchronous with the local (host) thread.

> Instead of fixing

> it here, will it make more sense to have the serialize_func hook to accommodate

> the NULL asyncqueue?


Sure, that may make sense, yes.  Right: if there's no asyncqueue to
serialize with, then serialize/synchronize with the local (host) thread.


Grüße
 Thomas
Thomas Schwinge Dec. 17, 2018, 2:10 p.m. | #14
Hi Chung-Lin!

On Fri, 14 Dec 2018 23:00:57 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/14 10:53 PM, Thomas Schwinge wrote:

> > Please comment on the one TODO

> > which before your async re-work also was -- incorrectly? -- run

> > asynchronously?


> > @@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,

> >   		 the value of the allocated device memory in the

> >   		 previous pointer.  */

> >   	      *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;

> > +	      /* This is intentionally no calling acc_update_device_async,

> > +		 because TODO.  */

> >   	      acc_update_device (hostaddrs[i], sizeof (uintptr_t));

> >   

> >   	      /* Restore the host pointer.  */

| 	      *(uintptr_t *) hostaddrs[i] = t;
> 

> I don't remember adding this piece of comment, it might have been Cesar I guess?


That "TODO" comment, you mean?  It was me who just added that one, to
highlight this, to ask you to comment, whether it's feasible to turn that
"acc_update_device" into "acc_update_device_async", too, or if that has
intentionally not been done, given that before your async re-work that
also was -- incorrectly? -- run asynchronously.

> I'm not sure if there's any real reason not to use acc_update_device_async here...


My worry was that the data object being copied here ("*hostaddrs[i]") is
changed immediatelly after the "acc_update_device" call (now inlined
above), so if that update get enqueued for asynchronous execution, it
might then actually copy the value after "Restore the host pointer".

Given the code before and after it, maybe "acc_update_device" is
generally the wrong function to call here?  (That's, again, a separate
change, please.)

I have not yet researched where that code is coming from, and what it's
supposed to be used for.  But as part of the async re-work we should at
least understand that one, too.

> Change and test to see?


Generally, when testing asynchronous behavior, I'd be wary of reading too
much into any such test results ("still PASSes" -- by chance?).  Unless,
of course, there's then a clear regression somewhere ("now FAILs").

..., which there isn't, because adding a "gomp_fatal" in that code path,
that doesn't trigger a single time when running the libgomp testsuite...


Grüße
 Thomas
Thomas Schwinge Dec. 17, 2018, 2:32 p.m. | #15
Hi Chung-Lin!

On Mon, 17 Dec 2018 19:03:12 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/14 10:56 PM, Thomas Schwinge wrote:

> > +  //TODO Are these safe to call, or might this cause deadlock if something's locked?

> >     CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);

> >     CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);

> >     CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);

> > @@ -1413,6 +1416,7 @@ static void

> >   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)

> >   {

> >     if (res != CUDA_SUCCESS)

> > +    //TODO Is this safe to call, or might this cause deadlock if something's locked?

> >       GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));

> 

> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the

> struct gomp_device_descr's *device* lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.

> 

> I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,


Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
some structural changes, to have plugin functions call the
non-terminating "GOMP_PLUGIN_error" and return some error, instead of
calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
another TODO item for later, separately...  Or, if that's actually the
case, that this has been fixed in the way I described, then should these
functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
call "GOMP_PLUGIN_error", and then return an error code?)

> though I can audit them again if deemed so.


My understanding had been that deadlock may happen if we're inside some
of these async/wait/serialize/synchronize functions, with "async" locked,
then run into an error, then libgomp prepares to abort, and at that time
shuts down the device, which will shut down the asyncqueues
("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
it actually doesn't.  My misunderstanding, I guess?


> > But then, I wonder if we couldn't skip all that locking, if we moved the

> > "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?

> > 

> 

> >     struct {

> > +    //TODO Why do these live in the "device" data structure, and not in the "per-thread" data structure?

> > +    //TODO Aren't they meant to be separate per thread?

> > +    //TODO That is, as far as I remember right now, OpenACC explicitly states that an asyncqueue doesn't entail any synchronization between different host threads.

> > +    //TODO Verify OpenACC.

> > +    //TODO With that moved into "goacc_thread", we could get rid of all the locking needed here?

> >       /* Once created and put into the "active" list, asyncqueues are then never

> >          destructed and removed from the "active" list, other than if the TODO

> >          device is shut down.  */

> > 

> > At this point, I will again (as in that other email) state that my

> > understanding of OpenACC is that an async queue does not entail any

> > inter-thread synchronization, so it would seem reasonable that all async

> > queues are separate per thread.

> 

> OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):

> 

> "If there are two or more host threads executing and sharing the same accelerator device,

> two asynchronous operations with the same async-value will be enqueued on the same activity queue"


Right, but then, in the very next sentence, it goes on to state: "If the
threads are not synchronized with respect to each other, the operations
may be enqueued in either order and therefore may execute on the device
in either order".  So this, and given that:

> That said, I recall most (if not all) of the synchronization operations and behavior are all

> defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with

> other host threads here may be moot, as there's no way meaningful way to synchronize with

> the OpenACC activity of other host threads (though correct me if I forgot some API)


..., I concluded something must be wrong in the OpenACC 2.6,
2.16.1. "async clause" text, and no such (host-side) inter-thread
synchronization can be expected from OpenACC "async"/"wait".  I've also
already got that on my list of things to clarify with the OpenACC
technical committee, later on.

> Also, CUDA streams do not seem to support local-thread-operation-only synchronization.

> I remember this was an issue in the old implementation inside the nvptx plugin as well, and we

> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)


Right.

> Well, another issue that we might want to bring up to the OpenACC committee :)

> I agree that if async queues spaces were simply thread-local then things would be much simpler.


OK, so you agree with that, good.

And, no problem foreseeable about simply moving the asyncqueues into
"goacc_thread" -- and removing the "async" lock?


Grüße
 Thomas
Chung-Lin Tang Dec. 18, 2018, 9:35 a.m. | #16
On 2018/12/17 9:52 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!

> 

> On Fri, 14 Dec 2018 22:52:44 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

>> On 2018/12/14 10:17 PM, Thomas Schwinge wrote:

>>> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

>>>> --- a/libgomp/oacc-async.c

>>>> +++ b/libgomp/oacc-async.c

>>>

>>>> +attribute_hidden struct goacc_asyncqueue *

>>>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

>>>> +{

>>>> +  /* The special value acc_async_noval (-1) maps to the thread-specific

>>>> +     default async stream.  */

>>>> +  if (async == acc_async_noval)

>>>> +    async = thr->default_async;

>>>> +

>>>> +  if (async == acc_async_sync)

>>>> +    return NULL;

>>>> +

>>>> +  if (async < 0)

>>>> +    gomp_fatal ("bad async %d", async);

>>>

>>> To make this "resolve" part more obvious, that is, the translation from

>>> the "async" argument to an "asyncqueue" array index:

>>>

>>>> +  if (!create

>>>> +      && (async >= dev->openacc.async.nasyncqueue

>>>> +	  || !dev->openacc.async.asyncqueue[async]))

>>>> +    return NULL;

>>>> +[...]

>>>

>>> ..., I propose adding a "async2id" function for that, and then rename all

>>> "asyncqueue[async]" to "asyncqueue[id]".

>>

>> I don't think this is needed. This is the only place in the entire runtime that

>> does asyncqueue indexing, adding more conceptual layers of re-directed indexing

>> seems unneeded.

> 

> It makes the code better understandable?  Or, curious, why do you think

> that the translation from an OpenACC async-argument to an internal

> asyncqueue ID should not be a separate function?


Because the index is (1) not used elsewhere; nor supposed to really, lookup_goacc_asyncqueue()
is intended to be the centralized place for looking up async queues.
and (2) the special async number case handling here is really short, creating another
conceptual index-redirecting in the code feels like over-engineering.

>> I do think the more descriptive comments are nice though.

> 

> 

>>> And, this also restores the current trunk behavior, so that

>>> "acc_async_noval" gets its own, separate "asyncqueue".

>>

>> Is there a reason we need to restore that behavior right now?

> 

> Because otherwise that's a functional change ("regression") from the

> current GCC trunk behavior, which I wouldn't expect in a re-work.


Okay, but do take note that the acc_get/set_default_async is part of the upstreaming too.
The behavior change is due to that new 2.5 functionality, not really because I arbitrarily changed things.

Thanks,
Chung-Lin
Chung-Lin Tang Dec. 18, 2018, 10:02 a.m. | #17
On 2018/12/17 10:32 PM, Thomas Schwinge wrote:
>> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the

>> struct gomp_device_descr's*device*  lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.

>>

>> I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,

> Ah, OK, I see.  (But I thought that method of deadlock had been fixed by

> some structural changes, to have plugin functions call the

> non-terminating "GOMP_PLUGIN_error" and return some error, instead of

> calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's

> another TODO item for later, separately...  Or, if that's actually the

> case, that this has been fixed in the way I described, then should these

> functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"

> call "GOMP_PLUGIN_error", and then return an error code?)


You remembered correctly, although...

>> though I can audit them again if deemed so.

> My understanding had been that deadlock may happen if we're inside some

> of these async/wait/serialize/synchronize functions, with "async" locked,

> then run into an error, then libgomp prepares to abort, and at that time

> shuts down the device, which will shut down the asyncqueues

> ("goacc_fini_asyncqueues"), which will again try to lock "async" -- which

> it actually doesn't.  My misunderstanding, I guess?


...at least now, you can see that goacc_fini_asyncqueues() does not attempt to
acquire devicep->openacc.async.lock when doing cleanup.

Come to think of it, that might be a bug there. :P

>> "If there are two or more host threads executing and sharing the same accelerator device,

>> two asynchronous operations with the same async-value will be enqueued on the same activity queue"

> Right, but then, in the very next sentence, it goes on to state: "If the

> threads are not synchronized with respect to each other, the operations

> may be enqueued in either order and therefore may execute on the device

> in either order".  So this, and given that:


I actually didn't care much about that next sentence, since it's just stating the obvious :)

It also seem to imply that the multiple host threads are enqueuing operations to the same async queue, hence further
corroborating that queues are device-wide, not thread.

>> That said, I recall most (if not all) of the synchronization operations and behavior are all

>> defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with

>> other host threads here may be moot, as there's no way meaningful way to synchronize with

>> the OpenACC activity of other host threads (though correct me if I forgot some API)

> ..., I concluded something must be wrong in the OpenACC 2.6,

> 2.16.1. "async clause" text, and no such (host-side) inter-thread

> synchronization can be expected from OpenACC "async"/"wait".  I've also

> already got that on my list of things to clarify with the OpenACC

> technical committee, later on.


I just remembered, there does seem to be one area where device vs. thread-wide interpretation will be visible:
when using acc_get/set_cuda_stream(). Supposedly, given the specification's device-wide queue/stream model,
different host-threads should access the same CUDA stream when using acc_get/set_cuda_stream().
This will break if we made async queues to be thread-local.

>> Also, CUDA streams do not seem to support local-thread-operation-only synchronization.

>> I remember this was an issue in the old implementation inside the nvptx plugin as well, and we

>> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)

> Right.

> 

>> Well, another issue that we might want to bring up to the OpenACC committee:)

>> I agree that if async queues spaces were simply thread-local then things would be much simpler.

> OK, so you agree with that, good.

> 

> And, no problem foreseeable about simply moving the asyncqueues into

> "goacc_thread" -- and removing the "async" lock?


I think we should still try to solve the potential deadlock problems, and stay close to the current
implementation just for now. We can ask the committee for further guidance later.

Chung-Lin
Thomas Schwinge Dec. 18, 2018, 11:44 a.m. | #18
Hi Chung-Lin!

On Tue, 18 Dec 2018 18:02:54 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/17 10:32 PM, Thomas Schwinge wrote:

> >> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the

> >> struct gomp_device_descr's*device*  lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.

> >>

> >> I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,

> > Ah, OK, I see.  (But I thought that method of deadlock had been fixed by

> > some structural changes, to have plugin functions call the

> > non-terminating "GOMP_PLUGIN_error" and return some error, instead of

> > calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's

> > another TODO item for later, separately...  Or, if that's actually the

> > case, that this has been fixed in the way I described, then should these

> > functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"

> > call "GOMP_PLUGIN_error", and then return an error code?)

> 

> You remembered correctly, although...

> 

> >> though I can audit them again if deemed so.

> > My understanding had been that deadlock may happen if we're inside some

> > of these async/wait/serialize/synchronize functions, with "async" locked,

> > then run into an error, then libgomp prepares to abort, and at that time

> > shuts down the device, which will shut down the asyncqueues

> > ("goacc_fini_asyncqueues"), which will again try to lock "async" -- which

> > it actually doesn't.  My misunderstanding, I guess?

> 

> ...at least now, you can see that goacc_fini_asyncqueues() does not attempt to

> acquire devicep->openacc.async.lock when doing cleanup.

> 

> Come to think of it, that might be a bug there. :P


Heh, I wondered about that, too.  ;-)

An asyncqueue as returned by "lookup_goacc_asyncqueue" itself is not
locked (and I suppose it shouldn't be, because that would be "too
much"?), so it may -- at least (only?) in a specially constructed test
case -- happen that an asyncqueue gets destructed
("goacc_fini_asyncqueues") while it's still in use?  (Don't know how the
CUDA Driver library thinks of that, for example.  Though, probably, that
scenario can only happen if the device used by a certain host thread is
shut down while an "async" operation is still running.

But, can we easily avoid that issue by calling
"openacc.async.synchronize_func" before "openacc.async.destruct_func"
(or, have the latter do that internally)?  Just have to make sure that
any such synchonization then doesn't raise (potentially) nested
"GOMP_PLUGIN_fatal" calls.  Hence the TODO comment I added in my "into
async re-work: locking concerns" commit, before the
"openacc.async.destruct_func" call: "Can/should/must we "synchronize"
here (how?), so as to make sure that no other operation on this
asyncqueue is going on while/after we've destructed it here?"

Probably an error-ignoring "cuStreamSynchronize" call before
"cuStreamDestroy" would be reasonable?


Oh, and don't we have another problem...  Consider an "acc_shutdown" run
by host thread 1, while another host thread 2 continues to use the
device-wide queues.  That "acc_shutdown" will call "gomp_fini_device",
which will call "goacc_fini_asyncqueues", which will happily destruct the
whole device-wide "async" data.  Just taking the "async" lock before
doing that won't solve the problem, as host thread 2 is supposed to
continue using the existing queues.  Reference counting required?

Anyway: I'm not asking you to fix that now.  "Fortunately", we're not at
all properly implementing OpenACC usage in context of multiple host
threads (such as created by OpenMP or the pthreads interface), so I'm
just noting that issue now, to be resolved later (as part of our internal
tracker issues CSTS-110 or CSTS-115).


Anyway:

> >> "If there are two or more host threads executing and sharing the same accelerator device,

> >> two asynchronous operations with the same async-value will be enqueued on the same activity queue"

> > Right, but then, in the very next sentence, it goes on to state: "If the

> > threads are not synchronized with respect to each other, the operations

> > may be enqueued in either order and therefore may execute on the device

> > in either order".  So this, and given that:

> 

> I actually didn't care much about that next sentence, since it's just stating the obvious :)


;-)

> It also seem to imply that the multiple host threads are enqueuing operations to the same async queue, hence further

> corroborating that queues are device-wide, not thread.


OK, that's your (certainly valid) interpretation; mine was to make our
life simpler:

> >> That said, I recall most (if not all) of the synchronization operations and behavior are all

> >> defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with

> >> other host threads here may be moot, as there's no way meaningful way to synchronize with

> >> the OpenACC activity of other host threads (though correct me if I forgot some API)

> > ..., I concluded something must be wrong in the OpenACC 2.6,

> > 2.16.1. "async clause" text, and no such (host-side) inter-thread

> > synchronization can be expected from OpenACC "async"/"wait".  I've also

> > already got that on my list of things to clarify with the OpenACC

> > technical committee, later on.


... so that we could simply avoid all that locking, by moving all "async"
stuff into "goacc_thread" instead of device-wide.

But you're raising a valid point here:

> I just remembered, there does seem to be one area where device vs. thread-wide interpretation will be visible:

> when using acc_get/set_cuda_stream(). Supposedly, given the specification's device-wide queue/stream model,

> different host-threads should access the same CUDA stream when using acc_get/set_cuda_stream().

> This will break if we made async queues to be thread-local.


That's either (a) correct (in your interpretation of the device-wide
queue model), or (b) another OpenACC specification documentation issue
(in the host thread-local queue model).  ;-|

> >> Also, CUDA streams do not seem to support local-thread-operation-only synchronization.

> >> I remember this was an issue in the old implementation inside the nvptx plugin as well, and we

> >> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)

> > Right.

> > 

> >> Well, another issue that we might want to bring up to the OpenACC committee:)

> >> I agree that if async queues spaces were simply thread-local then things would be much simpler.

> > OK, so you agree with that, good.

> > 

> > And, no problem foreseeable about simply moving the asyncqueues into

> > "goacc_thread" -- and removing the "async" lock?

> 

> I think we should still try to solve the potential deadlock problems, and stay close to the current

> implementation just for now.


OK.  Again, I was just trying to avoid that work.

> We can ask the committee for further guidance later.


ACK.


Oh, and I have another (possible?) interpretation of the OpenACC 2.6
specification: if the queues themselves indeed are device-wide, then
maybe the "active" queues list is private per host thread?  Because,
don't just all "test/wait all" operations (which are the ones to work on
the "active" queues list) have this "local (host) thread only" comment?
Is that the intention of the specification after all?

I'll leave it to you whether you want to look into that at this point --
if that might simplify some of the locking required maybe?


Grüße
 Thomas
Chung-Lin Tang Dec. 18, 2018, 3:06 p.m. | #19
On 2018/9/25 9:10 PM, Chung-Lin Tang wrote:
> Hi Thomas,

> These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime

> library API functions to use the new backend plugin interfaces, in a non-target specific way.

> 


Hi Thomas,
this part includes some of the lookup_goacc_asyncqueue fixes we talked about.
I am still thinking about how the queue lock problem should really be solved, so regard
this patch as just fixing some of the problems.
diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c
--- trunk-orig/libgomp/oacc-async.c	2018-12-14 22:11:29.252251925 +0800
+++ trunk-work/libgomp/oacc-async.c	2018-12-18 22:19:51.923102938 +0800
@@ -70,12 +70,16 @@
 
   struct gomp_device_descr *dev = thr->dev;
 
+  gomp_mutex_lock (&dev->openacc.async.lock);
+
   if (!create
       && (async >= dev->openacc.async.nasyncqueue
 	  || !dev->openacc.async.asyncqueue[async]))
-    return NULL;
+    {
+      gomp_mutex_unlock (&dev->openacc.async.lock);
+      return NULL;
+    }
 
-  gomp_mutex_lock (&dev->openacc.async.lock);
   if (async >= dev->openacc.async.nasyncqueue)
     {
       int diff = async + 1 - dev->openacc.async.nasyncqueue;
@@ -91,6 +95,12 @@
     {
       dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
 
+      if (!dev->openacc.async.asyncqueue[async])
+	{
+	  gomp_mutex_unlock (&dev->openacc.async.lock);
+	  gomp_fatal ("async %d creation failed", async);
+	}
+      
       /* Link new async queue into active list.  */
       goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
       n->aq = dev->openacc.async.asyncqueue[async];
diff -ru trunk-orig/libgomp/oacc-host.c trunk-work/libgomp/oacc-host.c
--- trunk-orig/libgomp/oacc-host.c	2018-12-14 18:31:07.487203770 +0800
+++ trunk-work/libgomp/oacc-host.c	2018-12-18 22:23:26.771807667 +0800
@@ -266,6 +266,9 @@
 
       .exec_func = host_openacc_exec,
 
+      .create_thread_data_func = host_openacc_create_thread_data,
+      .destroy_thread_data_func = host_openacc_destroy_thread_data,
+
       .async = {
 	.construct_func = host_openacc_async_construct,
 	.destruct_func = host_openacc_async_destruct,
@@ -278,9 +281,6 @@
 	.host2dev_func = host_openacc_async_host2dev,
       },
 
-      .create_thread_data_func = host_openacc_create_thread_data,
-      .destroy_thread_data_func = host_openacc_destroy_thread_data,
-
       .cuda = {
 	.get_current_device_func = NULL,
 	.get_current_context_func = NULL,
diff -ru trunk-orig/libgomp/oacc-plugin.c trunk-work/libgomp/oacc-plugin.c
--- trunk-orig/libgomp/oacc-plugin.c	2018-12-14 18:31:07.491203745 +0800
+++ trunk-work/libgomp/oacc-plugin.c	2018-12-18 22:27:46.047722004 +0800
@@ -30,6 +30,13 @@
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
+void
+GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
+			      int async __attribute__((unused)))
+{
+  gomp_fatal ("invalid plugin function");
+}
+
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
diff -ru trunk-orig/libgomp/plugin/plugin-nvptx.c trunk-work/libgomp/plugin/plugin-nvptx.c
Index: libgomp/oacc-async.c
===================================================================
--- libgomp/oacc-async.c	(revision 267226)
+++ libgomp/oacc-async.c	(working copy)
@@ -27,10 +27,97 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <assert.h>
+#include <string.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+     default async stream.  */
+  if (async == acc_async_noval)
+    async = thr->default_async;
+
+  if (async == acc_async_sync)
+    return NULL;
+
+  if (async < 0)
+    gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  gomp_mutex_lock (&dev->openacc.async.lock);
+
+  if (!create
+      && (async >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[async]))
+    {
+      gomp_mutex_unlock (&dev->openacc.async.lock);
+      return NULL;
+    }
+
+  if (async >= dev->openacc.async.nasyncqueue)
+    {
+      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      dev->openacc.async.asyncqueue
+	= gomp_realloc (dev->openacc.async.asyncqueue,
+			sizeof (goacc_aq) * (async + 1));
+      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+	      0, sizeof (goacc_aq) * diff);
+      dev->openacc.async.nasyncqueue = async + 1;
+    }
+
+  if (!dev->openacc.async.asyncqueue[async])
+    {
+      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+      if (!dev->openacc.async.asyncqueue[async])
+	{
+	  gomp_mutex_unlock (&dev->openacc.async.lock);
+	  gomp_fatal ("async %d creation failed", async);
+	}
+      
+      /* Link new async queue into active list.  */
+      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+      n->aq = dev->openacc.async.asyncqueue[async];
+      n->next = dev->openacc.async.active;
+      dev->openacc.async.active = n;
+    }
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+  return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -42,18 +129,25 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  return thr->dev->openacc.async_test_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
 acc_async_test_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  return thr->dev->openacc.async_test_all_func ();
+  int ret = 1;
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    if (!thr->dev->openacc.async.test_func (l->aq))
+      {
+	ret = 0;
+	break;
+      }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -62,12 +156,10 @@ acc_wait (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -84,23 +176,28 @@ acc_async_wait (int async)
 void
 acc_wait_async (int async1, int async2)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  if (!aq1)
+    gomp_fatal ("invalid async 1");
+  if (aq1 == aq2)
+    gomp_fatal ("identical parameters");
 
-  thr->dev->openacc.async_wait_async_func (async1, async2);
+  thr->dev->openacc.async.synchronize_func (aq1);
+  thr->dev->openacc.async.serialize_func (aq1, aq2);
 }
 
 void
 acc_wait_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct gomp_device_descr *dev = get_goacc_thread_device ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  thr->dev->openacc.async_wait_all_func ();
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+    dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&dev->openacc.async.lock);
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -120,10 +217,74 @@ acc_wait_all_async (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
-  thr->dev->openacc.async_wait_all_async_func (async);
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    {
+      thr->dev->openacc.async.synchronize_func (l->aq);
+      if (waiting_queue)
+	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+    }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
 }
+
+int
+acc_get_default_async (void)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return thr->default_async;
+}
+
+void
+acc_set_default_async (int async)
+{
+  if (async < acc_async_sync)
+    gomp_fatal ("invalid async argument: %d", async);
+
+  struct goacc_thread *thr = get_goacc_thread ();
+  thr->default_async = async;
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+		  struct goacc_asyncqueue *aq, void *ptr)
+{
+  if (!aq)
+    free (ptr);
+  else
+    devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+  gomp_mutex_init (&devicep->openacc.async.lock);
+  devicep->openacc.async.nasyncqueue = 0;
+  devicep->openacc.async.asyncqueue = NULL;
+  devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+  bool ret = true;
+  if (devicep->openacc.async.nasyncqueue > 0)
+    {
+      goacc_aq_list next;
+      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+	{
+	  ret &= devicep->openacc.async.destruct_func (l->aq);
+	  next = l->next;
+	  free (l);
+	}
+      free (devicep->openacc.async.asyncqueue);
+      devicep->openacc.async.nasyncqueue = 0;
+      devicep->openacc.async.asyncqueue = NULL;
+      devicep->openacc.async.active = NULL;
+    }
+  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  return ret;
+}
Index: libgomp/oacc-cuda.c
===================================================================
--- libgomp/oacc-cuda.c	(revision 267226)
+++ libgomp/oacc-cuda.c	(working copy)
@@ -62,7 +62,11 @@ acc_get_cuda_stream (int async)
     return NULL;
 
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
-    return thr->dev->openacc.cuda.get_stream_func (async);
+    {
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+      if (aq)
+	return thr->dev->openacc.cuda.get_stream_func (aq);
+    }
  
   return NULL;
 }
@@ -79,8 +83,14 @@ acc_set_cuda_stream (int async, void *stream)
 
   thr = goacc_thread ();
 
+  int ret = -1;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
-    return thr->dev->openacc.cuda.set_stream_func (async, stream);
+    {
+      goacc_aq aq = get_goacc_asyncqueue (async);
+      gomp_mutex_lock (&thr->dev->openacc.async.lock);
+      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+    }
 
-  return -1;
+  return ret;
 }
Index: libgomp/oacc-host.c
===================================================================
--- libgomp/oacc-host.c	(revision 267226)
+++ libgomp/oacc-host.c	(working copy)
@@ -140,8 +140,7 @@ host_openacc_exec (void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
 		   void **devaddrs __attribute__ ((unused)),
-		   int async __attribute__ ((unused)),
-		   unsigned *dims __attribute ((unused)),
+		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
   fn (hostaddrs);
@@ -148,49 +147,81 @@ host_openacc_exec (void (*fn) (void *),
 }
 
 static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
-				     int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+			 size_t mapnum __attribute__ ((unused)),
+			 void **hostaddrs,
+			 void **devaddrs __attribute__ ((unused)),
+			 unsigned *dims __attribute__ ((unused)),
+			 void *targ_mem_desc __attribute__ ((unused)),
+			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
+  fn (hostaddrs);
 }
 
 static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
   return 1;
 }
 
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+				__attribute__ ((unused)))
 {
-  return 1;
 }
 
 static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+			      __attribute__ ((unused)),
+			      struct goacc_asyncqueue *aq2
+			      __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
-			       int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
 static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+				   __attribute__ ((unused)),
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
 {
+  return NULL;
 }
 
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
+{
+  return true;
+}
+
 static void *
 host_openacc_create_thread_data (int ord __attribute__ ((unused)))
 {
@@ -235,19 +266,21 @@ static struct gomp_device_descr host_dispatch =
 
       .exec_func = host_openacc_exec,
 
-      .register_async_cleanup_func = host_openacc_register_async_cleanup,
-
-      .async_test_func = host_openacc_async_test,
-      .async_test_all_func = host_openacc_async_test_all,
-      .async_wait_func = host_openacc_async_wait,
-      .async_wait_async_func = host_openacc_async_wait_async,
-      .async_wait_all_func = host_openacc_async_wait_all,
-      .async_wait_all_async_func = host_openacc_async_wait_all_async,
-      .async_set_async_func = host_openacc_async_set_async,
-
       .create_thread_data_func = host_openacc_create_thread_data,
       .destroy_thread_data_func = host_openacc_destroy_thread_data,
 
+      .async = {
+	.construct_func = host_openacc_async_construct,
+	.destruct_func = host_openacc_async_destruct,
+	.test_func = host_openacc_async_test,
+	.synchronize_func = host_openacc_async_synchronize,
+	.serialize_func = host_openacc_async_serialize,
+	.queue_callback_func = host_openacc_async_queue_callback,
+	.exec_func = host_openacc_async_exec,
+	.dev2host_func = host_openacc_async_dev2host,
+	.host2dev_func = host_openacc_async_host2dev,
+      },
+
       .cuda = {
 	.get_current_device_func = NULL,
 	.get_current_context_func = NULL,
Index: libgomp/oacc-init.c
===================================================================
--- libgomp/oacc-init.c	(revision 267226)
+++ libgomp/oacc-init.c	(working copy)
@@ -309,7 +309,7 @@ acc_shutdown_1 (acc_device_t d)
       if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
         {
 	  devices_active = true;
-	  ret &= acc_dev->fini_device_func (acc_dev->target_id);
+	  ret &= gomp_fini_device (acc_dev);
 	  acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
 	}
       gomp_mutex_unlock (&acc_dev->lock);
@@ -426,8 +426,8 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-  
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
Index: libgomp/oacc-int.h
===================================================================
--- libgomp/oacc-int.h	(revision 267226)
+++ libgomp/oacc-int.h	(working copy)
@@ -73,6 +73,9 @@ struct goacc_thread
 
   /* Target-specific data (used by plugin).  */
   void *target_tls;
+
+  /* Default OpenACC async queue for current thread, exported to plugin.  */
+  int default_async;
 };
 
 #if defined HAVE_TLS || defined USE_EMUTLS
@@ -99,6 +102,14 @@ void goacc_restore_bind (void);
 void goacc_lazy_initialize (void);
 void goacc_host_init (void);
 
+void goacc_init_asyncqueues (struct gomp_device_descr *);
+bool goacc_fini_asyncqueues (struct gomp_device_descr *);
+void goacc_async_free (struct gomp_device_descr *, struct goacc_asyncqueue *,
+		       void *);
+struct goacc_asyncqueue *get_goacc_asyncqueue (int);
+struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool,
+						  int);
+
 static inline bool
 async_valid_stream_id_p (int async)
 {
Index: libgomp/oacc-mem.c
===================================================================
--- libgomp/oacc-mem.c	(revision 267226)
+++ libgomp/oacc-mem.c	(working copy)
@@ -172,18 +172,11 @@ memcpy_tofrom_device (bool from, void *d, void *h,
       return;
     }
 
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (async);
-
-  bool ret = (from
-	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
-	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (acc_async_sync);
-
-  if (!ret)
-    gomp_fatal ("error in %s", libfnname);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  if (from)
+    gomp_copy_dev2host (thr->dev, aq, h, d, s);
+  else
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
 }
 
 void
@@ -509,17 +502,13 @@ present_create_copy (unsigned f, void *h, size_t s
 
       gomp_mutex_unlock (&acc_dev->lock);
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
-			   GOMP_MAP_VARS_OPENACC);
+      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+				 &kinds, true, GOMP_MAP_VARS_OPENACC);
       /* Initialize dynamic refcount.  */
       tgt->list[0].key->dynamic_refcount = 1;
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
-
       gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
@@ -676,13 +665,9 @@ delete_copyout (unsigned f, void *h, size_t s, int
 
       if (f & FLAG_COPYOUT)
 	{
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (async);
-	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (acc_async_sync);
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
-
       gomp_remove_var (acc_dev, n);
     }
 
@@ -765,17 +750,13 @@ update_dev_host (int is_dev, void *h, size_t s, in
   d = (void *) (n->tgt->tgt_start + n->tgt_offset
 		+ (uintptr_t) h - n->host_start);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
   else
-    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
+    gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (acc_async_sync);
-
   gomp_mutex_unlock (&acc_dev->lock);
 }
 
@@ -805,7 +786,7 @@ acc_update_self_async (void *h, size_t s, int asyn
 
 void
 gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds)
+			 void *kinds, int async)
 {
   struct target_mem_desc *tgt;
   struct goacc_thread *thr = goacc_thread ();
@@ -835,8 +816,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
-		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
 
   /* Initialize dynamic refcount.  */
@@ -930,7 +912,10 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
       if (async < acc_async_noval)
 	gomp_unmap_vars (t, true);
       else
-	t->device_descr->openacc.register_async_cleanup_func (t, async);
+	{
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_unmap_vars_async (t, true, aq);
+	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 267226)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -208,8 +208,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void
     }
   va_end (ap);
   
-  acc_dev->openacc.async_set_async_func (async);
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -226,44 +224,29 @@ GOACC_parallel_keyed (int device, void (*fn) (void
   else
     tgt_fn = (void (*)) fn;
 
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		       GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+			     true, GOMP_MAP_VARS_OPENACC);
+  
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
 			    + tgt->list[i].key->tgt_offset
 			    + tgt->list[i].offset);
-
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-			      async, dims, tgt);
-
-  /* If running synchronously, unmap immediately.  */
-  bool copyfrom = true;
-  if (async_synchronous_p (async))
-    gomp_unmap_vars (tgt, true);
+  if (aq == NULL)
+    {
+      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				  dims, tgt);
+      /* If running synchronously, unmap immediately.  */
+      gomp_unmap_vars (tgt, true);
+    }
   else
     {
-      bool async_unmap = false;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	{
-	  splay_tree_key k = tgt->list[i].key;
-	  if (k && k->refcount == 1)
-	    {
-	      async_unmap = true;
-	      break;
-	    }
-	}
-      if (async_unmap)
-	tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
-      else
-	{
-	  copyfrom = false;
-	  gomp_unmap_vars (tgt, copyfrom);
-	}
+      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					dims, tgt, aq);
+      gomp_unmap_vars_async (tgt, true, aq);
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 /* Legacy entry point, only provide host execution.  */
@@ -372,8 +355,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	finalize = true;
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   /* Determine if this is an "acc enter data".  */
   for (i = 0; i < mapnum; ++i)
     {
@@ -441,7 +422,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	  else
 	    {
 	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-				       &sizes[i], &kinds[i]);
+				       &sizes[i], &kinds[i], async);
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -466,17 +447,17 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 		if (acc_is_present (hostaddrs[i], sizes[i]))
 		  {
 		    if (finalize)
-		      acc_delete_finalize (hostaddrs[i], sizes[i]);
+		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
 		    else
-		      acc_delete (hostaddrs[i], sizes[i]);
+		      acc_delete_async (hostaddrs[i], sizes[i], async);
 		  }
 		break;
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_FORCE_FROM:
 		if (finalize)
-		  acc_copyout_finalize (hostaddrs[i], sizes[i]);
+		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
 		else
-		  acc_copyout (hostaddrs[i], sizes[i]);
+		  acc_copyout_async (hostaddrs[i], sizes[i], async);
 		break;
 	      default:
 		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -494,8 +475,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	    i += pointer - 1;
 	  }
       }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 static void
@@ -508,17 +487,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      if (acc_async_test (qid))
+      goacc_aq aq = get_goacc_asyncqueue (qid);
+      if (acc_dev->openacc.async.test_func (aq))
 	continue;
-
       if (async == acc_async_sync)
-	acc_wait (qid);
+	acc_dev->openacc.async.synchronize_func (aq);
       else if (qid == async)
-	;/* If we're waiting on the same asynchronous queue as we're
-	    launching on, the queue itself will order work as
-	    required, so there's no need to wait explicitly.  */
+	/* If we're waiting on the same asynchronous queue as we're
+	   launching on, the queue itself will order work as
+	   required, so there's no need to wait explicitly.  */
+	;
       else
-	acc_dev->openacc.async_wait_async_func (qid, async);
+	{
+	  goacc_aq aq2 = get_goacc_asyncqueue (async);
+	  acc_dev->openacc.async.synchronize_func (aq);
+	  acc_dev->openacc.async.serialize_func (aq, aq2);
+	}
     }
 }
 
@@ -548,8 +532,6 @@ GOACC_update (int device, size_t mapnum,
       va_end (ap);
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   bool update_device = false;
   for (i = 0; i < mapnum; ++i)
     {
@@ -589,7 +571,7 @@ GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_TO:
 	  update_device = true;
-	  acc_update_device (hostaddrs[i], sizes[i]);
+	  acc_update_device_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	case GOMP_MAP_FROM:
@@ -601,7 +583,7 @@ GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_FROM:
 	  update_device = false;
-	  acc_update_self (hostaddrs[i], sizes[i]);
+	  acc_update_self_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	default:
@@ -609,8 +591,6 @@ GOACC_update (int device, size_t mapnum,
 	  break;
 	}
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 void
Index: libgomp/oacc-plugin.c
===================================================================
--- libgomp/oacc-plugin.c	(revision 267226)
+++ libgomp/oacc-plugin.c	(working copy)
@@ -31,14 +31,10 @@
 #include "oacc-int.h"
 
 void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
+GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
+			      int async __attribute__((unused)))
 {
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);
+  gomp_fatal ("invalid plugin function");
 }
 
 /* Return the target-specific part of the TLS data for the current thread.  */
Thomas Schwinge Dec. 18, 2018, 9:03 p.m. | #20
Hi Chung-Lin!

On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> this part includes some of the lookup_goacc_asyncqueue fixes we talked about.

> I am still thinking about how the queue lock problem should really be solved, so regard

> this patch as just fixing some of the problems.


Sure, thanks.

Two comments, though:

> --- libgomp/oacc-async.c	(revision 267226)

> +++ libgomp/oacc-async.c	(working copy)


> +attribute_hidden struct goacc_asyncqueue *

> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

> +{

> +  /* The special value acc_async_noval (-1) maps to the thread-specific

> +     default async stream.  */

> +  if (async == acc_async_noval)

> +    async = thr->default_async;

> +

> +  if (async == acc_async_sync)

> +    return NULL;

> +

> +  if (async < 0)

> +    gomp_fatal ("bad async %d", async);

> +

> +  struct gomp_device_descr *dev = thr->dev;

> +

> +  gomp_mutex_lock (&dev->openacc.async.lock);

> +

> +  if (!create

> +      && (async >= dev->openacc.async.nasyncqueue

> +	  || !dev->openacc.async.asyncqueue[async]))

> +    {

> +      gomp_mutex_unlock (&dev->openacc.async.lock);

> +      return NULL;

> +    }

> +

> +  if (async >= dev->openacc.async.nasyncqueue)

> +    {

> +      int diff = async + 1 - dev->openacc.async.nasyncqueue;

> +      dev->openacc.async.asyncqueue

> +	= gomp_realloc (dev->openacc.async.asyncqueue,

> +			sizeof (goacc_aq) * (async + 1));

> +      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,

> +	      0, sizeof (goacc_aq) * diff);

> +      dev->openacc.async.nasyncqueue = async + 1;

> +    }

> +

> +  if (!dev->openacc.async.asyncqueue[async])

> +    {

> +      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

> +

> +      if (!dev->openacc.async.asyncqueue[async])

> +	{

> +	  gomp_mutex_unlock (&dev->openacc.async.lock);

> +	  gomp_fatal ("async %d creation failed", async);

> +	}


That will now always fail for host fallback, where
"host_openacc_async_construct" just always does "return NULL".

Actually, if the device doesn't support asyncqueues, this whole function
should turn into some kind of no-op, so that we don't again and again try
to create a new one for every call to "lookup_goacc_asyncqueue".

I'm attaching one possible solution.  I think it's fine to assume that
the majority of devices will support asyncqueues, and for those that
don't, this is just a one-time overhead per async-argument.  So, no
special handling required in "lookup_goacc_asyncqueue".

> +      /* Link new async queue into active list.  */

> +      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

> +      n->aq = dev->openacc.async.asyncqueue[async];

> +      n->next = dev->openacc.async.active;

> +      dev->openacc.async.active = n;

> +    }

> +  gomp_mutex_unlock (&dev->openacc.async.lock);


You still need to keep "async" locked during...

> +  return dev->openacc.async.asyncqueue[async];


... this dereference.

> +}



Oh, and:

> --- libgomp/oacc-plugin.c	(revision 267226)

> +++ libgomp/oacc-plugin.c	(working copy)

> @@ -31,14 +31,10 @@

>  #include "oacc-int.h"

>  

>  void

> -GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)

> +GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),

> +			      int async __attribute__((unused)))

>  {

> -  struct target_mem_desc *tgt = ptr;

> -  struct gomp_device_descr *devicep = tgt->device_descr;

> -

> -  devicep->openacc.async_set_async_func (async);

> -  gomp_unmap_vars (tgt, true);

> -  devicep->openacc.async_set_async_func (acc_async_sync);

> +  gomp_fatal ("invalid plugin function");

>  }


Please add a comment here, something like: "Obsolete entry point, no
longer used."


Grüße
 Thomas
From 4cb99c3691f95b6b299e7cb2603af36f723f9e8e Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Tue, 18 Dec 2018 21:58:41 +0100
Subject: [PATCH] into async re-work: adjust host_openacc_async_construct

---
 libgomp/oacc-host.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 727f8866f45c..cfd8a24f0674 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
 static struct goacc_asyncqueue *
 host_openacc_async_construct (void)
 {
-  return NULL;
+  /* We have to return non-NULL here, but it's OK to use a dummy.  */
+  return (struct goacc_asyncqueue *) -1;
 }
 
 static bool
-- 
2.17.1
Chung-Lin Tang Dec. 21, 2018, 4:04 p.m. | #21
On 2018/12/19 5:03 AM, Thomas Schwinge wrote:
> Hi Chung-Lin!

> 

> On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

>> this part includes some of the lookup_goacc_asyncqueue fixes we talked about.

>> I am still thinking about how the queue lock problem should really be solved, so regard

>> this patch as just fixing some of the problems.


Hi Thomas,
This is my solution to the queue lock stuff we talked about. I've only attached a diff to
be applied atop of the existing changes, though that may actually be easier to review.

Note that this is still in testing, which means it hasn't been tested :P, but I'm
posting to see if you have time to give it a look before the holidays.

Having the need for a lock on the async queues is quite irritating, especially
when the structure needed for managing them is quite simple.

Therefore, lets do away the need for locks entirely.

This patch makes the asyncqueue part of the device->openacc.async managed by lock-free
atomic operations; almost all of the complexity is contained in lookup_goacc_asyncqueue(),
so it should be not too complex. A descriptor and the queue array is allocated/exchanged
atomically when more storage is required, while in the common case a simple lookup is enough.
The fact that we manage asyncqueues by only expanding and never destroying asyncqueues
during the device lifetime also simplifies many things.

The current implementation may be not that optimized and clunky in some cases, but I think
this should be the way to implement what should be a fairly simple asyncqueue array and active
list. I'll update more on the status as testing proceeds.

(and about the other corners you noticed in the last mail, I'll get to that later...)

Thanks,
Chung-Lin







> Sure, thanks.

> 

> Two comments, though:

> 

>> --- libgomp/oacc-async.c	(revision 267226)

>> +++ libgomp/oacc-async.c	(working copy)

> 

>> +attribute_hidden struct goacc_asyncqueue *

>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

>> +{

>> +  /* The special value acc_async_noval (-1) maps to the thread-specific

>> +     default async stream.  */

>> +  if (async == acc_async_noval)

>> +    async = thr->default_async;

>> +

>> +  if (async == acc_async_sync)

>> +    return NULL;

>> +

>> +  if (async < 0)

>> +    gomp_fatal ("bad async %d", async);

>> +

>> +  struct gomp_device_descr *dev = thr->dev;

>> +

>> +  gomp_mutex_lock (&dev->openacc.async.lock);

>> +

>> +  if (!create

>> +      && (async >= dev->openacc.async.nasyncqueue

>> +	  || !dev->openacc.async.asyncqueue[async]))

>> +    {

>> +      gomp_mutex_unlock (&dev->openacc.async.lock);

>> +      return NULL;

>> +    }

>> +

>> +  if (async >= dev->openacc.async.nasyncqueue)

>> +    {

>> +      int diff = async + 1 - dev->openacc.async.nasyncqueue;

>> +      dev->openacc.async.asyncqueue

>> +	= gomp_realloc (dev->openacc.async.asyncqueue,

>> +			sizeof (goacc_aq) * (async + 1));

>> +      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,

>> +	      0, sizeof (goacc_aq) * diff);

>> +      dev->openacc.async.nasyncqueue = async + 1;

>> +    }

>> +

>> +  if (!dev->openacc.async.asyncqueue[async])

>> +    {

>> +      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

>> +

>> +      if (!dev->openacc.async.asyncqueue[async])

>> +	{

>> +	  gomp_mutex_unlock (&dev->openacc.async.lock);

>> +	  gomp_fatal ("async %d creation failed", async);

>> +	}

> 

> That will now always fail for host fallback, where

> "host_openacc_async_construct" just always does "return NULL".

> 

> Actually, if the device doesn't support asyncqueues, this whole function

> should turn into some kind of no-op, so that we don't again and again try

> to create a new one for every call to "lookup_goacc_asyncqueue".

> 

> I'm attaching one possible solution.  I think it's fine to assume that

> the majority of devices will support asyncqueues, and for those that

> don't, this is just a one-time overhead per async-argument.  So, no

> special handling required in "lookup_goacc_asyncqueue".

> 

>> +      /* Link new async queue into active list.  */

>> +      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

>> +      n->aq = dev->openacc.async.asyncqueue[async];

>> +      n->next = dev->openacc.async.active;

>> +      dev->openacc.async.active = n;

>> +    }

>> +  gomp_mutex_unlock (&dev->openacc.async.lock);

> 

> You still need to keep "async" locked during...

> 

>> +  return dev->openacc.async.asyncqueue[async];

> 

> ... this dereference.

> 

>> +}

> 

> 

> Oh, and:

> 

>> --- libgomp/oacc-plugin.c	(revision 267226)

>> +++ libgomp/oacc-plugin.c	(working copy)

>> @@ -31,14 +31,10 @@

>>   #include "oacc-int.h"

>>   

>>   void

>> -GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)

>> +GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),

>> +			      int async __attribute__((unused)))

>>   {

>> -  struct target_mem_desc *tgt = ptr;

>> -  struct gomp_device_descr *devicep = tgt->device_descr;

>> -

>> -  devicep->openacc.async_set_async_func (async);

>> -  gomp_unmap_vars (tgt, true);

>> -  devicep->openacc.async_set_async_func (acc_async_sync);

>> +  gomp_fatal ("invalid plugin function");

>>   }

> 

> Please add a comment here, something like: "Obsolete entry point, no

> longer used."

> 

> 

> Grüße

>   Thomas

> 

>
diff -ru trunk-orig/libgomp/libgomp.h trunk-work/libgomp/libgomp.h
--- trunk-orig/libgomp/libgomp.h	2018-12-20 23:24:20.040375724 +0800
+++ trunk-work/libgomp/libgomp.h	2018-12-21 23:32:47.938628954 +0800
@@ -937,6 +937,13 @@
 
 #include "splay-tree.h"
 
+struct goacc_asyncqueue_desc
+{
+  int nasyncqueue;
+  struct goacc_asyncqueue **asyncqueue;
+  struct goacc_asyncqueue_list *active;
+};
+
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -955,10 +962,7 @@
     *destroy_thread_data_func;
   
   struct {
-    gomp_mutex_t lock;
-    int nasyncqueue;
-    struct goacc_asyncqueue **asyncqueue;
-    struct goacc_asyncqueue_list *active;
+    struct goacc_asyncqueue_desc *aqdesc;
 
     __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
     __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c
--- trunk-orig/libgomp/oacc-async.c	2018-12-18 22:19:51.923102938 +0800
+++ trunk-work/libgomp/oacc-async.c	2018-12-21 23:40:36.088528890 +0800
@@ -70,45 +70,84 @@
 
   struct gomp_device_descr *dev = thr->dev;
 
-  gomp_mutex_lock (&dev->openacc.async.lock);
+  struct goacc_asyncqueue_desc *aqdesc = dev->openacc.async.aqdesc;
 
-  if (!create
-      && (async >= dev->openacc.async.nasyncqueue
-	  || !dev->openacc.async.asyncqueue[async]))
-    {
-      gomp_mutex_unlock (&dev->openacc.async.lock);
-      return NULL;
-    }
+  if (!create)
+    return async < aqdesc->nasyncqueue ? aqdesc->asyncqueue[async] : NULL;
 
-  if (async >= dev->openacc.async.nasyncqueue)
-    {
-      int diff = async + 1 - dev->openacc.async.nasyncqueue;
-      dev->openacc.async.asyncqueue
-	= gomp_realloc (dev->openacc.async.asyncqueue,
-			sizeof (goacc_aq) * (async + 1));
-      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
-	      0, sizeof (goacc_aq) * diff);
-      dev->openacc.async.nasyncqueue = async + 1;
-    }
+  if (async < aqdesc->nasyncqueue && aqdesc->asyncqueue[async])
+    return aqdesc->asyncqueue[async];
 
-  if (!dev->openacc.async.asyncqueue[async])
-    {
-      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+  struct goacc_asyncqueue *new_aq = dev->openacc.async.construct_func ();
+  const int inc_size = 8;
+  int new_sz = async + 1 + inc_size;
+  new_sz = (new_sz + inc_size - 1) & ~(inc_size - 1);
+
+  struct goacc_asyncqueue **new_aqvec =
+    gomp_malloc_cleared (sizeof (goacc_aq) * new_sz);
+  new_aqvec[async] = new_aq;
+
+  struct goacc_asyncqueue_desc *ndesc =
+    gomp_malloc (sizeof (struct goacc_asyncqueue_desc));
+  ndesc->asyncqueue = new_aqvec;
+  ndesc->nasyncqueue = new_sz;
+
+  goacc_aq_list nl = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+  nl->aq = new_aq;
+  ndesc->active = nl;
+
+  /* Loop to atomically create and add a new async queue, without using a lock,
+     though may possibly requiring creating and replacing with a new descriptor.
+     This may likely not be as optimized as it can be, although simplifies
+     other handling of async queues.  */
+  do {
+    int curr_num_aqs = aqdesc->nasyncqueue;
+    
+    if (async < curr_num_aqs)
+      {
+	/* Someone else expanded the asyncqeue array to enough size, free the
+	   allocated resources.  */
+	free (ndesc->asyncqueue);
+	free (ndesc);
+
+	/* Try to set the new aq to the right place again.  */
+	struct goacc_asyncqueue_desc *nullq = NULL;
+	if (__atomic_compare_exchange (&aqdesc->asyncqueue[async],
+				       &nullq, &new_aq, false,
+				       __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE))
+	  {
+	    /* Push new asyncqueue to front of active list.  */
+	    nl->next = aqdesc->active;
+	    while (!__atomic_compare_exchange (&aqdesc->active, &nl->next, &nl,
+					       true, __ATOMIC_ACQ_REL,
+					       __ATOMIC_ACQUIRE))
+	      ;
+	  }
+	else
+	  {
+	    /* Okay, someone else already created 'async', free/destroy
+	       everything allocated and return existing queue directly.  */
+	    if (!dev->openacc.async.destruct_func (new_aq))
+	      gomp_fatal ("error in async queue creation");
+	    free (nl);
+	  }
+	return aqdesc->asyncqueue[async];
+      }
 
-      if (!dev->openacc.async.asyncqueue[async])
-	{
-	  gomp_mutex_unlock (&dev->openacc.async.lock);
-	  gomp_fatal ("async %d creation failed", async);
-	}
-      
-      /* Link new async queue into active list.  */
-      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
-      n->aq = dev->openacc.async.asyncqueue[async];
-      n->next = dev->openacc.async.active;
-      dev->openacc.async.active = n;
-    }
-  gomp_mutex_unlock (&dev->openacc.async.lock);
-  return dev->openacc.async.asyncqueue[async];
+    /* Copy current asyncqueue contents*/
+    memcpy (ndesc->asyncqueue, aqdesc->asyncqueue,
+	    sizeof (struct goacc_asyncqueue *) * curr_num_aqs);
+    ndesc->active->next = aqdesc->active;
+
+  } while (!__atomic_compare_exchange (&dev->openacc.async.aqdesc,
+				       &aqdesc, &ndesc, false,
+				       __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE));
+
+  /* Free the old stale descriptor and resources in aqdesc.  */
+  free (aqdesc->asyncqueue);
+  free (aqdesc);
+
+  return dev->openacc.async.aqdesc->asyncqueue[async];
 }
 
 attribute_hidden struct goacc_asyncqueue *
@@ -139,14 +178,12 @@
   struct goacc_thread *thr = get_goacc_thread ();
 
   int ret = 1;
-  gomp_mutex_lock (&thr->dev->openacc.async.lock);
-  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+  for (goacc_aq_list l = thr->dev->openacc.async.aqdesc->active; l; l = l->next)
     if (!thr->dev->openacc.async.test_func (l->aq))
       {
 	ret = 0;
 	break;
       }
-  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
   return ret;
 }
 
@@ -194,10 +231,8 @@
 {
   struct gomp_device_descr *dev = get_goacc_thread_device ();
 
-  gomp_mutex_lock (&dev->openacc.async.lock);
-  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+  for (goacc_aq_list l = dev->openacc.async.aqdesc->active; l; l = l->next)
     dev->openacc.async.synchronize_func (l->aq);
-  gomp_mutex_unlock (&dev->openacc.async.lock);
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -221,14 +256,12 @@
 
   goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
-  gomp_mutex_lock (&thr->dev->openacc.async.lock);
-  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+  for (goacc_aq_list l = thr->dev->openacc.async.aqdesc->active; l; l = l->next)
     {
       thr->dev->openacc.async.synchronize_func (l->aq);
       if (waiting_queue)
 	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
     }
-  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
 }
 
 int
@@ -261,30 +294,27 @@
 attribute_hidden void
 goacc_init_asyncqueues (struct gomp_device_descr *devicep)
 {
-  gomp_mutex_init (&devicep->openacc.async.lock);
-  devicep->openacc.async.nasyncqueue = 0;
-  devicep->openacc.async.asyncqueue = NULL;
-  devicep->openacc.async.active = NULL;
+  devicep->openacc.async.aqdesc =
+    gomp_malloc_cleared (sizeof (struct goacc_asyncqueue_desc));
 }
 
 attribute_hidden bool
 goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
 {
+  struct goacc_asyncqueue_desc *aqdesc = devicep->openacc.async.aqdesc;
   bool ret = true;
-  if (devicep->openacc.async.nasyncqueue > 0)
+
+  if (aqdesc->nasyncqueue > 0)
     {
       goacc_aq_list next;
-      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+      for (goacc_aq_list l = aqdesc->active; l; l = next)
 	{
 	  ret &= devicep->openacc.async.destruct_func (l->aq);
 	  next = l->next;
 	  free (l);
 	}
-      free (devicep->openacc.async.asyncqueue);
-      devicep->openacc.async.nasyncqueue = 0;
-      devicep->openacc.async.asyncqueue = NULL;
-      devicep->openacc.async.active = NULL;
+      free (aqdesc->asyncqueue);
     }
-  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  free (aqdesc);
   return ret;
 }
diff -ru trunk-orig/libgomp/oacc-cuda.c trunk-work/libgomp/oacc-cuda.c
--- trunk-orig/libgomp/oacc-cuda.c	2018-12-18 18:09:08.601913507 +0800
+++ trunk-work/libgomp/oacc-cuda.c	2018-12-21 23:34:10.024738834 +0800
@@ -87,9 +87,7 @@
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
     {
       goacc_aq aq = get_goacc_asyncqueue (async);
-      gomp_mutex_lock (&thr->dev->openacc.async.lock);
       ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
-      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
     }
 
   return ret;
Thomas Schwinge Dec. 28, 2018, 12:07 p.m. | #22
Hi Chung-Lin!

On Sat, 22 Dec 2018 00:04:56 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> On 2018/12/19 5:03 AM, Thomas Schwinge wrote:

> > On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:

> >> this part includes some of the lookup_goacc_asyncqueue fixes we talked about.

> >> I am still thinking about how the queue lock problem should really be solved, so regard

> >> this patch as just fixing some of the problems.

> 

> This is my solution to the queue lock stuff we talked about. I've only attached a diff to

> be applied atop of the existing changes, though that may actually be easier to review.

> 

> Note that this is still in testing, which means it hasn't been tested :P, but I'm

> posting to see if you have time to give it a look before the holidays.

> 

> Having the need for a lock on the async queues is quite irritating, especially

> when the structure needed for managing them is quite simple.

>

> Therefore, lets do away the need for locks entirely.


OK, if that's easily possible, fine.  Though, I won't object to the
current "lock" version, if done properly in all the relevant places, as
pointed out.

(And, as that was an open issue, as far as I remember: I think its fine
to have "aq"s not protected by the async lock, because they will only be
destroyed by device shutdown, and at that point, all async operations
must be synchronized with the local (host) thread, so the are then "aq"s
idle.)

> This patch makes the asyncqueue part of the device->openacc.async managed by lock-free

> atomic operations; almost all of the complexity is contained in lookup_goacc_asyncqueue(),

> so it should be not too complex. A descriptor and the queue array is allocated/exchanged

> atomically when more storage is required, while in the common case a simple lookup is enough.

> The fact that we manage asyncqueues by only expanding and never destroying asyncqueues

> during the device lifetime also simplifies many things.

> 

> The current implementation may be not that optimized and clunky in some cases, but I think

> this should be the way to implement what should be a fairly simple asyncqueue array and active

> list.


OK conceptually, but from the (very quick only!) look that I had, I did
not completely understand the data structure you're adding there, and
don't you also have to use atomic instructions for reading the lock-free
data structure ("aqdesc" etc.)?

> I'll update more on the status as testing proceeds.

> 

> (and about the other corners you noticed in the last mail, I'll get to that later...)


OK, thanks.


Grüße
 Thomas


> >> --- libgomp/oacc-async.c	(revision 267226)

> >> +++ libgomp/oacc-async.c	(working copy)

> > 

> >> +attribute_hidden struct goacc_asyncqueue *

> >> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)

> >> +{

> >> +[...]

> >> +      /* Link new async queue into active list.  */

> >> +      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

> >> +      n->aq = dev->openacc.async.asyncqueue[async];

> >> +      n->next = dev->openacc.async.active;

> >> +      dev->openacc.async.active = n;

> >> +    }

> >> +  gomp_mutex_unlock (&dev->openacc.async.lock);

> > 

> > You still need to keep "async" locked during...

> > 

> >> +  return dev->openacc.async.asyncqueue[async];

> > 

> > ... this dereference.

> > 

> >> +}


> --- trunk-orig/libgomp/libgomp.h	2018-12-20 23:24:20.040375724 +0800

> +++ trunk-work/libgomp/libgomp.h	2018-12-21 23:32:47.938628954 +0800

> @@ -937,6 +937,13 @@

>  

>  #include "splay-tree.h"

>  

> +struct goacc_asyncqueue_desc

> +{

> +  int nasyncqueue;

> +  struct goacc_asyncqueue **asyncqueue;

> +  struct goacc_asyncqueue_list *active;

> +};

> +

>  typedef struct acc_dispatch_t

>  {

>    /* This is a linked list of data mapped using the

> @@ -955,10 +962,7 @@

>      *destroy_thread_data_func;

>    

>    struct {

> -    gomp_mutex_t lock;

> -    int nasyncqueue;

> -    struct goacc_asyncqueue **asyncqueue;

> -    struct goacc_asyncqueue_list *active;

> +    struct goacc_asyncqueue_desc *aqdesc;

>  

>      __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;

>      __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;

> diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c

> --- trunk-orig/libgomp/oacc-async.c	2018-12-18 22:19:51.923102938 +0800

> +++ trunk-work/libgomp/oacc-async.c	2018-12-21 23:40:36.088528890 +0800

> @@ -70,45 +70,84 @@

>  

>    struct gomp_device_descr *dev = thr->dev;

>  

> -  gomp_mutex_lock (&dev->openacc.async.lock);

> +  struct goacc_asyncqueue_desc *aqdesc = dev->openacc.async.aqdesc;

>  

> -  if (!create

> -      && (async >= dev->openacc.async.nasyncqueue

> -	  || !dev->openacc.async.asyncqueue[async]))

> -    {

> -      gomp_mutex_unlock (&dev->openacc.async.lock);

> -      return NULL;

> -    }

> +  if (!create)

> +    return async < aqdesc->nasyncqueue ? aqdesc->asyncqueue[async] : NULL;

>  

> -  if (async >= dev->openacc.async.nasyncqueue)

> -    {

> -      int diff = async + 1 - dev->openacc.async.nasyncqueue;

> -      dev->openacc.async.asyncqueue

> -	= gomp_realloc (dev->openacc.async.asyncqueue,

> -			sizeof (goacc_aq) * (async + 1));

> -      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,

> -	      0, sizeof (goacc_aq) * diff);

> -      dev->openacc.async.nasyncqueue = async + 1;

> -    }

> +  if (async < aqdesc->nasyncqueue && aqdesc->asyncqueue[async])

> +    return aqdesc->asyncqueue[async];

>  

> -  if (!dev->openacc.async.asyncqueue[async])

> -    {

> -      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

> +  struct goacc_asyncqueue *new_aq = dev->openacc.async.construct_func ();

> +  const int inc_size = 8;

> +  int new_sz = async + 1 + inc_size;

> +  new_sz = (new_sz + inc_size - 1) & ~(inc_size - 1);

> +

> +  struct goacc_asyncqueue **new_aqvec =

> +    gomp_malloc_cleared (sizeof (goacc_aq) * new_sz);

> +  new_aqvec[async] = new_aq;

> +

> +  struct goacc_asyncqueue_desc *ndesc =

> +    gomp_malloc (sizeof (struct goacc_asyncqueue_desc));

> +  ndesc->asyncqueue = new_aqvec;

> +  ndesc->nasyncqueue = new_sz;

> +

> +  goacc_aq_list nl = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

> +  nl->aq = new_aq;

> +  ndesc->active = nl;

> +

> +  /* Loop to atomically create and add a new async queue, without using a lock,

> +     though may possibly requiring creating and replacing with a new descriptor.

> +     This may likely not be as optimized as it can be, although simplifies

> +     other handling of async queues.  */

> +  do {

> +    int curr_num_aqs = aqdesc->nasyncqueue;

> +    

> +    if (async < curr_num_aqs)

> +      {

> +	/* Someone else expanded the asyncqeue array to enough size, free the

> +	   allocated resources.  */

> +	free (ndesc->asyncqueue);

> +	free (ndesc);

> +

> +	/* Try to set the new aq to the right place again.  */

> +	struct goacc_asyncqueue_desc *nullq = NULL;

> +	if (__atomic_compare_exchange (&aqdesc->asyncqueue[async],

> +				       &nullq, &new_aq, false,

> +				       __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE))

> +	  {

> +	    /* Push new asyncqueue to front of active list.  */

> +	    nl->next = aqdesc->active;

> +	    while (!__atomic_compare_exchange (&aqdesc->active, &nl->next, &nl,

> +					       true, __ATOMIC_ACQ_REL,

> +					       __ATOMIC_ACQUIRE))

> +	      ;

> +	  }

> +	else

> +	  {

> +	    /* Okay, someone else already created 'async', free/destroy

> +	       everything allocated and return existing queue directly.  */

> +	    if (!dev->openacc.async.destruct_func (new_aq))

> +	      gomp_fatal ("error in async queue creation");

> +	    free (nl);

> +	  }

> +	return aqdesc->asyncqueue[async];

> +      }

>  

> -      if (!dev->openacc.async.asyncqueue[async])

> -	{

> -	  gomp_mutex_unlock (&dev->openacc.async.lock);

> -	  gomp_fatal ("async %d creation failed", async);

> -	}

> -      

> -      /* Link new async queue into active list.  */

> -      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));

> -      n->aq = dev->openacc.async.asyncqueue[async];

> -      n->next = dev->openacc.async.active;

> -      dev->openacc.async.active = n;

> -    }

> -  gomp_mutex_unlock (&dev->openacc.async.lock);

> -  return dev->openacc.async.asyncqueue[async];

> +    /* Copy current asyncqueue contents*/

> +    memcpy (ndesc->asyncqueue, aqdesc->asyncqueue,

> +	    sizeof (struct goacc_asyncqueue *) * curr_num_aqs);

> +    ndesc->active->next = aqdesc->active;

> +

> +  } while (!__atomic_compare_exchange (&dev->openacc.async.aqdesc,

> +				       &aqdesc, &ndesc, false,

> +				       __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE));

> +

> +  /* Free the old stale descriptor and resources in aqdesc.  */

> +  free (aqdesc->asyncqueue);

> +  free (aqdesc);

> +

> +  return dev->openacc.async.aqdesc->asyncqueue[async];

>  }

>  

>  attribute_hidden struct goacc_asyncqueue *

> @@ -139,14 +178,12 @@

>    struct goacc_thread *thr = get_goacc_thread ();

>  

>    int ret = 1;

> -  gomp_mutex_lock (&thr->dev->openacc.async.lock);

> -  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)

> +  for (goacc_aq_list l = thr->dev->openacc.async.aqdesc->active; l; l = l->next)

>      if (!thr->dev->openacc.async.test_func (l->aq))

>        {

>  	ret = 0;

>  	break;

>        }

> -  gomp_mutex_unlock (&thr->dev->openacc.async.lock);

>    return ret;

>  }

>  

> @@ -194,10 +231,8 @@

>  {

>    struct gomp_device_descr *dev = get_goacc_thread_device ();

>  

> -  gomp_mutex_lock (&dev->openacc.async.lock);

> -  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)

> +  for (goacc_aq_list l = dev->openacc.async.aqdesc->active; l; l = l->next)

>      dev->openacc.async.synchronize_func (l->aq);

> -  gomp_mutex_unlock (&dev->openacc.async.lock);

>  }

>  

>  /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */

> @@ -221,14 +256,12 @@

>  

>    goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);

>  

> -  gomp_mutex_lock (&thr->dev->openacc.async.lock);

> -  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)

> +  for (goacc_aq_list l = thr->dev->openacc.async.aqdesc->active; l; l = l->next)

>      {

>        thr->dev->openacc.async.synchronize_func (l->aq);

>        if (waiting_queue)

>  	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);

>      }

> -  gomp_mutex_unlock (&thr->dev->openacc.async.lock);

>  }

>  

>  int

> @@ -261,30 +294,27 @@

>  attribute_hidden void

>  goacc_init_asyncqueues (struct gomp_device_descr *devicep)

>  {

> -  gomp_mutex_init (&devicep->openacc.async.lock);

> -  devicep->openacc.async.nasyncqueue = 0;

> -  devicep->openacc.async.asyncqueue = NULL;

> -  devicep->openacc.async.active = NULL;

> +  devicep->openacc.async.aqdesc =

> +    gomp_malloc_cleared (sizeof (struct goacc_asyncqueue_desc));

>  }

>  

>  attribute_hidden bool

>  goacc_fini_asyncqueues (struct gomp_device_descr *devicep)

>  {

> +  struct goacc_asyncqueue_desc *aqdesc = devicep->openacc.async.aqdesc;

>    bool ret = true;

> -  if (devicep->openacc.async.nasyncqueue > 0)

> +

> +  if (aqdesc->nasyncqueue > 0)

>      {

>        goacc_aq_list next;

> -      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)

> +      for (goacc_aq_list l = aqdesc->active; l; l = next)

>  	{

>  	  ret &= devicep->openacc.async.destruct_func (l->aq);

>  	  next = l->next;

>  	  free (l);

>  	}

> -      free (devicep->openacc.async.asyncqueue);

> -      devicep->openacc.async.nasyncqueue = 0;

> -      devicep->openacc.async.asyncqueue = NULL;

> -      devicep->openacc.async.active = NULL;

> +      free (aqdesc->asyncqueue);

>      }

> -  gomp_mutex_destroy (&devicep->openacc.async.lock);

> +  free (aqdesc);

>    return ret;

>  }

> diff -ru trunk-orig/libgomp/oacc-cuda.c trunk-work/libgomp/oacc-cuda.c

> --- trunk-orig/libgomp/oacc-cuda.c	2018-12-18 18:09:08.601913507 +0800

> +++ trunk-work/libgomp/oacc-cuda.c	2018-12-21 23:34:10.024738834 +0800

> @@ -87,9 +87,7 @@

>    if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)

>      {

>        goacc_aq aq = get_goacc_asyncqueue (async);

> -      gomp_mutex_lock (&thr->dev->openacc.async.lock);

>        ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);

> -      gomp_mutex_unlock (&thr->dev->openacc.async.lock);

>      }

>  

>    return ret;
Chung-Lin Tang Jan. 2, 2019, 12:46 p.m. | #23
Hi Thomas, Happy New Year,

On 2018/12/19 5:03 AM, Thomas Schwinge wrote:
>> +

>> +  if (!dev->openacc.async.asyncqueue[async])

>> +    {

>> +      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

>> +

>> +      if (!dev->openacc.async.asyncqueue[async])

>> +	{

>> +	  gomp_mutex_unlock (&dev->openacc.async.lock);

>> +	  gomp_fatal ("async %d creation failed", async);

>> +	}

> That will now always fail for host fallback, where

> "host_openacc_async_construct" just always does "return NULL".

> 

> Actually, if the device doesn't support asyncqueues, this whole function

> should turn into some kind of no-op, so that we don't again and again try

> to create a new one for every call to "lookup_goacc_asyncqueue".

> 

> I'm attaching one possible solution.  I think it's fine to assume that

> the majority of devices will support asyncqueues, and for those that

> don't, this is just a one-time overhead per async-argument.  So, no

> special handling required in "lookup_goacc_asyncqueue".


> --- a/libgomp/oacc-host.c

> +++ b/libgomp/oacc-host.c

> @@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct goacc_asyncqueue *aq

>   static struct goacc_asyncqueue *

>   host_openacc_async_construct (void)

>   {

> -  return NULL;

> +  /* We have to return non-NULL here, but it's OK to use a dummy.  */

> +  return (struct goacc_asyncqueue *) -1;

>   }


I'm not sure I understand the meaning of this? Is there any use to segfaulting somewhere else
due to this 0xffff... pointer?

A feature of a NULL asyncqueue should mean that it is simply synchronous, however this does somewhat
conflict with the case of async.construct_func() returning NULL on error...

Perhaps, again using an explicit success code as the return value (and return asyncqueue using
an out parameter)?

Chung-Lin
Chung-Lin Tang Jan. 5, 2019, 9:47 a.m. | #24
Hi Thomas,
this is the current version of the oacc-* parts of the Async Re-work patch.

I have reverted away from the earlier mentioned attempt of using lockless
techniques to manage the asyncqueues; it is really hard to do in a 100% correct
manner, unless we only use something like simple lists to manage them,
which probably makes lookup unacceptably slow.

For now, I have changed to use the conventional locking and success/fail return
codes for the synchronize/serialize hooks. I hope this is enough to pass
and get committed.

Thanks,
Chung-Lin
Index: oacc-async.c
===================================================================
--- oacc-async.c	(revision 267507)
+++ oacc-async.c	(working copy)
@@ -27,10 +27,99 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <assert.h>
+#include <string.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+     default async stream.  */
+  if (async == acc_async_noval)
+    async = thr->default_async;
+
+  if (async == acc_async_sync)
+    return NULL;
+
+  if (async < 0)
+    gomp_fatal ("bad async %d", async);
+
+  struct goacc_asyncqueue *ret_aq = NULL;
+  struct gomp_device_descr *dev = thr->dev;
+
+  gomp_mutex_lock (&dev->openacc.async.lock);
+
+  if (!create
+      && (async >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[async]))
+    goto end;
+
+  if (async >= dev->openacc.async.nasyncqueue)
+    {
+      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      dev->openacc.async.asyncqueue
+	= gomp_realloc (dev->openacc.async.asyncqueue,
+			sizeof (goacc_aq) * (async + 1));
+      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+	      0, sizeof (goacc_aq) * diff);
+      dev->openacc.async.nasyncqueue = async + 1;
+    }
+
+  if (!dev->openacc.async.asyncqueue[async])
+    {
+      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+      if (!dev->openacc.async.asyncqueue[async])
+	{
+	  gomp_mutex_unlock (&dev->openacc.async.lock);
+	  gomp_fatal ("async %d creation failed", async);
+	}
+      
+      /* Link new async queue into active list.  */
+      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+      n->aq = dev->openacc.async.asyncqueue[async];
+      n->next = dev->openacc.async.active;
+      dev->openacc.async.active = n;
+    }
+
+  ret_aq = dev->openacc.async.asyncqueue[async];
+
+ end:
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+  return ret_aq;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -42,18 +131,25 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  return thr->dev->openacc.async_test_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
 acc_async_test_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  return thr->dev->openacc.async_test_all_func ();
+  int ret = 1;
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    if (!thr->dev->openacc.async.test_func (l->aq))
+      {
+	ret = 0;
+	break;
+      }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -62,12 +158,10 @@ acc_wait (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -84,23 +178,34 @@ acc_async_wait (int async)
 void
 acc_wait_async (int async1, int async2)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  if (!aq1)
+    gomp_fatal ("invalid async 1");
+  if (aq1 == aq2)
+    gomp_fatal ("identical parameters");
 
-  thr->dev->openacc.async_wait_async_func (async1, async2);
+  if (!thr->dev->openacc.async.synchronize_func (aq1))
+    gomp_fatal ("wait on %d failed", async1);
+  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
+    gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);
 }
 
 void
 acc_wait_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct gomp_device_descr *dev = get_goacc_thread_device ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  bool ret = true;
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+    ret &= dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&dev->openacc.async.lock);
 
-  thr->dev->openacc.async_wait_all_func ();
+  if (!ret)
+    gomp_fatal ("wait all failed");
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -120,10 +225,80 @@ acc_wait_all_async (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
-  thr->dev->openacc.async_wait_all_async_func (async);
+  bool ret = true;
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    {
+      ret &= thr->dev->openacc.async.synchronize_func (l->aq);
+      if (waiting_queue)
+	ret &= thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+    }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
+  if (!ret)
+    gomp_fatal ("wait all async(%d) failed", async);
 }
+
+int
+acc_get_default_async (void)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return thr->default_async;
+}
+
+void
+acc_set_default_async (int async)
+{
+  if (async < acc_async_sync)
+    gomp_fatal ("invalid async argument: %d", async);
+
+  struct goacc_thread *thr = get_goacc_thread ();
+  thr->default_async = async;
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+		  struct goacc_asyncqueue *aq, void *ptr)
+{
+  if (!aq)
+    free (ptr);
+  else
+    devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+  devicep->openacc.async.nasyncqueue = 0;
+  devicep->openacc.async.asyncqueue = NULL;
+  devicep->openacc.async.active = NULL;
+  gomp_mutex_init (&devicep->openacc.async.lock);
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+  bool ret = true;
+  gomp_mutex_lock (&devicep->openacc.async.lock);
+  if (devicep->openacc.async.nasyncqueue > 0)
+    {
+      goacc_aq_list next;
+      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+	{
+	  ret &= devicep->openacc.async.destruct_func (l->aq);
+	  next = l->next;
+	  free (l);
+	}
+      free (devicep->openacc.async.asyncqueue);
+      devicep->openacc.async.nasyncqueue = 0;
+      devicep->openacc.async.asyncqueue = NULL;
+      devicep->openacc.async.active = NULL;
+    }
+  gomp_mutex_unlock (&devicep->openacc.async.lock);
+  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  return ret;
+}
Index: oacc-cuda.c
===================================================================
--- oacc-cuda.c	(revision 267507)
+++ oacc-cuda.c	(working copy)
@@ -62,7 +62,11 @@ acc_get_cuda_stream (int async)
     return NULL;
 
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
-    return thr->dev->openacc.cuda.get_stream_func (async);
+    {
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+      if (aq)
+	return thr->dev->openacc.cuda.get_stream_func (aq);
+    }
  
   return NULL;
 }
@@ -79,8 +83,14 @@ acc_set_cuda_stream (int async, void *stream)
 
   thr = goacc_thread ();
 
+  int ret = -1;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
-    return thr->dev->openacc.cuda.set_stream_func (async, stream);
+    {
+      goacc_aq aq = get_goacc_asyncqueue (async);
+      gomp_mutex_lock (&thr->dev->openacc.async.lock);
+      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+    }
 
-  return -1;
+  return ret;
 }
Index: oacc-host.c
===================================================================
--- oacc-host.c	(revision 267507)
+++ oacc-host.c	(working copy)
@@ -140,8 +140,7 @@ host_openacc_exec (void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
 		   void **devaddrs __attribute__ ((unused)),
-		   int async __attribute__ ((unused)),
-		   unsigned *dims __attribute ((unused)),
+		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
   fn (hostaddrs);
@@ -148,49 +147,83 @@ host_openacc_exec (void (*fn) (void *),
 }
 
 static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
-				     int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+			 size_t mapnum __attribute__ ((unused)),
+			 void **hostaddrs,
+			 void **devaddrs __attribute__ ((unused)),
+			 unsigned *dims __attribute__ ((unused)),
+			 void *targ_mem_desc __attribute__ ((unused)),
+			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
+  fn (hostaddrs);
 }
 
 static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
   return 1;
 }
 
-static int
-host_openacc_async_test_all (void)
+static bool
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+				__attribute__ ((unused)))
 {
-  return 1;
+  return true;
 }
 
-static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+static bool
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+			      __attribute__ ((unused)),
+			      struct goacc_asyncqueue *aq2
+			      __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
-			       int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
 static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+				   __attribute__ ((unused)),
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
 {
+  return NULL;
 }
 
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
+{
+  return true;
+}
+
 static void *
 host_openacc_create_thread_data (int ord __attribute__ ((unused)))
 {
@@ -235,19 +268,21 @@ static struct gomp_device_descr host_dispatch =
 
       .exec_func = host_openacc_exec,
 
-      .register_async_cleanup_func = host_openacc_register_async_cleanup,
-
-      .async_test_func = host_openacc_async_test,
-      .async_test_all_func = host_openacc_async_test_all,
-      .async_wait_func = host_openacc_async_wait,
-      .async_wait_async_func = host_openacc_async_wait_async,
-      .async_wait_all_func = host_openacc_async_wait_all,
-      .async_wait_all_async_func = host_openacc_async_wait_all_async,
-      .async_set_async_func = host_openacc_async_set_async,
-
       .create_thread_data_func = host_openacc_create_thread_data,
       .destroy_thread_data_func = host_openacc_destroy_thread_data,
 
+      .async = {
+	.construct_func = host_openacc_async_construct,
+	.destruct_func = host_openacc_async_destruct,
+	.test_func = host_openacc_async_test,
+	.synchronize_func = host_openacc_async_synchronize,
+	.serialize_func = host_openacc_async_serialize,
+	.queue_callback_func = host_openacc_async_queue_callback,
+	.exec_func = host_openacc_async_exec,
+	.dev2host_func = host_openacc_async_dev2host,
+	.host2dev_func = host_openacc_async_host2dev,
+      },
+
       .cuda = {
 	.get_current_device_func = NULL,
 	.get_current_context_func = NULL,
Index: oacc-init.c
===================================================================
--- oacc-init.c	(revision 267507)
+++ oacc-init.c	(working copy)
@@ -309,7 +309,7 @@ acc_shutdown_1 (acc_device_t d)
       if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
         {
 	  devices_active = true;
-	  ret &= acc_dev->fini_device_func (acc_dev->target_id);
+	  ret &= gomp_fini_device (acc_dev);
 	  acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
 	}
       gomp_mutex_unlock (&acc_dev->lock);
@@ -426,8 +426,8 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-  
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
Index: oacc-int.h
===================================================================
--- oacc-int.h	(revision 267507)
+++ oacc-int.h	(working copy)
@@ -73,6 +73,9 @@ struct goacc_thread
 
   /* Target-specific data (used by plugin).  */
   void *target_tls;
+
+  /* Default OpenACC async queue for current thread, exported to plugin.  */
+  int default_async;
 };
 
 #if defined HAVE_TLS || defined USE_EMUTLS
@@ -99,6 +102,14 @@ void goacc_restore_bind (void);
 void goacc_lazy_initialize (void);
 void goacc_host_init (void);
 
+void goacc_init_asyncqueues (struct gomp_device_descr *);
+bool goacc_fini_asyncqueues (struct gomp_device_descr *);
+void goacc_async_free (struct gomp_device_descr *, struct goacc_asyncqueue *,
+		       void *);
+struct goacc_asyncqueue *get_goacc_asyncqueue (int);
+struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool,
+						  int);
+
 static inline bool
 async_valid_stream_id_p (int async)
 {
Index: oacc-mem.c
===================================================================
--- oacc-mem.c	(revision 267507)
+++ oacc-mem.c	(working copy)
@@ -172,18 +172,11 @@ memcpy_tofrom_device (bool from, void *d, void *h,
       return;
     }
 
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (async);
-
-  bool ret = (from
-	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
-	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (acc_async_sync);
-
-  if (!ret)
-    gomp_fatal ("error in %s", libfnname);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  if (from)
+    gomp_copy_dev2host (thr->dev, aq, h, d, s);
+  else
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
 }
 
 void
@@ -509,17 +502,13 @@ present_create_copy (unsigned f, void *h, size_t s
 
       gomp_mutex_unlock (&acc_dev->lock);
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
-			   GOMP_MAP_VARS_OPENACC);
+      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+				 &kinds, true, GOMP_MAP_VARS_OPENACC);
       /* Initialize dynamic refcount.  */
       tgt->list[0].key->dynamic_refcount = 1;
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
-
       gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
@@ -676,13 +665,9 @@ delete_copyout (unsigned f, void *h, size_t s, int
 
       if (f & FLAG_COPYOUT)
 	{
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (async);
-	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (acc_async_sync);
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
-
       gomp_remove_var (acc_dev, n);
     }
 
@@ -765,17 +750,13 @@ update_dev_host (int is_dev, void *h, size_t s, in
   d = (void *) (n->tgt->tgt_start + n->tgt_offset
 		+ (uintptr_t) h - n->host_start);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
   else
-    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
+    gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (acc_async_sync);
-
   gomp_mutex_unlock (&acc_dev->lock);
 }
 
@@ -805,7 +786,7 @@ acc_update_self_async (void *h, size_t s, int asyn
 
 void
 gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds)
+			 void *kinds, int async)
 {
   struct target_mem_desc *tgt;
   struct goacc_thread *thr = goacc_thread ();
@@ -835,8 +816,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
-		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
 
   /* Initialize dynamic refcount.  */
@@ -930,7 +912,10 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
       if (async < acc_async_noval)
 	gomp_unmap_vars (t, true);
       else
-	t->device_descr->openacc.register_async_cleanup_func (t, async);
+	{
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_unmap_vars_async (t, true, aq);
+	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
Index: oacc-parallel.c
===================================================================
--- oacc-parallel.c	(revision 267507)
+++ oacc-parallel.c	(working copy)
@@ -219,8 +219,6 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (voi
     }
   va_end (ap);
   
-  acc_dev->openacc.async_set_async_func (async);
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -237,44 +235,29 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (voi
   else
     tgt_fn = (void (*)) fn;
 
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		       GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+			     true, GOMP_MAP_VARS_OPENACC);
+  
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
 			    + tgt->list[i].key->tgt_offset
 			    + tgt->list[i].offset);
-
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-			      async, dims, tgt);
-
-  /* If running synchronously, unmap immediately.  */
-  bool copyfrom = true;
-  if (async_synchronous_p (async))
-    gomp_unmap_vars (tgt, true);
+  if (aq == NULL)
+    {
+      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				  dims, tgt);
+      /* If running synchronously, unmap immediately.  */
+      gomp_unmap_vars (tgt, true);
+    }
   else
     {
-      bool async_unmap = false;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	{
-	  splay_tree_key k = tgt->list[i].key;
-	  if (k && k->refcount == 1)
-	    {
-	      async_unmap = true;
-	      break;
-	    }
-	}
-      if (async_unmap)
-	tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
-      else
-	{
-	  copyfrom = false;
-	  gomp_unmap_vars (tgt, copyfrom);
-	}
+      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					dims, tgt, aq);
+      gomp_unmap_vars_async (tgt, true, aq);
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 /* Legacy entry point, only provide host execution.  */
@@ -385,8 +368,6 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
 	finalize = true;
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   /* Determine if this is an "acc enter data".  */
   for (i = 0; i < mapnum; ++i)
     {
@@ -454,7 +435,7 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
 	  else
 	    {
 	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-				       &sizes[i], &kinds[i]);
+				       &sizes[i], &kinds[i], async);
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -479,17 +460,17 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
 		if (acc_is_present (hostaddrs[i], sizes[i]))
 		  {
 		    if (finalize)
-		      acc_delete_finalize (hostaddrs[i], sizes[i]);
+		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
 		    else
-		      acc_delete (hostaddrs[i], sizes[i]);
+		      acc_delete_async (hostaddrs[i], sizes[i], async);
 		  }
 		break;
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_FORCE_FROM:
 		if (finalize)
-		  acc_copyout_finalize (hostaddrs[i], sizes[i]);
+		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
 		else
-		  acc_copyout (hostaddrs[i], sizes[i]);
+		  acc_copyout_async (hostaddrs[i], sizes[i], async);
 		break;
 	      default:
 		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -507,8 +488,6 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
 	    i += pointer - 1;
 	  }
       }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 static void
@@ -521,17 +500,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      if (acc_async_test (qid))
+      goacc_aq aq = get_goacc_asyncqueue (qid);
+      if (acc_dev->openacc.async.test_func (aq))
 	continue;
-
       if (async == acc_async_sync)
-	acc_wait (qid);
+	acc_dev->openacc.async.synchronize_func (aq);
       else if (qid == async)
-	;/* If we're waiting on the same asynchronous queue as we're
-	    launching on, the queue itself will order work as
-	    required, so there's no need to wait explicitly.  */
+	/* If we're waiting on the same asynchronous queue as we're
+	   launching on, the queue itself will order work as
+	   required, so there's no need to wait explicitly.  */
+	;
       else
-	acc_dev->openacc.async_wait_async_func (qid, async);
+	{
+	  goacc_aq aq2 = get_goacc_asyncqueue (async);
+	  acc_dev->openacc.async.synchronize_func (aq);
+	  acc_dev->openacc.async.serialize_func (aq, aq2);
+	}
     }
 }
 
@@ -562,8 +546,6 @@ GOACC_update (int flags_m, size_t mapnum,
       va_end (ap);
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   bool update_device = false;
   for (i = 0; i < mapnum; ++i)
     {
@@ -603,7 +585,7 @@ GOACC_update (int flags_m, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_TO:
 	  update_device = true;
-	  acc_update_device (hostaddrs[i], sizes[i]);
+	  acc_update_device_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	case GOMP_MAP_FROM:
@@ -615,7 +597,7 @@ GOACC_update (int flags_m, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_FROM:
 	  update_device = false;
-	  acc_update_self (hostaddrs[i], sizes[i]);
+	  acc_update_self_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	default:
@@ -623,8 +605,6 @@ GOACC_update (int flags_m, size_t mapnum,
 	  break;
 	}
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 void
Index: oacc-plugin.c
===================================================================
--- oacc-plugin.c	(revision 267507)
+++ oacc-plugin.c	(working copy)
@@ -30,15 +30,12 @@
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
+/* This plugin function is now obsolete.  */
 void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
+GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
+			      int async __attribute__((unused)))
 {
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);
+  gomp_fatal ("invalid plugin function");
 }
 
 /* Return the target-specific part of the TLS data for the current thread.  */
Thomas Schwinge Jan. 7, 2019, 2:14 p.m. | #25
Hi Chung-Lin!

On Wed, 2 Jan 2019 20:46:12 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> Hi Thomas, Happy New Year,


Thanks!  If I remember right, you still have a few weeks until "your" New
Year/Spring Festival, right?


> On 2018/12/19 5:03 AM, Thomas Schwinge wrote:

> >> +

> >> +  if (!dev->openacc.async.asyncqueue[async])

> >> +    {

> >> +      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();

> >> +

> >> +      if (!dev->openacc.async.asyncqueue[async])

> >> +	{

> >> +	  gomp_mutex_unlock (&dev->openacc.async.lock);

> >> +	  gomp_fatal ("async %d creation failed", async);

> >> +	}

> > That will now always fail for host fallback, where

> > "host_openacc_async_construct" just always does "return NULL".

> > 

> > Actually, if the device doesn't support asyncqueues, this whole function

> > should turn into some kind of no-op, so that we don't again and again try

> > to create a new one for every call to "lookup_goacc_asyncqueue".

> > 

> > I'm attaching one possible solution.  I think it's fine to assume that

> > the majority of devices will support asyncqueues, and for those that

> > don't, this is just a one-time overhead per async-argument.  So, no

> > special handling required in "lookup_goacc_asyncqueue".

> 

> > --- a/libgomp/oacc-host.c

> > +++ b/libgomp/oacc-host.c

> > @@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct goacc_asyncqueue *aq

> >   static struct goacc_asyncqueue *

> >   host_openacc_async_construct (void)

> >   {

> > -  return NULL;

> > +  /* We have to return non-NULL here, but it's OK to use a dummy.  */

> > +  return (struct goacc_asyncqueue *) -1;

> >   }

> 

> I'm not sure I understand the meaning of this? Is there any use to segfaulting somewhere else

> due to this 0xffff... pointer?


There will be no such dereferencing (and thus no such segfault), as you
(quite nicely!) made this is an opaque data type to the generic code.
The concrete type is specific to, and only ever dereferenced inside each
plugin, and the "host plugin" never dereferences it, so returning minus
one here only serves as a non-NULL value/identifier to the generic code.

> A feature of a NULL asyncqueue should mean that it is simply synchronous


OK, then that should be documented, and as I mentioned above, the
"lookup" code be adjusted so that it doesn't again and again try to
create an asyncqueue when the "construct" function returns NULL.

> however this does somewhat

> conflict with the case of async.construct_func() returning NULL on error...

> 

> Perhaps, again using an explicit success code as the return value (and return asyncqueue using

> an out parameter)?


Sure, that's also fine.  I just did it as presented above, because of its
simplicity, and to avoid adjusting the "lookup" code, as mentioned above.


Grüße
 Thomas
Thomas Schwinge Jan. 7, 2019, 2:15 p.m. | #26
Hi Chung-Lin!

On Sat, 5 Jan 2019 17:47:10 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> this is the current version of the oacc-* parts of the Async Re-work patch.

> 

> I have reverted away from the earlier mentioned attempt of using lockless

> techniques to manage the asyncqueues; it is really hard to do in a 100% correct

> manner, unless we only use something like simple lists to manage them,

> which probably makes lookup unacceptably slow.

> 

> For now, I have changed to use the conventional locking and success/fail return

> codes for the synchronize/serialize hooks.


OK, thanks.


> I hope this is enough to pass

> and get committed.


Well, the "Properly handle wait clause with no arguments" changes still
need to be completed and go in first (to avoid introducing regressions),
and then I will have to see your whole set of changes that you intend to
commit: the bits you've incrementally posted still don't include several
of the changes I suggested and provided patches for (again, to avoid
introducing regressions).


But GCC now is in "regression and documentation fixes mode", so I fear
that it's too late now?


> --- oacc-async.c	(revision 267507)

> +++ oacc-async.c	(working copy)


> @@ -62,12 +158,10 @@ acc_wait (int async)


> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);

> +  thr->dev->openacc.async.synchronize_func (aq);


Have to check the result here?  Like you're doing here, for example:

>  acc_wait_async (int async1, int async2)

>  {


> +  if (!thr->dev->openacc.async.synchronize_func (aq1))

> +    gomp_fatal ("wait on %d failed", async1);

> +  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))

> +    gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);


> --- oacc-parallel.c	(revision 267507)

> +++ oacc-parallel.c	(working copy)


> @@ -521,17 +500,22 @@ goacc_wait (int async, int num_waits, va_list *ap)


>        if (async == acc_async_sync)

> -	acc_wait (qid);

> +	acc_dev->openacc.async.synchronize_func (aq);


Likewise?

>        else if (qid == async)

> -	;/* If we're waiting on the same asynchronous queue as we're

> -	    launching on, the queue itself will order work as

> -	    required, so there's no need to wait explicitly.  */

> +	/* If we're waiting on the same asynchronous queue as we're

> +	   launching on, the queue itself will order work as

> +	   required, so there's no need to wait explicitly.  */

> +	;

>        else

> -	acc_dev->openacc.async_wait_async_func (qid, async);

> +	{

> +	  goacc_aq aq2 = get_goacc_asyncqueue (async);

> +	  acc_dev->openacc.async.synchronize_func (aq);

> +	  acc_dev->openacc.async.serialize_func (aq, aq2);

> +	}


Likewise?


Also, I had to apply additional changes as attached, to make this build.


Grüße
 Thomas
From e4c187a4be46682a989165c38bc6a8d8324554b9 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Mon, 7 Jan 2019 13:25:18 +0100
Subject: [PATCH] [WIP] into async re-work: complete
 GOMP_OFFLOAD_openacc_async_synchronize, GOMP_OFFLOAD_openacc_async_serialize
 interface changes

---
 libgomp/libgomp-plugin.h      |  4 ++--
 libgomp/plugin/plugin-nvptx.c | 29 +++++++++++++++++++++--------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index e3c031a282a1..ce3ae125e208 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -115,8 +115,8 @@ extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
 extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
 extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
-extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
-extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+extern bool GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
 						  struct goacc_asyncqueue *);
 extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
 						       void (*)(void *), void *);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index f42cbf488a79..12f87ba7be4d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1395,22 +1395,35 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
   return -1;
 }
 
-void
+bool
 GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
 {
-  //TODO Is this safe to call, or might this cause deadlock if something's locked?
-  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, aq->cuda_stream);
+  return r == CUDA_SUCCESS;
 }
 
-void
+bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
 {
+  CUresult r;
   CUevent e;
-  //TODO Are these safe to call, or might this cause deadlock if something's locked?
-  CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
-  CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
-  CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+  r = CUDA_CALL_NOCHECK (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
+  if (r != CUDA_SUCCESS)
+    return false;
+  r = CUDA_CALL_NOCHECK (cuEventRecord, e, aq1->cuda_stream);
+  if (r != CUDA_SUCCESS)
+    {
+      //TODO "cuEventDestroy"?
+      return false;
+    }
+  r = CUDA_CALL_NOCHECK (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+  if (r != CUDA_SUCCESS)
+    {
+      //TODO "cuEventDestroy"?
+      return false;
+    }
+  return true;
 }
 
 static void
-- 
2.17.1
Chung-Lin Tang Jan. 8, 2019, 2:04 p.m. | #27
On 2019/1/7 10:15 AM, Thomas Schwinge wrote:
> Well, the "Properly handle wait clause with no arguments" changes still

> need to be completed and go in first (to avoid introducing regressions),

> and then I will have to see your whole set of changes that you intend to

> commit: the bits you've incrementally posted still don't include several

> of the changes I suggested and provided patches for (again, to avoid

> introducing regressions).


I'll look at that state again.

> But GCC now is in "regression and documentation fixes mode", so I fear

> that it's too late now?


Maybe...I don't know.

>> --- oacc-async.c	(revision 267507)

>> +++ oacc-async.c	(working copy)

>> @@ -62,12 +158,10 @@ acc_wait (int async)

>> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);

>> +  thr->dev->openacc.async.synchronize_func (aq);

> Have to check the result here?  Like you're doing here, for example:

> 

>>   acc_wait_async (int async1, int async2)

>>   {

>> +  if (!thr->dev->openacc.async.synchronize_func (aq1))

>> +    gomp_fatal ("wait on %d failed", async1);

>> +  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))

>> +    gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);

>> --- oacc-parallel.c	(revision 267507)

>> +++ oacc-parallel.c	(working copy)

>> @@ -521,17 +500,22 @@ goacc_wait (int async, int num_waits, va_list *ap)

>>         if (async == acc_async_sync)

>> -	acc_wait (qid);

>> +	acc_dev->openacc.async.synchronize_func (aq);

> Likewise?


Oh okay, I forgot about those sites.


> Also, I had to apply additional changes as attached, to make this build.

>


Oh I had those changes, but forgot to update the other patches. I'll resend those later too.

Thanks,
Chung-Lin

Patch

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index a4e1863..68aaf19 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -27,10 +27,87 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #include <assert.h>
+#include <string.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+     default async stream.  */
+  if (async == acc_async_noval)
+    async = thr->default_async;
+
+  if (async == acc_async_sync)
+    return NULL;
+
+  if (async < 0)
+    gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  if (!create
+      && (async >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[async]))
+    return NULL;
+
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  if (async >= dev->openacc.async.nasyncqueue)
+    {
+      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      dev->openacc.async.asyncqueue
+	= gomp_realloc (dev->openacc.async.asyncqueue,
+			sizeof (goacc_aq) * (async + 1));
+      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+	      0, sizeof (goacc_aq) * diff);
+      dev->openacc.async.nasyncqueue = async + 1;
+    }
+
+  if (!dev->openacc.async.asyncqueue[async])
+    {
+      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+      /* Link new async queue into active list.  */
+      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+      n->aq = dev->openacc.async.asyncqueue[async];
+      n->next = dev->openacc.async.active;
+      dev->openacc.async.active = n;
+    }
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+  return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -42,18 +119,25 @@  acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  return thr->dev->openacc.async_test_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
 acc_async_test_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  return thr->dev->openacc.async_test_all_func ();
+  int ret = 1;
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    if (!thr->dev->openacc.async.test_func (l->aq))
+      {
+	ret = 0;
+	break;
+      }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -62,12 +146,10 @@  acc_wait (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -84,23 +166,28 @@  acc_async_wait (int async)
 void
 acc_wait_async (int async1, int async2)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  if (!aq1)
+    gomp_fatal ("invalid async 1");
+  if (aq1 == aq2)
+    gomp_fatal ("identical parameters");
 
-  thr->dev->openacc.async_wait_async_func (async1, async2);
+  thr->dev->openacc.async.synchronize_func (aq1);
+  thr->dev->openacc.async.serialize_func (aq1, aq2);
 }
 
 void
 acc_wait_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct gomp_device_descr *dev = get_goacc_thread_device ();
 
-  thr->dev->openacc.async_wait_all_func ();
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+    dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&dev->openacc.async.lock);
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -120,10 +207,99 @@  acc_wait_all_async (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
-  thr->dev->openacc.async_wait_all_async_func (async);
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    {
+      thr->dev->openacc.async.synchronize_func (l->aq);
+      if (waiting_queue)
+	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+    }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+}
+
+int
+acc_get_default_async (void)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return thr->default_async;
+}
+
+void
+acc_set_default_async (int async)
+{
+  if (async < acc_async_sync)
+    gomp_fatal ("invalid async argument: %d", async);
+
+  struct goacc_thread *thr = get_goacc_thread ();
+  thr->default_async = async;
+}
+
+static void
+goacc_async_unmap_tgt (void *ptr)
+{
+  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
+
+  if (tgt->refcount > 1)
+    tgt->refcount--;
+  else
+    gomp_unmap_tgt (tgt);
+}
+
+attribute_hidden void
+goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt,
+				struct goacc_asyncqueue *aq)
+{
+  struct gomp_device_descr *devicep = tgt->device_descr;
+
+  /* Increment reference to delay freeing of device memory until callback
+     has triggered.  */
+  tgt->refcount++;
+  gomp_unmap_vars_async (tgt, true, aq);
+  devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
+					      (void *) tgt);
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+		  struct goacc_asyncqueue *aq, void *ptr)
+{
+  if (!aq)
+    free (ptr);
+  else
+    devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+  gomp_mutex_init (&devicep->openacc.async.lock);
+  devicep->openacc.async.nasyncqueue = 0;
+  devicep->openacc.async.asyncqueue = NULL;
+  devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+  bool ret = true;
+  if (devicep->openacc.async.nasyncqueue > 0)
+    {
+      goacc_aq_list next;
+      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+	{
+	  ret &= devicep->openacc.async.destruct_func (l->aq);
+	  next = l->next;
+	  free (l);
+	}
+      free (devicep->openacc.async.asyncqueue);
+      devicep->openacc.async.nasyncqueue = 0;
+      devicep->openacc.async.asyncqueue = NULL;
+      devicep->openacc.async.active = NULL;
+    }
+  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  return ret;
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index 20774c1..0a842ea 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -62,7 +62,11 @@  acc_get_cuda_stream (int async)
     return NULL;
 
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
-    return thr->dev->openacc.cuda.get_stream_func (async);
+    {
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+      if (aq)
+	return thr->dev->openacc.cuda.get_stream_func (aq);
+    }
  
   return NULL;
 }
@@ -79,8 +83,14 @@  acc_set_cuda_stream (int async, void *stream)
 
   thr = goacc_thread ();
 
+  int ret = -1;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
-    return thr->dev->openacc.cuda.set_stream_func (async, stream);
-
-  return -1;
+    {
+      goacc_aq aq = get_goacc_asyncqueue (async);
+      gomp_mutex_lock (&thr->dev->openacc.async.lock);
+      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+    }
+
+  return ret;
 }
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 2de3c37..53658c8 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -140,55 +140,86 @@  host_openacc_exec (void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
 		   void **devaddrs __attribute__ ((unused)),
-		   int async __attribute__ ((unused)),
-		   unsigned *dims __attribute ((unused)),
+		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
   fn (hostaddrs);
 }
 
 static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
-				     int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+			 size_t mapnum __attribute__ ((unused)),
+			 void **hostaddrs,
+			 void **devaddrs __attribute__ ((unused)),
+			 unsigned *dims __attribute__ ((unused)),
+			 void *targ_mem_desc __attribute__ ((unused)),
+			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
+  fn (hostaddrs);
 }
 
 static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
   return 1;
 }
 
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+				__attribute__ ((unused)))
 {
-  return 1;
 }
 
 static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+			      __attribute__ ((unused)),
+			      struct goacc_asyncqueue *aq2
+			      __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
-			       int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
 static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+				   __attribute__ ((unused)),
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
 {
+  return NULL;
+}
+
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
+{
+  return true;
 }
 
 static void *
@@ -235,15 +266,17 @@  static struct gomp_device_descr host_dispatch =
 
       .exec_func = host_openacc_exec,
 
-      .register_async_cleanup_func = host_openacc_register_async_cleanup,
-
-      .async_test_func = host_openacc_async_test,
-      .async_test_all_func = host_openacc_async_test_all,
-      .async_wait_func = host_openacc_async_wait,
-      .async_wait_async_func = host_openacc_async_wait_async,
-      .async_wait_all_func = host_openacc_async_wait_all,
-      .async_wait_all_async_func = host_openacc_async_wait_all_async,
-      .async_set_async_func = host_openacc_async_set_async,
+      .async = {
+	.construct_func = host_openacc_async_construct,
+	.destruct_func = host_openacc_async_destruct,
+	.test_func = host_openacc_async_test,
+	.synchronize_func = host_openacc_async_synchronize,
+	.serialize_func = host_openacc_async_serialize,
+	.queue_callback_func = host_openacc_async_queue_callback,
+	.exec_func = host_openacc_async_exec,
+	.dev2host_func = host_openacc_async_dev2host,
+	.host2dev_func = host_openacc_async_host2dev,
+      },
 
       .create_thread_data_func = host_openacc_create_thread_data,
       .destroy_thread_data_func = host_openacc_destroy_thread_data,
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 8db24b1..2c2f91c 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -309,7 +309,7 @@  acc_shutdown_1 (acc_device_t d)
       if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
         {
 	  devices_active = true;
-	  ret &= acc_dev->fini_device_func (acc_dev->target_id);
+	  ret &= gomp_fini_device (acc_dev);
 	  acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
 	}
       gomp_mutex_unlock (&acc_dev->lock);
@@ -426,8 +426,8 @@  goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-  
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 72414b7..07a2524 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -172,18 +172,11 @@  memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
       return;
     }
 
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (async);
-
-  bool ret = (from
-	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
-	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (acc_async_sync);
-
-  if (!ret)
-    gomp_fatal ("error in %s", libfnname);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  if (from)
+    gomp_copy_dev2host (thr->dev, aq, h, d, s);
+  else
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
 }
 
 void
@@ -509,17 +502,13 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
 
       gomp_mutex_unlock (&acc_dev->lock);
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
-			   GOMP_MAP_VARS_OPENACC);
+      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+				 &kinds, true, GOMP_MAP_VARS_OPENACC);
       /* Initialize dynamic refcount.  */
       tgt->list[0].key->dynamic_refcount = 1;
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
-
       gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
@@ -673,13 +662,9 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
 
       if (f & FLAG_COPYOUT)
 	{
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (async);
-	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (acc_async_sync);
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
-
       gomp_remove_var (acc_dev, n);
     }
 
@@ -762,16 +747,12 @@  update_dev_host (int is_dev, void *h, size_t s, int async)
   d = (void *) (n->tgt->tgt_start + n->tgt_offset
 		+ (uintptr_t) h - n->host_start);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
   else
-    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (acc_async_sync);
+    gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
   gomp_mutex_unlock (&acc_dev->lock);
 }
@@ -802,7 +783,7 @@  acc_update_self_async (void *h, size_t s, int async)
 
 void
 gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds)
+			 void *kinds, int async)
 {
   struct target_mem_desc *tgt;
   struct goacc_thread *thr = goacc_thread ();
@@ -832,8 +813,9 @@  gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
-		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
 
   /* Initialize dynamic refcount.  */
@@ -927,7 +909,10 @@  gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async,
       if (async < acc_async_noval)
 	gomp_unmap_vars (t, true);
       else
-	t->device_descr->openacc.register_async_cleanup_func (t, async);
+	{
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  goacc_async_copyout_unmap_vars (t, aq);
+	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index bfe8876..07d0338 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -212,8 +212,6 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
   va_end (ap);
   
-  acc_dev->openacc.async_set_async_func (async);
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -230,43 +228,28 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
   else
     tgt_fn = (void (*)) fn;
 
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		       GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+			     true, GOMP_MAP_VARS_OPENACC);
 
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
 			    + tgt->list[i].key->tgt_offset);
-
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-			      async, dims, tgt);
-
-  /* If running synchronously, unmap immediately.  */
-  bool copyfrom = true;
-  if (async_synchronous_p (async))
-    gomp_unmap_vars (tgt, true);
+  if (aq == NULL)
+    {
+      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				  dims, tgt);
+      /* If running synchronously, unmap immediately.  */
+      gomp_unmap_vars (tgt, true);
+    }
   else
     {
-      bool async_unmap = false;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	{
-	  splay_tree_key k = tgt->list[i].key;
-	  if (k && k->refcount == 1)
-	    {
-	      async_unmap = true;
-	      break;
-	    }
-	}
-      if (async_unmap)
-	tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
-      else
-	{
-	  copyfrom = false;
-	  gomp_unmap_vars (tgt, copyfrom);
-	}
+      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					dims, tgt, aq);
+      goacc_async_copyout_unmap_vars (tgt, aq);
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 /* Legacy entry point, only provide host execution.  */
@@ -377,8 +360,6 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 	finalize = true;
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   /* Determine if this is an "acc enter data".  */
   for (i = 0; i < mapnum; ++i)
     {
@@ -450,7 +431,7 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 	  else
 	    {
 	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-				       &sizes[i], &kinds[i]);
+				       &sizes[i], &kinds[i], async);
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -475,17 +456,17 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 		if (acc_is_present (hostaddrs[i], sizes[i]))
 		  {
 		    if (finalize)
-		      acc_delete_finalize (hostaddrs[i], sizes[i]);
+		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
 		    else
-		      acc_delete (hostaddrs[i], sizes[i]);
+		      acc_delete_async (hostaddrs[i], sizes[i], async);
 		  }
 		break;
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_FORCE_FROM:
 		if (finalize)
-		  acc_copyout_finalize (hostaddrs[i], sizes[i]);
+		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
 		else
-		  acc_copyout (hostaddrs[i], sizes[i]);
+		  acc_copyout_async (hostaddrs[i], sizes[i], async);
 		break;
 	      default:
 		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -503,8 +484,6 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 	    i += pointer - 1;
 	  }
       }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 static void
@@ -517,17 +496,22 @@  goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      if (acc_async_test (qid))
+      goacc_aq aq = get_goacc_asyncqueue (qid);
+      if (acc_dev->openacc.async.test_func (aq))
 	continue;
-
       if (async == acc_async_sync)
-	acc_wait (qid);
+	acc_dev->openacc.async.synchronize_func (aq);
       else if (qid == async)
-	;/* If we're waiting on the same asynchronous queue as we're
-	    launching on, the queue itself will order work as
-	    required, so there's no need to wait explicitly.  */
+	/* If we're waiting on the same asynchronous queue as we're
+	   launching on, the queue itself will order work as
+	   required, so there's no need to wait explicitly.  */
+	;
       else
-	acc_dev->openacc.async_wait_async_func (qid, async);
+	{
+	  goacc_aq aq2 = get_goacc_asyncqueue (async);
+	  acc_dev->openacc.async.synchronize_func (aq);
+	  acc_dev->openacc.async.serialize_func (aq, aq2);
+	}
     }
 }
 
@@ -559,8 +543,6 @@  GOACC_update (int device, size_t mapnum,
   else if (num_waits == acc_async_noval)
     acc_wait_all_async (async);
 
-  acc_dev->openacc.async_set_async_func (async);
-
   bool update_device = false;
   for (i = 0; i < mapnum; ++i)
     {
@@ -600,7 +582,7 @@  GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_TO:
 	  update_device = true;
-	  acc_update_device (hostaddrs[i], sizes[i]);
+	  acc_update_device_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	case GOMP_MAP_FROM:
@@ -612,7 +594,7 @@  GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_FROM:
 	  update_device = false;
-	  acc_update_self (hostaddrs[i], sizes[i]);
+	  acc_update_self_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	default:
@@ -620,8 +602,6 @@  GOACC_update (int device, size_t mapnum,
 	  break;
 	}
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 void
@@ -638,7 +618,7 @@  GOACC_wait (int async, int num_waits, ...)
   else if (async == acc_async_sync)
     acc_wait_all ();
   else if (async == acc_async_noval)
-    goacc_thread ()->dev->openacc.async_wait_all_async_func (acc_async_noval);
+    acc_wait_all_async (async);
 }
 
 int
diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c
index c04db90..a114cc7 100644
--- a/libgomp/oacc-plugin.c
+++ b/libgomp/oacc-plugin.c
@@ -30,17 +30,6 @@ 
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
-void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
-{
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);
-}
-
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
diff --git a/libgomp/openacc.h b/libgomp/openacc.h
index f61bb77..ede59d7 100644
--- a/libgomp/openacc.h
+++ b/libgomp/openacc.h
@@ -63,6 +63,7 @@  typedef enum acc_device_t {
 
 typedef enum acc_async_t {
   /* Keep in sync with include/gomp-constants.h.  */
+  acc_async_default = 0,
   acc_async_noval = -1,
   acc_async_sync  = -2
 } acc_async_t;
@@ -72,6 +73,8 @@  void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
+void acc_set_default_async (int) __GOACC_NOTHROW;
+int acc_get_default_async (void) __GOACC_NOTHROW;
 int acc_async_test (int) __GOACC_NOTHROW;
 int acc_async_test_all (void) __GOACC_NOTHROW;
 void acc_wait (int) __GOACC_NOTHROW;