[0/7,OpenACC,libgomp,v5,stage1] Async re-work

Message ID e1cccd27-4f07-6123-e9ee-82be74002c08@mentor.com
Headers show
Series
  • Async re-work
Related show

Message

Chung-Lin Tang Jan. 22, 2019, 2:52 p.m.
Hi, this is a rebase to current trunk and re-submission of the OpenACC Async
re-organization work, aiming to commit when stage1 re-opens. This is technically
the 2nd time I'm sending this whole patch series, but because I've named
partial revisions up to v4 by now, for clarity I will just call this entire set "v5".

Thomas, I hope I resolved all discussed issues in this current patch set. Please
kindly remind if I missed anything, as there were so many emails to re-check :)

The more detailed descriptions are in the individual patch submissions.

Thanks,
Chung-Lin

Comments

Thomas Schwinge Feb. 12, 2019, 3:07 p.m. | #1
Hi Chung-Lin!

Happy New Year now to you, too!  :-)


On Tue, 22 Jan 2019 22:52:09 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> Hi, this is a rebase to current trunk and re-submission of the OpenACC Async

> re-organization work, aiming to commit when stage1 re-opens.


Thanks!

> This is technically

> the 2nd time I'm sending this whole patch series, but because I've named

> partial revisions up to v4 by now, for clarity I will just call this entire set "v5".


As far as I'm concerned, these patches should all (with a few exceptions
to be split out, see below) be merged into one patch, because they
logically all belong together, as one piece: "async re-work".


> Thomas, I hope I resolved all discussed issues in this current patch set. Please

> kindly remind if I missed anything, as there were so many emails to re-check :)


I'm still waiting for you to commit the PR87924 "OpenACC wait clauses
without async-arguments" changes, as a prerequisite to this re-work,
<https://gcc.gnu.org/ml/gcc-patches/2019-01/msg01282.html>.


If we agree that we actually need such a thing (I'll have to re-read
Jakub's comments), please submit the 'GOMP_PLUGIN_IF_VERSION' changes
separately, with 'GOMP_PLUGIN_IF_VERSION' equal to 'GOMP_VERSION'
(initially).  As this then is only a kind of documentation update, this
might then go into trunk right now -- and even if not right now, should
still be done separately as a prerequisite patch to this re-work, which
will then just increment 'GOMP_PLUGIN_IF_VERSION'.

Maybe rename 'GOMP_PLUGIN_IF_VERSION' to 'GOMP_PLUGIN_VERSION', for
similarity with 'GOMP_VERSION'?

And, it's then a bit confusing that 'GOMP_PLUGIN_VERSION' is returned
from 'GOMP_OFFLOAD_version' functions (plus 'host_version'); we there got
"plugin" vs. "offload".  But I suppose we'll just live with that?

The 'GOMP_OFFLOAD_version' functions should then also get their source
code comments updated: "libgomp [plugin] version"?


Now, back to the actual async re-work.

I see you've incorporated some of the incremental patches I provided
(thanks!), but not all of them.  I don't know if you just missed (some
of) these, or actually object?


I had requested that the OpenACC 2.5 'default_async' changes be discussed
separately, after this re-work has gone in, so please remove these
changes from this patch series.  I've again attached "into async re-work:
revert default_async changes".


I had provided changes, "into async re-work: don't create an asyncqueue
just to then test/synchronize with it", again attached.  I had asked that
you 'Please especially review the "libgomp/oacc-parallel.c:goacc_wait"
change, and confirm no corresponding "libgomp/oacc-parallel.c:GOACC_wait"
change to be done, because that code is structured differently'.


I had requested that we maintain the current behavior, that
"acc_async_noval" stays in its own, separate asyncqueue, instead of
aliasing it to 'async(0)'.  I had proposed "into async re-work:
libgomp/oacc-async.c:async2id", again attached.

You said you don't like the 'async2id' function I'm adding there (I still
don't understand why), so I assume you'd then implement this
async-argument to queue ID translation in 'lookup_goacc_asyncqueue'
proper?


I had provided "[WIP] into async re-work: documentation", again attached,
as 'A little bit of documentation starter update for you to include.
Please make sure that all relevant functions have such comments addded'.


I'm again attaching my changes 'into async re-work: replicate
"[PR88407] [OpenACC] Correctly handle unseen async-arguments"', which --
I suppose -- are necessary to maintain the current GCC trunk behavior
(that is, avoid testsuite regressions).


I'm again attaching my changes 'into async re-work: replicate "[PR88370]
acc_get_cuda_stream/acc_set_cuda_stream: acc_async_sync,
acc_async_noval"', which -- I suppose -- are necessary to maintain the
current GCC trunk behavior (that is, avoid testsuite regressions).


I'm again attaching my changes 'into async re-work: adjust for test case
added in "[PR88484] OpenACC wait directive without wait argument but with
async clause"', which -- I suppose -- are necessary to maintain the
current GCC trunk behavior (that is, avoid testsuite regressions).

You suggested that "Instead of fixing it here, will it make more sense to
have the serialize_func hook to accommodate the NULL asyncqueue?", to
which I said "Sure, that may make sense, yes.  Right: if there's no
asyncqueue to serialize with, then serialize/synchronize with the local
(host) thread", but this has not yet been implemented, as far as I can
tell.


I'm again attaching my changes 'into async re-work: don't synchronize
with the local thread unless actually necessary', which is the behavior
that makes most sense to me, and I had asked 'Would you please review the
"TODO" comments, and again also especially review the
"libgomp/oacc-parallel.c:goacc_wait" change, and confirm no corresponding
"libgomp/oacc-parallel.c:GOACC_wait" change to be done, because that code
is structured differently'.


By means of a "TODO" comment that I added, I had asked you to verify in
your 'libgomp/oacc-parallel.c:GOACC_enter_exit_data' translation from
'async_set_async_func' function call to 'async' formal parameter,
whether/why one case deliberately has not been converted; again attaching
this remaining piece of 'into async re-work: more async function usage'.


> The more detailed descriptions are in the individual patch submissions.


I'll respond to a few items individually, later on.


Grüße
 Thomas
From da84271ae48b0d6882be3304b8b97028e08158a1 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Thu, 6 Dec 2018 15:57:46 +0100
Subject: [PATCH 1/9] into async re-work: revert default_async changes

---
 include/gomp-constants.h                      |   1 -
 libgomp/libgomp.map                           |   4 -
 libgomp/oacc-async.c                          |  19 +-
 libgomp/oacc-init.c                           |   2 -
 libgomp/oacc-int.h                            |   3 -
 libgomp/openacc.f90                           |  22 +-
 libgomp/openacc.h                             |   3 -
 libgomp/openacc_lib.h                         |  13 -
 .../libgomp.oacc-c-c++-common/asyncwait-2.c   | 904 ------------------
 9 files changed, 2 insertions(+), 969 deletions(-)
 delete mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index e37f1f9e9e4..f1e2ca3c75c 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -160,7 +160,6 @@ enum gomp_map_kind
 /* Asynchronous behavior.  Keep in sync with
    libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t.  */
 
-#define GOMP_ASYNC_DEFAULT		0
 #define GOMP_ASYNC_NOVAL		-1
 #define GOMP_ASYNC_SYNC			-2
 
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 8feec91f5a3..d8e2fd1818b 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -464,12 +464,8 @@ OACC_2.5 {
 	acc_delete_finalize_async_32_h_;
 	acc_delete_finalize_async_64_h_;
 	acc_delete_finalize_async_array_h_;
-	acc_get_default_async;
-	acc_get_default_async_h_;
 	acc_memcpy_from_device_async;
 	acc_memcpy_to_device_async;
-	acc_set_default_async;
-	acc_set_default_async_h_;
 	acc_update_device_async;
 	acc_update_device_async_32_h_;
 	acc_update_device_async_64_h_;
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 00bed7452af..ea5ae542ac3 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -60,7 +60,7 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
   /* The special value acc_async_noval (-1) maps to the thread-specific
      default async stream.  */
   if (async == acc_async_noval)
-    async = thr->default_async;
+    async = 0; //TODO thr->default_async;
 
   if (async == acc_async_sync)
     return NULL;
@@ -246,23 +246,6 @@ acc_wait_all_async (int async)
     gomp_fatal ("wait all async(%d) failed", async);
 }
 
-int
-acc_get_default_async (void)
-{
-  struct goacc_thread *thr = get_goacc_thread ();
-  return thr->default_async;
-}
-
-void
-acc_set_default_async (int async)
-{
-  if (async < acc_async_sync)
-    gomp_fatal ("invalid async argument: %d", async);
-
-  struct goacc_thread *thr = get_goacc_thread ();
-  thr->default_async = async;
-}
-
 attribute_hidden void
 goacc_async_free (struct gomp_device_descr *devicep,
 		  struct goacc_asyncqueue *aq, void *ptr)
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index a561792b243..28471e40ba0 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -426,8 +426,6 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-
-  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index b343a06472e..e4b6ea6b7db 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -73,9 +73,6 @@ struct goacc_thread
 
   /* Target-specific data (used by plugin).  */
   void *target_tls;
-
-  /* Default OpenACC async queue for current thread, exported to plugin.  */
-  int default_async;
 };
 
 #if defined HAVE_TLS || defined USE_EMUTLS
diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90
index 971c16f6239..bc205453f82 100644
--- a/libgomp/openacc.f90
+++ b/libgomp/openacc.f90
@@ -51,10 +51,9 @@ module openacc_kinds
 
   integer, parameter :: acc_handle_kind = int32
 
-  public :: acc_async_default, acc_async_noval, acc_async_sync
+  public :: acc_async_noval, acc_async_sync
 
   ! Keep in sync with include/gomp-constants.h.
-  integer (acc_handle_kind), parameter :: acc_async_default = 0
   integer (acc_handle_kind), parameter :: acc_async_noval = -1
   integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -93,16 +92,6 @@ module openacc_internal
       integer (acc_device_kind) d
     end function
 
-    subroutine acc_set_default_async_h (a)
-      import
-      integer a
-    end subroutine
-
-    function acc_get_default_async_h ()
-      import
-      integer acc_get_default_async_h
-    end function
-
     function acc_async_test_h (a)
       logical acc_async_test_h
       integer a
@@ -731,7 +720,6 @@ module openacc
 
   public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type
   public :: acc_set_device_num, acc_get_device_num, acc_async_test
-  public :: acc_set_default_async, acc_get_default_async
   public :: acc_async_test_all
   public :: acc_wait, acc_async_wait, acc_wait_async
   public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async
@@ -764,14 +752,6 @@ module openacc
     procedure :: acc_get_device_num_h
   end interface
 
-  interface acc_set_default_async
-    procedure :: acc_set_default_async_h
-  end interface
-
-  interface acc_get_default_async
-    procedure :: acc_get_default_async_h
-  end interface
-
   interface acc_async_test
     procedure :: acc_async_test_h
   end interface
diff --git a/libgomp/openacc.h b/libgomp/openacc.h
index 381f74f39d0..1bbe6c90e7f 100644
--- a/libgomp/openacc.h
+++ b/libgomp/openacc.h
@@ -63,7 +63,6 @@ typedef enum acc_device_t {
 
 typedef enum acc_async_t {
   /* Keep in sync with include/gomp-constants.h.  */
-  acc_async_default = 0,
   acc_async_noval = -1,
   acc_async_sync  = -2
 } acc_async_t;
@@ -73,8 +72,6 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
-void acc_set_default_async (int) __GOACC_NOTHROW;
-int acc_get_default_async (void) __GOACC_NOTHROW;
 int acc_async_test (int) __GOACC_NOTHROW;
 int acc_async_test_all (void) __GOACC_NOTHROW;
 void acc_wait (int) __GOACC_NOTHROW;
diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h
index 9fe47bbc48d..fbd8f5e3625 100644
--- a/libgomp/openacc_lib.h
+++ b/libgomp/openacc_lib.h
@@ -46,7 +46,6 @@
       integer, parameter :: acc_handle_kind = 4
 
 !     Keep in sync with include/gomp-constants.h.
-      integer (acc_handle_kind), parameter :: acc_async_default = 0
       integer (acc_handle_kind), parameter :: acc_async_noval = -1
       integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -90,18 +89,6 @@
         end function
       end interface
 
-      interface acc_set_default_async
-        subroutine acc_set_default_async_h (a)
-          integer a
-        end subroutine
-      end interface
-
-      interface acc_get_default_async
-        function acc_get_default_async_h ()
-          integer acc_get_default_async_h
-        end function
-      end interface
-
       interface acc_async_test
         function acc_async_test_h (a)
           logical acc_async_test_h
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
deleted file mode 100644
index 94205407d41..00000000000
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
+++ /dev/null
@@ -1,904 +0,0 @@
-/* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-lcuda" } */
-
-#include <openacc.h>
-#include <stdlib.h>
-#include <cuda.h>
-
-#include <stdio.h>
-#include <time.h>
-#include <sys/time.h>
-
-int
-main (int argc, char **argv)
-{
-    CUresult r;
-    CUstream stream1;
-    int N = 128; //1024 * 1024;
-    float *a, *b, *c, *d, *e;
-    int i;
-    int nbytes;
-
-    srand (time (NULL));
-    int s = rand () % 100;
-
-    acc_init (acc_device_nvidia);
-
-    nbytes = N * sizeof (float);
-
-    a = (float *) malloc (nbytes);
-    b = (float *) malloc (nbytes);
-    c = (float *) malloc (nbytes);
-    d = (float *) malloc (nbytes);
-    e = (float *) malloc (nbytes);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-    acc_set_default_async (s);
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    acc_shutdown (acc_device_nvidia);
-
-    return 0;
-}
-- 
2.17.1
From 571278c7ba9432824e3b35ffc04b48b2bc433201 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Fri, 7 Dec 2018 12:19:56 +0100
Subject: [PATCH 2/9] into async re-work: don't create an asyncqueue just to
 then test/synchronize with it

---
 libgomp/oacc-async.c    | 16 +++++++++++-----
 libgomp/oacc-parallel.c |  5 ++++-
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index ea5ae542ac3..b281ceb8f18 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -131,8 +131,11 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  return thr->dev->openacc.async.test_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (!aq)
+    return 1;
+  else
+    return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -160,9 +163,12 @@ acc_wait (int async)
 
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  if (!thr->dev->openacc.async.synchronize_func (aq))
-    gomp_fatal ("wait on %d failed", async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (aq)
+    {
+      if (!thr->dev->openacc.async.synchronize_func (aq))
+	gomp_fatal ("wait on %d failed", async);
+    }
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 1d3ccca6580..994016346d4 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -508,7 +508,10 @@ goacc_wait (int async, int num_waits, va_list *ap)
       struct goacc_thread *thr = goacc_thread ();
       struct gomp_device_descr *acc_dev = thr->dev;
 
-      goacc_aq aq = get_goacc_asyncqueue (qid);
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
+      if (!aq)
+	continue;
+
       if (acc_dev->openacc.async.test_func (aq))
 	continue;
       if (async == acc_async_sync)
-- 
2.17.1
From d43e30c6042d932d9dc6a72487d5e96739488f16 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Wed, 12 Dec 2018 15:22:29 +0100
Subject: [PATCH 3/9] into async re-work: libgomp/oacc-async.c:async2id

---
 libgomp/oacc-async.c | 60 +++++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index b281ceb8f18..fcb05f5226a 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -54,19 +54,35 @@ get_goacc_thread_device (void)
   return thr->dev;
 }
 
-attribute_hidden struct goacc_asyncqueue *
-lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+/* Translate from an OpenACC async-argument to an internal asyncqueue ID, or -1
+   if no asyncqueue is to be used.  */
+
+static int
+async2id (int async)
 {
-  /* The special value acc_async_noval (-1) maps to the thread-specific
-     default async stream.  */
-  if (async == acc_async_noval)
-    async = 0; //TODO thr->default_async;
+  if (!async_valid_p (async))
+    gomp_fatal ("invalid async-argument: %d", async);
 
   if (async == acc_async_sync)
-    return NULL;
+    return -1;
+  else if (async == acc_async_noval)
+    return 0;
+  else if (async >= 0)
+    return 1 + async;
+  else
+    __builtin_unreachable ();
+}
 
-  if (async < 0)
-    gomp_fatal ("bad async %d", async);
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, if CREATE,
+   create the asyncqueue if it doesn't exist yet.  */
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  int id = async2id (async);
+  if (id < 0)
+    return NULL;
 
   struct goacc_asyncqueue *ret_aq = NULL;
   struct gomp_device_descr *dev = thr->dev;
@@ -74,26 +90,26 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
   gomp_mutex_lock (&dev->openacc.async.lock);
 
   if (!create
-      && (async >= dev->openacc.async.nasyncqueue
-	  || !dev->openacc.async.asyncqueue[async]))
+      && (id >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[id]))
     goto end;
 
-  if (async >= dev->openacc.async.nasyncqueue)
+  if (id >= dev->openacc.async.nasyncqueue)
     {
-      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      int diff = id + 1 - dev->openacc.async.nasyncqueue;
       dev->openacc.async.asyncqueue
 	= gomp_realloc (dev->openacc.async.asyncqueue,
-			sizeof (goacc_aq) * (async + 1));
+			sizeof (goacc_aq) * (id + 1));
       memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
 	      0, sizeof (goacc_aq) * diff);
-      dev->openacc.async.nasyncqueue = async + 1;
+      dev->openacc.async.nasyncqueue = id + 1;
     }
 
-  if (!dev->openacc.async.asyncqueue[async])
+  if (!dev->openacc.async.asyncqueue[id])
     {
-      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+      dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();
 
-      if (!dev->openacc.async.asyncqueue[async])
+      if (!dev->openacc.async.asyncqueue[id])
 	{
 	  gomp_mutex_unlock (&dev->openacc.async.lock);
 	  gomp_fatal ("async %d creation failed", async);
@@ -101,18 +117,22 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
       
       /* Link new async queue into active list.  */
       goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
-      n->aq = dev->openacc.async.asyncqueue[async];
+      n->aq = dev->openacc.async.asyncqueue[id];
       n->next = dev->openacc.async.active;
       dev->openacc.async.active = n;
     }
 
-  ret_aq = dev->openacc.async.asyncqueue[async];
+  ret_aq = dev->openacc.async.asyncqueue[id];
 
  end:
   gomp_mutex_unlock (&dev->openacc.async.lock);
   return ret_aq;
 }
 
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, create the
+   asyncqueue if it doesn't exist yet.  */
+
 attribute_hidden struct goacc_asyncqueue *
 get_goacc_asyncqueue (int async)
 {
-- 
2.17.1
From 98160fc42faa0fb8db87fd3d431e55460ca9c93f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Thu, 13 Dec 2018 17:59:46 +0100
Subject: [PATCH 4/9] [WIP] into async re-work: documentation

---
 libgomp/libgomp.h             | 3 +++
 libgomp/oacc-async.c          | 7 +++++++
 libgomp/plugin/plugin-nvptx.c | 4 ++--
 libgomp/target.c              | 3 +++
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 0c0c9ac954f..4b6dfe425e2 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -960,6 +960,9 @@ typedef struct acc_dispatch_t
     *destroy_thread_data_func;
   
   struct {
+    /* Once created and put into the "active" list, asyncqueues are then never
+       destructed and removed from the "active" list, other than if the TODO
+       device is shut down.  */
     gomp_mutex_t lock;
     int nasyncqueue;
     struct goacc_asyncqueue **asyncqueue;
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index fcb05f5226a..97d02dba6b3 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -282,6 +282,10 @@ goacc_async_free (struct gomp_device_descr *devicep,
     devicep->openacc.async.queue_callback_func (aq, free, ptr);
 }
 
+/* This function initializes the asyncqueues for the device specified by
+   DEVICEP.  TODO DEVICEP must be locked on entry, and remains locked on
+   return.  */
+
 attribute_hidden void
 goacc_init_asyncqueues (struct gomp_device_descr *devicep)
 {
@@ -291,6 +295,9 @@ goacc_init_asyncqueues (struct gomp_device_descr *devicep)
   gomp_mutex_init (&devicep->openacc.async.lock);
 }
 
+/* This function finalizes the asyncqueues for the device specified by DEVICEP.
+   TODO DEVICEP must be locked on entry, and remains locked on return.  */
+
 attribute_hidden bool
 goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
 {
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 22285f0de88..68aac268db0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1371,14 +1371,14 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void)
   return nvptx_get_current_cuda_context ();
 }
 
-/* NOTE: This returns a CUstream, not a ptx_stream pointer.  */
+/* This returns a CUstream.  */
 void *
 GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 {
   return (void *) aq->cuda_stream;
 }
 
-/* NOTE: This takes a CUstream, not a ptx_stream pointer.  */
+/* This takes a CUstream.  */
 int
 GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c7b62b83fbc..7aa93a2f46d 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1538,6 +1538,9 @@ gomp_init_device (struct gomp_device_descr *devicep)
   devicep->state = GOMP_DEVICE_INITIALIZED;
 }
 
+/* This function finalizes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
+
 attribute_hidden bool
 gomp_fini_device (struct gomp_device_descr *devicep)
 {
-- 
2.17.1
From 0e1dc8872190aef7890aa171ab7da245b14134eb Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Wed, 12 Dec 2018 15:24:17 +0100
Subject: [PATCH 5/9] into async re-work: replicate "[PR88407] [OpenACC]
 Correctly handle unseen async-arguments"

---
 libgomp/oacc-async.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 97d02dba6b3..8373456d8a5 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -143,9 +143,6 @@ get_goacc_asyncqueue (int async)
 int
 acc_async_test (int async)
 {
-  if (!async_valid_p (async))
-    gomp_fatal ("invalid async argument: %d", async);
-
   struct goacc_thread *thr = goacc_thread ();
 
   if (!thr || !thr->dev)
@@ -178,9 +175,6 @@ acc_async_test_all (void)
 void
 acc_wait (int async)
 {
-  if (!async_valid_p (async))
-    gomp_fatal ("invalid async argument: %d", async);
-
   struct goacc_thread *thr = get_goacc_thread ();
 
   goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
@@ -207,11 +201,12 @@ acc_wait_async (int async1, int async2)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
   goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  //TODO Is this correct also for acc_async_sync, assuming that in ths case, we'll always be synchronous anyway?
   if (!aq1)
-    gomp_fatal ("invalid async 1");
+    return;
 
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
   /* An async queue is always synchronized with itself.  */
   if (aq1 == aq2)
     return;
@@ -251,9 +246,6 @@ acc_async_wait_all (void)
 void
 acc_wait_all_async (int async)
 {
-  if (!async_valid_p (async))
-    gomp_fatal ("invalid async argument: %d", async);
-
   struct goacc_thread *thr = get_goacc_thread ();
 
   goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
-- 
2.17.1
From b8c83e9fb240bfa2c461c809d4f836c102543ac5 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Wed, 12 Dec 2018 15:25:19 +0100
Subject: [PATCH 6/9] into async re-work: replicate "[PR88370]
 acc_get_cuda_stream/acc_set_cuda_stream: acc_async_sync, acc_async_noval"

---
 libgomp/oacc-cuda.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index d4aa762575d..1a6946c5dda 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -30,6 +30,7 @@
 #include "config.h"
 #include "libgomp.h"
 #include "oacc-int.h"
+#include <assert.h>
 
 void *
 acc_get_current_cuda_device (void)
@@ -87,6 +88,15 @@ acc_set_cuda_stream (int async, void *stream)
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
     {
       goacc_aq aq = get_goacc_asyncqueue (async);
+      /* Due to not using an asyncqueue for "acc_async_sync", this cannot be
+	 used to change the CUDA stream associated with "acc_async_sync".  */
+      if (!aq)
+	{
+	  assert (async == acc_async_sync);
+	  gomp_debug (0, "Refusing request to set CUDA stream associated"
+		      " with \"acc_async_sync\"\n");
+	  return 0;
+	}
       gomp_mutex_lock (&thr->dev->openacc.async.lock);
       ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
       gomp_mutex_unlock (&thr->dev->openacc.async.lock);
-- 
2.17.1
From 855028c3d8d35561384ac54d75173f201fd89bd7 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Thu, 13 Dec 2018 17:43:42 +0100
Subject: [PATCH 7/9] into async re-work: adjust for test case added in
 "[PR88484] OpenACC wait directive without wait argument but with async
 clause"

---
 libgomp/oacc-async.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 8373456d8a5..e0d35742eef 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -213,8 +213,11 @@ acc_wait_async (int async1, int async2)
 
   if (!thr->dev->openacc.async.synchronize_func (aq1))
     gomp_fatal ("wait on %d failed", async1);
-  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
-    gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);
+  if (aq2)
+    {
+      if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
+	gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);
+    }
 }
 
 void
-- 
2.17.1
From b240d3783e1f4cabbf9eadb08c9210e3a4132b0d Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Thu, 13 Dec 2018 17:54:35 +0100
Subject: [PATCH 8/9] into async re-work: don't synchronize with the local
 thread unless actually necessary

---
 libgomp/oacc-async.c    | 12 +++++++++---
 libgomp/oacc-parallel.c |  2 --
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index e0d35742eef..292de005390 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -211,13 +211,17 @@ acc_wait_async (int async1, int async2)
   if (aq1 == aq2)
     return;
 
-  if (!thr->dev->openacc.async.synchronize_func (aq1))
-    gomp_fatal ("wait on %d failed", async1);
   if (aq2)
     {
       if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
 	gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);
     }
+  else
+    {
+      //TODO Local thread synchronization.  Necessary for the "async2 == acc_async_sync" case, or can just skip?
+      if (!thr->dev->openacc.async.synchronize_func (aq1))
+	gomp_fatal ("wait on %d failed", async1);
+    }
 }
 
 void
@@ -257,9 +261,11 @@ acc_wait_all_async (int async)
   gomp_mutex_lock (&thr->dev->openacc.async.lock);
   for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
     {
-      ret &= thr->dev->openacc.async.synchronize_func (l->aq);
       if (waiting_queue)
 	ret &= thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+      else
+	//TODO Local thread synchronization.  Necessary for the "async == acc_async_sync" case, or can just skip?
+	ret &= thr->dev->openacc.async.synchronize_func (l->aq);
     }
   gomp_mutex_unlock (&thr->dev->openacc.async.lock);
 
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 994016346d4..e2c4bfc1a4e 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -527,8 +527,6 @@ goacc_wait (int async, int num_waits, va_list *ap)
       else
 	{
 	  goacc_aq aq2 = get_goacc_asyncqueue (async);
-	  if (!acc_dev->openacc.async.synchronize_func (aq))
-	    gomp_fatal ("wait(%d) failed", qid);
 	  if (!acc_dev->openacc.async.serialize_func (aq, aq2))
 	    gomp_fatal ("wait(%d) async(%d) failed", qid, async);
 	}
-- 
2.17.1
From 9568689778d598fe75f53f48268bc51ce72c2971 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>

Date: Fri, 14 Dec 2018 14:34:17 +0100
Subject: [PATCH 9/9] into async re-work: more async function usage

---
 libgomp/oacc-parallel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index e2c4bfc1a4e..748c116c925 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -582,6 +582,8 @@ GOACC_update (int flags_m, size_t mapnum,
 		 the value of the allocated device memory in the
 		 previous pointer.  */
 	      *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;
+	      /* This is intentionally no calling acc_update_device_async,
+		 because TODO.  */
 	      acc_update_device (hostaddrs[i], sizeof (uintptr_t));
 
 	      /* Restore the host pointer.  */
-- 
2.17.1