[WIP] Use functional parameters for data mappings in OpenACC child functions

Message ID 7bfb59de-a141-46b4-0c6d-91ecd5b2a766@codesourcery.com
State New
Headers show
Series
  • [WIP] Use functional parameters for data mappings in OpenACC child functions
Related show

Commit Message

Cesar Philippidis Dec. 18, 2017, 10:58 p.m.
Jakub,

I'd like your thoughts on the following problem.

One of the offloading bottlenecks with GPU acceleration in OpenACC is
the nontrivial offloaded function invocation overhead. At present, GCC
generates code to pass a struct containing one field for each of the
data mappings used in the OMP child function. I'm guessing a struct is
used because pthread_create only accepts a single for new threads. What
I'd like to do is to create the child function with one argument per
data mapping. This has a number of advantages:

  1. No device memory needs to be managed for the child function data
     mapping struct.

  2. On PTX targets, the .param address space is cached. Using
     individual parameters for function arguments will allow the nvptx
     back end to generate a more relaxed "execution model" because the
     thread initialization code will be accessing cache memory instead
     of global memory.

  3. It was my hope that this would set a path to eliminate the
     GOMP_MAP_FIRSTPRIVATE_INT optimization, by replacing those mappings
     with the actual value directly.

1) is huge for programs, such as cloverleaf, which launch a lot of small
parallel regions a lot of times.

For the execution model in 2), OpenACC begins each parallel region in a
gang-redundant, worker-single and vector-single state. To transition
from a single-threaded (or single vector lane) state to a multi-threaded
partitioned state, GCC needs to emit code to propagate live variables,
both on the stack and registers to the spawned threads. A lot of loops,
including DGEMV from BLAS, can be executed in a fully-redundant state.
Executing code redundantly has the advantage of not requiring any state
transition code. The problem here is that because a) the struct is in
global memory, and b) not all of the GPU threads are executing the same
instruction at the same time. Consequently, initializing each thread in
a fully redundant manner actually hurts performance. When I rewrote the
same test case passing the data mappings via individual parameters, that
optimization improved performance compared to GCC trunk's baseline.

Lastly, 3) is more of a simplification than anything else. I'm not too
concerned about this because those variables only get initialized once.
So long as they don't require a separate COPYIN data mapping, the
performance hit should be negligible.

In this first attempt at using parameters I taught lower_omp_target how
to create child functions for OpenACC parallel regions with individual
parameters for the data mappings instead of using a large struct. This
works for the most part, but I realized too late that pthread_create
only passes one argument to each thread it creates. It should be noted
that I left the kernels implementation as-is, using the global struct
argument because kernels in GCC is largely ineffective and it usually
falls back to executing code on the host CPU. Eventually, we want to
redo kernels, but not until we get the parallel code running efficiently.

For fallback host targets, libgomp is using libffi to pass arguments to
the offloaded functions. This works OK at the moment because the host
code is always single-threaded. Ideally, that would change in the
future, but I'm not aware of any immediate plans to do so.

Question: is this approach acceptable for Stage 1 in May, or should I
make the offloaded function parameter expansion target-specific? I can
think a couple of ways to make this target-specific:

  a. Create two child functions during lowering, one with individual
     parameters for the data mappings, and another which takes in a
     single struct. The latter then calls the former immediately on
     on entry.

  b. Teach oaccdevlow to expand the incoming struct into individual
     parameters.

I'm concerned that b) is going to be a large pass. The SRA pass is
somewhat large at 5k. While this should be simpler, I'm not sure by how
much (probably a lot because it won't need to preform as much analysis).

While this patch is functional, it's not complete. I still need to tweak
a couple of things in the runtime. But I don't want to spend too much
time on it if we decide to go with a different approach.

Any thoughts are welcome.

By the way, next we'll be working on increasing vector_length on nvptx
targets. In conjunction with that, we'll simplifying the OpenACC
execution model in the nvptx BE, along with adding a new reduction
finalizer.

Cesar

Comments

Cesar Philippidis Dec. 21, 2017, 9:46 p.m. | #1
On 12/18/2017 02:58 PM, Cesar Philippidis wrote:
> Jakub,

> 

> I'd like your thoughts on the following problem.

> 

> One of the offloading bottlenecks with GPU acceleration in OpenACC is

> the nontrivial offloaded function invocation overhead. At present, GCC

> generates code to pass a struct containing one field for each of the

> data mappings used in the OMP child function. I'm guessing a struct is

> used because pthread_create only accepts a single for new threads. What

> I'd like to do is to create the child function with one argument per

> data mapping. This has a number of advantages:

> 

>   1. No device memory needs to be managed for the child function data

>      mapping struct.

> 

>   2. On PTX targets, the .param address space is cached. Using

>      individual parameters for function arguments will allow the nvptx

>      back end to generate a more relaxed "execution model" because the

>      thread initialization code will be accessing cache memory instead

>      of global memory.

> 

>   3. It was my hope that this would set a path to eliminate the

>      GOMP_MAP_FIRSTPRIVATE_INT optimization, by replacing those mappings

>      with the actual value directly.

> 

> 1) is huge for programs, such as cloverleaf, which launch a lot of small

> parallel regions a lot of times.

> 

> For the execution model in 2), OpenACC begins each parallel region in a

> gang-redundant, worker-single and vector-single state. To transition

> from a single-threaded (or single vector lane) state to a multi-threaded

> partitioned state, GCC needs to emit code to propagate live variables,

> both on the stack and registers to the spawned threads. A lot of loops,

> including DGEMV from BLAS, can be executed in a fully-redundant state.

> Executing code redundantly has the advantage of not requiring any state

> transition code. The problem here is that because a) the struct is in

> global memory, and b) not all of the GPU threads are executing the same

> instruction at the same time. Consequently, initializing each thread in

> a fully redundant manner actually hurts performance. When I rewrote the

> same test case passing the data mappings via individual parameters, that

> optimization improved performance compared to GCC trunk's baseline.

> 

> Lastly, 3) is more of a simplification than anything else. I'm not too

> concerned about this because those variables only get initialized once.

> So long as they don't require a separate COPYIN data mapping, the

> performance hit should be negligible.

> 

> In this first attempt at using parameters I taught lower_omp_target how

> to create child functions for OpenACC parallel regions with individual

> parameters for the data mappings instead of using a large struct. This

> works for the most part, but I realized too late that pthread_create

> only passes one argument to each thread it creates. It should be noted

> that I left the kernels implementation as-is, using the global struct

> argument because kernels in GCC is largely ineffective and it usually

> falls back to executing code on the host CPU. Eventually, we want to

> redo kernels, but not until we get the parallel code running efficiently.

> 

> For fallback host targets, libgomp is using libffi to pass arguments to

> the offloaded functions. This works OK at the moment because the host

> code is always single-threaded. Ideally, that would change in the

> future, but I'm not aware of any immediate plans to do so.

> 

> Question: is this approach acceptable for Stage 1 in May, or should I

> make the offloaded function parameter expansion target-specific? I can

> think a couple of ways to make this target-specific:

> 

>   a. Create two child functions during lowering, one with individual

>      parameters for the data mappings, and another which takes in a

>      single struct. The latter then calls the former immediately on

>      on entry.

> 

>   b. Teach oaccdevlow to expand the incoming struct into individual

>      parameters.

> 

> I'm concerned that b) is going to be a large pass. The SRA pass is

> somewhat large at 5k. While this should be simpler, I'm not sure by how

> much (probably a lot because it won't need to preform as much analysis).

> 

> While this patch is functional, it's not complete. I still need to tweak

> a couple of things in the runtime. But I don't want to spend too much

> time on it if we decide to go with a different approach.

> 

> Any thoughts are welcome.

> 

> By the way, next we'll be working on increasing vector_length on nvptx

> targets. In conjunction with that, we'll simplifying the OpenACC

> execution model in the nvptx BE, along with adding a new reduction

> finalizer.


After thinking about this some more, I decided that it would be better
expand the offloaded function arguments into individual parameters
during omp lowering, rather than writing a separate pass later on. I
don't see too many disadvantages of using libffi after a pthread is
spawned by the host. If anything, the pthread's use of libffi is
equivalent of preforming SRA by the accelerator anyway.

I've committed this patch to openacc-gcc-7-branch.

Note that I had to xfail
libgomp.oacc-c-c++-common/combined-directives-1.c because I disabled
struct analysis analysis on parallel regions. Unfortunately, that makes
kernels slightly less effective. But more often than not, kernels
regions fall back to host execution anyway.

Cesar
2017-12-21  Cesar Philippidis  <cesar@codesourcery.com>

	Makefile.def: Make libgomp depend on libffi.
	configure.ac: Likewise.
	Makefile.in: Regenerate.
	configure: Regenerate.

	gcc/fortran/
	* types.def: (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.

	gcc/
	* builtin-types.def (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.
	* config/nvptx/nvptx.c (nvptx_expand_cmp_swap): Handle PARM_DECLs.
	* omp-builtins.def (BUILD_IN_GOACC_PARALLEL): Call
	GOACC_parallel_keyed_v2.
	* omp-expand.c (expand_omp_target): Update call to
	BUILT_IN_GOACC_PARALLEL.
	* omp-low.c (struct omp_context): Add parm_map member.
	(lookup_parm): New function.
	(build_receiver_ref): Lookup parm_map decls.
	(install_parm_decl): New function.
	(install_var_field): Install parm_map decl for OpenACC parallel region
	data clauses.
	(delete_omp_context): Clean parm_map.
	(scan_sharing_clauses): Install subarray variable mapping into parm_map.
	(create_omp_child_function): Defer creation of child function for
	OpenACC parallel regions.
	(scan_omp_target): Likewise.
	(append_decl_arg): New function.
	(lower_omp_target): Create an child offloaded function using one
	parameter per data mapping for OpenACC parallel regions.
	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call):
	Ignore OpenACC parallel regions.
	(find_func_clobbers): Likewise.
	(ipa_pta_execute): Likewise.

	libgomp/
	* Makefile.am: Add libffi build dependency.
	* configure.ac: Likewise.
	* Makefile.in: Regenerate.
	* config.h.in: Regenerate.
	* configure: Regenerate.
	* libgomp-plugin.h: Define GOMP_OFFLOAD_openacc_exec_params and
	GOMP_OFFLOAD_openacc_async_exec_params.
	* libgomp.h (acc_dispatch_t): Use them here. 
	* libgomp.map (GOACC_parallel_keyed_v2): Declare.
	* libgomp_g.h (GOACC_parallel_keyed_v2): Likewise.
	* oacc-host.c (host_openacc_exec_params): New function.
	(host_openacc_async_exec_params): Likewise.
	* oacc-parallel.c (goacc_call_host_fn): Likewise.
	(GOACC_parallel_keyed_internal): Likewise.
	(GOACC_parallel_keyed): Wrapper for GOACC_parallel_keyed_internal.
	(GOACC_parallel_keyed_v2): Likewise.
	* plugin/plugin-nvptx.c (nvptx_exec): Replace CUDeviceptr dp parameter
	with void **kargs.
	(openacc_exec_internal): New function.
	(GOMP_OFFLOAD_openacc_exec_params): New function.
	(GOMP_OFFLOAD_openacc_exec): Update to call openacc_exec_internal.
	(openacc_async_exec_internal): New function.
	(GOMP_OFFLOAD_openacc_async_exec_params): New function.
	(GOMP_OFFLOAD_openacc_async_exec): Update call to
	openacc_async_exec_internal.
	* target.c (gomp_load_plugin_for_device): Handle
	openacc_exec_params and openacc_async_exec_params.
	* testsuite/Makefile.in: Regenerate.
	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
	Xfail on offloaded targets.


diff --git a/Makefile.def b/Makefile.def
index abfa9efe959..5e94062fa75 100644
--- a/Makefile.def
+++ b/Makefile.def
@@ -550,6 +550,7 @@ dependencies = { module=configure-target-libgo; on=all-target-libstdc++-v3; };
 dependencies = { module=all-target-libgo; on=all-target-libbacktrace; };
 dependencies = { module=all-target-libgo; on=all-target-libffi; };
 dependencies = { module=all-target-libgo; on=all-target-libatomic; };
+dependencies = { module=configure-target-libgomp; on=configure-target-libffi; };
 dependencies = { module=configure-target-libstdc++-v3; on=configure-target-libgomp; };
 dependencies = { module=configure-target-liboffloadmic; on=configure-target-libgomp; };
 dependencies = { module=configure-target-libsanitizer; on=all-target-libstdc++-v3; };
@@ -564,6 +565,7 @@ dependencies = { module=install-target-libgo; on=install-target-libatomic; };
 dependencies = { module=install-target-libgfortran; on=install-target-libquadmath; };
 dependencies = { module=install-target-libgfortran; on=install-target-libgcc; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libstdc++-v3; };
+dependencies = { module=install-target-libgomp; on=install-target-libffi; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libgcc; };
 dependencies = { module=install-target-libvtv; on=install-target-libstdc++-v3; };
 dependencies = { module=install-target-libvtv; on=install-target-libgcc; };
diff --git a/Makefile.in b/Makefile.in
index b824e0a0ca1..9b4497e3943 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -55803,6 +55803,7 @@ configure-target-libgo: maybe-all-target-libstdc++-v3
 all-target-libgo: maybe-all-target-libbacktrace
 all-target-libgo: maybe-all-target-libffi
 all-target-libgo: maybe-all-target-libatomic
+configure-target-libgomp: maybe-configure-target-libffi
 configure-target-libstdc++-v3: maybe-configure-target-libgomp
 
 configure-stage1-target-libstdc++-v3: maybe-configure-stage1-target-libgomp
@@ -55849,6 +55850,7 @@ install-target-libgo: maybe-install-target-libatomic
 install-target-libgfortran: maybe-install-target-libquadmath
 install-target-libgfortran: maybe-install-target-libgcc
 install-target-libsanitizer: maybe-install-target-libstdc++-v3
+install-target-libgomp: maybe-install-target-libffi
 install-target-libsanitizer: maybe-install-target-libgcc
 install-target-libvtv: maybe-install-target-libstdc++-v3
 install-target-libvtv: maybe-install-target-libgcc
diff --git a/configure b/configure
index 32a38633ad8..ed47944d8f9 100755
--- a/configure
+++ b/configure
@@ -3472,11 +3472,19 @@ case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -6460,6 +6468,15 @@ esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/configure.ac b/configure.ac
index 12377499295..a3b9e116a05 100644
--- a/configure.ac
+++ b/configure.ac
@@ -800,11 +800,19 @@ case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -2127,6 +2135,15 @@ esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index ac9894467ec..7f647c65162 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -763,6 +763,10 @@ DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			 BT_PTR, BT_PTR, BT_PTR)
 
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			 BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR)
+
 DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index a7b4c09bf6c..55c7e3cbf90 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4737,6 +4737,10 @@ nvptx_expand_cmp_swap (tree exp, rtx target,
 			 NULL_RTX, mode, EXPAND_NORMAL);
   rtx pat;
 
+  /* 'mem' might be a PARM_DECL.  If so, convert it to a register.  */
+  if (!REG_P (mem))
+    mem = copy_to_mode_reg (GET_MODE (mem), mem);
+
   mem = gen_rtx_MEM (mode, mem);
   if (!REG_P (cmp))
     cmp = copy_to_mode_reg (mode, cmp);
diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def
index 1f8a5a1277c..3c3ad69d848 100644
--- a/gcc/fortran/types.def
+++ b/gcc/fortran/types.def
@@ -252,3 +252,7 @@ DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			  BT_PTR, BT_PTR, BT_PTR)
+
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			  BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			  BT_PTR, BT_PTR, BT_PTR)
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 69b73f4b8c4..a9ec667aa54 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -38,8 +38,8 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_END, "GOACC_data_end",
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_ENTER_EXIT_DATA, "GOACC_enter_exit_data",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 		   ATTR_NOTHROW_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
-		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed_v2",
+		   BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 		   ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_UPDATE, "GOACC_update",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index bf1f127d8d6..f674c74ec82 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -7097,19 +7097,21 @@ expand_omp_target (struct omp_region *region)
   gomp_target *entry_stmt;
   gimple *stmt;
   edge e;
-  bool offloaded, data_region;
+  bool offloaded, data_region, oacc_parallel;
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
   new_bb = region->entry;
+  oacc_parallel = false;
 
   offloaded = is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
+    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
+      oacc_parallel = true;
     case GF_OMP_TARGET_KIND_REGION:
     case GF_OMP_TARGET_KIND_UPDATE:
     case GF_OMP_TARGET_KIND_ENTER_DATA:
     case GF_OMP_TARGET_KIND_EXIT_DATA:
-    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
     case GF_OMP_TARGET_KIND_OACC_KERNELS:
     case GF_OMP_TARGET_KIND_OACC_UPDATE:
     case GF_OMP_TARGET_KIND_OACC_ENTER_EXIT_DATA:
@@ -7171,7 +7173,7 @@ expand_omp_target (struct omp_region *region)
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
       tree data_arg = gimple_omp_target_data_arg (entry_stmt);
-      if (data_arg)
+      if (data_arg && !oacc_parallel)
 	{
 	  basic_block entry_succ_bb = single_succ (entry_bb);
 	  gimple_stmt_iterator gsi;
@@ -7489,6 +7491,11 @@ expand_omp_target (struct omp_region *region)
   /* The maximum number used by any start_ix, without varargs.  */
   auto_vec<tree, 11> args;
   args.quick_push (device);
+  if (start_ix == BUILT_IN_GOACC_PARALLEL)
+    {
+      tree use_params = oacc_parallel ? integer_one_node : integer_zero_node;
+      args.quick_push (use_params);
+    }
   if (offloaded)
     args.quick_push (build_fold_addr_expr (child_fn));
   args.quick_push (t1);
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index e790f0f1bb2..a2869e49ebd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -89,6 +89,7 @@ struct omp_context
   /* Map variables to fields in a structure that allows communication
      between sending and receiving threads.  */
   splay_tree field_map;
+  splay_tree parm_map;
   tree record_type;
   tree sender_decl;
   tree receiver_decl;
@@ -321,6 +322,14 @@ maybe_lookup_decl (const_tree var, omp_context *ctx)
 }
 
 static inline tree
+lookup_parm (const_tree var, omp_context *ctx)
+{
+  splay_tree_node n;
+  n = splay_tree_lookup (ctx->parm_map, (splay_tree_key) var);
+  return (tree) n->value;
+}
+
+static inline tree
 lookup_field (tree var, omp_context *ctx)
 {
   splay_tree_node n;
@@ -501,15 +510,21 @@ build_receiver_ref (tree var, bool by_ref, omp_context *ctx)
 {
   tree x, field = lookup_field (var, ctx);
 
-  /* If the receiver record type was remapped in the child function,
-     remap the field into the new record type.  */
-  x = maybe_lookup_field (field, ctx);
-  if (x != NULL)
-    field = x;
+  if (is_oacc_parallel (ctx))
+    x = lookup_parm (var, ctx);
+  else
+    {
+      /* If the receiver record type was remapped in the child function,
+	 remap the field into the new record type.  */
+      x = maybe_lookup_field (field, ctx);
+      if (x != NULL)
+	field = x;
+
+      x = build_simple_mem_ref (ctx->receiver_decl);
+      TREE_THIS_NOTRAP (x) = 1;
+      x = omp_build_component_ref (x, field);
+    }
 
-  x = build_simple_mem_ref (ctx->receiver_decl);
-  TREE_THIS_NOTRAP (x) = 1;
-  x = omp_build_component_ref (x, field);
   if (by_ref)
     {
       x = build_simple_mem_ref (x);
@@ -644,6 +659,32 @@ build_sender_ref (tree var, omp_context *ctx)
   return build_sender_ref ((splay_tree_key) var, ctx);
 }
 
+static void
+install_parm_decl (tree var, tree type, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return;
+
+  splay_tree_key key = (splay_tree_key) var;
+  tree decl_name = NULL_TREE, t;
+  location_t loc = UNKNOWN_LOCATION;
+
+  if (DECL_P (var))
+    {
+      decl_name = get_identifier (get_name (var));
+      loc = DECL_SOURCE_LOCATION (var);
+    }
+  t = build_decl (loc, PARM_DECL, decl_name, type);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_NAMELESS (t) = 1;
+  DECL_ARG_TYPE (t) = type;
+  DECL_CONTEXT (t) = current_function_decl;
+  TREE_USED (t) = 1;
+  TREE_READONLY (t) = 1;
+
+  splay_tree_insert (ctx->parm_map, key, (splay_tree_value) t);
+}
+
 /* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
    BASE_POINTERS_RESTRICT, declare the field with restrict.  */
 
@@ -764,7 +805,10 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx,
     }
 
   if (mask & 1)
-    splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+    {
+      splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+      install_parm_decl (var, type, ctx);
+    }
   if ((mask & 2) && ctx->sfield_map)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
@@ -1068,6 +1112,8 @@ delete_omp_context (splay_tree_value value)
     splay_tree_delete (ctx->field_map);
   if (ctx->sfield_map)
     splay_tree_delete (ctx->sfield_map);
+  if (ctx->parm_map)
+    splay_tree_delete (ctx->parm_map);
 
   /* We hijacked DECL_ABSTRACT_ORIGIN earlier.  We need to clear it before
      it produces corrupt debug information.  */
@@ -1506,6 +1552,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 		  insert_field_into_struct (ctx->record_type, field);
 		  splay_tree_insert (ctx->field_map, (splay_tree_key) decl,
 				     (splay_tree_value) field);
+		  install_parm_decl (decl, ptr_type_node, ctx);
 		}
 	    }
 	  break;
@@ -1800,10 +1847,13 @@ omp_maybe_offloaded_ctx (omp_context *ctx)
 }
 
 /* Build a decl for the omp child function.  It'll not contain a body
-   yet, just the bare decl.  */
+   yet, just the bare decl.  Unlike omp child functions, acc child
+   functions for parallel regions have one argument per data
+   mapping.  */
 
 static void
-create_omp_child_function (omp_context *ctx, bool task_copy)
+create_omp_child_function (omp_context *ctx, bool task_copy,
+			   unsigned int map_cnt = 0)
 {
   tree decl, type, name, t;
 
@@ -1825,6 +1875,13 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
       type = build_function_type_list (void_type_node, ptr_type_node,
 				       cilk_var_type, cilk_var_type, NULL_TREE);
     }
+  else if (is_oacc_parallel (ctx))
+    {
+      tree *arg_types = (tree *) alloca (sizeof (tree) * map_cnt);
+      for (unsigned int i = 0; i < map_cnt; i++)
+	arg_types[i] = ptr_type_node;
+      type = build_function_type_array (void_type_node, map_cnt, arg_types);
+    }
   else
     type = build_function_type_list (void_type_node, ptr_type_node, NULL_TREE);
 
@@ -1899,35 +1956,37 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
       DECL_ARGUMENTS (decl) = t;
     }
 
-  tree data_name = get_identifier (".omp_data_i");
-  t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
-		  ptr_type_node);
-  DECL_ARTIFICIAL (t) = 1;
-  DECL_NAMELESS (t) = 1;
-  DECL_ARG_TYPE (t) = ptr_type_node;
-  DECL_CONTEXT (t) = current_function_decl;
-  TREE_USED (t) = 1;
-  TREE_READONLY (t) = 1;
-  if (cilk_for_count)
-    DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
-  DECL_ARGUMENTS (decl) = t;
-  if (!task_copy)
-    ctx->receiver_decl = t;
-  else
+  if (!is_oacc_parallel (ctx))
     {
-      t = build_decl (DECL_SOURCE_LOCATION (decl),
-		      PARM_DECL, get_identifier (".omp_data_o"),
+      tree data_name = get_identifier (".omp_data_i");
+      t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
 		      ptr_type_node);
       DECL_ARTIFICIAL (t) = 1;
       DECL_NAMELESS (t) = 1;
       DECL_ARG_TYPE (t) = ptr_type_node;
       DECL_CONTEXT (t) = current_function_decl;
       TREE_USED (t) = 1;
-      TREE_ADDRESSABLE (t) = 1;
-      DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+      TREE_READONLY (t) = 1;
+      if (cilk_for_count)
+	DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
       DECL_ARGUMENTS (decl) = t;
+      if (!task_copy)
+	ctx->receiver_decl = t;
+      else
+	{
+	  t = build_decl (DECL_SOURCE_LOCATION (decl),
+			  PARM_DECL, get_identifier (".omp_data_o"),
+			  ptr_type_node);
+	  DECL_ARTIFICIAL (t) = 1;
+	  DECL_NAMELESS (t) = 1;
+	  DECL_ARG_TYPE (t) = ptr_type_node;
+	  DECL_CONTEXT (t) = current_function_decl;
+	  TREE_USED (t) = 1;
+	  TREE_ADDRESSABLE (t) = 1;
+	  DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+	  DECL_ARGUMENTS (decl) = t;
+	}
     }
-
   /* Allocate memory for the function structure.  The call to
      allocate_struct_function clobbers CFUN, so we need to restore
      it afterward.  */
@@ -2608,6 +2667,7 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
 
   ctx = new_omp_context (stmt, outer_ctx);
   ctx->field_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
+  ctx->parm_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
   ctx->default_kind = OMP_CLAUSE_DEFAULT_SHARED;
   ctx->record_type = lang_hooks.types.make_type (RECORD_TYPE);
   name = create_tmp_var_name (".omp_data_t");
@@ -2621,8 +2681,11 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   bool base_pointers_restrict = false;
   if (offloaded)
     {
-      create_omp_child_function (ctx, false);
-      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+      if (!is_oacc_parallel (ctx))
+	{
+	  create_omp_child_function (ctx, false);
+	  gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+	}
 
       base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses);
       if (base_pointers_restrict
@@ -7921,6 +7984,18 @@ convert_from_firstprivate_int (tree var, tree orig_type, bool is_ref,
   return var;
 }
 
+static tree
+append_decl_arg (tree var, tree decl_args, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return NULL_TREE;
+
+  tree temp = lookup_parm (var, ctx);
+  DECL_CHAIN (temp) = decl_args;
+
+  return temp;
+}
+
 /* Lower the GIMPLE_OMP_TARGET in the current statement
    in GSI_P.  CTX holds context information for the directive.  */
 
@@ -7934,7 +8009,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   gimple_seq tgt_body, olist, ilist, fplist, new_body;
   location_t loc = gimple_location (stmt);
   bool offloaded, data_region;
-  unsigned int map_cnt = 0;
+  unsigned int map_cnt = 0, init_cnt = 0;
 
   offloaded = is_gimple_omp_offloaded (stmt);
   switch (gimple_omp_target_kind (stmt))
@@ -7980,11 +8055,83 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     }
   else if (data_region)
     tgt_body = gimple_omp_body (stmt);
-  child_fn = ctx->cb.dst_fn;
 
   push_gimplify_context ();
   fplist = NULL;
 
+  /* Determine init_cnt to finish initialize ctx.  */
+
+  if (is_oacc_parallel (ctx))
+    {
+      for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
+	switch (OMP_CLAUSE_CODE (c))
+	  {
+	    tree var;
+
+	  default:
+	    break;
+	  case OMP_CLAUSE_MAP:
+	  case OMP_CLAUSE_TO:
+	  case OMP_CLAUSE_FROM:
+	  init_oacc_firstprivate:
+	    var = OMP_CLAUSE_DECL (c);
+	    if (!DECL_P (var))
+	      {
+		if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP
+		    || (!OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+			&& (OMP_CLAUSE_MAP_KIND (c)
+			    != GOMP_MAP_FIRSTPRIVATE_POINTER)))
+		  init_cnt++;
+		continue;
+	      }
+
+	    if (DECL_SIZE (var)
+		&& TREE_CODE (DECL_SIZE (var)) != INTEGER_CST)
+	      {
+		tree var2 = DECL_VALUE_EXPR (var);
+		gcc_assert (TREE_CODE (var2) == INDIRECT_REF);
+		var2 = TREE_OPERAND (var2, 0);
+		gcc_assert (DECL_P (var2));
+		var = var2;
+	      }
+
+	    if (offloaded
+		&& OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		&& (OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_FIRSTPRIVATE_POINTER
+		    || (OMP_CLAUSE_MAP_KIND (c)
+			== GOMP_MAP_FIRSTPRIVATE_REFERENCE)))
+	      {
+		continue;
+	      }
+
+	    if (!maybe_lookup_field (var, ctx))
+	      continue;
+
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_FIRSTPRIVATE:
+	    if (is_oacc_parallel (ctx))
+	      goto init_oacc_firstprivate;
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_USE_DEVICE_PTR:
+	  case OMP_CLAUSE_IS_DEVICE_PTR:
+	    init_cnt++;
+	    break;
+	  }
+
+      /* Initialize the offloaded child function.  */
+
+      create_omp_child_function (ctx, false, init_cnt);
+      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+    }
+
+  child_fn = ctx->cb.dst_fn;
+
+  /* Clause Pass 1: Scan and prepare sender decls VALUE_EXPRs for
+     usage on the child function.  */
   for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
     switch (OMP_CLAUSE_CODE (c))
       {
@@ -8247,6 +8394,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      if (is_oacc_parallel (ctx))
+	gcc_assert (init_cnt == map_cnt);
       target_nesting_level++;
       lower_omp (&tgt_body, ctx);
       target_nesting_level--;
@@ -8293,6 +8442,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       vec_alloc (vsize, map_cnt);
       vec_alloc (vkind, map_cnt);
       unsigned int map_idx = 0;
+      tree decl_args = NULL_TREE;
 
       for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
 	switch (OMP_CLAUSE_CODE (c))
@@ -8488,6 +8638,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    if (s == NULL_TREE)
 	      s = integer_one_node;
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8628,6 +8779,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    else
 	      s = TYPE_SIZE_UNIT (TREE_TYPE (ovar));
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8667,6 +8819,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	      }
 	    gimplify_assign (x, var, &ilist);
 	    s = size_int (0);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    gcc_checking_assert (tkind
@@ -8679,6 +8832,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	  }
 
       gcc_assert (map_idx == map_cnt);
+      if (is_oacc_parallel (ctx))
+	DECL_ARGUMENTS (child_fn) = nreverse (decl_args);
 
       DECL_INITIAL (TREE_VEC_ELT (t, 1))
 	= build_constructor (TREE_TYPE (TREE_VEC_ELT (t, 1)), vsize);
@@ -8717,9 +8872,12 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     {
       t = build_fold_addr_expr_loc (loc, ctx->sender_decl);
       /* fixup_child_record_type might have changed receiver_decl's type.  */
-      t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
-      gimple_seq_add_stmt (&new_body,
-	  		   gimple_build_assign (ctx->receiver_decl, t));
+      if (!is_oacc_parallel (ctx))
+	{
+	  t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
+	  gimple_seq_add_stmt (&new_body,
+			       gimple_build_assign (ctx->receiver_decl, t));
+	}
     }
   gimple_seq_add_seq (&new_body, fplist);
 
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index aab6821e792..c23ddeb9c86 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4618,6 +4618,7 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
       case BUILT_IN_GOMP_PARALLEL:
       case BUILT_IN_GOACC_PARALLEL:
 	{
+	  bool oacc_parallel = false;
 	  if (in_ipa_mode)
 	    {
 	      unsigned int fnpos, argpos;
@@ -4631,13 +4632,17 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
+		  fnpos = 2;
+		  argpos = 4;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -5195,6 +5200,7 @@ find_func_clobbers (struct function *fn, gimple *origt)
 	      unsigned int fnpos, argpos;
 	      unsigned int implicit_use_args[2];
 	      unsigned int num_implicit_use_args = 0;
+	      bool oacc_parallel = false;
 	      switch (DECL_FUNCTION_CODE (decl))
 		{
 		case BUILT_IN_GOMP_PARALLEL:
@@ -5205,15 +5211,19 @@ find_func_clobbers (struct function *fn, gimple *origt)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
-		  implicit_use_args[num_implicit_use_args++] = 4;
+		  fnpos = 2;
+		  argpos = 4;
 		  implicit_use_args[num_implicit_use_args++] = 5;
+		  implicit_use_args[num_implicit_use_args++] = 6;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -7968,7 +7978,7 @@ ipa_pta_execute (void)
 		if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL))
 		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 0), 0);
 		else if (gimple_call_builtin_p (stmt, BUILT_IN_GOACC_PARALLEL))
-		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 1), 0);
+		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 2), 0);
 
 		if (called_decl != NULL_TREE
 		    && !fndecl_maybe_in_other_partition (called_decl))
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 99ad2fd456d..4de30914d3d 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -13,9 +13,16 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
 
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
+
+if USE_LIBFFI
+libgomp_la_LIBADD = $(LIBFFI)
+endif
+
 vpath % $(strip $(search_path))
 
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 7a84b5681e1..617615d4d52 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -171,7 +171,6 @@ libgomp_plugin_nvptx_la_LINK = $(LIBTOOL) --tag=CC \
 	$(libgomp_plugin_nvptx_la_LDFLAGS) $(LDFLAGS) -o $@
 @PLUGIN_NVPTX_TRUE@am_libgomp_plugin_nvptx_la_rpath = -rpath \
 @PLUGIN_NVPTX_TRUE@	$(toolexeclibdir)
-libgomp_la_LIBADD =
 @USE_FORTRAN_TRUE@am__objects_1 = openacc.lo
 am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	env.lo error.lo icv.lo icv-device.lo iter.lo iter_ull.lo \
@@ -279,6 +278,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@
@@ -410,7 +411,8 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+libgomp_la_LIBADD = $(LIBFFI)
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 toolexeclib_LTLIBRARIES = libgomp.la $(am__append_1) $(am__append_2)
diff --git a/libgomp/config.h.in b/libgomp/config.h.in
index 2f45aa74bbe..65e01c5376a 100644
--- a/libgomp/config.h.in
+++ b/libgomp/config.h.in
@@ -189,5 +189,8 @@
 /* Define to 1 if the target use emutls for thread-local storage. */
 #undef USE_EMUTLS
 
+/* Define to 1 if the target requires libffi to call the offloaded funtions. */
+#undef USE_LIBFFI
+
 /* Version number of package */
 #undef VERSION
diff --git a/libgomp/configure b/libgomp/configure
index 11f5b0b1e1c..cc24a81372e 100755
--- a/libgomp/configure
+++ b/libgomp/configure
@@ -649,6 +649,10 @@ PLUGIN_NVPTX
 CUDA_DRIVER_LIB
 CUDA_DRIVER_INCLUDE
 offload_targets
+USE_LIBFFI_FALSE
+USE_LIBFFI_TRUE
+LIBFFIINCS
+LIBFFI
 libtool_VERSION
 ac_ct_FC
 FCFLAGS
@@ -2655,7 +2659,6 @@ else
 fi
 
 
-
 # -------
 # -------
 
@@ -11155,7 +11158,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11158 "configure"
+#line 11161 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11261,7 +11264,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11264 "configure"
+#line 11267 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -15137,6 +15140,28 @@ $as_echo "#define LIBGOMP_OFFLOADED_ONLY 1" >>confdefs.h
 
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+
+$as_echo "#define USE_LIBFFI 1" >>confdefs.h
+
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+
+
+ if test -d ../libffi; then
+  USE_LIBFFI_TRUE=
+  USE_LIBFFI_FALSE='#'
+else
+  USE_LIBFFI_TRUE='#'
+  USE_LIBFFI_FALSE=
+fi
+
+
 # Plugins for offload execution, configure.ac fragment.  -*- mode: autoconf -*-
 #
 # Copyright (C) 2014-2017 Free Software Foundation, Inc.
@@ -16960,6 +16985,10 @@ if test -z "${MAINTAINER_MODE_TRUE}" && test -z "${MAINTAINER_MODE_FALSE}"; then
   as_fn_error "conditional \"MAINTAINER_MODE\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
 fi
+if test -z "${USE_LIBFFI_TRUE}" && test -z "${USE_LIBFFI_FALSE}"; then
+  as_fn_error "conditional \"USE_LIBFFI\" was never defined.
+Usually this means the macro was only invoked conditionally." "$LINENO" 5
+fi
 if test -z "${PLUGIN_NVPTX_TRUE}" && test -z "${PLUGIN_NVPTX_FALSE}"; then
   as_fn_error "conditional \"PLUGIN_NVPTX\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
diff --git a/libgomp/configure.ac b/libgomp/configure.ac
index a42d4f08b4b..aa49577537e 100644
--- a/libgomp/configure.ac
+++ b/libgomp/configure.ac
@@ -28,7 +28,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, ,
 AC_MSG_RESULT($enable_generated_files_in_srcdir)
 AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes)
 
-
 # -------
 # -------
 
@@ -215,6 +214,19 @@ if test x$libgomp_offloaded_only = xyes; then
             [Define to 1 if building libgomp for an accelerator-only target.])
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+   AC_DEFINE(USE_LIBFFI, 1, [Define if we're to use libffi.])
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+AC_SUBST(LIBFFI)
+AC_SUBST(LIBFFIINCS)
+AM_CONDITIONAL([USE_LIBFFI], [test -d ../libffi])
+
 m4_include([plugin/configfrag.ac])
 
 # Check for functions needed.
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index c025069b457..44097cfd56a 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -119,6 +119,13 @@ extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
 extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
 					     void **, unsigned *, void *,
 					     struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_exec_params (void (*) (void *), size_t,
+					      void **, void **, unsigned *,
+					      void *);
+extern void GOMP_OFFLOAD_openacc_async_exec_params (void (*) (void *), size_t,
+						    void **, void **,
+						    unsigned *, void *,
+						    struct goacc_asyncqueue *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
 extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
 extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 59e7ca8b8c8..a31c83cc656 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -885,6 +885,7 @@ typedef struct acc_dispatch_t
 
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
+  __typeof (GOMP_OFFLOAD_openacc_exec_params) *exec_params_func;
 
   struct {
     gomp_mutex_t lock;
@@ -900,6 +901,7 @@ typedef struct acc_dispatch_t
     __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
 
     __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_exec_params) *exec_params_func;
     __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
     __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
   } async;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 546ac929a0e..7a49acc1dfe 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -461,8 +461,10 @@ GOACC_2.0.1 {
 GOACC_2.0.GOMP_4_BRANCH {
   global:
 	GOMP_set_offload_targets;
+	GOACC_parallel_keyed_v2;
 } GOACC_2.0.1;
 
+
 GOMP_PLUGIN_1.0 {
   global:
 	GOMP_PLUGIN_malloc;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index 958ca6e9cc3..c40e67f2e80 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -298,6 +298,8 @@ extern void GOMP_teams (unsigned int, unsigned int);
 
 extern void GOACC_parallel_keyed (int, void (*) (void *), size_t,
 				  void **, size_t *, unsigned short *, ...);
+extern void GOACC_parallel_keyed_v2 (int, int, void (*) (void *), size_t,
+				  void **, size_t *, unsigned short *, ...);
 extern void GOACC_parallel (int, void (*) (void *), size_t, void **, size_t *,
 			    unsigned short *, int, int, int, int, int, ...);
 extern void GOACC_data_start (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 3b2cafb2c55..5b4e34d7190 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -158,6 +158,30 @@ host_openacc_async_exec (void (*fn) (void *),
   fn (hostaddrs);
 }
 
+static void
+host_openacc_exec_params (void (*fn) (void *),
+			  size_t mapnum __attribute__ ((unused)),
+			  void **hostaddrs,
+			  void **devaddrs __attribute__ ((unused)),
+			  unsigned *dims __attribute__ ((unused)),
+			  void *targ_mem_desc __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+static void
+host_openacc_async_exec_params (void (*fn) (void *),
+				size_t mapnum __attribute__ ((unused)),
+				void **hostaddrs,
+				void **devaddrs __attribute__ ((unused)),
+				unsigned *dims __attribute__ ((unused)),
+				void *targ_mem_desc __attribute__ ((unused)),
+				struct goacc_asyncqueue *aq __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+
 static int
 host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
@@ -265,6 +289,7 @@ static struct gomp_device_descr host_dispatch =
       .data_environ = NULL,
 
       .exec_func = host_openacc_exec,
+      .exec_params_func = host_openacc_exec_params,
 
       .async = {
 	.construct_func = host_openacc_async_construct,
@@ -274,6 +299,7 @@ static struct gomp_device_descr host_dispatch =
 	.serialize_func = host_openacc_async_serialize,
 	.queue_callback_func = host_openacc_async_queue_callback,
 	.exec_func = host_openacc_async_exec,
+	.exec_params_func = host_openacc_async_exec_params,
 	.dev2host_func = host_openacc_async_dev2host,
 	.host2dev_func = host_openacc_async_host2dev,
       },
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 1172d739ec7..3c5aa24b5f5 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -31,6 +31,9 @@
 #include "libgomp_g.h"
 #include "gomp-constants.h"
 #include "oacc-int.h"
+#if USE_LIBFFI
+# include "ffi.h"
+#endif
 #ifdef HAVE_INTTYPES_H
 # include <inttypes.h>  /* For PRIu64.  */
 #endif
@@ -104,19 +107,47 @@ handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
 
 static void goacc_wait (int async, int num_waits, va_list *ap);
 
+static void
+goacc_call_host_fn (void (*fn) (void *), size_t mapnum, void **hostaddrs,
+		    int params)
+{
+#ifdef USE_LIBFFI
+  ffi_cif cif;
+  ffi_type *arg_types[mapnum];
+  void *arg_values[mapnum];
+  ffi_arg result;
+  int i;
+
+  if (params)
+    {
+      for (i = 0; i < mapnum; i++)
+	{
+	  arg_types[i] = &ffi_type_pointer;
+	  arg_values[i] = &hostaddrs[i];
+	}
+
+      if (ffi_prep_cif (&cif, FFI_DEFAULT_ABI, mapnum,
+			&ffi_type_void, arg_types) == FFI_OK)
+	ffi_call (&cif, FFI_FN (fn), &result, arg_values);
+      else
+	abort ();
+    }
+  else
+#endif
+  fn (hostaddrs);
+}
 
 /* Launch a possibly offloaded function on DEVICE.  FN is the host fn
    address.  MAPNUM, HOSTADDRS, SIZES & KINDS  describe the memory
    blocks to be copied to/from the device.  Varadic arguments are
    keyed optional parameters terminated with a zero.  */
 
-void
-GOACC_parallel_keyed (int device, void (*fn) (void *),
-		      size_t mapnum, void **hostaddrs, size_t *sizes,
-		      unsigned short *kinds, ...)
+static void
+GOACC_parallel_keyed_internal (int device, int params, void (*fn) (void *),
+			       size_t mapnum, void **hostaddrs, size_t *sizes,
+			       unsigned short *kinds, va_list *ap)
 {
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
-  va_list ap;
   struct goacc_thread *thr;
   struct gomp_device_descr *acc_dev;
   struct target_mem_desc *tgt;
@@ -206,13 +237,13 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
       prof_info.device_type = acc_device_host;
       api_info.device_type = prof_info.device_type;
       goacc_save_and_set_bind (acc_device_host);
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goacc_restore_bind ();
       goto out;
     }
   else if (acc_device_type (acc_dev->type) == acc_device_host)
     {
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goto out;
     }
   else if (profiling_dispatch_p)
@@ -222,9 +253,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
   for (i = 0; i != GOMP_DIM_MAX; i++)
     dims[i] = 0;
 
-  va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
-  while ((tag = va_arg (ap, unsigned)) != 0)
+  while ((tag = va_arg (*ap, unsigned)) != 0)
     {
       if (GOMP_LAUNCH_DEVICE (tag))
 	gomp_fatal ("device_type '%d' offload parameters, libgomp is too old",
@@ -238,7 +268,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 
 	    for (i = 0; i != GOMP_DIM_MAX; i++)
 	      if (mask & GOMP_DIM_MASK (i))
-		dims[i] = va_arg (ap, unsigned);
+		dims[i] = va_arg (*ap, unsigned);
 	  }
 	  break;
 
@@ -248,7 +278,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    async = GOMP_LAUNCH_OP (tag);
 
 	    if (async == GOMP_LAUNCH_OP_MAX)
-	      async = va_arg (ap, unsigned);
+	      async = va_arg (*ap, unsigned);
 
 	    if (profiling_dispatch_p)
 	      {
@@ -267,7 +297,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
 
 	    if (num_waits > 0)
-	      goacc_wait (async, num_waits, &ap);
+	      goacc_wait (async, num_waits, ap);
 	    else if (num_waits == acc_async_noval)
 	      acc_wait_all_async (async);
 	    break;
@@ -278,7 +308,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 		      " libgomp is too old", GOMP_LAUNCH_CODE (tag));
 	}
     }
-  va_end (ap);
   
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
@@ -338,8 +367,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 
   if (aq == NULL)
     {
-      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-				  dims, tgt);
+      if (params)
+	acc_dev->openacc.exec_params_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					   dims, tgt);
+      else
+	acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				    dims, tgt);
       if (profiling_dispatch_p)
 	{
 	  prof_info.event_type = acc_ev_exit_data_start;
@@ -362,8 +395,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
   else
     {
-      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-					dims, tgt, aq);
+      if (params)
+	acc_dev->openacc.async.exec_params_func (tgt_fn, mapnum, hostaddrs,
+						 devaddrs, dims, tgt, aq);
+      else
+	acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs,
+					  devaddrs, dims, tgt, aq);
       goacc_async_copyout_unmap_vars (tgt, aq);
     }
 
@@ -381,6 +418,30 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
 }
 
+void
+GOACC_parallel_keyed (int device, void (*fn) (void *),
+		      size_t mapnum, void **hostaddrs, size_t *sizes,
+		      unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, 0, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
+void
+GOACC_parallel_keyed_v2 (int device, int args, void (*fn) (void *),
+			 size_t mapnum, void **hostaddrs, size_t *sizes,
+			 unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, args, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
 /* Legacy entry point, only provide host execution.  */
 
 void
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 94abfe2036f..bdc0c30e1f5 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -697,12 +697,11 @@ link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs,
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    void **kargs, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
   int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
   int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
   int dev_size = nvptx_thread ()->ptx_dev->num_sms;
@@ -888,7 +887,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 					    api_info);
     }
   
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
@@ -1293,22 +1291,29 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-void
-GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
-			   void **hostaddrs, void **devaddrs,
-			   unsigned *dims, void *targ_mem_desc)
+static void
+openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
+		       void **hostaddrs, void **devaddrs,
+		       unsigned *dims, void *targ_mem_desc)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-  void **hp = NULL;
+  void **hp = alloca (mapnum * sizeof (void *));
   CUdeviceptr dp = 0;
 
   if (mapnum > 0)
     {
-      hp = alloca (mapnum * sizeof (void *));
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+      if (params)
+	{
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+	}
+      else
+	{
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	}
     }
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
@@ -1333,7 +1338,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
       data_event_info.data_event.var_name = NULL; //TODO
       data_event_info.data_event.bytes = mapnum * sizeof (void *);
       data_event_info.data_event.host_ptr = hp;
-      data_event_info.data_event.device_ptr = (void *) dp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
 
       api_info->device_api = acc_device_api_cuda;
 
@@ -1341,7 +1347,7 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  if (mapnum > 0)
+  if (!params && mapnum > 0)
     CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
 		      mapnum * sizeof (void *));
 
@@ -1353,8 +1359,15 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
+  if (params)
+    nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		hp, NULL);
+  else
+    {
+      void *kargs[1] = { &dp };
+      nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		  kargs, NULL);
+    }
 
   CUresult r = cuStreamSynchronize (NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
@@ -1363,7 +1376,27 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-  CUDA_CALL_ASSERT (cuMemFree, dp);
+
+  if (!params)
+    CUDA_CALL_ASSERT (cuMemFree, dp);
+}
+
+void
+GOMP_OFFLOAD_openacc_exec_params (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  openacc_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims,
+			 targ_mem_desc);
+}
+
+void
+GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  openacc_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims,
+			 targ_mem_desc);
 }
 
 static void
@@ -1374,11 +1407,11 @@ cuda_free_argmem (void *ptr)
   free (block);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
-				 void **hostaddrs, void **devaddrs,
-				 unsigned *dims, void *targ_mem_desc,
-				 struct goacc_asyncqueue *aq)
+static void
+openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
+			     void **hostaddrs, void **devaddrs,
+			     unsigned *dims, void *targ_mem_desc,
+			     struct goacc_asyncqueue *aq)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1388,11 +1421,20 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 
   if (mapnum > 0)
     {
-      block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
-      hp = block + 2;
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+      if (params)
+	{
+	  hp = alloca (sizeof (void *) * mapnum);
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+	}
+      else
+	{
+	  block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+	  hp = block + 2;
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	}
     }
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
@@ -1417,7 +1459,8 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
       data_event_info.data_event.var_name = NULL; //TODO
       data_event_info.data_event.bytes = mapnum * sizeof (void *);
       data_event_info.data_event.host_ptr = hp;
-      data_event_info.data_event.device_ptr = (void *) dp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
 
       api_info->device_api = acc_device_api_cuda;
 
@@ -1425,7 +1468,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  if (mapnum > 0)
+  if (!params && mapnum > 0)
     {
       CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
 			mapnum * sizeof (void *), aq->cuda_stream);
@@ -1443,14 +1486,42 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
       GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
 					    api_info);
     }
-  
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
 
-  if (mapnum > 0)
+  if (params)
+    nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		hp, aq->cuda_stream);
+  else
+    {
+      void *kargs[1] = { &dp };
+      nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		  kargs, aq->cuda_stream);
+    }
+
+  if (!params && mapnum > 0)
     GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
+void
+GOMP_OFFLOAD_openacc_async_exec_params (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  openacc_async_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims,
+			       targ_mem_desc, aq);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  openacc_async_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims,
+			       targ_mem_desc, aq);
+}
+
+
 void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index 336581d2196..10c5e34f378 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2908,6 +2908,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
     {
       if (!DLSYM_OPT (openacc.exec, openacc_exec)
+	  || !DLSYM_OPT (openacc.exec_params, openacc_exec_params)
 	  || !DLSYM_OPT (openacc.create_thread_data,
 			 openacc_create_thread_data)
 	  || !DLSYM_OPT (openacc.destroy_thread_data,
@@ -2920,6 +2921,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
 	  || !DLSYM_OPT (openacc.async.queue_callback,
 			 openacc_async_queue_callback)
 	  || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+	  || !DLSYM_OPT (openacc.async.exec_params, openacc_async_exec_params)
 	  || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
 	  || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
 	{
diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in
index 6edb7ae7ade..4d7f43abe3d 100644
--- a/libgomp/testsuite/Makefile.in
+++ b/libgomp/testsuite/Makefile.in
@@ -120,6 +120,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index dad6d13eb60..c6abc1d724a 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,6 +1,11 @@
 /* This test exercises combined directives.  */
 
+/* This test falls back to host execution because struct alias
+   analysis is deactivated on OpenACC parallel regions.  Consequently,
+   parloops can no longer disambiguate arrays a and b.  */
+
 /* { dg-do run } */
+/* { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O2" } { "" } } */
 
 #include <stdlib.h>
Jakub Jelinek Dec. 21, 2017, 9:52 p.m. | #2
On Thu, Dec 21, 2017 at 01:46:56PM -0800, Cesar Philippidis wrote:
> After thinking about this some more, I decided that it would be better

> expand the offloaded function arguments into individual parameters

> during omp lowering, rather than writing a separate pass later on. I

> don't see too many disadvantages of using libffi after a pthread is

> spawned by the host. If anything, the pthread's use of libffi is

> equivalent of preforming SRA by the accelerator anyway.


I'll have a look at your previous patch, just want to say that using libffi
for this or changing anything for non-nvptx targets seems to be very bad
idea to me.

	Jakub
Tom de Vries Feb. 6, 2018, 11:46 a.m. | #3
On 12/21/2017 10:46 PM, Cesar Philippidis wrote:

> I've committed this patch to openacc-gcc-7-branch.


> diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c

> index bf1f127d8d6..f674c74ec82 100644

> --- a/gcc/omp-expand.c

> +++ b/gcc/omp-expand.c


>     offloaded = is_gimple_omp_offloaded (entry_stmt);

>     switch (gimple_omp_target_kind (entry_stmt))

>       {

> +    case GF_OMP_TARGET_KIND_OACC_PARALLEL:

> +      oacc_parallel = true;

>       case GF_OMP_TARGET_KIND_REGION:

>       case GF_OMP_TARGET_KIND_UPDATE:

>       case GF_OMP_TARGET_KIND_ENTER_DATA:

>       case GF_OMP_TARGET_KIND_EXIT_DATA:

> -    case GF_OMP_TARGET_KIND_OACC_PARALLEL:

>       case GF_OMP_TARGET_KIND_OACC_KERNELS:

>       case GF_OMP_TARGET_KIND_OACC_UPDATE:

>       case GF_OMP_TARGET_KIND_OACC_ENTER_EXIT_DATA:


This broke openacc-gcc-7-branch bootstrap:
...
gcc/omp-expand.c: In function 'void expand_omp_target(omp_region*)':
gcc/omp-expand.c:7110:21: error: this statement may fall through 
[-Werror=implicit-fallthrough=]
        oacc_parallel = true;
        ~~~~~~~~~~~~~~^~~~~~
gcc/omp-expand.c:7111:5: note: here
      case GF_OMP_TARGET_KIND_REGION:
      ^~~~
...

Thanks,
- Tom

Patch

2017-12-18  Cesar Philippidis  <cesar@codesourcery.com>

	Makefile.def: Make libgomp depend on libffi.
	configure.ac: Likewise.
	Makefile.in: Regenerate.
	configure: Regenerate.

	gcc/fortran/
	* types.def: (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.

	gcc/
	* builtin-types.def (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.
	* config/nvptx/nvptx.c (nvptx_expand_cmp_swap): Handle PARM_DECLs.
	* omp-builtins.def (BUILD_IN_GOACC_PARALLEL): Call
	GOACC_parallel_keyed_v2.
	* omp-expand.c (expand_omp_target): Update call to
	BUILT_IN_GOACC_PARALLEL.
	* omp-low.c (struct omp_context): Add parm_map member.
	(lookup_parm): New function.
	(build_receiver_ref): Lookup parm_map decls.
	(install_parm_decl): New function.
	(install_var_field): Install parm_map decl for OpenACC parallel region
	data clauses.
	(delete_omp_context): Clean parm_map.
	(scan_sharing_clauses): Install subarray variable mapping into parm_map.
	(create_omp_child_function): Defer creation of child function for
	OpenACC parallel regions.
	(scan_omp_target): Likewise.
	(append_decl_arg): New function.
	(lower_omp_target): Create an child offloaded function using one
	parameter per data mapping for OpenACC parallel regions.
	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call):
	Ignore OpenACC parallel regions.
	(find_func_clobbers): Likewise.
	(ipa_pta_execute): Likewise.

	libgomp/
	* Makefile.am: Add libffi build dependency.
	* configure.ac: Likewise.
	* Makefile.in: Regenerate.
	* config.h.in: Regenerate.
	* configure: Regenerate.
	* libgomp-plugin.h: Define GOMP_OFFLOAD_openacc_exec_params and
	GOMP_OFFLOAD_openacc_async_exec_params.
	* libgomp.h (acc_dispatch_t): Use them here. 
	* libgomp.map (GOACC_parallel_keyed_v2): Declare.
	* libgomp_g.h (GOACC_parallel_keyed_v2): Likewise.
	* oacc-host.c (host_openacc_exec_params): New function.
	(host_openacc_async_exec_params): Likewise.
	* libgomp/oacc-parallel.c (goacc_call_host_fn): Likewise.
	(GOACC_parallel_keyed_internal): Likewise.
	(GOACC_parallel_keyed): Wrapper for GOACC_parallel_keyed_internal.
	(GOACC_parallel_keyed_v2): Likewise.
	* plugin/plugin-nvptx.c (nvptx_exec): Replace CUDeviceptr dp parameter
	with void **kargs.
	(GOMP_OFFLOAD_openacc_exec_params): New function.
	(GOMP_OFFLOAD_openacc_exec): Update call to nvptx_exec.
	(cuda_free_argmem_params): New function.
	(GOMP_OFFLOAD_openacc_async_exec_params): Likewise.
	(GOMP_OFFLOAD_openacc_async_exec): Update call to nvptx_exec.
	* target.c (gomp_load_plugin_for_device): Handle
	openacc_exec_params and openacc_async_exec_params.
	* testsuite/Makefile.in: Regenerate.


diff --git a/Makefile.def b/Makefile.def
index abfa9efe959..5e94062fa75 100644
--- a/Makefile.def
+++ b/Makefile.def
@@ -550,6 +550,7 @@  dependencies = { module=configure-target-libgo; on=all-target-libstdc++-v3; };
 dependencies = { module=all-target-libgo; on=all-target-libbacktrace; };
 dependencies = { module=all-target-libgo; on=all-target-libffi; };
 dependencies = { module=all-target-libgo; on=all-target-libatomic; };
+dependencies = { module=configure-target-libgomp; on=configure-target-libffi; };
 dependencies = { module=configure-target-libstdc++-v3; on=configure-target-libgomp; };
 dependencies = { module=configure-target-liboffloadmic; on=configure-target-libgomp; };
 dependencies = { module=configure-target-libsanitizer; on=all-target-libstdc++-v3; };
@@ -564,6 +565,7 @@  dependencies = { module=install-target-libgo; on=install-target-libatomic; };
 dependencies = { module=install-target-libgfortran; on=install-target-libquadmath; };
 dependencies = { module=install-target-libgfortran; on=install-target-libgcc; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libstdc++-v3; };
+dependencies = { module=install-target-libgomp; on=install-target-libffi; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libgcc; };
 dependencies = { module=install-target-libvtv; on=install-target-libstdc++-v3; };
 dependencies = { module=install-target-libvtv; on=install-target-libgcc; };
diff --git a/Makefile.in b/Makefile.in
index b824e0a0ca1..9b4497e3943 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -55803,6 +55803,7 @@  configure-target-libgo: maybe-all-target-libstdc++-v3
 all-target-libgo: maybe-all-target-libbacktrace
 all-target-libgo: maybe-all-target-libffi
 all-target-libgo: maybe-all-target-libatomic
+configure-target-libgomp: maybe-configure-target-libffi
 configure-target-libstdc++-v3: maybe-configure-target-libgomp
 
 configure-stage1-target-libstdc++-v3: maybe-configure-stage1-target-libgomp
@@ -55849,6 +55850,7 @@  install-target-libgo: maybe-install-target-libatomic
 install-target-libgfortran: maybe-install-target-libquadmath
 install-target-libgfortran: maybe-install-target-libgcc
 install-target-libsanitizer: maybe-install-target-libstdc++-v3
+install-target-libgomp: maybe-install-target-libffi
 install-target-libsanitizer: maybe-install-target-libgcc
 install-target-libvtv: maybe-install-target-libstdc++-v3
 install-target-libvtv: maybe-install-target-libgcc
diff --git a/configure b/configure
index 32a38633ad8..ed47944d8f9 100755
--- a/configure
+++ b/configure
@@ -3472,11 +3472,19 @@  case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -6460,6 +6468,15 @@  esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/configure.ac b/configure.ac
index 12377499295..a3b9e116a05 100644
--- a/configure.ac
+++ b/configure.ac
@@ -800,11 +800,19 @@  case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -2127,6 +2135,15 @@  esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index ac9894467ec..7f647c65162 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -763,6 +763,10 @@  DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			 BT_PTR, BT_PTR, BT_PTR)
 
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			 BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR)
+
 DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index dfb27efe704..8bc934f163b 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4731,6 +4731,10 @@  nvptx_expand_cmp_swap (tree exp, rtx target,
 			 NULL_RTX, mode, EXPAND_NORMAL);
   rtx pat;
 
+  /* 'mem' might be a PARM_DECL.  If so, convert it to a register.  */
+  if (!REG_P (mem))
+    mem = copy_to_mode_reg (GET_MODE (mem), mem);
+
   mem = gen_rtx_MEM (mode, mem);
   if (!REG_P (cmp))
     cmp = copy_to_mode_reg (mode, cmp);
diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def
index 1f8a5a1277c..3c3ad69d848 100644
--- a/gcc/fortran/types.def
+++ b/gcc/fortran/types.def
@@ -252,3 +252,7 @@  DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			  BT_PTR, BT_PTR, BT_PTR)
+
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			  BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			  BT_PTR, BT_PTR, BT_PTR)
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 69b73f4b8c4..a9ec667aa54 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -38,8 +38,8 @@  DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_END, "GOACC_data_end",
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_ENTER_EXIT_DATA, "GOACC_enter_exit_data",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 		   ATTR_NOTHROW_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
-		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed_v2",
+		   BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 		   ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_UPDATE, "GOACC_update",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index bf1f127d8d6..f674c74ec82 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -7097,19 +7097,21 @@  expand_omp_target (struct omp_region *region)
   gomp_target *entry_stmt;
   gimple *stmt;
   edge e;
-  bool offloaded, data_region;
+  bool offloaded, data_region, oacc_parallel;
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
   new_bb = region->entry;
+  oacc_parallel = false;
 
   offloaded = is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
+    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
+      oacc_parallel = true;
     case GF_OMP_TARGET_KIND_REGION:
     case GF_OMP_TARGET_KIND_UPDATE:
     case GF_OMP_TARGET_KIND_ENTER_DATA:
     case GF_OMP_TARGET_KIND_EXIT_DATA:
-    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
     case GF_OMP_TARGET_KIND_OACC_KERNELS:
     case GF_OMP_TARGET_KIND_OACC_UPDATE:
     case GF_OMP_TARGET_KIND_OACC_ENTER_EXIT_DATA:
@@ -7171,7 +7173,7 @@  expand_omp_target (struct omp_region *region)
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
       tree data_arg = gimple_omp_target_data_arg (entry_stmt);
-      if (data_arg)
+      if (data_arg && !oacc_parallel)
 	{
 	  basic_block entry_succ_bb = single_succ (entry_bb);
 	  gimple_stmt_iterator gsi;
@@ -7489,6 +7491,11 @@  expand_omp_target (struct omp_region *region)
   /* The maximum number used by any start_ix, without varargs.  */
   auto_vec<tree, 11> args;
   args.quick_push (device);
+  if (start_ix == BUILT_IN_GOACC_PARALLEL)
+    {
+      tree use_params = oacc_parallel ? integer_one_node : integer_zero_node;
+      args.quick_push (use_params);
+    }
   if (offloaded)
     args.quick_push (build_fold_addr_expr (child_fn));
   args.quick_push (t1);
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index e790f0f1bb2..a2869e49ebd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -89,6 +89,7 @@  struct omp_context
   /* Map variables to fields in a structure that allows communication
      between sending and receiving threads.  */
   splay_tree field_map;
+  splay_tree parm_map;
   tree record_type;
   tree sender_decl;
   tree receiver_decl;
@@ -321,6 +322,14 @@  maybe_lookup_decl (const_tree var, omp_context *ctx)
 }
 
 static inline tree
+lookup_parm (const_tree var, omp_context *ctx)
+{
+  splay_tree_node n;
+  n = splay_tree_lookup (ctx->parm_map, (splay_tree_key) var);
+  return (tree) n->value;
+}
+
+static inline tree
 lookup_field (tree var, omp_context *ctx)
 {
   splay_tree_node n;
@@ -501,15 +510,21 @@  build_receiver_ref (tree var, bool by_ref, omp_context *ctx)
 {
   tree x, field = lookup_field (var, ctx);
 
-  /* If the receiver record type was remapped in the child function,
-     remap the field into the new record type.  */
-  x = maybe_lookup_field (field, ctx);
-  if (x != NULL)
-    field = x;
+  if (is_oacc_parallel (ctx))
+    x = lookup_parm (var, ctx);
+  else
+    {
+      /* If the receiver record type was remapped in the child function,
+	 remap the field into the new record type.  */
+      x = maybe_lookup_field (field, ctx);
+      if (x != NULL)
+	field = x;
+
+      x = build_simple_mem_ref (ctx->receiver_decl);
+      TREE_THIS_NOTRAP (x) = 1;
+      x = omp_build_component_ref (x, field);
+    }
 
-  x = build_simple_mem_ref (ctx->receiver_decl);
-  TREE_THIS_NOTRAP (x) = 1;
-  x = omp_build_component_ref (x, field);
   if (by_ref)
     {
       x = build_simple_mem_ref (x);
@@ -644,6 +659,32 @@  build_sender_ref (tree var, omp_context *ctx)
   return build_sender_ref ((splay_tree_key) var, ctx);
 }
 
+static void
+install_parm_decl (tree var, tree type, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return;
+
+  splay_tree_key key = (splay_tree_key) var;
+  tree decl_name = NULL_TREE, t;
+  location_t loc = UNKNOWN_LOCATION;
+
+  if (DECL_P (var))
+    {
+      decl_name = get_identifier (get_name (var));
+      loc = DECL_SOURCE_LOCATION (var);
+    }
+  t = build_decl (loc, PARM_DECL, decl_name, type);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_NAMELESS (t) = 1;
+  DECL_ARG_TYPE (t) = type;
+  DECL_CONTEXT (t) = current_function_decl;
+  TREE_USED (t) = 1;
+  TREE_READONLY (t) = 1;
+
+  splay_tree_insert (ctx->parm_map, key, (splay_tree_value) t);
+}
+
 /* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
    BASE_POINTERS_RESTRICT, declare the field with restrict.  */
 
@@ -764,7 +805,10 @@  install_var_field (tree var, bool by_ref, int mask, omp_context *ctx,
     }
 
   if (mask & 1)
-    splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+    {
+      splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+      install_parm_decl (var, type, ctx);
+    }
   if ((mask & 2) && ctx->sfield_map)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
@@ -1068,6 +1112,8 @@  delete_omp_context (splay_tree_value value)
     splay_tree_delete (ctx->field_map);
   if (ctx->sfield_map)
     splay_tree_delete (ctx->sfield_map);
+  if (ctx->parm_map)
+    splay_tree_delete (ctx->parm_map);
 
   /* We hijacked DECL_ABSTRACT_ORIGIN earlier.  We need to clear it before
      it produces corrupt debug information.  */
@@ -1506,6 +1552,7 @@  scan_sharing_clauses (tree clauses, omp_context *ctx,
 		  insert_field_into_struct (ctx->record_type, field);
 		  splay_tree_insert (ctx->field_map, (splay_tree_key) decl,
 				     (splay_tree_value) field);
+		  install_parm_decl (decl, ptr_type_node, ctx);
 		}
 	    }
 	  break;
@@ -1800,10 +1847,13 @@  omp_maybe_offloaded_ctx (omp_context *ctx)
 }
 
 /* Build a decl for the omp child function.  It'll not contain a body
-   yet, just the bare decl.  */
+   yet, just the bare decl.  Unlike omp child functions, acc child
+   functions for parallel regions have one argument per data
+   mapping.  */
 
 static void
-create_omp_child_function (omp_context *ctx, bool task_copy)
+create_omp_child_function (omp_context *ctx, bool task_copy,
+			   unsigned int map_cnt = 0)
 {
   tree decl, type, name, t;
 
@@ -1825,6 +1875,13 @@  create_omp_child_function (omp_context *ctx, bool task_copy)
       type = build_function_type_list (void_type_node, ptr_type_node,
 				       cilk_var_type, cilk_var_type, NULL_TREE);
     }
+  else if (is_oacc_parallel (ctx))
+    {
+      tree *arg_types = (tree *) alloca (sizeof (tree) * map_cnt);
+      for (unsigned int i = 0; i < map_cnt; i++)
+	arg_types[i] = ptr_type_node;
+      type = build_function_type_array (void_type_node, map_cnt, arg_types);
+    }
   else
     type = build_function_type_list (void_type_node, ptr_type_node, NULL_TREE);
 
@@ -1899,35 +1956,37 @@  create_omp_child_function (omp_context *ctx, bool task_copy)
       DECL_ARGUMENTS (decl) = t;
     }
 
-  tree data_name = get_identifier (".omp_data_i");
-  t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
-		  ptr_type_node);
-  DECL_ARTIFICIAL (t) = 1;
-  DECL_NAMELESS (t) = 1;
-  DECL_ARG_TYPE (t) = ptr_type_node;
-  DECL_CONTEXT (t) = current_function_decl;
-  TREE_USED (t) = 1;
-  TREE_READONLY (t) = 1;
-  if (cilk_for_count)
-    DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
-  DECL_ARGUMENTS (decl) = t;
-  if (!task_copy)
-    ctx->receiver_decl = t;
-  else
+  if (!is_oacc_parallel (ctx))
     {
-      t = build_decl (DECL_SOURCE_LOCATION (decl),
-		      PARM_DECL, get_identifier (".omp_data_o"),
+      tree data_name = get_identifier (".omp_data_i");
+      t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
 		      ptr_type_node);
       DECL_ARTIFICIAL (t) = 1;
       DECL_NAMELESS (t) = 1;
       DECL_ARG_TYPE (t) = ptr_type_node;
       DECL_CONTEXT (t) = current_function_decl;
       TREE_USED (t) = 1;
-      TREE_ADDRESSABLE (t) = 1;
-      DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+      TREE_READONLY (t) = 1;
+      if (cilk_for_count)
+	DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
       DECL_ARGUMENTS (decl) = t;
+      if (!task_copy)
+	ctx->receiver_decl = t;
+      else
+	{
+	  t = build_decl (DECL_SOURCE_LOCATION (decl),
+			  PARM_DECL, get_identifier (".omp_data_o"),
+			  ptr_type_node);
+	  DECL_ARTIFICIAL (t) = 1;
+	  DECL_NAMELESS (t) = 1;
+	  DECL_ARG_TYPE (t) = ptr_type_node;
+	  DECL_CONTEXT (t) = current_function_decl;
+	  TREE_USED (t) = 1;
+	  TREE_ADDRESSABLE (t) = 1;
+	  DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+	  DECL_ARGUMENTS (decl) = t;
+	}
     }
-
   /* Allocate memory for the function structure.  The call to
      allocate_struct_function clobbers CFUN, so we need to restore
      it afterward.  */
@@ -2608,6 +2667,7 @@  scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
 
   ctx = new_omp_context (stmt, outer_ctx);
   ctx->field_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
+  ctx->parm_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
   ctx->default_kind = OMP_CLAUSE_DEFAULT_SHARED;
   ctx->record_type = lang_hooks.types.make_type (RECORD_TYPE);
   name = create_tmp_var_name (".omp_data_t");
@@ -2621,8 +2681,11 @@  scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   bool base_pointers_restrict = false;
   if (offloaded)
     {
-      create_omp_child_function (ctx, false);
-      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+      if (!is_oacc_parallel (ctx))
+	{
+	  create_omp_child_function (ctx, false);
+	  gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+	}
 
       base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses);
       if (base_pointers_restrict
@@ -7921,6 +7984,18 @@  convert_from_firstprivate_int (tree var, tree orig_type, bool is_ref,
   return var;
 }
 
+static tree
+append_decl_arg (tree var, tree decl_args, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return NULL_TREE;
+
+  tree temp = lookup_parm (var, ctx);
+  DECL_CHAIN (temp) = decl_args;
+
+  return temp;
+}
+
 /* Lower the GIMPLE_OMP_TARGET in the current statement
    in GSI_P.  CTX holds context information for the directive.  */
 
@@ -7934,7 +8009,7 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   gimple_seq tgt_body, olist, ilist, fplist, new_body;
   location_t loc = gimple_location (stmt);
   bool offloaded, data_region;
-  unsigned int map_cnt = 0;
+  unsigned int map_cnt = 0, init_cnt = 0;
 
   offloaded = is_gimple_omp_offloaded (stmt);
   switch (gimple_omp_target_kind (stmt))
@@ -7980,11 +8055,83 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     }
   else if (data_region)
     tgt_body = gimple_omp_body (stmt);
-  child_fn = ctx->cb.dst_fn;
 
   push_gimplify_context ();
   fplist = NULL;
 
+  /* Determine init_cnt to finish initialize ctx.  */
+
+  if (is_oacc_parallel (ctx))
+    {
+      for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
+	switch (OMP_CLAUSE_CODE (c))
+	  {
+	    tree var;
+
+	  default:
+	    break;
+	  case OMP_CLAUSE_MAP:
+	  case OMP_CLAUSE_TO:
+	  case OMP_CLAUSE_FROM:
+	  init_oacc_firstprivate:
+	    var = OMP_CLAUSE_DECL (c);
+	    if (!DECL_P (var))
+	      {
+		if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP
+		    || (!OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+			&& (OMP_CLAUSE_MAP_KIND (c)
+			    != GOMP_MAP_FIRSTPRIVATE_POINTER)))
+		  init_cnt++;
+		continue;
+	      }
+
+	    if (DECL_SIZE (var)
+		&& TREE_CODE (DECL_SIZE (var)) != INTEGER_CST)
+	      {
+		tree var2 = DECL_VALUE_EXPR (var);
+		gcc_assert (TREE_CODE (var2) == INDIRECT_REF);
+		var2 = TREE_OPERAND (var2, 0);
+		gcc_assert (DECL_P (var2));
+		var = var2;
+	      }
+
+	    if (offloaded
+		&& OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		&& (OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_FIRSTPRIVATE_POINTER
+		    || (OMP_CLAUSE_MAP_KIND (c)
+			== GOMP_MAP_FIRSTPRIVATE_REFERENCE)))
+	      {
+		continue;
+	      }
+
+	    if (!maybe_lookup_field (var, ctx))
+	      continue;
+
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_FIRSTPRIVATE:
+	    if (is_oacc_parallel (ctx))
+	      goto init_oacc_firstprivate;
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_USE_DEVICE_PTR:
+	  case OMP_CLAUSE_IS_DEVICE_PTR:
+	    init_cnt++;
+	    break;
+	  }
+
+      /* Initialize the offloaded child function.  */
+
+      create_omp_child_function (ctx, false, init_cnt);
+      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+    }
+
+  child_fn = ctx->cb.dst_fn;
+
+  /* Clause Pass 1: Scan and prepare sender decls VALUE_EXPRs for
+     usage on the child function.  */
   for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
     switch (OMP_CLAUSE_CODE (c))
       {
@@ -8247,6 +8394,8 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      if (is_oacc_parallel (ctx))
+	gcc_assert (init_cnt == map_cnt);
       target_nesting_level++;
       lower_omp (&tgt_body, ctx);
       target_nesting_level--;
@@ -8293,6 +8442,7 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       vec_alloc (vsize, map_cnt);
       vec_alloc (vkind, map_cnt);
       unsigned int map_idx = 0;
+      tree decl_args = NULL_TREE;
 
       for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
 	switch (OMP_CLAUSE_CODE (c))
@@ -8488,6 +8638,7 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    if (s == NULL_TREE)
 	      s = integer_one_node;
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8628,6 +8779,7 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    else
 	      s = TYPE_SIZE_UNIT (TREE_TYPE (ovar));
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8667,6 +8819,7 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	      }
 	    gimplify_assign (x, var, &ilist);
 	    s = size_int (0);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    gcc_checking_assert (tkind
@@ -8679,6 +8832,8 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	  }
 
       gcc_assert (map_idx == map_cnt);
+      if (is_oacc_parallel (ctx))
+	DECL_ARGUMENTS (child_fn) = nreverse (decl_args);
 
       DECL_INITIAL (TREE_VEC_ELT (t, 1))
 	= build_constructor (TREE_TYPE (TREE_VEC_ELT (t, 1)), vsize);
@@ -8717,9 +8872,12 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     {
       t = build_fold_addr_expr_loc (loc, ctx->sender_decl);
       /* fixup_child_record_type might have changed receiver_decl's type.  */
-      t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
-      gimple_seq_add_stmt (&new_body,
-	  		   gimple_build_assign (ctx->receiver_decl, t));
+      if (!is_oacc_parallel (ctx))
+	{
+	  t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
+	  gimple_seq_add_stmt (&new_body,
+			       gimple_build_assign (ctx->receiver_decl, t));
+	}
     }
   gimple_seq_add_seq (&new_body, fplist);
 
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index aab6821e792..c23ddeb9c86 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4618,6 +4618,7 @@  find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
       case BUILT_IN_GOMP_PARALLEL:
       case BUILT_IN_GOACC_PARALLEL:
 	{
+	  bool oacc_parallel = false;
 	  if (in_ipa_mode)
 	    {
 	      unsigned int fnpos, argpos;
@@ -4631,13 +4632,17 @@  find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
+		  fnpos = 2;
+		  argpos = 4;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -5195,6 +5200,7 @@  find_func_clobbers (struct function *fn, gimple *origt)
 	      unsigned int fnpos, argpos;
 	      unsigned int implicit_use_args[2];
 	      unsigned int num_implicit_use_args = 0;
+	      bool oacc_parallel = false;
 	      switch (DECL_FUNCTION_CODE (decl))
 		{
 		case BUILT_IN_GOMP_PARALLEL:
@@ -5205,15 +5211,19 @@  find_func_clobbers (struct function *fn, gimple *origt)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
-		  implicit_use_args[num_implicit_use_args++] = 4;
+		  fnpos = 2;
+		  argpos = 4;
 		  implicit_use_args[num_implicit_use_args++] = 5;
+		  implicit_use_args[num_implicit_use_args++] = 6;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -7968,7 +7978,7 @@  ipa_pta_execute (void)
 		if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL))
 		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 0), 0);
 		else if (gimple_call_builtin_p (stmt, BUILT_IN_GOACC_PARALLEL))
-		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 1), 0);
+		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 2), 0);
 
 		if (called_decl != NULL_TREE
 		    && !fndecl_maybe_in_other_partition (called_decl))
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 99ad2fd456d..4de30914d3d 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -13,9 +13,16 @@  search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
 
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
+
+if USE_LIBFFI
+libgomp_la_LIBADD = $(LIBFFI)
+endif
+
 vpath % $(strip $(search_path))
 
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 7a84b5681e1..617615d4d52 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -171,7 +171,6 @@  libgomp_plugin_nvptx_la_LINK = $(LIBTOOL) --tag=CC \
 	$(libgomp_plugin_nvptx_la_LDFLAGS) $(LDFLAGS) -o $@
 @PLUGIN_NVPTX_TRUE@am_libgomp_plugin_nvptx_la_rpath = -rpath \
 @PLUGIN_NVPTX_TRUE@	$(toolexeclibdir)
-libgomp_la_LIBADD =
 @USE_FORTRAN_TRUE@am__objects_1 = openacc.lo
 am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	env.lo error.lo icv.lo icv-device.lo iter.lo iter_ull.lo \
@@ -279,6 +278,8 @@  INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@
@@ -410,7 +411,8 @@  search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+libgomp_la_LIBADD = $(LIBFFI)
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 toolexeclib_LTLIBRARIES = libgomp.la $(am__append_1) $(am__append_2)
diff --git a/libgomp/config.h.in b/libgomp/config.h.in
index 2f45aa74bbe..65e01c5376a 100644
--- a/libgomp/config.h.in
+++ b/libgomp/config.h.in
@@ -189,5 +189,8 @@ 
 /* Define to 1 if the target use emutls for thread-local storage. */
 #undef USE_EMUTLS
 
+/* Define to 1 if the target requires libffi to call the offloaded funtions. */
+#undef USE_LIBFFI
+
 /* Version number of package */
 #undef VERSION
diff --git a/libgomp/configure b/libgomp/configure
index ac52d89c68e..fa0072e92be 100755
--- a/libgomp/configure
+++ b/libgomp/configure
@@ -649,6 +649,10 @@  PLUGIN_NVPTX
 CUDA_DRIVER_LIB
 CUDA_DRIVER_INCLUDE
 offload_targets
+USE_LIBFFI_FALSE
+USE_LIBFFI_TRUE
+LIBFFIINCS
+LIBFFI
 libtool_VERSION
 ac_ct_FC
 FCFLAGS
@@ -2655,7 +2659,6 @@  else
 fi
 
 
-
 # -------
 # -------
 
@@ -11155,7 +11158,7 @@  else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11158 "configure"
+#line 11161 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11261,7 +11264,7 @@  else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11264 "configure"
+#line 11267 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -15137,6 +15140,28 @@  $as_echo "#define LIBGOMP_OFFLOADED_ONLY 1" >>confdefs.h
 
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+
+$as_echo "#define USE_LIBFFI 1" >>confdefs.h
+
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+
+
+ if test -d ../libffi; then
+  USE_LIBFFI_TRUE=
+  USE_LIBFFI_FALSE='#'
+else
+  USE_LIBFFI_TRUE='#'
+  USE_LIBFFI_FALSE=
+fi
+
+
 # Plugins for offload execution, configure.ac fragment.  -*- mode: autoconf -*-
 #
 # Copyright (C) 2014-2017 Free Software Foundation, Inc.
@@ -16960,6 +16985,10 @@  if test -z "${MAINTAINER_MODE_TRUE}" && test -z "${MAINTAINER_MODE_FALSE}"; then
   as_fn_error "conditional \"MAINTAINER_MODE\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
 fi
+if test -z "${USE_LIBFFI_TRUE}" && test -z "${USE_LIBFFI_FALSE}"; then
+  as_fn_error "conditional \"USE_LIBFFI\" was never defined.
+Usually this means the macro was only invoked conditionally." "$LINENO" 5
+fi
 if test -z "${PLUGIN_NVPTX_TRUE}" && test -z "${PLUGIN_NVPTX_FALSE}"; then
   as_fn_error "conditional \"PLUGIN_NVPTX\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
diff --git a/libgomp/configure.ac b/libgomp/configure.ac
index a42d4f08b4b..aa49577537e 100644
--- a/libgomp/configure.ac
+++ b/libgomp/configure.ac
@@ -28,7 +28,6 @@  LIBGOMP_ENABLE(generated-files-in-srcdir, no, ,
 AC_MSG_RESULT($enable_generated_files_in_srcdir)
 AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes)
 
-
 # -------
 # -------
 
@@ -215,6 +214,19 @@  if test x$libgomp_offloaded_only = xyes; then
             [Define to 1 if building libgomp for an accelerator-only target.])
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+   AC_DEFINE(USE_LIBFFI, 1, [Define if we're to use libffi.])
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+AC_SUBST(LIBFFI)
+AC_SUBST(LIBFFIINCS)
+AM_CONDITIONAL([USE_LIBFFI], [test -d ../libffi])
+
 m4_include([plugin/configfrag.ac])
 
 # Check for functions needed.
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index c025069b457..44097cfd56a 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -119,6 +119,13 @@  extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
 extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
 					     void **, unsigned *, void *,
 					     struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_exec_params (void (*) (void *), size_t,
+					      void **, void **, unsigned *,
+					      void *);
+extern void GOMP_OFFLOAD_openacc_async_exec_params (void (*) (void *), size_t,
+						    void **, void **,
+						    unsigned *, void *,
+						    struct goacc_asyncqueue *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
 extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
 extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 59e7ca8b8c8..a31c83cc656 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -885,6 +885,7 @@  typedef struct acc_dispatch_t
 
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
+  __typeof (GOMP_OFFLOAD_openacc_exec_params) *exec_params_func;
 
   struct {
     gomp_mutex_t lock;
@@ -900,6 +901,7 @@  typedef struct acc_dispatch_t
     __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
 
     __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_exec_params) *exec_params_func;
     __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
     __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
   } async;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 546ac929a0e..7a49acc1dfe 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -461,8 +461,10 @@  GOACC_2.0.1 {
 GOACC_2.0.GOMP_4_BRANCH {
   global:
 	GOMP_set_offload_targets;
+	GOACC_parallel_keyed_v2;
 } GOACC_2.0.1;
 
+
 GOMP_PLUGIN_1.0 {
   global:
 	GOMP_PLUGIN_malloc;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index 958ca6e9cc3..c40e67f2e80 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -298,6 +298,8 @@  extern void GOMP_teams (unsigned int, unsigned int);
 
 extern void GOACC_parallel_keyed (int, void (*) (void *), size_t,
 				  void **, size_t *, unsigned short *, ...);
+extern void GOACC_parallel_keyed_v2 (int, int, void (*) (void *), size_t,
+				  void **, size_t *, unsigned short *, ...);
 extern void GOACC_parallel (int, void (*) (void *), size_t, void **, size_t *,
 			    unsigned short *, int, int, int, int, int, ...);
 extern void GOACC_data_start (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 3b2cafb2c55..5b4e34d7190 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -158,6 +158,30 @@  host_openacc_async_exec (void (*fn) (void *),
   fn (hostaddrs);
 }
 
+static void
+host_openacc_exec_params (void (*fn) (void *),
+			  size_t mapnum __attribute__ ((unused)),
+			  void **hostaddrs,
+			  void **devaddrs __attribute__ ((unused)),
+			  unsigned *dims __attribute__ ((unused)),
+			  void *targ_mem_desc __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+static void
+host_openacc_async_exec_params (void (*fn) (void *),
+				size_t mapnum __attribute__ ((unused)),
+				void **hostaddrs,
+				void **devaddrs __attribute__ ((unused)),
+				unsigned *dims __attribute__ ((unused)),
+				void *targ_mem_desc __attribute__ ((unused)),
+				struct goacc_asyncqueue *aq __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+
 static int
 host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
@@ -265,6 +289,7 @@  static struct gomp_device_descr host_dispatch =
       .data_environ = NULL,
 
       .exec_func = host_openacc_exec,
+      .exec_params_func = host_openacc_exec_params,
 
       .async = {
 	.construct_func = host_openacc_async_construct,
@@ -274,6 +299,7 @@  static struct gomp_device_descr host_dispatch =
 	.serialize_func = host_openacc_async_serialize,
 	.queue_callback_func = host_openacc_async_queue_callback,
 	.exec_func = host_openacc_async_exec,
+	.exec_params_func = host_openacc_async_exec_params,
 	.dev2host_func = host_openacc_async_dev2host,
 	.host2dev_func = host_openacc_async_host2dev,
       },
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 1172d739ec7..3c5aa24b5f5 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -31,6 +31,9 @@ 
 #include "libgomp_g.h"
 #include "gomp-constants.h"
 #include "oacc-int.h"
+#if USE_LIBFFI
+# include "ffi.h"
+#endif
 #ifdef HAVE_INTTYPES_H
 # include <inttypes.h>  /* For PRIu64.  */
 #endif
@@ -104,19 +107,47 @@  handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
 
 static void goacc_wait (int async, int num_waits, va_list *ap);
 
+static void
+goacc_call_host_fn (void (*fn) (void *), size_t mapnum, void **hostaddrs,
+		    int params)
+{
+#ifdef USE_LIBFFI
+  ffi_cif cif;
+  ffi_type *arg_types[mapnum];
+  void *arg_values[mapnum];
+  ffi_arg result;
+  int i;
+
+  if (params)
+    {
+      for (i = 0; i < mapnum; i++)
+	{
+	  arg_types[i] = &ffi_type_pointer;
+	  arg_values[i] = &hostaddrs[i];
+	}
+
+      if (ffi_prep_cif (&cif, FFI_DEFAULT_ABI, mapnum,
+			&ffi_type_void, arg_types) == FFI_OK)
+	ffi_call (&cif, FFI_FN (fn), &result, arg_values);
+      else
+	abort ();
+    }
+  else
+#endif
+  fn (hostaddrs);
+}
 
 /* Launch a possibly offloaded function on DEVICE.  FN is the host fn
    address.  MAPNUM, HOSTADDRS, SIZES & KINDS  describe the memory
    blocks to be copied to/from the device.  Varadic arguments are
    keyed optional parameters terminated with a zero.  */
 
-void
-GOACC_parallel_keyed (int device, void (*fn) (void *),
-		      size_t mapnum, void **hostaddrs, size_t *sizes,
-		      unsigned short *kinds, ...)
+static void
+GOACC_parallel_keyed_internal (int device, int params, void (*fn) (void *),
+			       size_t mapnum, void **hostaddrs, size_t *sizes,
+			       unsigned short *kinds, va_list *ap)
 {
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
-  va_list ap;
   struct goacc_thread *thr;
   struct gomp_device_descr *acc_dev;
   struct target_mem_desc *tgt;
@@ -206,13 +237,13 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
       prof_info.device_type = acc_device_host;
       api_info.device_type = prof_info.device_type;
       goacc_save_and_set_bind (acc_device_host);
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goacc_restore_bind ();
       goto out;
     }
   else if (acc_device_type (acc_dev->type) == acc_device_host)
     {
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goto out;
     }
   else if (profiling_dispatch_p)
@@ -222,9 +253,8 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
   for (i = 0; i != GOMP_DIM_MAX; i++)
     dims[i] = 0;
 
-  va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
-  while ((tag = va_arg (ap, unsigned)) != 0)
+  while ((tag = va_arg (*ap, unsigned)) != 0)
     {
       if (GOMP_LAUNCH_DEVICE (tag))
 	gomp_fatal ("device_type '%d' offload parameters, libgomp is too old",
@@ -238,7 +268,7 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
 
 	    for (i = 0; i != GOMP_DIM_MAX; i++)
 	      if (mask & GOMP_DIM_MASK (i))
-		dims[i] = va_arg (ap, unsigned);
+		dims[i] = va_arg (*ap, unsigned);
 	  }
 	  break;
 
@@ -248,7 +278,7 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    async = GOMP_LAUNCH_OP (tag);
 
 	    if (async == GOMP_LAUNCH_OP_MAX)
-	      async = va_arg (ap, unsigned);
+	      async = va_arg (*ap, unsigned);
 
 	    if (profiling_dispatch_p)
 	      {
@@ -267,7 +297,7 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
 
 	    if (num_waits > 0)
-	      goacc_wait (async, num_waits, &ap);
+	      goacc_wait (async, num_waits, ap);
 	    else if (num_waits == acc_async_noval)
 	      acc_wait_all_async (async);
 	    break;
@@ -278,7 +308,6 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
 		      " libgomp is too old", GOMP_LAUNCH_CODE (tag));
 	}
     }
-  va_end (ap);
   
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
@@ -338,8 +367,12 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
 
   if (aq == NULL)
     {
-      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-				  dims, tgt);
+      if (params)
+	acc_dev->openacc.exec_params_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					   dims, tgt);
+      else
+	acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				    dims, tgt);
       if (profiling_dispatch_p)
 	{
 	  prof_info.event_type = acc_ev_exit_data_start;
@@ -362,8 +395,12 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
   else
     {
-      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-					dims, tgt, aq);
+      if (params)
+	acc_dev->openacc.async.exec_params_func (tgt_fn, mapnum, hostaddrs,
+						 devaddrs, dims, tgt, aq);
+      else
+	acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs,
+					  devaddrs, dims, tgt, aq);
       goacc_async_copyout_unmap_vars (tgt, aq);
     }
 
@@ -381,6 +418,30 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
 }
 
+void
+GOACC_parallel_keyed (int device, void (*fn) (void *),
+		      size_t mapnum, void **hostaddrs, size_t *sizes,
+		      unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, 0, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
+void
+GOACC_parallel_keyed_v2 (int device, int args, void (*fn) (void *),
+			 size_t mapnum, void **hostaddrs, size_t *sizes,
+			 unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, args, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
 /* Legacy entry point, only provide host execution.  */
 
 void
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 008b6d4e209..dfdd469660e 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -149,6 +149,13 @@  init_cuda_lib (void)
 #define CU_JIT_NEW_SM3X_OPT 15
 #endif
 
+/* It's not clear if cuLaunchKernel caches the kernel launch
+   parameters when that function is called.  If it does not, then the
+   runtime will need to.  Settin cache_kernel_args to 1 caches those
+   arguments, otherwise it does not.  */
+
+static const int cache_kernel_args = 0;
+
 /* Convenience macros for the frequently used CUDA library call and
    error handling sequence as well as CUDA library calls that
    do the error checking themselves or don't do it at all.  */
@@ -1033,12 +1040,11 @@  link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs,
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    void **kargs, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
   int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
   int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
   int dev_size = nvptx_thread ()->ptx_dev->num_sms;
@@ -1224,7 +1230,6 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 					    api_info);
     }
   
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
@@ -1630,6 +1635,68 @@  GOMP_OFFLOAD_free (int ord, void *ptr)
 }
 
 void
+GOMP_OFFLOAD_openacc_exec_params (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
+
+  void **hp = alloca (mapnum * sizeof (void *));
+
+  if (mapnum > 0)
+    for (int i = 0; i < mapnum; i++)
+      hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+
+  /* Copy the (device) pointers to arguments to the device (hp might in
+     fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      hp, NULL);
+
+  CUresult r = cuStreamSynchronize (NULL);
+  const char *maybe_abort_msg = "(perhaps abort was called)";
+  if (r == CUDA_ERROR_LAUNCH_FAILED)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+		       maybe_abort_msg);
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+}
+
+void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 			   void **hostaddrs, void **devaddrs,
 			   unsigned *dims, void *targ_mem_desc)
@@ -1689,8 +1756,9 @@  GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
+  void *kargs[1] = { &dp };
   nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
+	      kargs, NULL);
 
   CUresult r = cuStreamSynchronize (NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
@@ -1703,6 +1771,92 @@  GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 }
 
 static void
+cuda_free_argmem_params (void *ptr)
+{
+  void **block = (void **) ptr;
+  free (block[0]);
+  free (block);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_exec_params (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
+
+  void **hp = NULL;
+  void **block = NULL;
+
+  if (mapnum > 0)
+    {
+      if (cache_kernel_args > 0)
+	{
+	  block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+	  hp = block + 2;
+	}
+      else
+	hp = alloca (sizeof (void *) * mapnum);
+
+      for (int i = 0; i < mapnum; i++)
+	hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+    }
+
+  /* Copy the (device) pointers to arguments to the device (hp might in
+     fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  if (cache_kernel_args && mapnum > 0)
+    {
+      block[0] = hp;
+
+      struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+      struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
+      block[1] = (void *) nvthd->ptx_dev;
+    }
+
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+  
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      hp, aq->cuda_stream);
+
+  if (cache_kernel_args && mapnum > 0)
+    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem_params, block);
+}
+
+static void
 cuda_free_argmem (void *ptr)
 {
   void **block = (void **) ptr;
@@ -1779,9 +1933,10 @@  GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
       GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
 					    api_info);
     }
-  
+
+  void *kargs[1] = { &dp };
   nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
+	      kargs, aq->cuda_stream);
 
   if (mapnum > 0)
     GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
diff --git a/libgomp/target.c b/libgomp/target.c
index 336581d2196..10c5e34f378 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2908,6 +2908,7 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
     {
       if (!DLSYM_OPT (openacc.exec, openacc_exec)
+	  || !DLSYM_OPT (openacc.exec_params, openacc_exec_params)
 	  || !DLSYM_OPT (openacc.create_thread_data,
 			 openacc_create_thread_data)
 	  || !DLSYM_OPT (openacc.destroy_thread_data,
@@ -2920,6 +2921,7 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
 	  || !DLSYM_OPT (openacc.async.queue_callback,
 			 openacc_async_queue_callback)
 	  || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+	  || !DLSYM_OPT (openacc.async.exec_params, openacc_async_exec_params)
 	  || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
 	  || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
 	{
diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in
index 6edb7ae7ade..4d7f43abe3d 100644
--- a/libgomp/testsuite/Makefile.in
+++ b/libgomp/testsuite/Makefile.in
@@ -120,6 +120,8 @@  INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@