[opencc] Don't mark OpenACC auto loops as independent inside acc parallel regions

Message ID 4035f554-768f-2b88-c1a0-374972f168db@codesourcery.com
State New
Headers show
Series
  • [opencc] Don't mark OpenACC auto loops as independent inside acc parallel regions
Related show

Commit Message

Cesar Philippidis Sept. 20, 2018, 4:49 p.m.
OpenACC as a concept of loop independence, in which independent loops
may be executed in parallel across gangs, workers and vectors. Inside
acc parallel regions, if a loop isn't explicitly marked seq or auto, it
is predetermined to be independent.

This patch corrects a bug where acc loops marked as auto were being
mistakenly promoted to independent. That's bad because it can generate
bogus results if a dependency exist.

Note that this patch depends on the following patches for
-fnote-info-omp-optimized which is used in a test case.

  * Add user-friendly OpenACC diagnostics regarding detected
    parallelism.
    https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01652.html

  * Correct the reported line number in fortran combined OpenACC
    directives
    https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01554.html

  * Correct the reported line number in c++ combined OpenACC directives
    https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01552.html

Is this OK for trunk? I bootstrapped and regtested on x86_64 Linux with
nvptx offloading.

Thanks,
Cesar

Comments

Julian Brown Dec. 4, 2018, 12:07 a.m. | #1
On Thu, 20 Sep 2018 09:49:43 -0700
Cesar Philippidis <cesar@codesourcery.com> wrote:

> OpenACC as a concept of loop independence, in which independent loops

> may be executed in parallel across gangs, workers and vectors. Inside

> acc parallel regions, if a loop isn't explicitly marked seq or auto,

> it is predetermined to be independent.

> 

> This patch corrects a bug where acc loops marked as auto were being

> mistakenly promoted to independent. That's bad because it can generate

> bogus results if a dependency exist.

> 

> Note that this patch depends on the following patches for

> -fnote-info-omp-optimized which is used in a test case.

> 

>   * Add user-friendly OpenACC diagnostics regarding detected

>     parallelism.

>     https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01652.html

> 

>   * Correct the reported line number in fortran combined OpenACC

>     directives

>     https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01554.html

> 

>   * Correct the reported line number in c++ combined OpenACC

> directives https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01552.html

> 

> Is this OK for trunk? I bootstrapped and regtested on x86_64 Linux

> with nvptx offloading.


LGTM, FWIW.

Thanks,

Julian

Patch

[OpenACC] Don't mark OpenACC auto loops as independent inside acc parallel regions

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* omp-low.c (lower_oacc_head_mark): Don't mark OpenACC auto
	loops as independent inside acc parallel regions.

	gcc/testsuite/
	* c-c++-common/goacc/loop-auto-1.c: Adjust test case to conform to
	the new behavior of the auto clause in OpenACC 2.5.
	* c-c++-common/goacc/loop-auto-2.c: Likewise.
	* gcc.dg/goacc/loop-processing-1.c: Likewise.
	* c-c++-common/goacc/loop-auto-3.c: New test.
	* gfortran.dg/goacc/loop-auto-1.f90: New test.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Adjust test case
	to conform to the new behavior of the auto clause in OpenACC 2.5.

(cherry picked from gomp-4_0-branch r247569, 6d30b542f29)

---
 gcc/omp-low.c                                 |  5 +-
 .../c-c++-common/goacc/loop-auto-1.c          | 50 +++++------
 .../c-c++-common/goacc/loop-auto-2.c          |  4 +-
 .../c-c++-common/goacc/loop-auto-3.c          | 78 ++++++++++++++++
 .../gcc.dg/goacc/loop-processing-1.c          |  2 +-
 .../gfortran.dg/goacc/loop-auto-1.f90         | 88 +++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-auto-1.c   | 20 ++---
 7 files changed, 207 insertions(+), 40 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/loop-auto-3.c
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/loop-auto-1.f90

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fdabf67249b..24685fd012c 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -5647,9 +5647,10 @@  lower_oacc_head_mark (location_t loc, tree ddvar, tree clauses,
       tag |= OLF_GANG_STATIC;
     }
 
-  /* In a parallel region, loops are implicitly INDEPENDENT.  */
+  /* In a parallel region, loops without auto and seq clauses are
+     implicitly INDEPENDENT.  */
   omp_context *tgt = enclosing_target_ctx (ctx);
-  if (!tgt || is_oacc_parallel (tgt))
+  if ((!tgt || is_oacc_parallel (tgt)) && !(tag & (OLF_SEQ | OLF_AUTO)))
     tag |= OLF_INDEPENDENT;
 
   if (tag & OLF_TILE)
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-auto-1.c b/gcc/testsuite/c-c++-common/goacc/loop-auto-1.c
index 124befc4002..dcad07f11c8 100644
--- a/gcc/testsuite/c-c++-common/goacc/loop-auto-1.c
+++ b/gcc/testsuite/c-c++-common/goacc/loop-auto-1.c
@@ -10,7 +10,7 @@  void Foo ()
 #pragma acc loop seq
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
@@ -20,7 +20,7 @@  void Foo ()
 #pragma acc loop auto
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  {
 #pragma acc loop vector
@@ -51,7 +51,7 @@  void Foo ()
 #pragma acc loop vector
 	for (int jx = 0; jx < 10; jx++)
 	  {
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	    for (int kx = 0; kx < 10; kx++) {}
 	  }
 
@@ -64,27 +64,27 @@  void Foo ()
 
       }
     
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop auto independent
 	for (int jx = 0; jx < 10; jx++)
 	  {
-#pragma acc loop auto
+#pragma acc loop auto independent
 	    for (int kx = 0; kx < 10; kx++) {}
 	  }
       }
 
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop auto independent
 	for (int jx = 0; jx < 10; jx++)
 	  {
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	    for (int kx = 0; kx < 10; kx++)
 	      {
-#pragma acc loop auto
+#pragma acc loop auto independent
 		for (int lx = 0; lx < 10; lx++) {}
 	      }
 	  }
@@ -101,7 +101,7 @@  void Gang (void)
 #pragma acc loop seq
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
@@ -111,7 +111,7 @@  void Gang (void)
 #pragma acc loop auto
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  {
 #pragma acc loop vector
@@ -142,7 +142,7 @@  void Gang (void)
 #pragma acc loop vector
 	for (int jx = 0; jx < 10; jx++)
 	  {
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	    for (int kx = 0; kx < 10; kx++) {}
 	  }
 
@@ -176,7 +176,7 @@  void Worker (void)
 #pragma acc loop seq
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
@@ -186,7 +186,7 @@  void Worker (void)
 #pragma acc loop auto
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  {
 #pragma acc loop vector
@@ -194,20 +194,20 @@  void Worker (void)
 	  }
       }
 
-#pragma acc loop auto
+#pragma acc loop
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
-#pragma acc loop auto
+#pragma acc loop
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  {
-#pragma acc loop auto
+#pragma acc loop
 	    for (int kx = 0; kx < 10; kx++) {}
 	  }
       }
@@ -222,17 +222,17 @@  void Vector (void)
 #pragma acc loop seq
 	for (int jx = 0; jx < 10; jx++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int ix = 0; ix < 10; ix++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop auto independent
 	for (int jx = 0; jx < 10; jx++) {}
       }
 }
@@ -240,6 +240,6 @@  void Vector (void)
 #pragma acc routine seq
 void Seq (void)
 {
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
     for (int ix = 0; ix < 10; ix++) {}
 }
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-auto-2.c b/gcc/testsuite/c-c++-common/goacc/loop-auto-2.c
index af3f0bddf2c..5aa36e93ab8 100644
--- a/gcc/testsuite/c-c++-common/goacc/loop-auto-2.c
+++ b/gcc/testsuite/c-c++-common/goacc/loop-auto-2.c
@@ -72,12 +72,12 @@  void Bad ()
 #pragma acc loop tile(*) gang vector
     for (int ix = 0; ix < 10; ix++)
       {
-	#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+	#pragma acc loop auto independent /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  ;
       }
 
-#pragma acc loop tile(*) auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop tile(*) auto independent /* { dg-warning "insufficient partitioning" } */
     for (int ix = 0; ix < 10; ix++)
       {
 	#pragma acc loop worker
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-auto-3.c b/gcc/testsuite/c-c++-common/goacc/loop-auto-3.c
new file mode 100644
index 00000000000..5826967240e
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/loop-auto-3.c
@@ -0,0 +1,78 @@ 
+/* Ensure that the auto clause falls back to seq parallelism when the
+   OpenACC loop is not explicitly independent.  */
+
+/* { dg-additional-options "-fopt-info-optimized-omp" } */
+
+void
+test ()
+{
+  int i, j, k, l, n = 100;
+  
+#pragma acc parallel loop auto /* { dg-message "Detected parallelism <acc loop seq>" } */
+  for (i = 0; i < n; i++)
+#pragma acc loop auto independent /* { dg-message "Detected parallelism <acc loop gang>" } */
+    for (j = 0; j < n; j++)
+#pragma acc loop worker vector /* { dg-message "Detected parallelism <acc loop worker vector>" } */
+      for (k = 0; k < n; k++)
+	;
+
+#pragma acc parallel loop auto independent /* { dg-message "Detected parallelism <acc loop gang worker>" } */
+  for (i = 0; i < n; i++)
+#pragma acc loop auto /* { dg-message "Detected parallelism <acc loop seq>" } */
+    for (j = 0; j < n; j++)
+#pragma acc loop auto /* { dg-message "Detected parallelism <acc loop seq>" } */
+      for (k = 0; k < n; k++)
+#pragma acc loop auto independent /* { dg-message "Detected parallelism <acc loop vector>" } */
+	for (l = 0; l < n; l++)
+	  ;
+
+#pragma acc parallel loop gang /* { dg-message "Detected parallelism <acc loop gang>" } */
+  for (i = 0; i < n; i++)
+#pragma acc loop worker /* { dg-message "Detected parallelism <acc loop worker>" } */
+    for (j = 0; j < n; j++)
+#pragma acc loop vector /* { dg-message "Detected parallelism <acc loop vector>" } */
+      for (k = 0; k < n; k++)
+	{
+#pragma acc loop auto independent /* { dg-message "Detected parallelism <acc loop seq>" } */
+	  /* { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 } */
+	  for (l = 0; l < n; l++)
+	    ;
+#pragma acc loop auto /* { dg-message "Detected parallelism <acc loop seq>" } */
+	  for (l = 0; l < n; l++)
+	    ;
+	}
+
+#pragma acc parallel loop /* { dg-message "Detected parallelism <acc loop seq>" } */
+  /* { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 } */
+  for (i = 0; i < n; i++)
+    {
+#pragma acc loop gang worker /* { dg-message "Detected parallelism <acc loop gang worker>" } */
+      for (j = 0; j < n; j++)
+#pragma acc loop auto /* { dg-message "Detected parallelism <acc loop seq>" } */
+	for (k = 0; k < n; k++)
+	  {
+#pragma acc loop vector /* { dg-message "Detected parallelism <acc loop vector>" } */
+	    for (l = 0; l < n; l++)
+	      ;
+#pragma acc loop auto independent /* { dg-message "Detected parallelism <acc loop vector>" } */
+	    for (l = 0; l < n; l++)
+	      ;
+	  }
+#pragma acc loop worker /* { dg-message "Detected parallelism <acc loop worker>" } */
+      for (j = 0; j < n; j++)
+#pragma acc loop vector /* { dg-message "Detected parallelism <acc loop vector>" } */
+	for (k = 0; k < n; k++)
+	  ;
+    }
+
+#pragma acc parallel loop /* { dg-message "Detected parallelism <acc loop gang>" } */
+  for (i = 0; i < n; i++)
+#pragma acc loop /* { dg-message "Detected parallelism <acc loop worker>" } */
+    for (j = 0; j < n; j++)
+#pragma acc loop /* { dg-message "Detected parallelism <acc loop seq>" } */
+      /* { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 } */
+      for (k = 0; k < n; k++)
+#pragma acc loop /* { dg-message "Detected parallelism <acc loop vector>" } */
+	  for (l = 0; l < n; l++)
+	    ;
+}
diff --git a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
index bd4c07e7d81..7ba61e53012 100644
--- a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
+++ b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
@@ -9,7 +9,7 @@  void vector_1 (int *ary, int size)
   {
 #pragma acc loop gang
     for (int jx = 0; jx < 1; jx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int ix = 0; ix < size; ix++)
 	ary[ix] = place ();
   }
diff --git a/gcc/testsuite/gfortran.dg/goacc/loop-auto-1.f90 b/gcc/testsuite/gfortran.dg/goacc/loop-auto-1.f90
new file mode 100644
index 00000000000..a8fadf504bc
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/loop-auto-1.f90
@@ -0,0 +1,88 @@ 
+! Ensure that the auto clause falls back to seq parallelism when the
+! OpenACC loop is not explicitly independent.
+
+! { dg-additional-options "-fopt-info-optimized-omp" }
+
+program test
+  implicit none
+  integer, parameter :: n = 100
+  integer i, j, k, l
+  
+  !$acc parallel loop auto ! { dg-message "Detected parallelism <acc loop seq>" }
+  do i = 1, n
+     !$acc loop auto independent ! { dg-message "Detected parallelism <acc loop gang>" }
+     do j = 1, n
+        !$acc loop worker vector ! { dg-message "Detected parallelism <acc loop worker vector>" }
+        do k = 1, n
+        end do
+     end do
+  end do
+ 
+  !$acc parallel loop auto independent ! { dg-message "Detected parallelism <acc loop gang worker>" }
+  do i = 1, n
+     !$acc loop auto ! { dg-message "Detected parallelism <acc loop seq>" }
+     do j = 1, n
+        !$acc loop auto ! { dg-message "Detected parallelism <acc loop seq>" }
+        do k = 1, n
+           !$acc loop auto independent ! { dg-message "Detected parallelism <acc loop vector>" }
+           do l = 1, n
+           end do
+        end do
+     end do
+  end do
+
+  !$acc parallel loop gang ! { dg-message "Detected parallelism <acc loop gang>" }
+  do i = 1, n
+     !$acc loop worker ! { dg-message "Detected parallelism <acc loop worker>" }
+     do j = 1, n
+        !$acc loop vector ! { dg-message "Detected parallelism <acc loop vector>" }
+        do k = 1, n
+           !$acc loop auto independent ! { dg-message "Detected parallelism <acc loop seq>" }
+           ! { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 }
+           do l = 1, n
+           end do
+           !$acc loop auto ! { dg-message "Detected parallelism <acc loop seq>" }
+           do l = 1, n
+           end do
+        end do
+     end do
+  end do
+  
+
+  !$acc parallel loop ! { dg-message "Detected parallelism <acc loop seq>" }
+  ! { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 }
+  do i = 1, n
+     !$acc loop gang worker ! { dg-message "Detected parallelism <acc loop gang worker>" }
+     do j = 1, n
+        !$acc loop auto ! { dg-message "Detected parallelism <acc loop seq>" }
+	do k = 1, n
+          !$acc loop vector ! { dg-message "Detected parallelism <acc loop vector>" }
+          do l = 1, n
+          end do
+       end do
+       !$acc loop auto independent ! { dg-message "Detected parallelism <acc loop vector>" }
+       do l = 1, n
+       end do
+    end do
+    !$acc loop worker ! { dg-message "Detected parallelism <acc loop worker>" }
+    do j = 1, n
+       !$acc loop vector ! { dg-message "Detected parallelism <acc loop vector>" }
+       do k = 1, n
+       end do
+    end do
+  end do
+
+  !$acc parallel loop ! { dg-message "Detected parallelism <acc loop gang>" }
+  do i = 1, n
+     !$acc loop ! { dg-message "Detected parallelism <acc loop worker>" }
+     do j = 1, n
+        !$acc loop ! { dg-message "Detected parallelism <acc loop seq>" }
+        ! { dg-warning "insufficient partitioning available to parallelize loop" "" { target *-*-* } .-1 }
+        do k = 1, n
+           !$acc loop ! { dg-message "Detected parallelism <acc loop vector>" }
+           do l = 1, n
+           end do
+        end do
+     end do
+  end do
+end program test
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
index 34bc57e51f5..cfb18d11acd 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
@@ -101,7 +101,7 @@  int vector_1 (int *ary, int size)
   {
 #pragma acc loop gang
     for (int jx = 0; jx < 1; jx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int ix = 0; ix < size; ix++)
 	ary[ix] = place ();
   }
@@ -117,7 +117,7 @@  int vector_2 (int *ary, int size)
   {
 #pragma acc loop worker
     for (int jx = 0; jx < size  / 64; jx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int ix = 0; ix < 64; ix++)
 	ary[ix + jx * 64] = place ();
   }
@@ -133,7 +133,7 @@  int worker_1 (int *ary, int size)
   {
 #pragma acc loop gang
     for (int kx = 0; kx < 1; kx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int jx = 0; jx <  size  / 64; jx++)
 #pragma acc loop vector
 	for (int ix = 0; ix < 64; ix++)
@@ -149,7 +149,7 @@  int gang_1 (int *ary, int size)
   
 #pragma acc parallel num_gangs (32) num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int jx = 0; jx <  size  / 64; jx++)
 #pragma acc loop worker
       for (int ix = 0; ix < 64; ix++)
@@ -165,11 +165,11 @@  int gang_2 (int *ary, int size)
   
 #pragma acc parallel num_gangs (32) num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int kx = 0; kx < size / (32 * 32); kx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int jx = 0; jx <  32; jx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
 	for (int ix = 0; ix < 32; ix++)
 	  ary[ix + jx * 32 + kx * 32 * 32] = place ();
   }
@@ -183,9 +183,9 @@  int gang_3 (int *ary, int size)
   
 #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int jx = 0; jx <  size  / 64; jx++)
-#pragma acc loop auto
+#pragma acc loop auto independent
       for (int ix = 0; ix < 64; ix++)
 	ary[ix + jx * 64] = place ();
   }
@@ -199,7 +199,7 @@  int gang_4 (int *ary, int size)
   
 #pragma acc parallel vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
-#pragma acc loop auto
+#pragma acc loop auto independent
     for (int jx = 0; jx <  size; jx++)
       ary[jx] = place ();
   }
-- 
2.17.1