[og7] vector_length extension part 5: libgomp and tests

Message ID c024e89c-6bd8-2cc2-28a2-f5ca17b30aa8@codesourcery.com
State New
Headers show
Series
  • [og7] vector_length extension part 5: libgomp and tests
Related show

Commit Message

Cesar Philippidis March 2, 2018, 8:47 p.m.
The attached patch is the last one in the vector length extension
series. It consists of some tweaks to the libgomp nvptx plugin to
accommodate larger vectors along with two test cases.

I only added two test cases because there's really not much interesting
going on with longer vector lengths. We should eventually add more tests
cases to handle situations where the nvptx BE falls back to using a
shorter vector length. But right now GCC just makes those changes
silently. There is precedent for the nvptx BE to emit a warning when
vector length != 32, but that might be too verbose. On one hand, it
could be argued that the compiler should error if it cannot satisfy the
user's request. On the other hand, falling back to a smaller vector
length ensures correctness.

Thomas, do you have any thoughts on the warnings/errors or there lack of?

I'll apply this patch to openacc-gcc-7-branch once the reduction changes
have been approved.

Cesar

Comments

Thomas Schwinge March 16, 2018, 1:50 p.m. | #1
Hi!

On Fri, 2 Mar 2018 12:47:23 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
> The attached patch is the last one in the vector length extension

> series. It consists of some tweaks to the libgomp nvptx plugin to

> accommodate larger vectors along with two test cases.

> 

> I only added two test cases because there's really not much interesting

> going on with longer vector lengths. We should eventually add more tests

> cases to handle situations where the nvptx BE falls back to using a

> shorter vector length. But right now GCC just makes those changes

> silently. There is precedent for the nvptx BE to emit a warning when

> vector length != 32, but that might be too verbose. On one hand, it

> could be argued that the compiler should error if it cannot satisfy the

> user's request. On the other hand, falling back to a smaller vector

> length ensures correctness.

> 

> Thomas, do you have any thoughts on the warnings/errors or there lack of?


Yeah, warning when "vector_length([bigger than 32])" is reduced to
"vector_length(32)" is probably too verbose/not useful.  But, it'd be
useful to have in "-fopt-info-omp"?


Grüße
 Thomas
Tom de Vries March 27, 2018, 12:16 p.m. | #2
On 03/02/2018 09:47 PM, Cesar Philippidis wrote:
> two test cases.


Committed as separate patch, while ignoring the warnings "using 
vector_length \\(32\\), ignoring 128".

Thanks,
- Tom
[openacc] Add vector_length 128 testcases

2018-03-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: New test.
	* testsuite/libgomp.oacc-fortran/gemm.f90: New test.

---
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |  57 +++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 109 +++++++++++++++++++++
 3 files changed, 172 insertions(+)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
new file mode 100644
index 0000000..1dc5fe0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -0,0 +1,57 @@
+/* Test large vector lengths.  */
+
+#include <assert.h>
+
+#define n 10000
+int a1[n], a2[n];
+
+#define gentest(name, outer, inner)		\
+  void name ()					\
+  {						\
+  long i, j, t1, t2, t3;			\
+  _Pragma(outer)				\
+  for (i = 0; i < n; i++)			\
+    {						\
+      t1 = 0;					\
+      t2 = 0;					\
+      _Pragma(inner)				\
+      for (j = i; j < n; j++)			\
+	{					\
+	  t1++;					\
+	  t2--;					\
+	}					\
+      a1[i] = t1;				\
+      a2[i] = t2;				\
+    }						\
+  for (i = 0; i < n; i++)			\
+    {						\
+      assert (a1[i] == n-i);			\
+      assert (a2[i] == -(n-i));			\
+    }						\
+  }						\
+
+gentest (test1, "acc parallel loop gang vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test2, "acc parallel loop gang vector_length (128)",
+	 "acc loop worker vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test3, "acc parallel loop gang worker vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test4, "acc parallel loop",
+	 "acc loop reduction(+:t1) reduction(-:t2)")
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
+
+
+int
+main ()
+{
+  test1 ();
+  test2 ();
+  test3 ();
+  test4 ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
new file mode 100644
index 0000000..62b8a45
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -0,0 +1,109 @@
+! Exercise three levels of parallelism using SGEMM from BLAS.
+
+! { dg-additional-options "-fopenacc-dim=-:-:128" }
+
+! Implicitly set vector_length to 128 using -fopenacc-dim.
+subroutine openacc_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n))
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm
+
+! Explicitly set vector_length to 128 using a vector_length clause.
+subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
+  ! { dg-prune-output "using vector_length \\(32\\), ignoring 128" }
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm_128
+
+subroutine host_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  do j = 1, n
+     do i = 1, m
+        temp = 0.0
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine host_sgemm
+
+program main
+  integer, parameter :: M = 100, N = 50, K = 2000
+  real :: a(K, M), b(K, N), c(M, N), d (M, N), e (M, N)
+  real alpha, beta
+  integer i, j
+
+  a(:,:) = 1.0
+  b(:,:) = 0.25
+
+  c(:,:) = 0.0
+  d(:,:) = 0.0
+  e(:,:) = 0.0
+
+  alpha = 1.05
+  beta = 1.25
+
+  call openacc_sgemm (M, N, K, alpha, a, b, beta, c)
+  call openacc_sgemm_128 (M, N, K, alpha, a, b, beta, d)
+  call host_sgemm (M, N, K, alpha, a, b, beta, e)
+
+  do i = 1, m
+     do j = 1, n
+        if (c(i,j) /= e(i,j)) call abort
+        if (d(i,j) /= e(i,j)) call abort
+     end do
+  end do
+end program main
Tom de Vries April 5, 2018, 4:36 p.m. | #3
On 03/02/2018 09:47 PM, Cesar Philippidis wrote:
> 	libgomp/

> 	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of

> 	workers and vectors.


I wrote a test case that triggers this code, and added it to this code.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom
[nvptx] Handle large vectors in libgomp

2018-04-05  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of
	workers and vectors.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: New test.

---
 libgomp/plugin/plugin-nvptx.c                      | 10 +++---
 .../vector-length-128-7.c                          | 41 ++++++++++++++++++++++
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bdc0c30..9b4768f 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -734,8 +734,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   int threads_per_block = threads_per_sm > block_size
     ? block_size : threads_per_sm;
 
-  threads_per_block /= warp_size;
-
   if (threads_per_sm > cpu_size)
     threads_per_sm = cpu_size;
 
@@ -802,6 +800,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
+      int vectors = dims[GOMP_DIM_VECTOR] > 0
+	? dims[GOMP_DIM_VECTOR] : warp_size;
+      int workers = threads_per_block / vectors;
+
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
 	  {
@@ -819,10 +821,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 		  : 2 * dev_size;
 		break;
 	      case GOMP_DIM_WORKER:
-		dims[i] = threads_per_block;
+		dims[i] = workers;
 		break;
 	      case GOMP_DIM_VECTOR:
-		dims[i] = warp_size;
+		dims[i] = vectors;
 		break;
 	      default:
 		abort ();
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
new file mode 100644
index 0000000..60c264c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
@@ -0,0 +1,41 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */

Patch

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of
	workers and vectors.
	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: New test.
	* testsuite/libgomp.oacc-fortran/gemm.f90: New test.


From 37e83baa6ae7e867e6203d7554e8381660488421 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 06:54:25 -0800
Subject: [PATCH] runtime changes and tests

---
 libgomp/plugin/plugin-nvptx.c                      |  10 +-
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |  55 +++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 108 +++++++++++++++++++++
 3 files changed, 169 insertions(+), 4 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bdc0c30e1f5..9b4768f0e59 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -734,8 +734,6 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   int threads_per_block = threads_per_sm > block_size
     ? block_size : threads_per_sm;
 
-  threads_per_block /= warp_size;
-
   if (threads_per_sm > cpu_size)
     threads_per_sm = cpu_size;
 
@@ -802,6 +800,10 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
+      int vectors = dims[GOMP_DIM_VECTOR] > 0
+	? dims[GOMP_DIM_VECTOR] : warp_size;
+      int workers = threads_per_block / vectors;
+
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
 	  {
@@ -819,10 +821,10 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 		  : 2 * dev_size;
 		break;
 	      case GOMP_DIM_WORKER:
-		dims[i] = threads_per_block;
+		dims[i] = workers;
 		break;
 	      case GOMP_DIM_VECTOR:
-		dims[i] = warp_size;
+		dims[i] = vectors;
 		break;
 	      default:
 		abort ();
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
new file mode 100644
index 00000000000..0fbd30e77b9
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -0,0 +1,55 @@ 
+/* Test large vector lengths.  */
+
+#include <assert.h>
+
+#define n 10000
+int a1[n], a2[n];
+
+#define gentest(name, outer, inner)		\
+  void name ()					\
+  {						\
+  long i, j, t1, t2, t3;			\
+  _Pragma(outer)				\
+  for (i = 0; i < n; i++)			\
+    {						\
+      t1 = 0;					\
+      t2 = 0;					\
+      _Pragma(inner)				\
+      for (j = i; j < n; j++)			\
+	{					\
+	  t1++;					\
+	  t2--;					\
+	}					\
+      a1[i] = t1;				\
+      a2[i] = t2;				\
+    }						\
+  for (i = 0; i < n; i++)			\
+    {						\
+      assert (a1[i] == n-i);			\
+      assert (a2[i] == -(n-i));			\
+    }						\
+  }						\
+  
+gentest (test1, "acc parallel loop gang vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test2, "acc parallel loop gang vector_length (128)",
+	 "acc loop worker vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test3, "acc parallel loop gang worker vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test4, "acc parallel loop",
+	 "acc loop reduction(+:t1) reduction(-:t2)")
+
+
+int
+main ()
+{
+  test1 ();
+  test2 ();
+  test3 ();
+  test4 ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
new file mode 100644
index 00000000000..ad67dce5cad
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -0,0 +1,108 @@ 
+! Exercise three levels of parallelism using SGEMM from BLAS.
+
+! { dg-additional-options "-fopenacc-dim=-:-:128" }
+
+! Implicitly set vector_length to 128 using -fopenacc-dim.
+subroutine openacc_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n))
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm
+
+! Explicitly set vector_length to 128 using a vector_length clause.
+subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm_128
+
+subroutine host_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  do j = 1, n
+     do i = 1, m
+        temp = 0.0
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine host_sgemm
+
+program main
+  integer, parameter :: M = 100, N = 50, K = 2000
+  real :: a(K, M), b(K, N), c(M, N), d (M, N), e (M, N)
+  real alpha, beta
+  integer i, j
+
+  a(:,:) = 1.0
+  b(:,:) = 0.25
+
+  c(:,:) = 0.0
+  d(:,:) = 0.0
+  e(:,:) = 0.0
+
+  alpha = 1.05
+  beta = 1.25
+
+  call openacc_sgemm (M, N, K, alpha, a, b, beta, c)
+  call openacc_sgemm_128 (M, N, K, alpha, a, b, beta, d)
+  call host_sgemm (M, N, K, alpha, a, b, beta, e)
+
+  do i = 1, m
+     do j = 1, n
+        if (c(i,j) /= e(i,j)) call abort
+        if (d(i,j) /= e(i,j)) call abort
+     end do
+  end do
+end program main
-- 
2.14.3