[NVPTX] Fix PR83920

Message ID e8e22ccf-2f25-1a8a-a599-b4a0b245c14d@codesourcery.com
State New
Headers show
Series
  • [NVPTX] Fix PR83920
Related show

Commit Message

Cesar Philippidis Jan. 17, 2018, 11:40 p.m.
In PR83920, I encountered a nvptx bug where live predicate variables
were clobbered before their value was broadcasted. Apparently, there
were problems in certain version of the CUDA driver where the JIT would
generate wrong code for shfl broadcasts. The attached patch teaches
nvptx_single not to apply that workaround if the predicate register is live.

Tom, does this patch look sane to you? I'm not sure if it defeats the
purpose of your original patch. Regardless, the live predicate registers
shouldn't be clobbered before they are used.

Unfortunately, I cannot reproduce the runtime failure with gemm example
in the PR, so I didn't include it in the patch. However, this patch does
fix the failure with da-1.c in og7. This patch does not cause any
regressions.

Is it OK for trunk?

Thanks,
Cesar

Comments

Tom de Vries Jan. 18, 2018, 1:27 p.m. | #1
On 01/18/2018 12:40 AM, Cesar Philippidis wrote:
> In PR83920, I encountered a nvptx bug where live predicate variables

> were clobbered before their value was broadcasted. 


Hi,

I've managed to reproduce the problem based on the description in the PR.

> Apparently, there

> were problems in certain version of the CUDA driver where the JIT would

> generate wrong code for shfl broadcasts. 


Correct. And there's a work around committed for the JIT problem, which 
you refer to in the next line (without introducing it first).

> The attached patch teaches

> nvptx_single not to apply that workaround if the predicate register is live.

>  > Tom, does this patch look sane to you?


The fact that the cond register is live at the start of the from bb does 
not mean that the register can't be set inside the bb.

Furthermore, the live info does not make a distinction between 
live-for-lane-0 and line-for-warp. So, if the condition reg is not set 
in the bb, but set only for lane-0 in a previous bb, then we still need 
to initialize lanes 1-31.

So, I don't think this is the way to address this bug.

> I'm not sure if it defeats the

> purpose of your original patch. 


In test cases mentioned above, it does.

> Regardless, the live predicate registers

> shouldn't be clobbered before they are used.

> 


There is a bug in the workaround, that's correct.

I think the way to address it is using a tmp .pred reg like so:
...
{
   .reg .u32 %x;
   mov.u32 %x,%tid.x;
   setp.ne.u32 %rnotvzero,%x,0;
}

{
   .reg .pred %rcond2;
   setp.eq.u32 %rcond2, 1, 0; // workaround

   @%rnotvzero bra Lskip;
   ...
   setp.<op>.<type> %rcond,op1,op2; // could be here, could be earlier
   mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for convenience
  Lskip:
   selp.u32 %rcondu32,1,0,%rcond2;
   shfl.idx.b32 %rcondu32,%rcondu32,0,31;
   setp.ne.u32 %rcond,%rcondu32,0;
}
...

> Unfortunately, I cannot reproduce the runtime failure with gemm example

> in the PR, so I didn't include it in the patch.


I'm managed to modify the test-case such that it reproduces the failure 
with trunk (see PR). So, that test-case should be included.

Thanks,
- Tom

> However, this patch does

> fix the failure with da-1.c in og7.

> This patch does not cause any

> regressions.

> 

> Is it OK for trunk?

> 

> Thanks,

> Cesar

> 

> 

> nvptx-jit-relax.diff

> 

> 

> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c

> index 55c7e3c..698c574 100644

> --- a/gcc/config/nvptx/nvptx.c

> +++ b/gcc/config/nvptx/nvptx.c

> @@ -3957,6 +3957,7 @@ bb_first_real_insn (basic_block bb)

>   static void

>   nvptx_single (unsigned mask, basic_block from, basic_block to)

>   {

> +  bitmap live = DF_LIVE_IN (from);

>     rtx_insn *head = BB_HEAD (from);

>     rtx_insn *tail = BB_END (to);

>     unsigned skip_mask = mask;

> @@ -4126,8 +4127,9 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)

>   	     There is nothing in the PTX spec to suggest that this is wrong, or

>   	     to explain why the extra initialization is needed.  So, we classify

>   	     it as a JIT bug, and the extra initialization as workaround.  */

> -	  emit_insn_before (gen_movbi (pvar, const0_rtx),

> -			    bb_first_real_insn (from));

> +	  if (!bitmap_bit_p (live, REGNO (pvar)))

> +	    emit_insn_before (gen_movbi (pvar, const0_rtx),

> +			      bb_first_real_insn (from));

>   #endif

>   	  emit_insn_before (nvptx_gen_vcast (pvar), tail);

>   	}

>
Tom de Vries Jan. 19, 2018, 2:15 p.m. | #2
On 01/18/2018 02:27 PM, Tom de Vries wrote:
> On 01/18/2018 12:40 AM, Cesar Philippidis wrote:

>> In PR83920, I encountered a nvptx bug where live predicate variables

>> were clobbered before their value was broadcasted. 

> 

> Hi,

> 

> I've managed to reproduce the problem based on the description in the PR.


> I think the way to address it is using a tmp .pred reg like so:

> ...

> {

>    .reg .u32 %x;

>    mov.u32 %x,%tid.x;

>    setp.ne.u32 %rnotvzero,%x,0;

> }

> 

> {

>    .reg .pred %rcond2;

>    setp.eq.u32 %rcond2, 1, 0; // workaround

> 

>    @%rnotvzero bra Lskip;

>    ...

>    setp.<op>.<type> %rcond,op1,op2; // could be here, could be earlier

>    mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for convenience

>   Lskip:

>    selp.u32 %rcondu32,1,0,%rcond2;

>    shfl.idx.b32 %rcondu32,%rcondu32,0,31;

>    setp.ne.u32 %rcond,%rcondu32,0;

> }

> ...

> 


Hi,

this is the fix that I plan to commit (similar to the scheme listed 
above, but modified to keep the selp.u32 using rcond, which is easier in 
code generation).

Build and reg-tested on x86_64 with nvptx accelerator.

Richard, this is an 8 regression for the nvptx target. OK for stage 4 or 
defer to stage1?

Thanks,
- Tom
[nvptx] Fix bug in jit bug workaround

2018-01-19  Tom de Vries  <tom@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	PR target/83920

	* config/nvptx/nvptx.c (nvptx_single): Fix jit workaround.

	* testsuite/libgomp.oacc-c-c++-common/pr83920.c: New test.
	* testsuite/libgomp.oacc-fortran/pr83920.f90: New test.

---
 gcc/config/nvptx/nvptx.c                           | 28 +++++++++++++++++--
 .../testsuite/libgomp.oacc-c-c++-common/pr83920.c  | 32 ++++++++++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 | 28 +++++++++++++++++++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 86fc13f4fc0..afb0e4dd185 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4096,9 +4096,33 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 
 	     There is nothing in the PTX spec to suggest that this is wrong, or
 	     to explain why the extra initialization is needed.  So, we classify
-	     it as a JIT bug, and the extra initialization as workaround.  */
-	  emit_insn_before (gen_movbi (pvar, const0_rtx),
+	     it as a JIT bug, and the extra initialization as workaround:
+
+		{
+		    .reg .u32 %x;
+		    mov.u32 %x,%tid.x;
+		    setp.ne.u32 %rnotvzero,%x,0;
+		}
+
+		+.reg .pred %rcond2;
+		+setp.eq.u32 %rcond2, 1, 0;
+
+		 @%rnotvzero bra Lskip;
+		 setp.<op>.<type> %rcond,op1,op2;
+		+mov.pred %rcond2, %rcond;
+		 Lskip:
+		+mov.pred %rcond, %rcond2;
+		 selp.u32 %rcondu32,1,0,%rcond;
+		 shfl.idx.b32 %rcondu32,%rcondu32,0,31;
+		 setp.ne.u32 %rcond,%rcondu32,0;
+	  */
+	  rtx_insn *label = PREV_INSN (tail);
+	  gcc_assert (label && LABEL_P (label));
+	  rtx tmp = gen_reg_rtx (BImode);
+	  emit_insn_before (gen_movbi (tmp, const0_rtx),
 			    bb_first_real_insn (from));
+	  emit_insn_before (gen_rtx_SET (tmp, pvar), label);
+	  emit_insn_before (gen_rtx_SET (pvar, tmp), tail);
 #endif
 	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
 	}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c
new file mode 100644
index 00000000000..6cd3b5d6f06
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c
@@ -0,0 +1,32 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+#define n 10
+
+static void __attribute__((noinline)) __attribute__((noclone))
+foo (int beta, int *c)
+{
+  #pragma acc parallel copy(c[0:(n * n) - 1]) num_gangs(2)
+  #pragma acc loop gang
+  for (int j = 0; j < n; ++j)
+    if (beta != 1)
+      {
+        #pragma acc loop vector
+	for (int i = 0; i < n; ++i)
+	  c[i + (j * n)] = 0;
+      }
+}
+
+int
+main (void)
+{
+  int c[n * n];
+
+  c[0] = 1;
+  foo (0, c);
+  if (c[0] != 0)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 b/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90
new file mode 100644
index 00000000000..34ad001abcd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90
@@ -0,0 +1,28 @@
+! { dg-do run }
+
+subroutine foo (BETA, C)
+  real ::  C(100,100)
+  integer :: i, j, l
+  real, parameter :: one = 1.0
+  real :: beta
+
+  !$acc parallel copy(c(1:100,1:100)) num_gangs(2)
+  !$acc loop gang
+  do j = 1, 100
+     if (beta /= one) then
+        !$acc loop vector
+        do i = 1, 100
+           C(i,j) = 0.0
+        end do
+     end if
+  end do
+  !$acc end parallel
+end subroutine foo
+
+program test_foo
+  real :: c(100,100), beta
+  beta = 0.0
+  c(:,:) = 1.0
+  call foo (beta, c)
+  if (c(1,1) /= 0.0) call abort ()
+end program test_foo
Richard Biener Jan. 19, 2018, 4:05 p.m. | #3
On January 19, 2018 3:15:45 PM GMT+01:00, Tom de Vries <Tom_deVries@mentor.com> wrote:
>On 01/18/2018 02:27 PM, Tom de Vries wrote:

>> On 01/18/2018 12:40 AM, Cesar Philippidis wrote:

>>> In PR83920, I encountered a nvptx bug where live predicate variables

>>> were clobbered before their value was broadcasted. 

>> 

>> Hi,

>> 

>> I've managed to reproduce the problem based on the description in the

>PR.

>

>> I think the way to address it is using a tmp .pred reg like so:

>> ...

>> {

>>    .reg .u32 %x;

>>    mov.u32 %x,%tid.x;

>>    setp.ne.u32 %rnotvzero,%x,0;

>> }

>> 

>> {

>>    .reg .pred %rcond2;

>>    setp.eq.u32 %rcond2, 1, 0; // workaround

>> 

>>    @%rnotvzero bra Lskip;

>>    ...

>>    setp.<op>.<type> %rcond,op1,op2; // could be here, could be

>earlier

>>    mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for

>convenience

>>   Lskip:

>>    selp.u32 %rcondu32,1,0,%rcond2;

>>    shfl.idx.b32 %rcondu32,%rcondu32,0,31;

>>    setp.ne.u32 %rcond,%rcondu32,0;

>> }

>> ...

>> 

>

>Hi,

>

>this is the fix that I plan to commit (similar to the scheme listed 

>above, but modified to keep the selp.u32 using rcond, which is easier

>in 

>code generation).

>

>Build and reg-tested on x86_64 with nvptx accelerator.

>

>Richard, this is an 8 regression for the nvptx target. OK for stage 4

>or 

>defer to stage1?


OK for stage 4.

Richard. 

>Thanks,

>- Tom

Patch

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 55c7e3c..698c574 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3957,6 +3957,7 @@  bb_first_real_insn (basic_block bb)
 static void
 nvptx_single (unsigned mask, basic_block from, basic_block to)
 {
+  bitmap live = DF_LIVE_IN (from);
   rtx_insn *head = BB_HEAD (from);
   rtx_insn *tail = BB_END (to);
   unsigned skip_mask = mask;
@@ -4126,8 +4127,9 @@  nvptx_single (unsigned mask, basic_block from, basic_block to)
 	     There is nothing in the PTX spec to suggest that this is wrong, or
 	     to explain why the extra initialization is needed.  So, we classify
 	     it as a JIT bug, and the extra initialization as workaround.  */
-	  emit_insn_before (gen_movbi (pvar, const0_rtx),
-			    bb_first_real_insn (from));
+	  if (!bitmap_bit_p (live, REGNO (pvar)))
+	    emit_insn_before (gen_movbi (pvar, const0_rtx),
+			      bb_first_real_insn (from));
 #endif
 	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
 	}