, PR target/81550, Rewrite PowerPC loop_align test so it still tests the original target hook

Message ID 20180124052755.GA8250@ibm-tiger.the-meissners.org
State New
Headers show
Series
  • , PR target/81550, Rewrite PowerPC loop_align test so it still tests the original target hook
Related show

Commit Message

Michael Meissner Jan. 24, 2018, 5:27 a.m.
The loop_align.c test has been broken for some time, since I put in patches to
eliminate some debug hooks (-mno-upper-regs-{df,di,sf}) that we deemed to no
longer be needed.

As Segher and I were discussing over private IRC, the root cause of this bug is
the compiler no long generates the BDNZ instruction for a count down loop,
instead it decrements the index in a GPR and does a branch/comparison on it.
In doing so, it now unrolls the loop twice, and and the resulting loop is too
big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP.  This means the loop
isn't aligned to a 32 byte boundary.

While it is important to ultimately fix the code generation bug to once again
generate the BDNZ instruction, it may be more involved in fixing the bug.  So,
I decided to rewrite the test to be simpler, and the resulting code fits within
the 4-8 instruction window the target hook is looking for.

I have tested this on a little endian power8 system, and a big endian power8
system, doing both bootstrap builds and regression tests.  The only change to
the regression test is that loop_align.c now passes on little endian 64-bit and
big endian 64-bit (big endian 32-bit did not fail with the new changes).  Can I
install this on the trunk?  Back ports to GCC 7/6 are not needed, since the
original code works on those systems.

[gcc/testsuite]
2018-01-24  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/81550
	* gcc.target/powerpc/loop_align.c: Rewrite test so that the loop
	remains small enough that it tests the alignment of loops
	specified by the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

Comments

Segher Boessenkool Jan. 24, 2018, 6:35 p.m. | #1
Hi!

On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote:
> 

> As Segher and I were discussing over private IRC, the root cause of this bug is

> the compiler no long generates the BDNZ instruction for a count down loop,

> instead it decrements the index in a GPR and does a branch/comparison on it.


Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should
keep one with stride -1 for the loop counter, for optimal code; it also
does three separate increments for the three memory accesses, which is
a bit excessive here).

> In doing so, it now unrolls the loop twice, and and the resulting loop is too

> big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP.  This means the loop

> isn't aligned to a 32 byte boundary.


It's not really unrolling, it is bb-reorder copying an RTL block.  However,
even if you disable it you still get 9 insns on some configurations, so
your patch does not hide the problem :-(

Although, hrm, in your patch you also change "int i" to "long i"; that
alone seems to be enough to fix everything?  Could you check that please?


Segher
Michael Meissner Jan. 24, 2018, 8:19 p.m. | #2
On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote:
> Hi!

> 

> On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote:

> > 

> > As Segher and I were discussing over private IRC, the root cause of this bug is

> > the compiler no long generates the BDNZ instruction for a count down loop,

> > instead it decrements the index in a GPR and does a branch/comparison on it.

> 

> Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should

> keep one with stride -1 for the loop counter, for optimal code; it also

> does three separate increments for the three memory accesses, which is

> a bit excessive here).

> 

> > In doing so, it now unrolls the loop twice, and and the resulting loop is too

> > big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP.  This means the loop

> > isn't aligned to a 32 byte boundary.

> 

> It's not really unrolling, it is bb-reorder copying an RTL block.  However,

> even if you disable it you still get 9 insns on some configurations, so

> your patch does not hide the problem :-(

> 

> Although, hrm, in your patch you also change "int i" to "long i"; that

> alone seems to be enough to fix everything?  Could you check that please?


Changing i and n to either 'long' or 'long unsigned' makes the test work.

It is interesting that -mcpu=power7 -mbig does not seem to be able to create
LFDU and STFDU, but either setting cpu to power8/power9 or setting -mbig to
-mlittle or -m32 it can generate those instructions.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Michael Meissner Jan. 24, 2018, 10 p.m. | #3
On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote:
> Hi!

> 

> On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote:

> > 

> > As Segher and I were discussing over private IRC, the root cause of this bug is

> > the compiler no long generates the BDNZ instruction for a count down loop,

> > instead it decrements the index in a GPR and does a branch/comparison on it.

> 

> Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should

> keep one with stride -1 for the loop counter, for optimal code; it also

> does three separate increments for the three memory accesses, which is

> a bit excessive here).

> 

> > In doing so, it now unrolls the loop twice, and and the resulting loop is too

> > big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP.  This means the loop

> > isn't aligned to a 32 byte boundary.

> 

> It's not really unrolling, it is bb-reorder copying an RTL block.  However,

> even if you disable it you still get 9 insns on some configurations, so

> your patch does not hide the problem :-(

> 

> Although, hrm, in your patch you also change "int i" to "long i"; that

> alone seems to be enough to fix everything?  Could you check that please?


Replacing 'int' with 'unsigned long' allows the test to succeed once again.  I
have checked this on a big endian power8 (both 32-bit and 64-bit) and on a
little endian power8 (64-bit only), and it passes in all three environments.
Can I install this on the trunk?

[gcc/testsuite]
2018-01-24  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/81550
	* gcc.target/powerpc/loop_align.c: Use unsigned long for the loop
	index instead of int, which allows IVOPTs to properly optimize the
	loop.

Index: gcc/testsuite/gcc.target/powerpc/loop_align.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/loop_align.c	(revision 256992)
+++ gcc/testsuite/gcc.target/powerpc/loop_align.c	(working copy)
@@ -4,8 +4,8 @@
 /* { dg-options "-O2 -mcpu=power7 -falign-functions=16" } */
 /* { dg-final { scan-assembler ".p2align 5,,31" } } */
 
-void f(double *a, double *b, double *c, int n) {
-  int i;
+void f(double *a, double *b, double *c, unsigned long n) {
+  unsigned long i;
   for (i=0; i < n; i++)
     a[i] = b[i] + c[i];
 }

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Segher Boessenkool Jan. 25, 2018, 12:27 a.m. | #4
On Wed, Jan 24, 2018 at 03:19:00PM -0500, Michael Meissner wrote:
> On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote:

> > Although, hrm, in your patch you also change "int i" to "long i"; that

> > alone seems to be enough to fix everything?  Could you check that please?

> 

> Changing i and n to either 'long' or 'long unsigned' makes the test work.

> 

> It is interesting that -mcpu=power7 -mbig does not seem to be able to create

> LFDU and STFDU, but either setting cpu to power8/power9 or setting -mbig to

> -mlittle or -m32 it can generate those instructions.


Yeah, dunno...  I suspect we have some target costs a bit wrong,
influencing the ivopts etc. decisions.

The auto_inc_dec pass says (-mcpu=power7 -mabi=elfv2 -mbig):

   23: r147:DF=[r126:DI]
found mem(23) *(r[126]+0)
trying SIMPLE_PRE_INC
cost failure old=16 new=408

(where -mlittle thinks it is fine; does not say what costs, but the 408
for -mbig looks suspicious of course -- sounds like the call to
rs6000_slow_unaligned_access in rs6000_rtx_costs misfired.  Yet another
reason to use insn_cost instead ;-) )


Segher
Segher Boessenkool Jan. 25, 2018, 12:51 a.m. | #5
On Wed, Jan 24, 2018 at 05:00:39PM -0500, Michael Meissner wrote:
> Replacing 'int' with 'unsigned long' allows the test to succeed once again.  I

> have checked this on a big endian power8 (both 32-bit and 64-bit) and on a

> little endian power8 (64-bit only), and it passes in all three environments.

> Can I install this on the trunk?


Yes, this is.  Please install on trunk.  Thanks!


Segher


> [gcc/testsuite]

> 2018-01-24  Michael Meissner  <meissner@linux.vnet.ibm.com>

> 

> 	PR target/81550

> 	* gcc.target/powerpc/loop_align.c: Use unsigned long for the loop

> 	index instead of int, which allows IVOPTs to properly optimize the

> 	loop.
Michael Meissner Jan. 27, 2018, 4:16 p.m. | #6
While we've 'fixed' the loop_align.c test to work with the current compiler, I
did some digging into the effects of subversion id 240482, which removed the
old debug options -m{,no-}upper-regs{,-df,-di,-sf}.  These options controlled
whether the DFmode, DImode, and SFmode types could go into the Altivec
registers.

In subversion id 240481, if you did not use the -mno-upper-regs* switches, on
power7 DFmode and DImode could go into the Altivec registers.  On power8,
SFmode can also go into the Altivec registers.

In subversion id 240481 for power7, the flags in the reg_addr array say that
DFmode does not support ++/-- operations, due to the Altivec registers not
having load and store with update variants.  Similarly for power8, the flags in
the reg_addr for SFmode says that it does not support ++/-- operations.

In subversion id 240482, I allowed ++/-- for DFmode in power7 and ++/-- for
SFmode in power8.

By allowing the ++/-- support for int indexes, it causes IV-OPTS to not
properly optimize the loop.

This patch restores the 240841 behavior of not allowing ++/-- for scalar
floating point types.  The loop looks like:

.L3:
        lfdx 0,4,9
        lfdx 12,5,9
        fadd 0,0,12
        stfdx 0,3,9
        addi 9,9,8
        bdnz .L3

Which is small enough to align the loop to 32 bytes.

However, it does cause slight changes in running Spec 2006 on a little endian
power8 system:

	471.omnetpp:	1.3% faster
	434.zeusmp:	1.9% slower
	435.gromacs:	2.0% faster
	482.sphinx3:	1.4% faster

I did the bootstrap and make check tests, and they all pass.  If I replace the
loop_align.c test with the old version it passes also.  Even with the zeusmp
regression, three other benchmarks speed up slightly, and the simple loop test
also is better.  Can I check this into the trunk?

[gcc]
2018-01-27  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/81550
	* config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If
	floating point scalars are allowed in traditional Altivec
	registers, don't allow PRE_{INC,DEC,MODIFY} on those floating
	point types.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 257024)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -2990,6 +2990,8 @@ rs6000_setup_reg_addr_masks (void)
 		  && !VECTOR_MODE_P (m2)
 		  && !FLOAT128_VECTOR_P (m2)
 		  && !complex_p
+		  && (m != E_DFmode || !TARGET_VSX)
+		  && (m != E_SFmode || !TARGET_P8_VECTOR)
 		  && !small_int_vsx_p)
 		{
 		  addr_mask |= RELOAD_REG_PRE_INCDEC;
Segher Boessenkool Jan. 29, 2018, 10 p.m. | #7
Hi!

On Sat, Jan 27, 2018 at 11:16:00AM -0500, Michael Meissner wrote:
> By allowing the ++/-- support for int indexes, it causes IV-OPTS to not

> properly optimize the loop.

> 

> This patch restores the 240841 behavior of not allowing ++/-- for scalar

> floating point types.  The loop looks like:

> 

> .L3:

>         lfdx 0,4,9

>         lfdx 12,5,9

>         fadd 0,0,12

>         stfdx 0,3,9

>         addi 9,9,8

>         bdnz .L3

> 

> Which is small enough to align the loop to 32 bytes.


More importantly ;-), this is better than doing three increments in the
loop (as the other case does).  Which then as a bonus makes an IV choice
that allows bdnz win.

> However, it does cause slight changes in running Spec 2006 on a little endian

> power8 system:

> 

> 	471.omnetpp:	1.3% faster

> 	434.zeusmp:	1.9% slower

> 	435.gromacs:	2.0% faster

> 	482.sphinx3:	1.4% faster


That looks good.  Would be nice to figure out what causes the zeusmp
regression.

Also, with or without this patch, it seems we give pretty wrong costs
to ivopts.  We should investigate that, too (for GCC 9, unless there
is some obvious and simple thing we do wrong currently).

> 	PR target/81550

> 	* config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If

> 	floating point scalars are allowed in traditional Altivec

> 	registers, don't allow PRE_{INC,DEC,MODIFY} on those floating

> 	point types.


> Index: gcc/config/rs6000/rs6000.c

> ===================================================================

> --- gcc/config/rs6000/rs6000.c	(revision 257024)

> +++ gcc/config/rs6000/rs6000.c	(working copy)

> @@ -2990,6 +2990,8 @@ rs6000_setup_reg_addr_masks (void)

>  		  && !VECTOR_MODE_P (m2)

>  		  && !FLOAT128_VECTOR_P (m2)

>  		  && !complex_p

> +		  && (m != E_DFmode || !TARGET_VSX)

> +		  && (m != E_SFmode || !TARGET_P8_VECTOR)

>  		  && !small_int_vsx_p)

>  		{

>  		  addr_mask |= RELOAD_REG_PRE_INCDEC;


(Maybe add a comment?)

This is okay for trunk.  Thanks!


Segher
Michael Meissner Jan. 29, 2018, 10:31 p.m. | #8
On Mon, Jan 29, 2018 at 04:00:59PM -0600, Segher Boessenkool wrote:
> (Maybe add a comment?)

> 

> This is okay for trunk.  Thanks!


This is the patch I applied:

2018-01-29  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/81550
	* config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If DFmode
	and SFmode can go in Altivec registers (-mcpu=power7 for DFmode,
	-mcpu=power8 for SFmode) don't set the PRE_INCDEC or PRE_MODIFY
	flags.  This restores the settings used before the 2017-07-24.
	Turning off pre increment/decrement/modify allows IVOPTS to
	optimize DF/SF loops where the index is an int.

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 257165)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -2982,7 +2982,15 @@ rs6000_setup_reg_addr_masks (void)
 
 	      /* Figure out if we can do PRE_INC, PRE_DEC, or PRE_MODIFY
 		 addressing.  If we allow scalars into Altivec registers,
-		 don't allow PRE_INC, PRE_DEC, or PRE_MODIFY.  */
+		 don't allow PRE_INC, PRE_DEC, or PRE_MODIFY.
+
+		 For VSX systems, we don't allow update addressing for
+		 DFmode/SFmode if those registers can go in both the
+		 traditional floating point registers and Altivec registers.
+		 The load/store instructions for the Altivec registers do not
+		 have update forms.  If we allowed update addressing, it seems
+		 to break IV-OPT code using floating point if the index type is
+		 int instead of long (PR target/81550 and target/84042).  */
 
 	      if (TARGET_UPDATE
 		  && (rc == RELOAD_REG_GPR || rc == RELOAD_REG_FPR)
@@ -2990,6 +2998,8 @@ rs6000_setup_reg_addr_masks (void)
 		  && !VECTOR_MODE_P (m2)
 		  && !FLOAT128_VECTOR_P (m2)
 		  && !complex_p
+		  && (m != E_DFmode || !TARGET_VSX)
+		  && (m != E_SFmode || !TARGET_P8_VECTOR)
 		  && !small_int_vsx_p)
 		{
 		  addr_mask |= RELOAD_REG_PRE_INCDEC;

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

Patch

Index: gcc/testsuite/gcc.target/powerpc/loop_align.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/loop_align.c	(revision 256992)
+++ gcc/testsuite/gcc.target/powerpc/loop_align.c	(working copy)
@@ -1,11 +1,16 @@ 
 /* { dg-do compile { target { powerpc*-*-* } } } */
 /* { dg-skip-if "" { powerpc*-*-darwin* powerpc-ibm-aix* } } */
 /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power7" } } */
-/* { dg-options "-O2 -mcpu=power7 -falign-functions=16" } */
+/* { dg-options "-O2 -mcpu=power7 -falign-functions=16 -fno-reorder-blocks" } */
 /* { dg-final { scan-assembler ".p2align 5,,31" } } */
 
-void f(double *a, double *b, double *c, int n) {
-  int i;
+/* This test must be crafted so that the loop is less than 8 insns (and more
+   than 4) in order for the TARGET_ASM_LOOP_ALIGN_MAX_SKIP target hook to fire
+   and align the loop properly.  Originally the loop used double, and various
+   optimizations caused the loop to double in size, and fail the test.  */
+
+void f(long *a, long *b, long *c, long n) {
+  long i;
   for (i=0; i < n; i++)
     a[i] = b[i] + c[i];
 }