[PATCH/RFC] Simplify wrapped RTL op

Message ID 25250026-7a2b-64c1-4cdb-ec805a50f6be@linux.ibm.com
State New
Headers show
Series
  • [PATCH/RFC] Simplify wrapped RTL op
Related show

Commit Message

Robin Dapp Aug. 27, 2019, 9:12 a.m.
Hi,

as announced in the wrapped-binop gimple patch mail, on s390 we still
emit odd code in front of loops:

  void v1 (unsigned long *in, unsigned long *out, unsigned int n)
  {
    int i;
    for (i = 0; i < n; i++)
    {
      out[i] = in[i];
    }
   }

   -->

   aghi    %r1,-8
   srlg    %r1,%r1,3
   aghi    %r1,1

This is created by doloop after getting niter from the loop as n - 1 or
"n * 8 - 8" with a step width of 8.  Realizing s390's doloop pattern
compares against 1, we add 1 to niter resulting in the code above.

When going a similar route as with the gimple patch, something like

STORE_FLAG_VALUE
         is 1.  */

helps immediately, yet overflow/range information is not considered.  Do
we somehow guarantee that the niter-related we created until doloop do
not overflow?  I did not note something when looking through the code.
Granted, the simplification seems oddly specific and is probably not
useful for a wide range of targets and situations.


Another approach would be to store "niter+1" (== n) when niter (== n-1)
is calculated and, when we need to do the increment, use the niter+1
that we already have without needing to simplify (n - 8) >> 3 + 1.

Any comments on this?

The patch above bootstraps and test suite is without regressions on s390
fwiw.

Regards
 Robin

Comments

Segher Boessenkool Aug. 27, 2019, 10:09 a.m. | #1
On Tue, Aug 27, 2019 at 11:12:32AM +0200, Robin Dapp wrote:
> as announced in the wrapped-binop gimple patch mail, on s390 we still

> emit odd code in front of loops:


>    aghi    %r1,-8

>    srlg    %r1,%r1,3

>    aghi    %r1,1


This is done like this because %r1 might be 0.

We see this same problem on Power; there are quite a few PRs about it.

[ ... ]

> helps immediately, yet overflow/range information is not considered.


Yeah, and it has to be.

> Do

> we somehow guarantee that the niter-related we created until doloop do

> not overflow?  I did not note something when looking through the code.

> Granted, the simplification seems oddly specific and is probably not

> useful for a wide range of targets and situations.


You're at least the third target, and it's pretty annoying, and it tends
to cost more than two insns (because things can often be simplified
further after this).  It won't do super much for execution time, there
is a loop after this after all, a handful of insns executed once can't
be all that expensive relatively.

> Another approach would be to store "niter+1" (== n) when niter (== n-1)

> is calculated and, when we need to do the increment, use the niter+1

> that we already have without needing to simplify (n - 8) >> 3 + 1.

> 

> Any comments on this?

> 

> The patch above bootstraps and test suite is without regressions on s390

> fwiw.


When something similar was tried before there were regressions for
rs6000.  I'll find the PR later.

I was hoping that now that ivopts learns about doloops, this can be
handled better as well.  Ideally the doloop pass can move closer to
expand, and do much less analysis and work, all the heavy lifting has
been done already.


Segher
Segher Boessenkool Aug. 28, 2019, 7:05 a.m. | #2
On Tue, Aug 27, 2019 at 05:09:52AM -0500, Segher Boessenkool wrote:
> On Tue, Aug 27, 2019 at 11:12:32AM +0200, Robin Dapp wrote:

> > as announced in the wrapped-binop gimple patch mail, on s390 we still

> > emit odd code in front of loops:

> 

> >    aghi    %r1,-8

> >    srlg    %r1,%r1,3

> >    aghi    %r1,1

> 

> This is done like this because %r1 might be 0.

> 

> We see this same problem on Power; there are quite a few PRs about it.

> 

> [ ... ]

> 

> > helps immediately, yet overflow/range information is not considered.

> 

> Yeah, and it has to be.

> 

> > Do

> > we somehow guarantee that the niter-related we created until doloop do

> > not overflow?  I did not note something when looking through the code.

> > Granted, the simplification seems oddly specific and is probably not

> > useful for a wide range of targets and situations.

> 

> You're at least the third target, and it's pretty annoying, and it tends

> to cost more than two insns (because things can often be simplified

> further after this).  It won't do super much for execution time, there

> is a loop after this after all, a handful of insns executed once can't

> be all that expensive relatively.

> 

> > Another approach would be to store "niter+1" (== n) when niter (== n-1)

> > is calculated and, when we need to do the increment, use the niter+1

> > that we already have without needing to simplify (n - 8) >> 3 + 1.

> > 

> > Any comments on this?

> > 

> > The patch above bootstraps and test suite is without regressions on s390

> > fwiw.

> 

> When something similar was tried before there were regressions for

> rs6000.  I'll find the PR later.


PR37451.  Not clear what target that regressed on, btw.

> I was hoping that now that ivopts learns about doloops, this can be

> handled better as well.  Ideally the doloop pass can move closer to

> expand, and do much less analysis and work, all the heavy lifting has

> been done already.



Segher
Segher Boessenkool Aug. 28, 2019, 7:27 a.m. | #3
On Wed, Aug 28, 2019 at 02:05:58AM -0500, Segher Boessenkool wrote:
> On Tue, Aug 27, 2019 at 05:09:52AM -0500, Segher Boessenkool wrote:

> > On Tue, Aug 27, 2019 at 11:12:32AM +0200, Robin Dapp wrote:

> > > as announced in the wrapped-binop gimple patch mail, on s390 we still

> > > emit odd code in front of loops:

> > 

> > >    aghi    %r1,-8

> > >    srlg    %r1,%r1,3

> > >    aghi    %r1,1

> > 

> > This is done like this because %r1 might be 0.

> > 

> > We see this same problem on Power; there are quite a few PRs about it.

> > 

> > [ ... ]

> > 

> > > helps immediately, yet overflow/range information is not considered.

> > 

> > Yeah, and it has to be.

> > 

> > > Do

> > > we somehow guarantee that the niter-related we created until doloop do

> > > not overflow?  I did not note something when looking through the code.

> > > Granted, the simplification seems oddly specific and is probably not

> > > useful for a wide range of targets and situations.

> > 

> > You're at least the third target, and it's pretty annoying, and it tends

> > to cost more than two insns (because things can often be simplified

> > further after this).  It won't do super much for execution time, there

> > is a loop after this after all, a handful of insns executed once can't

> > be all that expensive relatively.

> > 

> > > Another approach would be to store "niter+1" (== n) when niter (== n-1)

> > > is calculated and, when we need to do the increment, use the niter+1

> > > that we already have without needing to simplify (n - 8) >> 3 + 1.

> > > 

> > > Any comments on this?

> > > 

> > > The patch above bootstraps and test suite is without regressions on s390

> > > fwiw.

> > 

> > When something similar was tried before there were regressions for

> > rs6000.  I'll find the PR later.

> 

> PR37451.  Not clear what target that regressed on, btw.


And PR55190 and PR67288 and probably more.

> > I was hoping that now that ivopts learns about doloops, this can be

> > handled better as well.  Ideally the doloop pass can move closer to

> > expand, and do much less analysis and work, all the heavy lifting has

> > been done already.



Segher
Robin Dapp Aug. 29, 2019, 9:08 a.m. | #4
>> PR37451.  Not clear what target that regressed on, btw.

> 

> And PR55190 and PR67288 and probably more.


Thanks for finding those.  So the hope is to get this fixed or rather
move towards a fix with the patch series that's currently reviewed which
injects some doloop knowledge into ivopts?

As said before, I was thinking of storing niter + 1 somewhere and use
this instead of doing the + 1 later when it cannot be simplified
anymore. But if we expect a larger rewrite anyway, it's probably not
worthwhile to pursue that now.

Regards
 Robin
Segher Boessenkool Aug. 29, 2019, 12:13 p.m. | #5
Hi Robin,

On Thu, Aug 29, 2019 at 11:08:11AM +0200, Robin Dapp wrote:
> >> PR37451.  Not clear what target that regressed on, btw.

> > 

> > And PR55190 and PR67288 and probably more.

> 

> Thanks for finding those.  So the hope is to get this fixed or rather

> move towards a fix with the patch series that's currently reviewed which

> injects some doloop knowledge into ivopts?


The long-term plan is to do more and more of the loop optimisation earlier,
at Gimple level, and to certainly do almost no analysis on the RTL, as done
currently.  But this is a longer term still somewhat vague plan.

Right now ivopts decides if a loop can use doloop, and costs stuff based
on that.  Next steps will be to communicate "please use / do not use doloop
here" down to RTL.  Perhaps the doloop expansion should move closer to the
expand pass as well.

And it's not just doloop -- the unrolling should be done earlier, too.

How much of this will ever work out, and how much of it will make GCC 10,
hrm I need a new crystal ball :-)

> As said before, I was thinking of storing niter + 1 somewhere and use

> this instead of doing the + 1 later when it cannot be simplified

> anymore. But if we expect a larger rewrite anyway, it's probably not

> worthwhile to pursue that now.


This sounds like a pretty simple short-term solution.  If it works, I'm
all for it :-)


Segher

Patch

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 9359a3cdb4d..9c06c9b6ee9 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -2364,6 +2364,24 @@  simplify_binary_operation_1 (enum rtx_code code,
machine_mode mode,
                                                           in1, in2));
        }

+      /* Transform (plus (lshiftrt (plus A -C1) C2) C3) to (lshiftrt A C2)
+         if C1 == -C3 * (1 << C2).  */
+      if (CONST_SCALAR_INT_P (op1)
+         && GET_CODE (op0) == LSHIFTRT
+         && CONST_SCALAR_INT_P (XEXP (op0, 1))
+         && GET_CODE (XEXP (op0, 0)) == PLUS
+         && CONST_SCALAR_INT_P (XEXP (XEXP (op0, 0), 1)))
+       {
+         rtx c3 = op1;
+         rtx c2 = XEXP (op0, 1);
+         rtx c1 = XEXP (XEXP (op0, 0), 1);
+
+         rtx a = XEXP (XEXP (op0, 0), 0);
+
+         if (-INTVAL (c3) * (1 << INTVAL (c2)) == INTVAL (c1))
+           return simplify_gen_binary (LSHIFTRT, mode, a, c2);
+       }
+
       /* (plus (comparison A B) C) can become (neg (rev-comp A B)) if
         C is 1 and STORE_FLAG_VALUE is -1 or if C is -1 and