[committed,PR,tree-optimization/83410] Avoid some jump threads when parallelizing loops

Message ID c56e84d9-5e0a-5430-4d2f-0584c51e3012@redhat.com
State New
Headers show
Series
  • [committed,PR,tree-optimization/83410] Avoid some jump threads when parallelizing loops
Related show

Commit Message

Jeff Law Dec. 15, 2017, 4:19 p.m.
I hate this patch.

The fundamental problem we have is that there are times when we very
much want to thread jumps and yet there are other times when we do not.

To date we have been able to largely select between those two by looking
at the shape of the CFG and the jump thread to see how threading a
particular jump would impact the shape of the CFG (particularly loops in
the CFG).

In this BZ we have a loop like this:

        2
        |
        3<-------+
        |        |
        4<---+   |
       / \   |   |
      5   6  |   |
       \  /  |   |
         7   |   |
        / \  |   |
       8  11-+   |
      / \        |
     9  10-------+
     |
     E

We want to thread the path (6, 7) (7, 11).  ie, there's a path through
that inner loop where we don't need to do the loop test.  Here's an
example (from libgomp testsuite)

(N is 1500)

     for (j = 0; j < N; j++)
        {
          if (j > 500)
            {
              x[i][j] = i + j + 3;
              y[j] = i*j + 10;
            }
          else
            x[i][j] = x[i][j]*3;
        }



Threading (in effect) puts the loop test into both arms of the
conditional, then removes it from the ELSE arm because the condition is
always "keep looping" in that arm.

This plays poorly with loop parallelization -- I tried to follow what's
going on in there and just simply got lost.  I got as far as seeing that
the parallelization code thought there were loop carried dependencies.
At least that's how it looked to me.  But I don't see any loop carried
dependencies in the code.

You might naturally ask if threading this is actually wise.  It seems to
broadly fit into the cases where we throttle threading so as not to muck
around too much with the loop structure.  It's not terrible to detect a
CFG of this shape and avoid threading.

BUT....


You can have essentially the same shape CFG (not 100% the same, but the
same key characteristics), but where jump threading simplifies things in
ways that are good for the SLP vectorizer (vect/bb-slp-16.c) or where
jump threading avoids spurious warnings (graphite/scop-4.c)

Initially I thought I'd seen a key difference in the contents of the
latch block, but I think I was just up too late and mis-read the dump.

So I don't see anything in the CFG shape or the contents of the blocks
that can be reasonably analyzed at jump threading time.  Given we have
two opposite needs and no reasonable way I can find to select between
them, I'm resorting to a disgusting hack.  Namely to avoid threading
through the latch to another point in the loop *when loop
parallelization is enabled*.

Let me be clear.  This is a hack.  I don't like it, not one little bit.
But I don't see a way to resolve the regression without introducing
other regressions and the loop parallelization code is a total mystery
to me.

Bootstrapped on x86_64 and regression tested with and without graphite.
Confirmed it fixes the graphite related regressions mentioned in the BZ
on x86_64.

Committing to the trunk and hanging my head in shame.

Jeff
commit aa9f2e239944cc5baafdae431c821b900e7f37a9
Author: Jeff Law <law@torsion.usersys.redhat.com>
Date:   Fri Dec 15 11:09:50 2017 -0500

            PR tree-optimization/83410
            * tree-ssa-threadupdate.c (thread_block_1): Avoid certain jump
            threads when parallelizing loops.

Comments

Jakub Jelinek Dec. 15, 2017, 4:45 p.m. | #1
On Fri, Dec 15, 2017 at 09:19:14AM -0700, Jeff Law wrote:
> +	  /* Loop parallelization can be confused by the result of

> +	     threading through the loop exit test back into the loop.

> +	     However, theading those jumps seems to help other codes.

> +

> +	     I have been unable to find anything related to the shape of

> +	     the CFG, the contents of the affected blocks, etc which would

> +	     allow a more sensible test than what we're using below which

> +	     merely avoids the optimization when parallelizing loops.  */

> +	  if (flag_tree_parallelize_loops > 1)


Is there no jump threading (dom or vrp) after the parloops pass?
If there is, it would be nice to only do this if the parloops pass
has not been invoked yet.

	Jakub
Richard Biener Dec. 15, 2017, 5 p.m. | #2
On December 15, 2017 5:19:14 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>I hate this patch.

>

>The fundamental problem we have is that there are times when we very

>much want to thread jumps and yet there are other times when we do not.

>

>To date we have been able to largely select between those two by

>looking

>at the shape of the CFG and the jump thread to see how threading a

>particular jump would impact the shape of the CFG (particularly loops

>in

>the CFG).

>

>In this BZ we have a loop like this:

>

>        2

>        |

>        3<-------+

>        |        |

>        4<---+   |

>       / \   |   |

>      5   6  |   |

>       \  /  |   |

>         7   |   |

>        / \  |   |

>       8  11-+   |

>      / \        |

>     9  10-------+

>     |

>     E

>

>We want to thread the path (6, 7) (7, 11).  ie, there's a path through

>that inner loop where we don't need to do the loop test.  Here's an

>example (from libgomp testsuite)

>

>(N is 1500)

>

>     for (j = 0; j < N; j++)

>        {

>          if (j > 500)

>            {

>              x[i][j] = i + j + 3;

>              y[j] = i*j + 10;

>            }

>          else

>            x[i][j] = x[i][j]*3;

>        }

>

>

>

>Threading (in effect) puts the loop test into both arms of the

>conditional, then removes it from the ELSE arm because the condition is

>always "keep looping" in that arm.

>

>This plays poorly with loop parallelization -- I tried to follow what's

>going on in there and just simply got lost.  I got as far as seeing

>that

>the parallelization code thought there were loop carried dependencies.

>At least that's how it looked to me.  But I don't see any loop carried

>dependencies in the code.


Hmm. I'll double check if I remember on Monday. 

>

>You might naturally ask if threading this is actually wise.  It seems

>to

>broadly fit into the cases where we throttle threading so as not to

>muck

>around too much with the loop structure.  It's not terrible to detect a

>CFG of this shape and avoid threading.

>

>BUT....

>

>

>You can have essentially the same shape CFG (not 100% the same, but the

>same key characteristics), but where jump threading simplifies things

>in

>ways that are good for the SLP vectorizer (vect/bb-slp-16.c) or where

>jump threading avoids spurious warnings (graphite/scop-4.c)

>

>Initially I thought I'd seen a key difference in the contents of the

>latch block, but I think I was just up too late and mis-read the dump.

>

>So I don't see anything in the CFG shape or the contents of the blocks

>that can be reasonably analyzed at jump threading time.  Given we have

>two opposite needs and no reasonable way I can find to select between

>them, I'm resorting to a disgusting hack.  Namely to avoid threading

>through the latch to another point in the loop *when loop

>parallelization is enabled*.

>

>Let me be clear.  This is a hack.  I don't like it, not one little bit.

>But I don't see a way to resolve the regression without introducing

>other regressions and the loop parallelization code is a total mystery

>to me.

>

>Bootstrapped on x86_64 and regression tested with and without graphite.

>Confirmed it fixes the graphite related regressions mentioned in the BZ

>on x86_64.

>

>Committing to the trunk and hanging my head in shame.


I'd not have worried much about auto parallekization or graphite here. It does look like a missed handling there. 

Richard. 


>Jeff
Jeff Law Dec. 15, 2017, 6:10 p.m. | #3
On 12/15/2017 09:45 AM, Jakub Jelinek wrote:
> On Fri, Dec 15, 2017 at 09:19:14AM -0700, Jeff Law wrote:

>> +	  /* Loop parallelization can be confused by the result of

>> +	     threading through the loop exit test back into the loop.

>> +	     However, theading those jumps seems to help other codes.

>> +

>> +	     I have been unable to find anything related to the shape of

>> +	     the CFG, the contents of the affected blocks, etc which would

>> +	     allow a more sensible test than what we're using below which

>> +	     merely avoids the optimization when parallelizing loops.  */

>> +	  if (flag_tree_parallelize_loops > 1)

> 

> Is there no jump threading (dom or vrp) after the parloops pass?

> If there is, it would be nice to only do this if the parloops pass

> has not been invoked yet.

That's precisely how it works -- there's an pre-existing guard.

So prior to the loop optimizers, we're conservative about potentially
mucking up the loop structure.  After the loop optimizers we allow more
aggressive threading.

JEff
Richard Biener Dec. 18, 2017, 10:15 a.m. | #4
On Fri, Dec 15, 2017 at 6:00 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On December 15, 2017 5:19:14 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:

>>I hate this patch.

>>

>>The fundamental problem we have is that there are times when we very

>>much want to thread jumps and yet there are other times when we do not.

>>

>>To date we have been able to largely select between those two by

>>looking

>>at the shape of the CFG and the jump thread to see how threading a

>>particular jump would impact the shape of the CFG (particularly loops

>>in

>>the CFG).

>>

>>In this BZ we have a loop like this:

>>

>>        2

>>        |

>>        3<-------+

>>        |        |

>>        4<---+   |

>>       / \   |   |

>>      5   6  |   |

>>       \  /  |   |

>>         7   |   |

>>        / \  |   |

>>       8  11-+   |

>>      / \        |

>>     9  10-------+

>>     |

>>     E

>>

>>We want to thread the path (6, 7) (7, 11).  ie, there's a path through

>>that inner loop where we don't need to do the loop test.  Here's an

>>example (from libgomp testsuite)

>>

>>(N is 1500)

>>

>>     for (j = 0; j < N; j++)

>>        {

>>          if (j > 500)

>>            {

>>              x[i][j] = i + j + 3;

>>              y[j] = i*j + 10;

>>            }

>>          else

>>            x[i][j] = x[i][j]*3;

>>        }

>>

>>

>>

>>Threading (in effect) puts the loop test into both arms of the

>>conditional, then removes it from the ELSE arm because the condition is

>>always "keep looping" in that arm.

>>

>>This plays poorly with loop parallelization -- I tried to follow what's

>>going on in there and just simply got lost.  I got as far as seeing

>>that

>>the parallelization code thought there were loop carried dependencies.

>>At least that's how it looked to me.  But I don't see any loop carried

>>dependencies in the code.

>

> Hmm. I'll double check if I remember on Monday.


So I did and the ultimate reason GRAPHITE FAILs is because for
the loop with the threading applied we can no longer compute
number of iterations.  That's of course bad for _all_ loop optimizers,
not just GRAPHITE (we just don't have any other test coverage).
The main issue here is that after the threading is applied the
block with the loop exit no longer dominates the loop latch.
This means the loop should really be split but it also means we should
definitely avoid performing such threading before loop optimizers run,
and not just if parallelizing loops.

Can you adjust your patch this way?  The condition you check seems
to more or less match the case where we thread through the exit
test _into_ the loop?  The bad thing is really if in the resulting CFG
the exit test isn't dominating the latch (thus we have an exit test
that is not always executed):

bool
number_of_iterations_exit_assumptions (struct loop *loop, edge exit,
                                       struct tree_niter_desc *niter,
                                       gcond **at_stmt, bool every_iteration)
{
...
  safe = dominated_by_p (CDI_DOMINATORS, loop->latch, exit->src);

  if (every_iteration && !safe)
    return false;

and later passing it to

  if (!number_of_iterations_cond (loop, type, &iv0, code, &iv1, niter,
                                  loop_only_exit_p (loop, exit), safe))

which has

  if (!every_iteration
      && (!iv0->no_overflow || !iv1->no_overflow
          || code == NE_EXPR || code == EQ_EXPR))
    return false;

and in our case code is NE_EXPR (_16 != 10000).

I'm not sure if we eventually can lift this restriction if 'exit' is the only
exit from 'loop'.

The particular case can of course be handled given the exit test
and the test making it not dominating the latch are related
(operating on a related IV).

Richard.

>>

>>You might naturally ask if threading this is actually wise.  It seems

>>to

>>broadly fit into the cases where we throttle threading so as not to

>>muck

>>around too much with the loop structure.  It's not terrible to detect a

>>CFG of this shape and avoid threading.

>>

>>BUT....

>>

>>

>>You can have essentially the same shape CFG (not 100% the same, but the

>>same key characteristics), but where jump threading simplifies things

>>in

>>ways that are good for the SLP vectorizer (vect/bb-slp-16.c) or where

>>jump threading avoids spurious warnings (graphite/scop-4.c)

>>

>>Initially I thought I'd seen a key difference in the contents of the

>>latch block, but I think I was just up too late and mis-read the dump.

>>

>>So I don't see anything in the CFG shape or the contents of the blocks

>>that can be reasonably analyzed at jump threading time.  Given we have

>>two opposite needs and no reasonable way I can find to select between

>>them, I'm resorting to a disgusting hack.  Namely to avoid threading

>>through the latch to another point in the loop *when loop

>>parallelization is enabled*.

>>

>>Let me be clear.  This is a hack.  I don't like it, not one little bit.

>>But I don't see a way to resolve the regression without introducing

>>other regressions and the loop parallelization code is a total mystery

>>to me.

>>

>>Bootstrapped on x86_64 and regression tested with and without graphite.

>>Confirmed it fixes the graphite related regressions mentioned in the BZ

>>on x86_64.

>>

>>Committing to the trunk and hanging my head in shame.

>

> I'd not have worried much about auto parallekization or graphite here. It does look like a missed handling there.

>

> Richard.

>

>

>>Jeff

>
Jeff Law Dec. 18, 2017, 4:06 p.m. | #5
On 12/18/2017 03:15 AM, Richard Biener wrote:
> On Fri, Dec 15, 2017 at 6:00 PM, Richard Biener

> <richard.guenther@gmail.com> wrote:

>> On December 15, 2017 5:19:14 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:

>>> I hate this patch.

>>>

>>> The fundamental problem we have is that there are times when we very

>>> much want to thread jumps and yet there are other times when we do not.

>>>

>>> To date we have been able to largely select between those two by

>>> looking

>>> at the shape of the CFG and the jump thread to see how threading a

>>> particular jump would impact the shape of the CFG (particularly loops

>>> in

>>> the CFG).

>>>

>>> In this BZ we have a loop like this:

>>>

>>>        2

>>>        |

>>>        3<-------+

>>>        |        |

>>>        4<---+   |

>>>       / \   |   |

>>>      5   6  |   |

>>>       \  /  |   |

>>>         7   |   |

>>>        / \  |   |

>>>       8  11-+   |

>>>      / \        |

>>>     9  10-------+

>>>     |

>>>     E

>>>

>>> We want to thread the path (6, 7) (7, 11).  ie, there's a path through

>>> that inner loop where we don't need to do the loop test.  Here's an

>>> example (from libgomp testsuite)

>>>

>>> (N is 1500)

>>>

>>>     for (j = 0; j < N; j++)

>>>        {

>>>          if (j > 500)

>>>            {

>>>              x[i][j] = i + j + 3;

>>>              y[j] = i*j + 10;

>>>            }

>>>          else

>>>            x[i][j] = x[i][j]*3;

>>>        }

>>>

>>>

>>>

>>> Threading (in effect) puts the loop test into both arms of the

>>> conditional, then removes it from the ELSE arm because the condition is

>>> always "keep looping" in that arm.

>>>

>>> This plays poorly with loop parallelization -- I tried to follow what's

>>> going on in there and just simply got lost.  I got as far as seeing

>>> that

>>> the parallelization code thought there were loop carried dependencies.

>>> At least that's how it looked to me.  But I don't see any loop carried

>>> dependencies in the code.

>>

>> Hmm. I'll double check if I remember on Monday.

> 

> So I did and the ultimate reason GRAPHITE FAILs is because for

> the loop with the threading applied we can no longer compute

> number of iterations.  That's of course bad for _all_ loop optimizers,

> not just GRAPHITE (we just don't have any other test coverage).

> The main issue here is that after the threading is applied the

> block with the loop exit no longer dominates the loop latch.

> This means the loop should really be split but it also means we should

> definitely avoid performing such threading before loop optimizers run,

> and not just if parallelizing loops.

> 

> Can you adjust your patch this way?  The condition you check seems

> to more or less match the case where we thread through the exit

> test _into_ the loop?  The bad thing is really if in the resulting CFG

> the exit test isn't dominating the latch (thus we have an exit test

> that is not always executed):

The problem is we'll end up regressing other vectorization cases.  I
looked at a variety of things WRT the shape of the CFG and ultimately
found that there wasn't anything I could use to distinguish the two
cases.   I'll look again though.

Jeff
Richard Biener Dec. 18, 2017, 7:54 p.m. | #6
On December 18, 2017 5:06:25 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 12/18/2017 03:15 AM, Richard Biener wrote:

>> On Fri, Dec 15, 2017 at 6:00 PM, Richard Biener

>> <richard.guenther@gmail.com> wrote:

>>> On December 15, 2017 5:19:14 PM GMT+01:00, Jeff Law <law@redhat.com>

>wrote:

>>>> I hate this patch.

>>>>

>>>> The fundamental problem we have is that there are times when we

>very

>>>> much want to thread jumps and yet there are other times when we do

>not.

>>>>

>>>> To date we have been able to largely select between those two by

>>>> looking

>>>> at the shape of the CFG and the jump thread to see how threading a

>>>> particular jump would impact the shape of the CFG (particularly

>loops

>>>> in

>>>> the CFG).

>>>>

>>>> In this BZ we have a loop like this:

>>>>

>>>>        2

>>>>        |

>>>>        3<-------+

>>>>        |        |

>>>>        4<---+   |

>>>>       / \   |   |

>>>>      5   6  |   |

>>>>       \  /  |   |

>>>>         7   |   |

>>>>        / \  |   |

>>>>       8  11-+   |

>>>>      / \        |

>>>>     9  10-------+

>>>>     |

>>>>     E

>>>>

>>>> We want to thread the path (6, 7) (7, 11).  ie, there's a path

>through

>>>> that inner loop where we don't need to do the loop test.  Here's an

>>>> example (from libgomp testsuite)

>>>>

>>>> (N is 1500)

>>>>

>>>>     for (j = 0; j < N; j++)

>>>>        {

>>>>          if (j > 500)

>>>>            {

>>>>              x[i][j] = i + j + 3;

>>>>              y[j] = i*j + 10;

>>>>            }

>>>>          else

>>>>            x[i][j] = x[i][j]*3;

>>>>        }

>>>>

>>>>

>>>>

>>>> Threading (in effect) puts the loop test into both arms of the

>>>> conditional, then removes it from the ELSE arm because the

>condition is

>>>> always "keep looping" in that arm.

>>>>

>>>> This plays poorly with loop parallelization -- I tried to follow

>what's

>>>> going on in there and just simply got lost.  I got as far as seeing

>>>> that

>>>> the parallelization code thought there were loop carried

>dependencies.

>>>> At least that's how it looked to me.  But I don't see any loop

>carried

>>>> dependencies in the code.

>>>

>>> Hmm. I'll double check if I remember on Monday.

>> 

>> So I did and the ultimate reason GRAPHITE FAILs is because for

>> the loop with the threading applied we can no longer compute

>> number of iterations.  That's of course bad for _all_ loop

>optimizers,

>> not just GRAPHITE (we just don't have any other test coverage).

>> The main issue here is that after the threading is applied the

>> block with the loop exit no longer dominates the loop latch.

>> This means the loop should really be split but it also means we

>should

>> definitely avoid performing such threading before loop optimizers

>run,

>> and not just if parallelizing loops.

>> 

>> Can you adjust your patch this way?  The condition you check seems

>> to more or less match the case where we thread through the exit

>> test _into_ the loop?  The bad thing is really if in the resulting

>CFG

>> the exit test isn't dominating the latch (thus we have an exit test

>> that is not always executed):

>The problem is we'll end up regressing other vectorization cases.  I

>looked at a variety of things WRT the shape of the CFG and ultimately

>found that there wasn't anything I could use to distinguish the two

>cases.   I'll look again though.


I think a good sanity check would be to assert (or dump) whenever we change the return value of number_of_iterations_exit from before to after threading in case the block with the exit test is involved in the threading path. 

I guess we can't undo after the fact but it might help getting more testcases and thus an idea what to look for exactly. 

Maybe sth to train the very first AI inside GCC ;) 

Richard. 

>Jeff

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8830638d226..2d53e24b4c1 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@ 
+2017-12-12  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/83410
+	* tree-ssa-threadupdate.c (thread_block_1): Avoid certain jump
+	threads when parallelizing loops.
+
 2017-12-15  Jakub Jelinek  <jakub@redhat.com>
 
 	* tree-core.h (struct attribute_spec): Swap affects_type_identity and
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 045905eceb7..63ad8f9c953 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -1333,6 +1333,31 @@  thread_block_1 (basic_block bb, bool noloop_only, bool joiners)
 
 	  if (i != path->length ())
 	    continue;
+
+	  /* Loop parallelization can be confused by the result of
+	     threading through the loop exit test back into the loop.
+	     However, theading those jumps seems to help other codes.
+
+	     I have been unable to find anything related to the shape of
+	     the CFG, the contents of the affected blocks, etc which would
+	     allow a more sensible test than what we're using below which
+	     merely avoids the optimization when parallelizing loops.  */
+	  if (flag_tree_parallelize_loops > 1)
+	    {
+	      for (i = 1; i < path->length (); i++)
+	        if (bb->loop_father == e2->src->loop_father
+		    && loop_exits_from_bb_p (bb->loop_father,
+					     (*path)[i]->e->src)
+		    && !loop_exit_edge_p (bb->loop_father, e2))
+		  break;
+
+	      if (i != path->length ())
+		{
+		  delete_jump_thread_path (path);
+		  e->aux = NULL;
+		  continue;
+		}
+	    }
 	}
 
       /* Insert the outgoing edge into the hash table if it is not