[RFC,tree-vect] PR 88915: Further vectorize second loop when versioning

Message ID 28b81a90-250e-feb0-8c39-c3c1b62449e7@arm.com
State New
Headers show
Series
  • [RFC,tree-vect] PR 88915: Further vectorize second loop when versioning
Related show

Commit Message

Andre Vieira (lists) July 11, 2019, 4:13 p.m.
Hi Richard(s),

I am trying to tackle PR88915 and get GCC to further vectorize the 
"fallback" loop when doing loop versioning as it does when loop 
versioning is not required.

I have a prototype patch that needs further testing, but at first glance 
it seems to be achieving the desired outcome.
I was wondering whether you had any specific concerns with my current 
approach.

On top of this change I am looking at the iterations and alias checks 
generated for every "vectorized-version". I.e. with the above patch I see:
if (iterations_check_VF_0 () && alias_check_VF_0 ())
   vectorized_for_VF_0 ();
else if (iterations_check_VF_1 () && alias_check_VF_1 ())
   vectorized_for_VF_1 ();
...
else
   scalar_loop ();

The alias checks are not always short and may cause unnecessary 
performance hits. Instead I am now trying to change the checks to 
produce the following form:

if (iterations_check_VF_0 ())
{
   if (alias_check_VF_0 ())
    {
      vectorized_for_VF_0 ();
    }
   else
     goto VF_1_check;  // or scalar_loop
}
else if (iterations_check_VF_1 ())
   {
VF_1_check:

     if (alias_check_VF_1 ())
       vectorized_for_VF_1 ();
     else
       goto goto_VF_2_check; // or scalar_loop
   }
...
else
   scalar_loop ();


I am not yet sure whether to try the next VF after an alias check fail 
or to just fall back to scalar instead.

Cheers,
Andre

Comments

Richard Biener July 12, 2019, 10:19 a.m. | #1
On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> Hi Richard(s),

> 

> I am trying to tackle PR88915 and get GCC to further vectorize the "fallback"

> loop when doing loop versioning as it does when loop versioning is not

> required.

> 

> I have a prototype patch that needs further testing, but at first glance it

> seems to be achieving the desired outcome.

> I was wondering whether you had any specific concerns with my current

> approach.

> 

> On top of this change I am looking at the iterations and alias checks

> generated for every "vectorized-version". I.e. with the above patch I see:

> if (iterations_check_VF_0 () && alias_check_VF_0 ())

>   vectorized_for_VF_0 ();

> else if (iterations_check_VF_1 () && alias_check_VF_1 ())

>   vectorized_for_VF_1 ();

> ...

> else

>   scalar_loop ();

> 

> The alias checks are not always short and may cause unnecessary performance

> hits. Instead I am now trying to change the checks to produce the following

> form:

> 

> if (iterations_check_VF_0 ())

> {

>   if (alias_check_VF_0 ())

>    {

>      vectorized_for_VF_0 ();

>    }

>   else

>     goto VF_1_check;  // or scalar_loop

> }

> else if (iterations_check_VF_1 ())

>   {

> VF_1_check:

> 

>     if (alias_check_VF_1 ())

>       vectorized_for_VF_1 ();

>     else

>       goto goto_VF_2_check; // or scalar_loop

>   }

> ...

> else

>   scalar_loop ();


I think for code-size reason it would make sense to do it like

  if (iterations_check_for_lowest_VF ())
    {
      if (alias_check_for_highest_VF ())
        {
          vectorized_for_highest_VF ();
          vectorized epilogues;
        }
    }

and make the vectorized_for_highest_VF loop skipped, falling through
to the vectorized epilogues, when the number of iterations isn't
enough to hit it.

The advantage is that this would just use the epilogue vectorization
code and it would avoid excessive code growth if you have many
VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and
64 byte vectors...).  The disadvantage is of course that a small
number of loops will not enter the vector code at all - namely
those that would pass the alias check for lowest_VF but not the
one for highest_VF.  I'm sure this isn't a common situation and
in quite a number of cases we formulate the alias check in a way
that it isn't dependent on the VF anyways.  There's also possibly
an extra branch for the case the highest_VF loop isn't entered
(unless there already was a prologue loop).

> I am not yet sure whether to try the next VF after an alias check fail or to

> just fall back to scalar instead.


If you don't then there's no advantage to doing what I suggested?

Richard.

> 

> Cheers,

> Andre

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Andre Vieira (lists) July 15, 2019, 10:36 a.m. | #2
On 12/07/2019 11:19, Richard Biener wrote:
> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> I think for code-size reason it would make sense to do it like

> 

>    if (iterations_check_for_lowest_VF ())

>      {

>        if (alias_check_for_highest_VF ())

>          {

>            vectorized_for_highest_VF ();

>            vectorized epilogues;

>          }

>      }

> 

> and make the vectorized_for_highest_VF loop skipped, falling through

> to the vectorized epilogues, when the number of iterations isn't

> enough to hit it.


Are you suggesting we only make the distinction between highest and 
lowest VF? Why not something like:

if (alias_check_for_highest_VF ())
{
   if (iterations_check_VF_0 ())
     goto VF_0;
   else if (iterations_check_VF_1 ())
     goto VF_1;
   else if (iterations_check_VF_2 ())
     goto VF_2;
   ...
VF_0:
  vectorized_for_vf_0();
VF_1:
  vectorized_for_vf_1();
VF_2:
  vectorized_for_vf_2();
...
}
else
{
   goto scalar_loop;
}

I'll go have a look at how to best do this. The benefit of the earlier 
approach is it was able to use a lot of the existing vectorizer code to 
get it done.

I have code that can split the condition and alias checks in 
'vect_loop_versioning'. For this approach I am considering keeping that 
bit of code and seeing if I can patch up the checks after vectorizing 
the epilogue further. I think initially I will just go with a "hacked 
up" way of passing down the bb with the iteration check and split the 
false edge every time we vectorize it further. Will keep you posted on 
progress. If you have any pointers/tips they are most welcome!

> 

> The advantage is that this would just use the epilogue vectorization

> code and it would avoid excessive code growth if you have many

> VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> 64 byte vectors...).  The disadvantage is of course that a small

> number of loops will not enter the vector code at all - namely

> those that would pass the alias check for lowest_VF but not the

> one for highest_VF.  I'm sure this isn't a common situation and

> in quite a number of cases we formulate the alias check in a way

> that it isn't dependent on the VF anyways.


The code growth is indeed a factor and I can see the argument for 
choosing this approach over the other. Cases of such specific overlaps 
are most likely oddities rather than the common situation.



> There's also possibly

> an extra branch for the case the highest_VF loop isn't entered

> (unless there already was a prologue loop).

I don't understand this one, can you elaborate?

Cheers,
Andre
Richard Biener July 15, 2019, 10:54 a.m. | #3
On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> On 12/07/2019 11:19, Richard Biener wrote:

> > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> > 

> > 

> > I think for code-size reason it would make sense to do it like

> > 

> >    if (iterations_check_for_lowest_VF ())

> >      {

> >        if (alias_check_for_highest_VF ())

> >          {

> >            vectorized_for_highest_VF ();

> >            vectorized epilogues;

> >          }

> >      }

> > 

> > and make the vectorized_for_highest_VF loop skipped, falling through

> > to the vectorized epilogues, when the number of iterations isn't

> > enough to hit it.

> 

> Are you suggesting we only make the distinction between highest and lowest VF?

> Why not something like:

> 

> if (alias_check_for_highest_VF ())

> {

>   if (iterations_check_VF_0 ())

>     goto VF_0;

>   else if (iterations_check_VF_1 ())

>     goto VF_1;

>   else if (iterations_check_VF_2 ())

>     goto VF_2;

>   ...

> VF_0:

>  vectorized_for_vf_0();

> VF_1:

>  vectorized_for_vf_1();

> VF_2:

>  vectorized_for_vf_2();

> ...

> }

> else

> {

>   goto scalar_loop;

> }


I think it will actually do it this way via the epilogue vectorization
path.
 
> I'll go have a look at how to best do this. The benefit of the earlier

> approach is it was able to use a lot of the existing vectorizer code to get it

> done.

>

> I have code that can split the condition and alias checks in

> 'vect_loop_versioning'. For this approach I am considering keeping that bit of

> code and seeing if I can patch up the checks after vectorizing the epilogue

> further. I think initially I will just go with a "hacked up" way of passing

> down the bb with the iteration check and split the false edge every time we

> vectorize it further. Will keep you posted on progress. If you have any

> pointers/tips they are most welcome!


I thought to somehow force the idea that we have a prologue loop
to the vectorizer so it creates the number-of-vectorized iterations
check and branch around the main (highest VF) vectorized loop.

> > 

> > The advantage is that this would just use the epilogue vectorization

> > code and it would avoid excessive code growth if you have many

> > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> > 64 byte vectors...).  The disadvantage is of course that a small

> > number of loops will not enter the vector code at all - namely

> > those that would pass the alias check for lowest_VF but not the

> > one for highest_VF.  I'm sure this isn't a common situation and

> > in quite a number of cases we formulate the alias check in a way

> > that it isn't dependent on the VF anyways.

> 

> The code growth is indeed a factor and I can see the argument for choosing

> this approach over the other. Cases of such specific overlaps are most likely

> oddities rather than the common situation.


Yeah, it also looks simplest to me (and a motivation to enable
epilogue vectorization by default).

> > There's also possibly

> > an extra branch for the case the highest_VF loop isn't entered

> > (unless there already was a prologue loop).

> I don't understand this one, can you elaborate?


The branch around the main vectorized loop I talked about above.
So I'd fool the versioning condition to use the lowest VF for
the iteration count checking and use the code that handles
zero-trip iteration count for the vector loop unconditionally.

In some way this makes checking the niter condition on the version
check pointless (at least if we have a really low lowest VF like
on x64 where it will likely be 2), so we may want to elide that
completely?  For the check to be "correct" we'd also need to
compute the lowest VF a vectorized epilogue is still profitable
(on x86 those will run once or never, but we can also end up
with say main AVX512 vectorization, and a single vectorized
epilogue with SSE2 if we somehow figure AVX256 vectorization
isn't profitable for it - we can also end up with non-vectorizable
epilogue).  So with the current setup how we vectorize epilogues
we maybe want to have a location of the version niter check we
can "patch up" later after (not) vectorizing the epilogue(s).

Richard.
Andre Vieira (lists) July 19, 2019, 11:13 a.m. | #4
On 15/07/2019 11:54, Richard Biener wrote:
> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> 

>>

>>

>> On 12/07/2019 11:19, Richard Biener wrote:

>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>

>>

>> I have code that can split the condition and alias checks in

>> 'vect_loop_versioning'. For this approach I am considering keeping that bit of

>> code and seeing if I can patch up the checks after vectorizing the epilogue

>> further. I think initially I will just go with a "hacked up" way of passing

>> down the bb with the iteration check and split the false edge every time we

>> vectorize it further. Will keep you posted on progress. If you have any

>> pointers/tips they are most welc ome!

> 

> I thought to somehow force the idea that we have a prologue loop

> to the vectorizer so it creates the number-of-vectorized iterations

> check and branch around the main (highest VF) vectorized loop.

> 


Hmm I think I may have skimmed over this earlier. I am reading it now 
and am not entirely sure what you mean by this. Specifically the "number 
of vectorized iterations" or how a prologue loop plays a role in this.


>>>

>>> The advantage is that this would just use the epilogue vectorization

>>> code and it would avoid excessive code growth if you have many

>>> VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

>>> 64 byte vectors...).  The disadvantage is of course that a small

>>> number of loops will not enter the vector code at all - namely

>>> those that would pass the alias check for lowest_VF but not the

>>> one for highest_VF.  I'm sure this isn't a common situation and

>>> in quite a number of cases we formulate the alias check in a way

>>> that it isn't dependent on the VF anyways.

>>

>> The code growth is indeed a factor and I can see the argument for choosing

>> this approach over the other. Cases of such specific overlaps are most likely

>> oddities rather than the common situation.

> 

> Yeah, it also looks simplest to me (and a motivation to enable

> epilogue vectorization by default).

> 

>>> There's also possibly

>>> an extra branch for the case the highest_VF loop isn't entered

>>> (unless there already was a prologue loop).

>> I don't understand this one, can you elaborate?

> 

> The branch around the main vectorized loop I talked about above.

> So I'd fool the versioning condition to use the lowest VF for

> the iteration count checking and use the code that handles

> zero-trip iteration count for the vector loop unconditionally.


I guess this sheds some light on the comment above. And it definitely 
implies we would need to know the lowest VF when creating this 
condition. Which is tricky.
> 

> In some way this makes checking the niter condition on the version

> check pointless (at least if we have a really low lowest VF like

> on x64 where it will likely be 2), so we may want to elide that

> completely?  For the check to be "correct" we'd also need to

> compute the lowest VF a vectorized epilogue is still profitable

> (on x86 those will run once or never, but we can also end up

> with say main AVX512 vectorization, and a single vectorized

> epilogue with SSE2 if we somehow figure AVX256 vectorization

> isn't profitable for it - we can also end up with non-vectorizable

> epilogue).  So with the current setup how we vectorize epilogues

> we maybe want to have a location of the version niter check we

> can "patch up" later after (not) vectorizing the epilogue(s).


I think you come to the same conclusion here as I mentioned above. 
Somehow I wish I had understood this better when I first read it ... but 
eh such is life :)

I went on and continued hacking around the approach of splitting the 
niter and alias check I had earlier. I got it to work with a single 
loop. However, when dealing with nested loops I run into the problem 
that I'd need to sink the niter checks. Otherwise you could end up with 
an alias check and niter checks outside the outer loop. Where the 2nd 
and consequent VF niter checks point to the corresponding epilogues in 
the inner loop.  However, once those are done and it iterates over the 
outer-loop, it will go through the higher VF's first, leading to wrong 
behavior.

To illustrate what I mean, here is a very simplistic illustration of 
what is happening:

BB1: Alias check
BB2: niter check VF 32
BB3: niter check VF 16
BB4: Vectorized loop VF32
BB5: Vectorized loop VF16
BB6: Remaining epilogue scalar loop
BB7: Outer loop iteration (updates IV's and DRs of inner loop)
BB8: Scalar inner&outer loop

With edges:
BB1 -T-> BB2
BB1 -F-> BB8
BB2 -T-> BB4
BB2 -F-> BB3
BB3 -T-> BB5
BB3 -F-> BB8
BB4 -> BB5
BB5 -> BB6
BB6 -> BB7
BB7 -> BB4

Where -T-> is a True edge and -F-> is a False edge

So my first thought to solve this is to sink BB2 and BB3 into the loop 
for which BB7 is the latch.

I.e. make BB7 -> BB2

But then I would argue, it would be good to introduce a BB9:
BB1 -T-> BB9
BB9 -T-> BB2
BB9 -F-> BB8

Where BB9 checks that niter is at least the lowest VF.

Sorry if I am repeating what you were telling me to do all along :')

Cheers,
Andre

PS: I often find myself having to patch the DOMINATOR information, 
sometimes its easy to, but sometimes it can get pretty complicated. I 
wonder whether it would make sense to write something that traverses a 
loop and corrects this, if it doesn't exist already.


> 

> Richard.

>
Richard Biener July 19, 2019, 11:35 a.m. | #5
On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> On 15/07/2019 11:54, Richard Biener wrote:

> > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> > 

> > > 

> > > 

> > > On 12/07/2019 11:19, Richard Biener wrote:

> > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> > > 

> > > 

> > > I have code that can split the condition and alias checks in

> > > 'vect_loop_versioning'. For this approach I am considering keeping that

> > > bit of

> > > code and seeing if I can patch up the checks after vectorizing the

> > > epilogue

> > > further. I think initially I will just go with a "hacked up" way of

> > > passing

> > > down the bb with the iteration check and split the false edge every time

> > > we

> > > vectorize it further. Will keep you posted on progress. If you have any

> > > pointers/tips they are most welc ome!

> > 

> > I thought to somehow force the idea that we have a prologue loop

> > to the vectorizer so it creates the number-of-vectorized iterations

> > check and branch around the main (highest VF) vectorized loop.

> > 

> 

> Hmm I think I may have skimmed over this earlier. I am reading it now and am

> not entirely sure what you mean by this. Specifically the "number of

> vectorized iterations" or how a prologue loop plays a role in this.


When there is no prologue then the versioning condition currently
ensures we always enter the vector loop.  Only if there is a prologue
is there a check and branch eventually skipping right to the epilogue.
If the versioning condition (now using a lower VF) doesn't guarantee
this we have to add this jump-around.

> 

> > > > 

> > > > The advantage is that this would just use the epilogue vectorization

> > > > code and it would avoid excessive code growth if you have many

> > > > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> > > > 64 byte vectors...).  The disadvantage is of course that a small

> > > > number of loops will not enter the vector code at all - namely

> > > > those that would pass the alias check for lowest_VF but not the

> > > > one for highest_VF.  I'm sure this isn't a common situation and

> > > > in quite a number of cases we formulate the alias check in a way

> > > > that it isn't dependent on the VF anyways.

> > > 

> > > The code growth is indeed a factor and I can see the argument for choosing

> > > this approach over the other. Cases of such specific overlaps are most

> > > likely

> > > oddities rather than the common situation.

> > 

> > Yeah, it also looks simplest to me (and a motivation to enable

> > epilogue vectorization by default).

> > 

> > > > There's also possibly

> > > > an extra branch for the case the highest_VF loop isn't entered

> > > > (unless there already was a prologue loop).

> > > I don't understand this one, can you elaborate?

> > 

> > The branch around the main vectorized loop I talked about above.

> > So I'd fool the versioning condition to use the lowest VF for

> > the iteration count checking and use the code that handles

> > zero-trip iteration count for the vector loop unconditionally.

> 

> I guess this sheds some light on the comment above. And it definitely implies

> we would need to know the lowest VF when creating this condition. Which is

> tricky.


We can simply use the smallest vector size supported by the target to
derive it from the actual VF, no?

> > 

> > In some way this makes checking the niter condition on the version

> > check pointless (at least if we have a really low lowest VF like

> > on x64 where it will likely be 2), so we may want to elide that

> > completely?  For the check to be "correct" we'd also need to

> > compute the lowest VF a vectorized epilogue is still profitable

> > (on x86 those will run once or never, but we can also end up

> > with say main AVX512 vectorization, and a single vectorized

> > epilogue with SSE2 if we somehow figure AVX256 vectorization

> > isn't profitable for it - we can also end up with non-vectorizable

> > epilogue).  So with the current setup how we vectorize epilogues

> > we maybe want to have a location of the version niter check we

> > can "patch up" later after (not) vectorizing the epilogue(s).

> 

> I think you come to the same conclusion here as I mentioned above. Somehow I

> wish I had understood this better when I first read it ... but eh such is life

> :)

> 

> I went on and continued hacking around the approach of splitting the niter and

> alias check I had earlier. I got it to work with a single loop. However, when

> dealing with nested loops I run into the problem that I'd need to sink the

> niter checks. Otherwise you could end up with an alias check and niter checks

> outside the outer loop. Where the 2nd and consequent VF niter checks point to

> the corresponding epilogues in the inner loop.  However, once those are done

> and it iterates over the outer-loop, it will go through the higher VF's first,

> leading to wrong behavior.

> 

> To illustrate what I mean, here is a very simplistic illustration of what is

> happening:

> 

> BB1: Alias check

> BB2: niter check VF 32

> BB3: niter check VF 16

> BB4: Vectorized loop VF32

> BB5: Vectorized loop VF16

> BB6: Remaining epilogue scalar loop

> BB7: Outer loop iteration (updates IV's and DRs of inner loop)

> BB8: Scalar inner&outer loop

> 

> With edges:

> BB1 -T-> BB2

> BB1 -F-> BB8

> BB2 -T-> BB4

> BB2 -F-> BB3

> BB3 -T-> BB5

> BB3 -F-> BB8

> BB4 -> BB5

> BB5 -> BB6

> BB6 -> BB7

> BB7 -> BB4

> 

> Where -T-> is a True edge and -F-> is a False edge

> 

> So my first thought to solve this is to sink BB2 and BB3 into the loop for

> which BB7 is the latch.

> 

> I.e. make BB7 -> BB2

> 

> But then I would argue, it would be good to introduce a BB9:

> BB1 -T-> BB9

> BB9 -T-> BB2

> BB9 -F-> BB8

> 

> Where BB9 checks that niter is at least the lowest VF.

> 

> Sorry if I am repeating what you were telling me to do all along :')


I'm not sure I understand - why would you have any check not inside
the outer loop?  Yes, we now eventually hoist versioning checks
but the VF checks for the individual variants should be around
the vectorized loop itself (so not really part of the versioning check).

> Cheers,

> Andre

> 

> PS: I often find myself having to patch the DOMINATOR information, sometimes

> its easy to, but sometimes it can get pretty complicated. I wonder whether it

> would make sense to write something that traverses a loop and corrects this,

> if it doesn't exist already.


There's iterate-fix-dominators, but unless you create new edges/blocks
manually rather than doing split-block/redirect-edge which should do
dominator updating for you.

Richard.

> 

> > 

> > Richard.

> > 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Andre Vieira (lists) July 19, 2019, 12:38 p.m. | #6
On 19/07/2019 12:35, Richard Biener wrote:
> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> 

>>

>>

>> On 15/07/2019 11:54, Richard Biener wrote:

>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

>>>

>>>>

>>>>

>>>> On 12/07/2019 11:19, Richard Biener wrote:

>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>>>

>>>>

>>>> I have code that can split the condition and alias checks in

>>>> 'vect_loop_versioning'. For this approach I am considering keeping that

>>>> bit of

>>>> code and seeing if I can patch up the checks after vectorizing the

>>>> epilogue

>>>> further. I think initially I will just go with a "hacked up" way of

>>>> passing

>>>> down the bb with the iteration check and split the false edge every time

>>>> we

>>>> vectorize it further. Will keep you posted on progress. If you have any

>>>> pointers/tips they are most welc ome!

>>>

>>> I thought to somehow force the idea that we have a prologue loop

>>> to the vectorizer so it creates the number-of-vectorized iterations

>>> check and branch around the main (highest VF) vectorized loop.

>>>

>>

>> Hmm I think I may have skimmed over this earlier. I am reading it now and am

>> not entirely sure what you mean by this. Specifically the "number of

>> vectorized iterations" or how a prologue loop plays a role in this.

> 

> When there is no prologue then the versioning condition currently

> ensures we always enter the vector loop.  Only if there is a prologue

> is there a check and branch eventually skipping right to the epilogue.

> If the versioning condition (now using a lower VF) doesn't guarantee

> this we have to add this jump-around.


Right, I haven't looked at the prologue path yet. I had a quick look and 
can't find where this branch skipping to the epilogue is constructed.  I 
will take a better look after I got my current example to work.

>>

>> I guess this sheds some light on the comment above. And it definitely implies

>> we would need to know the lowest VF when creating this condition. Which is

>> tricky.

> 

> We can simply use the smallest vector size supported by the target to

> derive it from the actual VF, no?


So I could wait to introduce this check after all epilogue vectorization 
is done, back track to the last niter check and duplicate that in the 
outer loop.

What I didn't want to do was use the smallest possible vector size for 
the target because I was under the impression that does not necessarily 
correspond to the smallest VF we may have for a loop, as the vectorizer 
may have decided not to vectorize for that vector size because of costs? 
If it I can assume this never happens, that once it starts to vectorize 
epilogues that it will keep vectorizing them for any vector size it 
knows off then yeah I can use that.


> I'm not sure I understand - why would you have any check not inside

> the outer loop?  Yes, we now eventually hoist versioning checks

> but the VF checks for the individual variants should be around

> the vectorized loop itself (so not really part of the versioning check).


Yeah I agree. I was just explaining what I had done wrong now.
> 

>> Cheers,

>> Andre

>>

>> PS: I often find myself having to patch the DOMINATOR information, sometimes

>> its easy to, but sometimes it can get pretty complicated. I wonder whether it

>> would make sense to write something that traverses a loop and corrects this,

>> if it doesn't exist already.

> 

> There's iterate-fix-dominators, but unless you create new edges/blocks

> manually rather than doing split-block/redirect-edge which should do

> dominator updating for you.


Ah I was doing everything manually after having some bad experiences 
with lv_add_condition_to_bb.  I will have a look at those thanks!

Cheers,
Andre

> 

> Richard.

> 

>>

>>>

>>> Richard.

>>>

>>

>
Andre Vieira (lists) July 30, 2019, 12:16 p.m. | #7
Hi Richard,

I believe this is in line with what you were explaining to me earlier. 
The one thing I haven't quite done here is the jump around for if there 
is no prolog peeling. Though I think skip_vectors introduces the jumping 
we need.

The main gist of it is:
1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding 
the versioning threshold check when versioning a loop at first,
2) later we allow the vectorizer to use skip_vectors to basically check 
niters to be able to skip the vectorized loop on to the right epilogue,
3) after vectorizing the epilogue, we check whether this was a versioned 
loop and whether we skipped adding the versioning threshold, if so we 
add a threshold based on the last VF

There is a caveat here, if we don't vectorize the epilogue, because say 
our architecture doesn't know how, we will end up with a regression.
Let's say we have a loop for which our target can vectorize for 32x but 
not 16x, we get:

if (alias & threshold_for_16x ())
{
   if (niters > 31)
     vectorized_31 ();
   else
     scalar ();
}
else
  scalar ();

Imagine our iteration count is 18, all we hit is the scalar loop, but 
now we go through 1x alias and 2x niters. Whereas potentially we could 
have done with just 1x niters.

A mitigation for this could be to only add the threshold check if we 
actually vectorized the last loop, I believe a:
'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that 
new "else" in vect_transform_loop would do the trick. Though then we 
will still have that extra alias check...

I am going to hack around in the back-end to "fake" a situation like 
this and see what happens.

Another potential issue arises when the profitability threshold obtained 
from the cost model isn't guaranteed to be either LE or EQ to the 
versioning threshold. In that case there are two separate checks, which 
now we no longer will attempt to fold.

I am trying to find a way to test and benchmark these changes. 
Unfortunately I am having trouble doing this for AVX512 as I find that 
the old '--param vect_epilogues_nomask=1' already causes wrong codegen 
in SPEC2017 for the gcc and perlbench benchmarks.


Cheers,
Andre

On 19/07/2019 13:38, Andre Vieira (lists) wrote:
> 

> 

> On 19/07/2019 12:35, Richard Biener wrote:

>> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

>>

>>>

>>>

>>> On 15/07/2019 11:54, Richard Biener wrote:

>>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

>>>>

>>>>>

>>>>>

>>>>> On 12/07/2019 11:19, Richard Biener wrote:

>>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>>>>

>>>>>

>>>>> I have code that can split the condition and alias checks in

>>>>> 'vect_loop_versioning'. For this approach I am considering keeping 

>>>>> that

>>>>> bit of

>>>>> code and seeing if I can patch up the checks after vectorizing the

>>>>> epilogue

>>>>> further. I think initially I will just go with a "hacked up" way of

>>>>> passing

>>>>> down the bb with the iteration check and split the false edge every 

>>>>> time

>>>>> we

>>>>> vectorize it further. Will keep you posted on progress. If you have 

>>>>> any

>>>>> pointers/tips they are most welc ome!

>>>>

>>>> I thought to somehow force the idea that we have a prologue loop

>>>> to the vectorizer so it creates the number-of-vectorized iterations

>>>> check and branch around the main (highest VF) vectorized loop.

>>>>

>>>

>>> Hmm I think I may have skimmed over this earlier. I am reading it now 

>>> and am

>>> not entirely sure what you mean by this. Specifically the "number of

>>> vectorized iterations" or how a prologue loop plays a role in this.

>>

>> When there is no prologue then the versioning condition currently

>> ensures we always enter the vector loop.  Only if there is a prologue

>> is there a check and branch eventually skipping right to the epilogue.

>> If the versioning condition (now using a lower VF) doesn't guarantee

>> this we have to add this jump-around.

> 

> Right, I haven't looked at the prologue path yet. I had a quick look and 

> can't find where this branch skipping to the epilogue is constructed.  I 

> will take a better look after I got my current example to work.

> 

>>>

>>> I guess this sheds some light on the comment above. And it definitely 

>>> implies

>>> we would need to know the lowest VF when creating this condition. 

>>> Which is

>>> tricky.

>>

>> We can simply use the smallest vector size supported by the target to

>> derive it from the actual VF, no?

> 

> So I could wait to introduce this check after all epilogue vectorization 

> is done, back track to the last niter check and duplicate that in the 

> outer loop.

> 

> What I didn't want to do was use the smallest possible vector size for 

> the target because I was under the impression that does not necessarily 

> correspond to the smallest VF we may have for a loop, as the vectorizer 

> may have decided not to vectorize for that vector size because of costs? 

> If it I can assume this never happens, that once it starts to vectorize 

> epilogues that it will keep vectorizing them for any vector size it 

> knows off then yeah I can use that.

> 

> 

>> I'm not sure I understand - why would you have any check not inside

>> the outer loop?  Yes, we now eventually hoist versioning checks

>> but the VF checks for the individual variants should be around

>> the vectorized loop itself (so not really part of the versioning check).

> 

> Yeah I agree. I was just explaining what I had done wrong now.

>>

>>> Cheers,

>>> Andre

>>>

>>> PS: I often find myself having to patch the DOMINATOR information, 

>>> sometimes

>>> its easy to, but sometimes it can get pretty complicated. I wonder 

>>> whether it

>>> would make sense to write something that traverses a loop and 

>>> corrects this,

>>> if it doesn't exist already.

>>

>> There's iterate-fix-dominators, but unless you create new edges/blocks

>> manually rather than doing split-block/redirect-edge which should do

>> dominator updating for you.

> 

> Ah I was doing everything manually after having some bad experiences 

> with lv_add_condition_to_bb.  I will have a look at those thanks!

> 

> Cheers,

> Andre

> 

>>

>> Richard.

>>

>>>

>>>>

>>>> Richard.

>>>>

>>>

>>
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..b1f1d2d6c6f8b48cce1a666dfbf8cf542f6857df 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -268,6 +268,10 @@ public:
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  /* Basic-block containing the checks for versioning this loop.  This is set
+     by `vect_loop_versioning'.  */
+  basic_block version_check_bb;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..47873eb76543c3bfa39c197a8f98eeaf132faf4c 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -355,6 +355,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->version_check_bb = NULL;
   return loop;
 }
 
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..b899a5cfb626d12125d0595b3bbbf161cb27d446 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2401,7 +2401,8 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 bool vect_epilogues_nomask)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2474,7 +2475,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues_nomask));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2968,7 +2970,8 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 class loop *
 vect_loop_versioning (loop_vec_info loop_vinfo,
 		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+		      poly_uint64 versioning_threshold,
+		      bool vect_epilogues_nomask)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2995,7 +2998,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
-  if (maybe_ne (versioning_threshold, 0U))
+  if (maybe_ne (versioning_threshold, 0U) && !vect_epilogues_nomask)
     {
       tree expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			       build_int_cst (TREE_TYPE (scalar_loop_iters),
@@ -3199,6 +3202,9 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
       update_ssa (TODO_update_ssa);
     }
 
+  if (vect_epilogues_nomask)
+    LOOP_VINFO_LOOP (loop_vinfo)->version_check_bb = condition_bb;
+
   if (version_niter)
     {
       /* The versioned loop could be infinite, we need to clear existing
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..6a7329d5506233c6974a86bd56697eacc0bbea5c 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1852,6 +1852,40 @@ vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
     }
 }
 
+/* During peeling, we need to check if number of loop iterations is
+   enough for both peeled prolog loop and vector loop.  This check
+   can be merged along with threshold check of loop versioning, so
+   increase threshold for this case if necessary.  */
+
+static poly_uint64
+vec_calc_versioning_threshold (loop_vec_info loop_vinfo, poly_uint64 vf)
+{
+  poly_uint64 niters_th = 0;
+
+  if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    {
+      /* Niters for peeled prolog loop.  */
+      if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
+	{
+	  dr_vec_info *dr_info = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
+	  tree vectype = STMT_VINFO_VECTYPE (dr_info->stmt);
+	  niters_th += TYPE_VECTOR_SUBPARTS (vectype) - 1;
+	}
+      else
+	niters_th += LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+    }
+
+  /* Niters for at least one iteration of vectorized loop.  */
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    niters_th += vf;
+  /* One additional iteration because of peeling for gap.  */
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    niters_th += 1;
+
+  return niters_th;
+}
+
+
 /* Function vect_analyze_loop_2.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -2179,35 +2213,10 @@ start_over:
         }
     }
 
-  /* During peeling, we need to check if number of loop iterations is
-     enough for both peeled prolog loop and vector loop.  This check
-     can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
-    {
-      poly_uint64 niters_th = 0;
-
-      if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
-	{
-	  /* Niters for peeled prolog loop.  */
-	  if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
-	    {
-	      dr_vec_info *dr_info = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
-	      tree vectype = STMT_VINFO_VECTYPE (dr_info->stmt);
-	      niters_th += TYPE_VECTOR_SUBPARTS (vectype) - 1;
-	    }
-	  else
-	    niters_th += LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-	}
-
-      /* Niters for at least one iteration of vectorized loop.  */
-      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-	niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-      /* One additional iteration because of peeling for gap.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	niters_th += 1;
-      LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
-    }
+    LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo)
+      = vec_calc_versioning_threshold (loop_vinfo,
+				       LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
   gcc_assert (known_eq (vectorization_factor,
 			LOOP_VINFO_VECT_FACTOR (loop_vinfo)));
@@ -8530,7 +8539,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 	}
       class loop *sloop
 	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+				versioning_threshold,
+				PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK));
+
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8557,7 +8568,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK));
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8865,9 +8877,99 @@ vect_transform_loop (loop_vec_info loop_vinfo)
       /* We may need to if-convert epilogue to vectorize it.  */
       if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
 	tree_if_conversion (epilogue);
+
+      return epilogue;
     }
+  /* EPILOGUE is null, let's see if we did versioning on this loop and
+     whether we need to patch up the condition checks.  */
+  else
+    {
+      gimple_seq stmts;
+      gimple_stmt_iterator stmt_it;
+      gimple * old_check;
+      tree low_niters_check, new_check;
+      basic_block version_check_bb;
+      loop_vec_info orig_loop_vinfo;
+      poly_uint64 low_niters;
+
+      /* We only exclude the threshold check if we are vectorizing the
+	 epilogues.  */
+      if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	return NULL;
+
+      /* Find the top loop info as this will be the one that was versioned.  */
+      orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+
+      while (orig_loop_vinfo &&
+	     LOOP_VINFO_ORIG_LOOP_INFO (orig_loop_vinfo) != NULL)
+	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (orig_loop_vinfo);
+
+      if (!orig_loop_vinfo)
+	return NULL;
+
+      version_check_bb = LOOP_VINFO_LOOP (orig_loop_vinfo)->version_check_bb;
+
+      /* If we don't have a basic block to patch up then there is nothing to
+	 do.  */
+      if (!version_check_bb)
+	return NULL;
+
+      low_niters
+	= vec_calc_versioning_threshold (orig_loop_vinfo,
+					 LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+
+      /* All conditions indicate we should add a check that the number of
+	 iterations is higher than the versioning threshold using the lowest
+	 vectorization factor to enter the vectorized version of the loop.  */
+      old_check = gsi_stmt (gsi_last_bb (version_check_bb));
+
+      /* Construct the check for the lowest threshold.  */
+      low_niters_check = LOOP_VINFO_NITERS (orig_loop_vinfo);
+      low_niters_check
+	= fold_build2 (GT_EXPR, boolean_type_node,
+		       low_niters_check,
+		       build_int_cst (TREE_TYPE (low_niters_check),
+				      low_niters - 1));
+
+      low_niters_check
+	= force_gimple_operand_1 (unshare_expr (low_niters_check),
+				  &stmts,
+				  is_gimple_condexpr,
+				  NULL_TREE);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gsi_insert_seq_before (&stmt_it, stmts, GSI_SAME_STMT);
+
+
+      /* Get the last versioning check, and merge it with the new check.  */
+      gcc_assert (gimple_code (old_check) == GIMPLE_COND);
+      new_check
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       low_niters_check,
+		       gimple_cond_lhs (old_check));
+
+      new_check
+	= force_gimple_operand_1 (unshare_expr (new_check),
+				  &stmts,
+				  is_gimple_condexpr,
+				  NULL_TREE);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gsi_insert_seq_before (&stmt_it, stmts, GSI_SAME_STMT);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gimple *stmt = gsi_stmt (stmt_it);
+
+      /* Set the new check as the condition for going into the vectorized loop.
+       */
+      gimple_cond_set_condition_from_tree (as_a <gcond *> (stmt),
+					   new_check);
+
+      /* Make sure we update def/use information.  */
+      update_stmt (gsi_stmt (gsi_last_bb (version_check_bb)));
 
-  return epilogue;
+      return NULL;
+    }
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..be282a974a76617fbb0cf6f0e34b08d4d652c3df 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1481,9 +1481,10 @@ extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+				   poly_uint64, bool);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
Andre Vieira (lists) July 30, 2019, 1:14 p.m. | #8
On 30/07/2019 13:16, Andre Vieira (lists) wrote:
> Hi Richard,

> 

> I believe this is in line with what you were explaining to me earlier. 

> The one thing I haven't quite done here is the jump around for if there 

> is no prolog peeling. Though I think skip_vectors introduces the jumping 

> we need.

> 

> The main gist of it is:

> 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding 

> the versioning threshold check when versioning a loop at first,

> 2) later we allow the vectorizer to use skip_vectors to basically check 

> niters to be able to skip the vectorized loop on to the right epilogue,

> 3) after vectorizing the epilogue, we check whether this was a versioned 

> loop and whether we skipped adding the versioning threshold, if so we 

> add a threshold based on the last VF

> 

> There is a caveat here, if we don't vectorize the epilogue, because say 

> our architecture doesn't know how, we will end up with a regression.

> Let's say we have a loop for which our target can vectorize for 32x but 

> not 16x, we get:

> 

> if (alias & threshold_for_16x ())

> {

>    if (niters > 31)

>      vectorized_31 ();

>    else

>      scalar ();

> }

> else

>   scalar ();

> 

> Imagine our iteration count is 18, all we hit is the scalar loop, but 

> now we go through 1x alias and 2x niters. Whereas potentially we could 

> have done with just 1x niters.

> 

> A mitigation for this could be to only add the threshold check if we 

> actually vectorized the last loop, I believe a:

> 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that 

> new "else" in vect_transform_loop would do the trick. Though then we 

> will still have that extra alias check... >

> I am going to hack around in the back-end to "fake" a situation like 

> this and see what happens.


Right should have quickly tried this before sending the email, what 
happens is actually vect_transform_loop never gets called because 
try_vectorize_loop_1 will recognize it can't vectorize the epilogue. So 
we end up with the "mitigation" result I suggested, where we simply 
don't have a versioning threshold which might still not be ideal.

> 

> Another potential issue arises when the profitability threshold obtained 

> from the cost model isn't guaranteed to be either LE or EQ to the 

> versioning threshold. In that case there are two separate checks, which 

> now we no longer will attempt to fold.


And I just realized I don't take the cost model threshold into account 
when creating the new threshold check, I guess if it is ordered_p we 
should again take the max there between the new threshold check and the 
cost threshold for the new check.

> 

> I am trying to find a way to test and benchmark these changes. 

> Unfortunately I am having trouble doing this for AVX512 as I find that 

> the old '--param vect_epilogues_nomask=1' already causes wrong codegen 

> in SPEC2017 for the gcc and perlbench benchmarks.

> 

> 

> Cheers,

> Andre

> 

> On 19/07/2019 13:38, Andre Vieira (lists) wrote:

>>

>>

>> On 19/07/2019 12:35, Richard Biener wrote:

>>> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

>>>

>>>>

>>>>

>>>> On 15/07/2019 11:54, Richard Biener wrote:

>>>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

>>>>>

>>>>>>

>>>>>>

>>>>>> On 12/07/2019 11:19, Richard Biener wrote:

>>>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>>>>>

>>>>>>

>>>>>> I have code that can split the condition and alias checks in

>>>>>> 'vect_loop_versioning'. For this approach I am considering keeping 

>>>>>> that

>>>>>> bit of

>>>>>> code and seeing if I can patch up the checks after vectorizing the

>>>>>> epilogue

>>>>>> further. I think initially I will just go with a "hacked up" way of

>>>>>> passing

>>>>>> down the bb with the iteration check and split the false edge 

>>>>>> every time

>>>>>> we

>>>>>> vectorize it further. Will keep you posted on progress. If you 

>>>>>> have any

>>>>>> pointers/tips they are most welc ome!

>>>>>

>>>>> I thought to somehow force the idea that we have a prologue loop

>>>>> to the vectorizer so it creates the number-of-vectorized iterations

>>>>> check and branch around the main (highest VF) vectorized loop.

>>>>>

>>>>

>>>> Hmm I think I may have skimmed over this earlier. I am reading it 

>>>> now and am

>>>> not entirely sure what you mean by this. Specifically the "number of

>>>> vectorized iterations" or how a prologue loop plays a role in this.

>>>

>>> When there is no prologue then the versioning condition currently

>>> ensures we always enter the vector loop.  Only if there is a prologue

>>> is there a check and branch eventually skipping right to the epilogue.

>>> If the versioning condition (now using a lower VF) doesn't guarantee

>>> this we have to add this jump-around.

>>

>> Right, I haven't looked at the prologue path yet. I had a quick look 

>> and can't find where this branch skipping to the epilogue is 

>> constructed.  I will take a better look after I got my current example 

>> to work.

>>

>>>>

>>>> I guess this sheds some light on the comment above. And it 

>>>> definitely implies

>>>> we would need to know the lowest VF when creating this condition. 

>>>> Which is

>>>> tricky.

>>>

>>> We can simply use the smallest vector size supported by the target to

>>> derive it from the actual VF, no?

>>

>> So I could wait to introduce this check after all epilogue 

>> vectorization is done, back track to the last niter check and 

>> duplicate that in the outer loop.

>>

>> What I didn't want to do was use the smallest possible vector size for 

>> the target because I was under the impression that does not 

>> necessarily correspond to the smallest VF we may have for a loop, as 

>> the vectorizer may have decided not to vectorize for that vector size 

>> because of costs? If it I can assume this never happens, that once it 

>> starts to vectorize epilogues that it will keep vectorizing them for 

>> any vector size it knows off then yeah I can use that.

>>

>>

>>> I'm not sure I understand - why would you have any check not inside

>>> the outer loop?  Yes, we now eventually hoist versioning checks

>>> but the VF checks for the individual variants should be around

>>> the vectorized loop itself (so not really part of the versioning check).

>>

>> Yeah I agree. I was just explaining what I had done wrong now.

>>>

>>>> Cheers,

>>>> Andre

>>>>

>>>> PS: I often find myself having to patch the DOMINATOR information, 

>>>> sometimes

>>>> its easy to, but sometimes it can get pretty complicated. I wonder 

>>>> whether it

>>>> would make sense to write something that traverses a loop and 

>>>> corrects this,

>>>> if it doesn't exist already.

>>>

>>> There's iterate-fix-dominators, but unless you create new edges/blocks

>>> manually rather than doing split-block/redirect-edge which should do

>>> dominator updating for you.

>>

>> Ah I was doing everything manually after having some bad experiences 

>> with lv_add_condition_to_bb.  I will have a look at those thanks!

>>

>> Cheers,

>> Andre

>>

>>>

>>> Richard.

>>>

>>>>

>>>>>

>>>>> Richard.

>>>>>

>>>>

>>>
Richard Biener Aug. 1, 2019, 3:26 p.m. | #9
On Tue, 30 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> On 30/07/2019 13:16, Andre Vieira (lists) wrote:

> > Hi Richard,

> > 

> > I believe this is in line with what you were explaining to me earlier. The

> > one thing I haven't quite done here is the jump around for if there is no

> > prolog peeling. Though I think skip_vectors introduces the jumping we need.

> > 

> > The main gist of it is:

> > 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the

> > versioning threshold check when versioning a loop at first,

> > 2) later we allow the vectorizer to use skip_vectors to basically check

> > niters to be able to skip the vectorized loop on to the right epilogue,

> > 3) after vectorizing the epilogue, we check whether this was a versioned

> > loop and whether we skipped adding the versioning threshold, if so we add a

> > threshold based on the last VF

> > 

> > There is a caveat here, if we don't vectorize the epilogue, because say our

> > architecture doesn't know how, we will end up with a regression.

> > Let's say we have a loop for which our target can vectorize for 32x but not

> > 16x, we get:

> > 

> > if (alias & threshold_for_16x ())

> > {

> >    if (niters > 31)

> >      vectorized_31 ();

> >    else

> >      scalar ();

> > }

> > else

> >   scalar ();

> > 

> > Imagine our iteration count is 18, all we hit is the scalar loop, but now we

> > go through 1x alias and 2x niters. Whereas potentially we could have done

> > with just 1x niters.


True.  Note we should swap the checks to

  if (threshold_for_16x && alias)

> > A mitigation for this could be to only add the threshold check if we

> > actually vectorized the last loop, I believe a:

> > 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new

> > "else" in vect_transform_loop would do the trick. Though then we will still

> > have that extra alias check... >

> > I am going to hack around in the back-end to "fake" a situation like this

> > and see what happens.

> 

> Right should have quickly tried this before sending the email, what happens is

> actually vect_transform_loop never gets called because try_vectorize_loop_1

> will recognize it can't vectorize the epilogue. So we end up with the

> "mitigation" result I suggested, where we simply don't have a versioning

> threshold which might still not be ideal.


I think the main issue is how we have implemented epilogue vectorization.
Ideally when vectorizing the loop () we'd recognize all VFs we can handle
and thus know beforehand.  I think that's already done for the sake
of openmp simd now so doing this when epilogue vectorization is enabled
as well wouldn't be too bad - so then we know, at the time we do the
versioning, what and how many vectorized epilogues we create.  See
vect_analyze_loop and the loop over vector sizes.

> > 

> > Another potential issue arises when the profitability threshold obtained

> > from the cost model isn't guaranteed to be either LE or EQ to the versioning

> > threshold. In that case there are two separate checks, which now we no

> > longer will attempt to fold.

> 

> And I just realized I don't take the cost model threshold into account when

> creating the new threshold check, I guess if it is ordered_p we should again

> take the max there between the new threshold check and the cost threshold for

> the new check.


Also see https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01397.html for
a patch that computes costs of each possible VF, with that we could
compute a better combined estimate for minimum number of iters
(also considering the extra jumps to reach the main VF/2 loop).

> > 

> > I am trying to find a way to test and benchmark these changes. Unfortunately

> > I am having trouble doing this for AVX512 as I find that the old '--param

> > vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the

> > gcc and perlbench benchmarks.


There's a reason it is not enabled by default.  But I'd like to
fix bugs it has so can you possibly reduce testcases and file
bugs for it?

Thanks,
Richard.

> > 

> > 

> > Cheers,

> > Andre

> > 

> > On 19/07/2019 13:38, Andre Vieira (lists) wrote:

> > > 

> > > 

> > > On 19/07/2019 12:35, Richard Biener wrote:

> > > > On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> > > > 

> > > > > 

> > > > > 

> > > > > On 15/07/2019 11:54, Richard Biener wrote:

> > > > > > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> > > > > > 

> > > > > > > 

> > > > > > > 

> > > > > > > On 12/07/2019 11:19, Richard Biener wrote:

> > > > > > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> > > > > > > 

> > > > > > > 

> > > > > > > I have code that can split the condition and alias checks in

> > > > > > > 'vect_loop_versioning'. For this approach I am considering keeping

> > > > > > > that

> > > > > > > bit of

> > > > > > > code and seeing if I can patch up the checks after vectorizing the

> > > > > > > epilogue

> > > > > > > further. I think initially I will just go with a "hacked up" way

> > > > > > > of

> > > > > > > passing

> > > > > > > down the bb with the iteration check and split the false edge

> > > > > > > every time

> > > > > > > we

> > > > > > > vectorize it further. Will keep you posted on progress. If you

> > > > > > > have any

> > > > > > > pointers/tips they are most welc ome!

> > > > > > 

> > > > > > I thought to somehow force the idea that we have a prologue loop

> > > > > > to the vectorizer so it creates the number-of-vectorized iterations

> > > > > > check and branch around the main (highest VF) vectorized loop.

> > > > > > 

> > > > > 

> > > > > Hmm I think I may have skimmed over this earlier. I am reading it now

> > > > > and am

> > > > > not entirely sure what you mean by this. Specifically the "number of

> > > > > vectorized iterations" or how a prologue loop plays a role in this.

> > > > 

> > > > When there is no prologue then the versioning condition currently

> > > > ensures we always enter the vector loop.  Only if there is a prologue

> > > > is there a check and branch eventually skipping right to the epilogue.

> > > > If the versioning condition (now using a lower VF) doesn't guarantee

> > > > this we have to add this jump-around.

> > > 

> > > Right, I haven't looked at the prologue path yet. I had a quick look and

> > > can't find where this branch skipping to the epilogue is constructed.  I

> > > will take a better look after I got my current example to work.

> > > 

> > > > > 

> > > > > I guess this sheds some light on the comment above. And it definitely

> > > > > implies

> > > > > we would need to know the lowest VF when creating this condition.

> > > > > Which is

> > > > > tricky.

> > > > 

> > > > We can simply use the smallest vector size supported by the target to

> > > > derive it from the actual VF, no?

> > > 

> > > So I could wait to introduce this check after all epilogue vectorization

> > > is done, back track to the last niter check and duplicate that in the

> > > outer loop.

> > > 

> > > What I didn't want to do was use the smallest possible vector size for the

> > > target because I was under the impression that does not necessarily

> > > correspond to the smallest VF we may have for a loop, as the vectorizer

> > > may have decided not to vectorize for that vector size because of costs?

> > > If it I can assume this never happens, that once it starts to vectorize

> > > epilogues that it will keep vectorizing them for any vector size it knows

> > > off then yeah I can use that.

> > > 

> > > 

> > > > I'm not sure I understand - why would you have any check not inside

> > > > the outer loop?  Yes, we now eventually hoist versioning checks

> > > > but the VF checks for the individual variants should be around

> > > > the vectorized loop itself (so not really part of the versioning check).

> > > 

> > > Yeah I agree. I was just explaining what I had done wrong now.

> > > > 

> > > > > Cheers,

> > > > > Andre

> > > > > 

> > > > > PS: I often find myself having to patch the DOMINATOR information,

> > > > > sometimes

> > > > > its easy to, but sometimes it can get pretty complicated. I wonder

> > > > > whether it

> > > > > would make sense to write something that traverses a loop and corrects

> > > > > this,

> > > > > if it doesn't exist already.

> > > > 

> > > > There's iterate-fix-dominators, but unless you create new edges/blocks

> > > > manually rather than doing split-block/redirect-edge which should do

> > > > dominator updating for you.

> > > 

> > > Ah I was doing everything manually after having some bad experiences with

> > > lv_add_condition_to_bb.  I will have a look at those thanks!

> > > 

> > > Cheers,

> > > Andre

> > > 

> > > > 

> > > > Richard.

> > > > 

> > > > > 

> > > > > > 

> > > > > > Richard.

> > > > > > 

> > > > > 

> > > > 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Andre Vieira (lists) Aug. 5, 2019, 6:10 p.m. | #10
Hi Richard,

Thanks for the feedback! See comments inline.

On 01/08/2019 16:26, Richard Biener wrote:
> On Tue, 30 Jul 2019, Andre Vieira (lists) wrote:

> 

>>

>>

>> On 30/07/2019 13:16, Andre Vieira (lists) wrote:

>>> Hi Richard,

>>>

>>> I believe this is in line with what you were explaining to me earlier. The

>>> one thing I haven't quite done here is the jump around for if there is no

>>> prolog peeling. Though I think skip_vectors introduces the jumping we need.

>>>

>>> The main gist of it is:

>>> 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the

>>> versioning threshold check when versioning a loop at first,

>>> 2) later we allow the vectorizer to use skip_vectors to basically check

>>> niters to be able to skip the vectorized loop on to the right epilogue,

>>> 3) after vectorizing the epilogue, we check whether this was a versioned

>>> loop and whether we skipped adding the versioning threshold, if so we add a

>>> threshold based on the last VF

>>>

>>> There is a caveat here, if we don't vectorize the epilogue, because say our

>>> architecture doesn't know how, we will end up with a regression.

>>> Let's say we have a loop for which our target can vectorize for 32x but not

>>> 16x, we get:

>>>

>>> if (alias & threshold_for_16x ())

>>> {

>>>     if (niters > 31)

>>>       vectorized_31 ();

>>>     else

>>>       scalar ();

>>> }

>>> else

>>>    scalar ();

>>>

>>> Imagine our iteration count is 18, all we hit is the scalar loop, but now we

>>> go through 1x alias and 2x niters. Whereas potentially we could have done

>>> with just 1x niters.

> 

> True.  Note we should swap the checks to

> 

>    if (threshold_for_16x && alias)


I agree, I just haven't figured out what the best way of doing it. I am 
trying TRUTH_ANDIF_EXPR now, but fold_build2 turns that into a 
TRUTH_AND_EXPR.  Can I assume it does that for aarch64 because it can 
use conditional compare to merge the two checks? To be fair the code 
generated for aarch64 for the two checks I am looking at doesn't look 
too bad.

I am pondering figuring out why fold decides to transform it back into a 
TRUTH_AND_EXPR. Otherwise I think I might just try splitting the basic 
block.  Though I am not entirely sure adding an extra jump in there is 
always a good thing.
> 

>>> A mitigation for this could be to only add the threshold check if we

>>> actually vectorized the last loop, I believe a:

>>> 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new

>>> "else" in vect_transform_loop would do the trick. Though then we will still

>>> have that extra alias check... >

>>> I am going to hack around in the back-end to "fake" a situation like this

>>> and see what happens.

>>

>> Right should have quickly tried this before sending the email, what happens is

>> actually vect_transform_loop never gets called because try_vectorize_loop_1

>> will recognize it can't vectorize the epilogue. So we end up with the

>> "mitigation" result I suggested, where we simply don't have a versioning

>> threshold which might still not be ideal.

> 

> I think the main issue is how we have implemented epilogue vectorization.

> Ideally when vectorizing the loop () we'd recognize all VFs we can handle

> and thus know beforehand.  I think that's already done for the sake

> of openmp simd now so doing this when epilogue vectorization is enabled

> as well wouldn't be too bad - so then we know, at the time we do the

> versioning, what and how many vectorized epilogues we create.  See

> vect_analyze_loop and the loop over vector sizes.

> 


I think I managed to do this by passing a boolean initialized to the 
PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) to vect_analyze_loop and also 
changing the looping over vector_sizes to count the number of 
vectorizable loops whilst using the same code path as loop->simdlen if 
vect_epilogues_nomask is true.

Then I set vect_epilogues_nomask to false if the number of vectorized 
loops isn't higher than 1. I'll follow up with a new patch for you to 
see, which will most likely make more sense than what I just wrote up 
there ;)

>>>

>>> Another potential issue arises when the profitability threshold obtained

>>> from the cost model isn't guaranteed to be either LE or EQ to the versioning

>>> threshold. In that case there are two separate checks, which now we no

>>> longer will attempt to fold.

>>

>> And I just realized I don't take the cost model threshold into account when

>> creating the new threshold check, I guess if it is ordered_p we should again

>> take the max there between the new threshold check and the cost threshold for

>> the new check.

> 

> Also see https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01397.html for

> a patch that computes costs of each possible VF, with that we could

> compute a better combined estimate for minimum number of iters

> (also considering the extra jumps to reach the main VF/2 loop).

> 

Started to look at this. Just to check something, this patch enables 
picking a lower VF loop if it determines that its single iteration cost 
is more than N times lower than the higher VF loop, where N is the 
ration between higher VF and lower VF no?

Sounds like a good check to have, but I don't see how this is related to 
my current changes. I mean other than that it could turn off epilogue 
vectorization altogether since it could decide to only vectorize once.

Did you mean to show me this as a way to incorporate costs for the extra 
jumps somehow? For instance, taking the ordered max of current 
versioning threshold and the SINGLE_ITERATION_COST + 1 (for the extra jump)?

I also noticed we purposefully only vectorize the epilogue once, with 
vect-epilogues-nomask=1, I am guessing that's just because nobody got 
around to it yet? Specially since even vectorizing it once goes wrong 
sometimes.
>>>

>>> I am trying to find a way to test and benchmark these changes. Unfortunately

>>> I am having trouble doing this for AVX512 as I find that the old '--param

>>> vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the

>>> gcc and perlbench benchmarks.

> 

> There's a reason it is not enabled by default.  But I'd like to

> fix bugs it has so can you possibly reduce testcases and file

> bugs for it?

> 


Yeah that makes sense, also options that are off by default tend to 
bitrot fast! I did have an initial look at it, but the gcc benchmark 
isn't an easy one to dissect. I will give it another go sometime this 
week, see if I can figure out at least what loop is being miscompiled.

Do you see any use in running any set of tests in the testsuite with the 
parameter for smaller testcases?

> Thanks,

> Richard.

> 

>>>

>>>

>>> Cheers,

>>> Andre

>>>

>>> On 19/07/2019 13:38, Andre Vieira (lists) wrote:

>>>>

>>>>

>>>> On 19/07/2019 12:35, Richard Biener wrote:

>>>>> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

>>>>>

>>>>>>

>>>>>>

>>>>>> On 15/07/2019 11:54, Richard Biener wrote:

>>>>>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> On 12/07/2019 11:19, Richard Biener wrote:

>>>>>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>>>>>>>

>>>>>>>>

>>>>>>>> I have code that can split the condition and alias checks in

>>>>>>>> 'vect_loop_versioning'. For this approach I am considering keeping

>>>>>>>> that

>>>>>>>> bit of

>>>>>>>> code and seeing if I can patch up the checks after vectorizing the

>>>>>>>> epilogue

>>>>>>>> further. I think initially I will just go with a "hacked up" way

>>>>>>>> of

>>>>>>>> passing

>>>>>>>> down the bb with the iteration check and split the false edge

>>>>>>>> every time

>>>>>>>> we

>>>>>>>> vectorize it further. Will keep you posted on progress. If you

>>>>>>>> have any

>>>>>>>> pointers/tips they are most welc ome!

>>>>>>>

>>>>>>> I thought to somehow force the idea that we have a prologue loop

>>>>>>> to the vectorizer so it creates the number-of-vectorized iterations

>>>>>>> check and branch around the main (highest VF) vectorized loop.

>>>>>>>

>>>>>>

>>>>>> Hmm I think I may have skimmed over this earlier. I am reading it now

>>>>>> and am

>>>>>> not entirely sure what you mean by this. Specifically the "number of

>>>>>> vectorized iterations" or how a prologue loop plays a role in this.

>>>>>

>>>>> When there is no prologue then the versioning condition currently

>>>>> ensures we always enter the vector loop.  Only if there is a prologue

>>>>> is there a check and branch eventually skipping right to the epilogue.

>>>>> If the versioning condition (now using a lower VF) doesn't guarantee

>>>>> this we have to add this jump-around.

>>>>

>>>> Right, I haven't looked at the prologue path yet. I had a quick look and

>>>> can't find where this branch skipping to the epilogue is constructed.  I

>>>> will take a better look after I got my current example to work.

>>>>

>>>>>>

>>>>>> I guess this sheds some light on the comment above. And it definitely

>>>>>> implies

>>>>>> we would need to know the lowest VF when creating this condition.

>>>>>> Which is

>>>>>> tricky.

>>>>>

>>>>> We can simply use the smallest vector size supported by the target to

>>>>> derive it from the actual VF, no?

>>>>

>>>> So I could wait to introduce this check after all epilogue vectorization

>>>> is done, back track to the last niter check and duplicate that in the

>>>> outer loop.

>>>>

>>>> What I didn't want to do was use the smallest possible vector size for the

>>>> target because I was under the impression that does not necessarily

>>>> correspond to the smallest VF we may have for a loop, as the vectorizer

>>>> may have decided not to vectorize for that vector size because of costs?

>>>> If it I can assume this never happens, that once it starts to vectorize

>>>> epilogues that it will keep vectorizing them for any vector size it knows

>>>> off then yeah I can use that.

>>>>

>>>>

>>>>> I'm not sure I understand - why would you have any check not inside

>>>>> the outer loop?  Yes, we now eventually hoist versioning checks

>>>>> but the VF checks for the individual variants should be around

>>>>> the vectorized loop itself (so not really part of the versioning check).

>>>>

>>>> Yeah I agree. I was just explaining what I had done wrong now.

>>>>>

>>>>>> Cheers,

>>>>>> Andre

>>>>>>

>>>>>> PS: I often find myself having to patch the DOMINATOR information,

>>>>>> sometimes

>>>>>> its easy to, but sometimes it can get pretty complicated. I wonder

>>>>>> whether it

>>>>>> would make sense to write something that traverses a loop and corrects

>>>>>> this,

>>>>>> if it doesn't exist already.

>>>>>

>>>>> There's iterate-fix-dominators, but unless you create new edges/blocks

>>>>> manually rather than doing split-block/redirect-edge which should do

>>>>> dominator updating for you.

>>>>

>>>> Ah I was doing everything manually after having some bad experiences with

>>>> lv_add_condition_to_bb.  I will have a look at those thanks!

>>>>

>>>> Cheers,

>>>> Andre

>>>>

>>>>>

>>>>> Richard.

>>>>>

>>>>>>

>>>>>>>

>>>>>>> Richard.

>>>>>>>

>>>>>>

>>>>>

>>

>

Patch

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 2f8ab106d03a8927087ee8038e08a825f6e1e237..04d874b70ddfb8a3f5175dcddf00fef6d33f3219 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -266,6 +266,8 @@  struct GTY ((chain_next ("%h.next"))) loop {
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  unsigned long max_vf_limit;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index f64326b944e630075ced7035937f4601a1cb6c66..07d633b678b52943d3ab82e8d61b80cd712431ac 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -355,6 +355,7 @@  alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->max_vf_limit = MAX_VECTORIZATION_FACTOR;
   return loop;
 }
 
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index bd8fffb1704787d0a611fc02ee29054422596cbb..89529138b9cefb7f822bca72da06df519eff1a28 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2968,7 +2968,8 @@  vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 struct loop *
 vect_loop_versioning (loop_vec_info loop_vinfo,
 		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+		      poly_uint64 versioning_threshold,
+		      vec<loop_p> &more_loops)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -3143,6 +3144,19 @@  vect_loop_versioning (loop_vec_info loop_vinfo,
       nloop = loop_version (loop_to_version, cond_expr, &condition_bb,
 			    prob, prob.invert (), prob, prob.invert (), true);
       gcc_assert (nloop);
+
+      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+
+	  /* Add the scalar fallback loop to the MORE_LOOPS vector to be looked
+	     at later.  Also make sure it is never vectorized for the original
+	     vf by setting the limit of the maximum vf to the original vf minus
+	     one.  */
+	  nloop->max_vf_limit
+	    = constant_lower_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1;
+	  more_loops.safe_push (nloop);
+	}
+
       nloop = get_loop_copy (loop);
 
       /* Kill off IFN_LOOP_VECTORIZED_CALL in the copy, nobody will
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b49ab152012a5c7fe9cc0564e58d296447f9ffb1..081885c378200661237ef22d2b011fc775e21218 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1862,7 +1862,7 @@  vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
 {
   opt_result ok = opt_result::success ();
   int res;
-  unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
+  unsigned int max_vf = LOOP_VINFO_LOOP (loop_vinfo)->max_vf_limit;
   poly_uint64 min_vf = 2;
 
   /* The first group of checks is independent of the vector size.  */
@@ -8468,7 +8468,7 @@  vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
    Returns scalar epilogue loop if any.  */
 
 struct loop *
-vect_transform_loop (loop_vec_info loop_vinfo)
+vect_transform_loop (loop_vec_info loop_vinfo, vec<loop_p> &more_loops)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *epilogue = NULL;
@@ -8530,7 +8530,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
 	}
       struct loop *sloop
 	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+				versioning_threshold, more_loops);
+
       sloop->force_vectorize = false;
       check_profitability = false;
     }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index f7432f0584762fd28d54f2978dc59f2df443e991..53d66b72d3ba6e15681209153b57736630e40e3b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1089,8 +1089,6 @@  STMT_VINFO_BB_VINFO (stmt_vec_info stmt_vinfo)
    conversion.  */
 #define MAX_INTERM_CVT_STEPS         3
 
-#define MAX_VECTORIZATION_FACTOR INT_MAX
-
 /* Nonzero if TYPE represents a (scalar) boolean type or type
    in the middle-end compatible with it (unsigned precision 1 integral
    types).  Used to determine which types should be vectorized as
@@ -1473,7 +1471,7 @@  extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 struct loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+				   poly_uint64, vec<loop_p> &);
 extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
 				     tree *, tree *, tree *, int, bool, bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
@@ -1614,7 +1612,7 @@  extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 
 /* Drive for loop transformation stage.  */
-extern struct loop *vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info, vec<loop_p> &);
 extern opt_loop_vec_info vect_analyze_loop_form (struct loop *,
 						 vec_info_shared *);
 extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *,
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 325ef58722d21a65ab896a9358677b07111b060b..d63d532d5fe474904ff84b23912a2ed9cfd6194a 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -868,7 +868,8 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		      unsigned *num_vectorized_loops,
 		      loop_p loop, loop_vec_info orig_loop_vinfo,
 		      gimple *loop_vectorized_call,
-		      gimple *loop_dist_alias_call)
+		      gimple *loop_dist_alias_call,
+		      vec<loop_p> &more_loops)
 {
   unsigned ret = 0;
   vec_info_shared shared;
@@ -979,7 +980,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 			 "loop vectorized using variable length vectors\n");
     }
 
-  loop_p new_loop = vect_transform_loop (loop_vinfo);
+  loop_p new_loop = vect_transform_loop (loop_vinfo, more_loops);
   (*num_vectorized_loops)++;
   /* Now that the loop has been vectorized, allow it to be unrolled
      etc.  */
@@ -1013,7 +1014,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
     ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+				 new_loop, loop_vinfo, NULL, NULL, more_loops);
 
   return ret;
 }
@@ -1022,7 +1023,8 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
 static unsigned
 try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
-		    unsigned *num_vectorized_loops, loop_p loop)
+		    unsigned *num_vectorized_loops, loop_p loop,
+		    vec<loop_p> &more_loops)
 {
   if (!((flag_tree_loop_vectorize
 	 && optimize_loop_nest_for_speed_p (loop))
@@ -1032,7 +1034,8 @@  try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   return try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
 			       loop, NULL,
 			       vect_loop_vectorized_call (loop),
-			       vect_loop_dist_alias_call (loop));
+			       vect_loop_dist_alias_call (loop),
+			       more_loops);
 }
 
 
@@ -1051,6 +1054,7 @@  vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  auto_vec<loop_p> more_loops;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -1105,14 +1109,19 @@  vectorize_loops (void)
 		    vector_loop->dont_vectorize = true;
 		    ret |= try_vectorize_loop (simduid_to_vf_htab,
 					       &num_vectorized_loops,
-					       vector_loop);
+					       vector_loop,
+					       more_loops);
 		  }
 	      }
 	  }
       }
     else
       ret |= try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops,
-				 loop);
+				 loop, more_loops);
+
+  while (!more_loops.is_empty ())
+    try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops,
+			more_loops.pop (), more_loops);
 
   vect_location = dump_user_location_t ();
 
diff --git a/gcc/tree.h b/gcc/tree.h
index 3dce602dfbaca03f568e1c3638d56dfe3a3fd01c..b1c41131e9d1637784a1024d5c301252a06f89e1 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -22,6 +22,8 @@  along with GCC; see the file COPYING3.  If not see
 
 #include "tree-core.h"
 
+#define MAX_VECTORIZATION_FACTOR INT_MAX
+
 /* Convert a target-independent built-in function code to a combined_fn.  */
 
 inline combined_fn