[RFC,tree-vect] PR 88915: Further vectorize second loop when versioning

Message ID 28b81a90-250e-feb0-8c39-c3c1b62449e7@arm.com
State New
Headers show
Series
  • [RFC,tree-vect] PR 88915: Further vectorize second loop when versioning
Related show

Commit Message

Andre Vieira (lists) July 11, 2019, 4:13 p.m.
Hi Richard(s),

I am trying to tackle PR88915 and get GCC to further vectorize the 
"fallback" loop when doing loop versioning as it does when loop 
versioning is not required.

I have a prototype patch that needs further testing, but at first glance 
it seems to be achieving the desired outcome.
I was wondering whether you had any specific concerns with my current 
approach.

On top of this change I am looking at the iterations and alias checks 
generated for every "vectorized-version". I.e. with the above patch I see:
if (iterations_check_VF_0 () && alias_check_VF_0 ())
   vectorized_for_VF_0 ();
else if (iterations_check_VF_1 () && alias_check_VF_1 ())
   vectorized_for_VF_1 ();
...
else
   scalar_loop ();

The alias checks are not always short and may cause unnecessary 
performance hits. Instead I am now trying to change the checks to 
produce the following form:

if (iterations_check_VF_0 ())
{
   if (alias_check_VF_0 ())
    {
      vectorized_for_VF_0 ();
    }
   else
     goto VF_1_check;  // or scalar_loop
}
else if (iterations_check_VF_1 ())
   {
VF_1_check:

     if (alias_check_VF_1 ())
       vectorized_for_VF_1 ();
     else
       goto goto_VF_2_check; // or scalar_loop
   }
...
else
   scalar_loop ();


I am not yet sure whether to try the next VF after an alias check fail 
or to just fall back to scalar instead.

Cheers,
Andre

Comments

Richard Biener July 12, 2019, 10:19 a.m. | #1
On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> Hi Richard(s),

> 

> I am trying to tackle PR88915 and get GCC to further vectorize the "fallback"

> loop when doing loop versioning as it does when loop versioning is not

> required.

> 

> I have a prototype patch that needs further testing, but at first glance it

> seems to be achieving the desired outcome.

> I was wondering whether you had any specific concerns with my current

> approach.

> 

> On top of this change I am looking at the iterations and alias checks

> generated for every "vectorized-version". I.e. with the above patch I see:

> if (iterations_check_VF_0 () && alias_check_VF_0 ())

>   vectorized_for_VF_0 ();

> else if (iterations_check_VF_1 () && alias_check_VF_1 ())

>   vectorized_for_VF_1 ();

> ...

> else

>   scalar_loop ();

> 

> The alias checks are not always short and may cause unnecessary performance

> hits. Instead I am now trying to change the checks to produce the following

> form:

> 

> if (iterations_check_VF_0 ())

> {

>   if (alias_check_VF_0 ())

>    {

>      vectorized_for_VF_0 ();

>    }

>   else

>     goto VF_1_check;  // or scalar_loop

> }

> else if (iterations_check_VF_1 ())

>   {

> VF_1_check:

> 

>     if (alias_check_VF_1 ())

>       vectorized_for_VF_1 ();

>     else

>       goto goto_VF_2_check; // or scalar_loop

>   }

> ...

> else

>   scalar_loop ();


I think for code-size reason it would make sense to do it like

  if (iterations_check_for_lowest_VF ())
    {
      if (alias_check_for_highest_VF ())
        {
          vectorized_for_highest_VF ();
          vectorized epilogues;
        }
    }

and make the vectorized_for_highest_VF loop skipped, falling through
to the vectorized epilogues, when the number of iterations isn't
enough to hit it.

The advantage is that this would just use the epilogue vectorization
code and it would avoid excessive code growth if you have many
VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and
64 byte vectors...).  The disadvantage is of course that a small
number of loops will not enter the vector code at all - namely
those that would pass the alias check for lowest_VF but not the
one for highest_VF.  I'm sure this isn't a common situation and
in quite a number of cases we formulate the alias check in a way
that it isn't dependent on the VF anyways.  There's also possibly
an extra branch for the case the highest_VF loop isn't entered
(unless there already was a prologue loop).

> I am not yet sure whether to try the next VF after an alias check fail or to

> just fall back to scalar instead.


If you don't then there's no advantage to doing what I suggested?

Richard.

> 

> Cheers,

> Andre

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Andre Vieira (lists) July 15, 2019, 10:36 a.m. | #2
On 12/07/2019 11:19, Richard Biener wrote:
> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> I think for code-size reason it would make sense to do it like

> 

>    if (iterations_check_for_lowest_VF ())

>      {

>        if (alias_check_for_highest_VF ())

>          {

>            vectorized_for_highest_VF ();

>            vectorized epilogues;

>          }

>      }

> 

> and make the vectorized_for_highest_VF loop skipped, falling through

> to the vectorized epilogues, when the number of iterations isn't

> enough to hit it.


Are you suggesting we only make the distinction between highest and 
lowest VF? Why not something like:

if (alias_check_for_highest_VF ())
{
   if (iterations_check_VF_0 ())
     goto VF_0;
   else if (iterations_check_VF_1 ())
     goto VF_1;
   else if (iterations_check_VF_2 ())
     goto VF_2;
   ...
VF_0:
  vectorized_for_vf_0();
VF_1:
  vectorized_for_vf_1();
VF_2:
  vectorized_for_vf_2();
...
}
else
{
   goto scalar_loop;
}

I'll go have a look at how to best do this. The benefit of the earlier 
approach is it was able to use a lot of the existing vectorizer code to 
get it done.

I have code that can split the condition and alias checks in 
'vect_loop_versioning'. For this approach I am considering keeping that 
bit of code and seeing if I can patch up the checks after vectorizing 
the epilogue further. I think initially I will just go with a "hacked 
up" way of passing down the bb with the iteration check and split the 
false edge every time we vectorize it further. Will keep you posted on 
progress. If you have any pointers/tips they are most welcome!

> 

> The advantage is that this would just use the epilogue vectorization

> code and it would avoid excessive code growth if you have many

> VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> 64 byte vectors...).  The disadvantage is of course that a small

> number of loops will not enter the vector code at all - namely

> those that would pass the alias check for lowest_VF but not the

> one for highest_VF.  I'm sure this isn't a common situation and

> in quite a number of cases we formulate the alias check in a way

> that it isn't dependent on the VF anyways.


The code growth is indeed a factor and I can see the argument for 
choosing this approach over the other. Cases of such specific overlaps 
are most likely oddities rather than the common situation.



> There's also possibly

> an extra branch for the case the highest_VF loop isn't entered

> (unless there already was a prologue loop).

I don't understand this one, can you elaborate?

Cheers,
Andre
Richard Biener July 15, 2019, 10:54 a.m. | #3
On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> On 12/07/2019 11:19, Richard Biener wrote:

> > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> > 

> > 

> > I think for code-size reason it would make sense to do it like

> > 

> >    if (iterations_check_for_lowest_VF ())

> >      {

> >        if (alias_check_for_highest_VF ())

> >          {

> >            vectorized_for_highest_VF ();

> >            vectorized epilogues;

> >          }

> >      }

> > 

> > and make the vectorized_for_highest_VF loop skipped, falling through

> > to the vectorized epilogues, when the number of iterations isn't

> > enough to hit it.

> 

> Are you suggesting we only make the distinction between highest and lowest VF?

> Why not something like:

> 

> if (alias_check_for_highest_VF ())

> {

>   if (iterations_check_VF_0 ())

>     goto VF_0;

>   else if (iterations_check_VF_1 ())

>     goto VF_1;

>   else if (iterations_check_VF_2 ())

>     goto VF_2;

>   ...

> VF_0:

>  vectorized_for_vf_0();

> VF_1:

>  vectorized_for_vf_1();

> VF_2:

>  vectorized_for_vf_2();

> ...

> }

> else

> {

>   goto scalar_loop;

> }


I think it will actually do it this way via the epilogue vectorization
path.
 
> I'll go have a look at how to best do this. The benefit of the earlier

> approach is it was able to use a lot of the existing vectorizer code to get it

> done.

>

> I have code that can split the condition and alias checks in

> 'vect_loop_versioning'. For this approach I am considering keeping that bit of

> code and seeing if I can patch up the checks after vectorizing the epilogue

> further. I think initially I will just go with a "hacked up" way of passing

> down the bb with the iteration check and split the false edge every time we

> vectorize it further. Will keep you posted on progress. If you have any

> pointers/tips they are most welcome!


I thought to somehow force the idea that we have a prologue loop
to the vectorizer so it creates the number-of-vectorized iterations
check and branch around the main (highest VF) vectorized loop.

> > 

> > The advantage is that this would just use the epilogue vectorization

> > code and it would avoid excessive code growth if you have many

> > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> > 64 byte vectors...).  The disadvantage is of course that a small

> > number of loops will not enter the vector code at all - namely

> > those that would pass the alias check for lowest_VF but not the

> > one for highest_VF.  I'm sure this isn't a common situation and

> > in quite a number of cases we formulate the alias check in a way

> > that it isn't dependent on the VF anyways.

> 

> The code growth is indeed a factor and I can see the argument for choosing

> this approach over the other. Cases of such specific overlaps are most likely

> oddities rather than the common situation.


Yeah, it also looks simplest to me (and a motivation to enable
epilogue vectorization by default).

> > There's also possibly

> > an extra branch for the case the highest_VF loop isn't entered

> > (unless there already was a prologue loop).

> I don't understand this one, can you elaborate?


The branch around the main vectorized loop I talked about above.
So I'd fool the versioning condition to use the lowest VF for
the iteration count checking and use the code that handles
zero-trip iteration count for the vector loop unconditionally.

In some way this makes checking the niter condition on the version
check pointless (at least if we have a really low lowest VF like
on x64 where it will likely be 2), so we may want to elide that
completely?  For the check to be "correct" we'd also need to
compute the lowest VF a vectorized epilogue is still profitable
(on x86 those will run once or never, but we can also end up
with say main AVX512 vectorization, and a single vectorized
epilogue with SSE2 if we somehow figure AVX256 vectorization
isn't profitable for it - we can also end up with non-vectorizable
epilogue).  So with the current setup how we vectorize epilogues
we maybe want to have a location of the version niter check we
can "patch up" later after (not) vectorizing the epilogue(s).

Richard.
Andre Vieira (lists) July 19, 2019, 11:13 a.m. | #4
On 15/07/2019 11:54, Richard Biener wrote:
> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> 

>>

>>

>> On 12/07/2019 11:19, Richard Biener wrote:

>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>

>>

>> I have code that can split the condition and alias checks in

>> 'vect_loop_versioning'. For this approach I am considering keeping that bit of

>> code and seeing if I can patch up the checks after vectorizing the epilogue

>> further. I think initially I will just go with a "hacked up" way of passing

>> down the bb with the iteration check and split the false edge every time we

>> vectorize it further. Will keep you posted on progress. If you have any

>> pointers/tips they are most welc ome!

> 

> I thought to somehow force the idea that we have a prologue loop

> to the vectorizer so it creates the number-of-vectorized iterations

> check and branch around the main (highest VF) vectorized loop.

> 


Hmm I think I may have skimmed over this earlier. I am reading it now 
and am not entirely sure what you mean by this. Specifically the "number 
of vectorized iterations" or how a prologue loop plays a role in this.


>>>

>>> The advantage is that this would just use the epilogue vectorization

>>> code and it would avoid excessive code growth if you have many

>>> VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

>>> 64 byte vectors...).  The disadvantage is of course that a small

>>> number of loops will not enter the vector code at all - namely

>>> those that would pass the alias check for lowest_VF but not the

>>> one for highest_VF.  I'm sure this isn't a common situation and

>>> in quite a number of cases we formulate the alias check in a way

>>> that it isn't dependent on the VF anyways.

>>

>> The code growth is indeed a factor and I can see the argument for choosing

>> this approach over the other. Cases of such specific overlaps are most likely

>> oddities rather than the common situation.

> 

> Yeah, it also looks simplest to me (and a motivation to enable

> epilogue vectorization by default).

> 

>>> There's also possibly

>>> an extra branch for the case the highest_VF loop isn't entered

>>> (unless there already was a prologue loop).

>> I don't understand this one, can you elaborate?

> 

> The branch around the main vectorized loop I talked about above.

> So I'd fool the versioning condition to use the lowest VF for

> the iteration count checking and use the code that handles

> zero-trip iteration count for the vector loop unconditionally.


I guess this sheds some light on the comment above. And it definitely 
implies we would need to know the lowest VF when creating this 
condition. Which is tricky.
> 

> In some way this makes checking the niter condition on the version

> check pointless (at least if we have a really low lowest VF like

> on x64 where it will likely be 2), so we may want to elide that

> completely?  For the check to be "correct" we'd also need to

> compute the lowest VF a vectorized epilogue is still profitable

> (on x86 those will run once or never, but we can also end up

> with say main AVX512 vectorization, and a single vectorized

> epilogue with SSE2 if we somehow figure AVX256 vectorization

> isn't profitable for it - we can also end up with non-vectorizable

> epilogue).  So with the current setup how we vectorize epilogues

> we maybe want to have a location of the version niter check we

> can "patch up" later after (not) vectorizing the epilogue(s).


I think you come to the same conclusion here as I mentioned above. 
Somehow I wish I had understood this better when I first read it ... but 
eh such is life :)

I went on and continued hacking around the approach of splitting the 
niter and alias check I had earlier. I got it to work with a single 
loop. However, when dealing with nested loops I run into the problem 
that I'd need to sink the niter checks. Otherwise you could end up with 
an alias check and niter checks outside the outer loop. Where the 2nd 
and consequent VF niter checks point to the corresponding epilogues in 
the inner loop.  However, once those are done and it iterates over the 
outer-loop, it will go through the higher VF's first, leading to wrong 
behavior.

To illustrate what I mean, here is a very simplistic illustration of 
what is happening:

BB1: Alias check
BB2: niter check VF 32
BB3: niter check VF 16
BB4: Vectorized loop VF32
BB5: Vectorized loop VF16
BB6: Remaining epilogue scalar loop
BB7: Outer loop iteration (updates IV's and DRs of inner loop)
BB8: Scalar inner&outer loop

With edges:
BB1 -T-> BB2
BB1 -F-> BB8
BB2 -T-> BB4
BB2 -F-> BB3
BB3 -T-> BB5
BB3 -F-> BB8
BB4 -> BB5
BB5 -> BB6
BB6 -> BB7
BB7 -> BB4

Where -T-> is a True edge and -F-> is a False edge

So my first thought to solve this is to sink BB2 and BB3 into the loop 
for which BB7 is the latch.

I.e. make BB7 -> BB2

But then I would argue, it would be good to introduce a BB9:
BB1 -T-> BB9
BB9 -T-> BB2
BB9 -F-> BB8

Where BB9 checks that niter is at least the lowest VF.

Sorry if I am repeating what you were telling me to do all along :')

Cheers,
Andre

PS: I often find myself having to patch the DOMINATOR information, 
sometimes its easy to, but sometimes it can get pretty complicated. I 
wonder whether it would make sense to write something that traverses a 
loop and corrects this, if it doesn't exist already.


> 

> Richard.

>
Richard Biener July 19, 2019, 11:35 a.m. | #5
On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> 

> 

> On 15/07/2019 11:54, Richard Biener wrote:

> > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

> > 

> > > 

> > > 

> > > On 12/07/2019 11:19, Richard Biener wrote:

> > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

> > > 

> > > 

> > > I have code that can split the condition and alias checks in

> > > 'vect_loop_versioning'. For this approach I am considering keeping that

> > > bit of

> > > code and seeing if I can patch up the checks after vectorizing the

> > > epilogue

> > > further. I think initially I will just go with a "hacked up" way of

> > > passing

> > > down the bb with the iteration check and split the false edge every time

> > > we

> > > vectorize it further. Will keep you posted on progress. If you have any

> > > pointers/tips they are most welc ome!

> > 

> > I thought to somehow force the idea that we have a prologue loop

> > to the vectorizer so it creates the number-of-vectorized iterations

> > check and branch around the main (highest VF) vectorized loop.

> > 

> 

> Hmm I think I may have skimmed over this earlier. I am reading it now and am

> not entirely sure what you mean by this. Specifically the "number of

> vectorized iterations" or how a prologue loop plays a role in this.


When there is no prologue then the versioning condition currently
ensures we always enter the vector loop.  Only if there is a prologue
is there a check and branch eventually skipping right to the epilogue.
If the versioning condition (now using a lower VF) doesn't guarantee
this we have to add this jump-around.

> 

> > > > 

> > > > The advantage is that this would just use the epilogue vectorization

> > > > code and it would avoid excessive code growth if you have many

> > > > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and

> > > > 64 byte vectors...).  The disadvantage is of course that a small

> > > > number of loops will not enter the vector code at all - namely

> > > > those that would pass the alias check for lowest_VF but not the

> > > > one for highest_VF.  I'm sure this isn't a common situation and

> > > > in quite a number of cases we formulate the alias check in a way

> > > > that it isn't dependent on the VF anyways.

> > > 

> > > The code growth is indeed a factor and I can see the argument for choosing

> > > this approach over the other. Cases of such specific overlaps are most

> > > likely

> > > oddities rather than the common situation.

> > 

> > Yeah, it also looks simplest to me (and a motivation to enable

> > epilogue vectorization by default).

> > 

> > > > There's also possibly

> > > > an extra branch for the case the highest_VF loop isn't entered

> > > > (unless there already was a prologue loop).

> > > I don't understand this one, can you elaborate?

> > 

> > The branch around the main vectorized loop I talked about above.

> > So I'd fool the versioning condition to use the lowest VF for

> > the iteration count checking and use the code that handles

> > zero-trip iteration count for the vector loop unconditionally.

> 

> I guess this sheds some light on the comment above. And it definitely implies

> we would need to know the lowest VF when creating this condition. Which is

> tricky.


We can simply use the smallest vector size supported by the target to
derive it from the actual VF, no?

> > 

> > In some way this makes checking the niter condition on the version

> > check pointless (at least if we have a really low lowest VF like

> > on x64 where it will likely be 2), so we may want to elide that

> > completely?  For the check to be "correct" we'd also need to

> > compute the lowest VF a vectorized epilogue is still profitable

> > (on x86 those will run once or never, but we can also end up

> > with say main AVX512 vectorization, and a single vectorized

> > epilogue with SSE2 if we somehow figure AVX256 vectorization

> > isn't profitable for it - we can also end up with non-vectorizable

> > epilogue).  So with the current setup how we vectorize epilogues

> > we maybe want to have a location of the version niter check we

> > can "patch up" later after (not) vectorizing the epilogue(s).

> 

> I think you come to the same conclusion here as I mentioned above. Somehow I

> wish I had understood this better when I first read it ... but eh such is life

> :)

> 

> I went on and continued hacking around the approach of splitting the niter and

> alias check I had earlier. I got it to work with a single loop. However, when

> dealing with nested loops I run into the problem that I'd need to sink the

> niter checks. Otherwise you could end up with an alias check and niter checks

> outside the outer loop. Where the 2nd and consequent VF niter checks point to

> the corresponding epilogues in the inner loop.  However, once those are done

> and it iterates over the outer-loop, it will go through the higher VF's first,

> leading to wrong behavior.

> 

> To illustrate what I mean, here is a very simplistic illustration of what is

> happening:

> 

> BB1: Alias check

> BB2: niter check VF 32

> BB3: niter check VF 16

> BB4: Vectorized loop VF32

> BB5: Vectorized loop VF16

> BB6: Remaining epilogue scalar loop

> BB7: Outer loop iteration (updates IV's and DRs of inner loop)

> BB8: Scalar inner&outer loop

> 

> With edges:

> BB1 -T-> BB2

> BB1 -F-> BB8

> BB2 -T-> BB4

> BB2 -F-> BB3

> BB3 -T-> BB5

> BB3 -F-> BB8

> BB4 -> BB5

> BB5 -> BB6

> BB6 -> BB7

> BB7 -> BB4

> 

> Where -T-> is a True edge and -F-> is a False edge

> 

> So my first thought to solve this is to sink BB2 and BB3 into the loop for

> which BB7 is the latch.

> 

> I.e. make BB7 -> BB2

> 

> But then I would argue, it would be good to introduce a BB9:

> BB1 -T-> BB9

> BB9 -T-> BB2

> BB9 -F-> BB8

> 

> Where BB9 checks that niter is at least the lowest VF.

> 

> Sorry if I am repeating what you were telling me to do all along :')


I'm not sure I understand - why would you have any check not inside
the outer loop?  Yes, we now eventually hoist versioning checks
but the VF checks for the individual variants should be around
the vectorized loop itself (so not really part of the versioning check).

> Cheers,

> Andre

> 

> PS: I often find myself having to patch the DOMINATOR information, sometimes

> its easy to, but sometimes it can get pretty complicated. I wonder whether it

> would make sense to write something that traverses a loop and corrects this,

> if it doesn't exist already.


There's iterate-fix-dominators, but unless you create new edges/blocks
manually rather than doing split-block/redirect-edge which should do
dominator updating for you.

Richard.

> 

> > 

> > Richard.

> > 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Andre Vieira (lists) July 19, 2019, 12:38 p.m. | #6
On 19/07/2019 12:35, Richard Biener wrote:
> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> 

>>

>>

>> On 15/07/2019 11:54, Richard Biener wrote:

>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:

>>>

>>>>

>>>>

>>>> On 12/07/2019 11:19, Richard Biener wrote:

>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:

>>>>

>>>>

>>>> I have code that can split the condition and alias checks in

>>>> 'vect_loop_versioning'. For this approach I am considering keeping that

>>>> bit of

>>>> code and seeing if I can patch up the checks after vectorizing the

>>>> epilogue

>>>> further. I think initially I will just go with a "hacked up" way of

>>>> passing

>>>> down the bb with the iteration check and split the false edge every time

>>>> we

>>>> vectorize it further. Will keep you posted on progress. If you have any

>>>> pointers/tips they are most welc ome!

>>>

>>> I thought to somehow force the idea that we have a prologue loop

>>> to the vectorizer so it creates the number-of-vectorized iterations

>>> check and branch around the main (highest VF) vectorized loop.

>>>

>>

>> Hmm I think I may have skimmed over this earlier. I am reading it now and am

>> not entirely sure what you mean by this. Specifically the "number of

>> vectorized iterations" or how a prologue loop plays a role in this.

> 

> When there is no prologue then the versioning condition currently

> ensures we always enter the vector loop.  Only if there is a prologue

> is there a check and branch eventually skipping right to the epilogue.

> If the versioning condition (now using a lower VF) doesn't guarantee

> this we have to add this jump-around.


Right, I haven't looked at the prologue path yet. I had a quick look and 
can't find where this branch skipping to the epilogue is constructed.  I 
will take a better look after I got my current example to work.

>>

>> I guess this sheds some light on the comment above. And it definitely implies

>> we would need to know the lowest VF when creating this condition. Which is

>> tricky.

> 

> We can simply use the smallest vector size supported by the target to

> derive it from the actual VF, no?


So I could wait to introduce this check after all epilogue vectorization 
is done, back track to the last niter check and duplicate that in the 
outer loop.

What I didn't want to do was use the smallest possible vector size for 
the target because I was under the impression that does not necessarily 
correspond to the smallest VF we may have for a loop, as the vectorizer 
may have decided not to vectorize for that vector size because of costs? 
If it I can assume this never happens, that once it starts to vectorize 
epilogues that it will keep vectorizing them for any vector size it 
knows off then yeah I can use that.


> I'm not sure I understand - why would you have any check not inside

> the outer loop?  Yes, we now eventually hoist versioning checks

> but the VF checks for the individual variants should be around

> the vectorized loop itself (so not really part of the versioning check).


Yeah I agree. I was just explaining what I had done wrong now.
> 

>> Cheers,

>> Andre

>>

>> PS: I often find myself having to patch the DOMINATOR information, sometimes

>> its easy to, but sometimes it can get pretty complicated. I wonder whether it

>> would make sense to write something that traverses a loop and corrects this,

>> if it doesn't exist already.

> 

> There's iterate-fix-dominators, but unless you create new edges/blocks

> manually rather than doing split-block/redirect-edge which should do

> dominator updating for you.


Ah I was doing everything manually after having some bad experiences 
with lv_add_condition_to_bb.  I will have a look at those thanks!

Cheers,
Andre

> 

> Richard.

> 

>>

>>>

>>> Richard.

>>>

>>

>

Patch

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 2f8ab106d03a8927087ee8038e08a825f6e1e237..04d874b70ddfb8a3f5175dcddf00fef6d33f3219 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -266,6 +266,8 @@  struct GTY ((chain_next ("%h.next"))) loop {
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  unsigned long max_vf_limit;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index f64326b944e630075ced7035937f4601a1cb6c66..07d633b678b52943d3ab82e8d61b80cd712431ac 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -355,6 +355,7 @@  alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->max_vf_limit = MAX_VECTORIZATION_FACTOR;
   return loop;
 }
 
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index bd8fffb1704787d0a611fc02ee29054422596cbb..89529138b9cefb7f822bca72da06df519eff1a28 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2968,7 +2968,8 @@  vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 struct loop *
 vect_loop_versioning (loop_vec_info loop_vinfo,
 		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+		      poly_uint64 versioning_threshold,
+		      vec<loop_p> &more_loops)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -3143,6 +3144,19 @@  vect_loop_versioning (loop_vec_info loop_vinfo,
       nloop = loop_version (loop_to_version, cond_expr, &condition_bb,
 			    prob, prob.invert (), prob, prob.invert (), true);
       gcc_assert (nloop);
+
+      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+
+	  /* Add the scalar fallback loop to the MORE_LOOPS vector to be looked
+	     at later.  Also make sure it is never vectorized for the original
+	     vf by setting the limit of the maximum vf to the original vf minus
+	     one.  */
+	  nloop->max_vf_limit
+	    = constant_lower_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1;
+	  more_loops.safe_push (nloop);
+	}
+
       nloop = get_loop_copy (loop);
 
       /* Kill off IFN_LOOP_VECTORIZED_CALL in the copy, nobody will
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b49ab152012a5c7fe9cc0564e58d296447f9ffb1..081885c378200661237ef22d2b011fc775e21218 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1862,7 +1862,7 @@  vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
 {
   opt_result ok = opt_result::success ();
   int res;
-  unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
+  unsigned int max_vf = LOOP_VINFO_LOOP (loop_vinfo)->max_vf_limit;
   poly_uint64 min_vf = 2;
 
   /* The first group of checks is independent of the vector size.  */
@@ -8468,7 +8468,7 @@  vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
    Returns scalar epilogue loop if any.  */
 
 struct loop *
-vect_transform_loop (loop_vec_info loop_vinfo)
+vect_transform_loop (loop_vec_info loop_vinfo, vec<loop_p> &more_loops)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *epilogue = NULL;
@@ -8530,7 +8530,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
 	}
       struct loop *sloop
 	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+				versioning_threshold, more_loops);
+
       sloop->force_vectorize = false;
       check_profitability = false;
     }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index f7432f0584762fd28d54f2978dc59f2df443e991..53d66b72d3ba6e15681209153b57736630e40e3b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1089,8 +1089,6 @@  STMT_VINFO_BB_VINFO (stmt_vec_info stmt_vinfo)
    conversion.  */
 #define MAX_INTERM_CVT_STEPS         3
 
-#define MAX_VECTORIZATION_FACTOR INT_MAX
-
 /* Nonzero if TYPE represents a (scalar) boolean type or type
    in the middle-end compatible with it (unsigned precision 1 integral
    types).  Used to determine which types should be vectorized as
@@ -1473,7 +1471,7 @@  extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 struct loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+				   poly_uint64, vec<loop_p> &);
 extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
 				     tree *, tree *, tree *, int, bool, bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
@@ -1614,7 +1612,7 @@  extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 
 /* Drive for loop transformation stage.  */
-extern struct loop *vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info, vec<loop_p> &);
 extern opt_loop_vec_info vect_analyze_loop_form (struct loop *,
 						 vec_info_shared *);
 extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *,
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 325ef58722d21a65ab896a9358677b07111b060b..d63d532d5fe474904ff84b23912a2ed9cfd6194a 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -868,7 +868,8 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		      unsigned *num_vectorized_loops,
 		      loop_p loop, loop_vec_info orig_loop_vinfo,
 		      gimple *loop_vectorized_call,
-		      gimple *loop_dist_alias_call)
+		      gimple *loop_dist_alias_call,
+		      vec<loop_p> &more_loops)
 {
   unsigned ret = 0;
   vec_info_shared shared;
@@ -979,7 +980,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 			 "loop vectorized using variable length vectors\n");
     }
 
-  loop_p new_loop = vect_transform_loop (loop_vinfo);
+  loop_p new_loop = vect_transform_loop (loop_vinfo, more_loops);
   (*num_vectorized_loops)++;
   /* Now that the loop has been vectorized, allow it to be unrolled
      etc.  */
@@ -1013,7 +1014,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
     ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+				 new_loop, loop_vinfo, NULL, NULL, more_loops);
 
   return ret;
 }
@@ -1022,7 +1023,8 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
 static unsigned
 try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
-		    unsigned *num_vectorized_loops, loop_p loop)
+		    unsigned *num_vectorized_loops, loop_p loop,
+		    vec<loop_p> &more_loops)
 {
   if (!((flag_tree_loop_vectorize
 	 && optimize_loop_nest_for_speed_p (loop))
@@ -1032,7 +1034,8 @@  try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   return try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
 			       loop, NULL,
 			       vect_loop_vectorized_call (loop),
-			       vect_loop_dist_alias_call (loop));
+			       vect_loop_dist_alias_call (loop),
+			       more_loops);
 }
 
 
@@ -1051,6 +1054,7 @@  vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  auto_vec<loop_p> more_loops;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -1105,14 +1109,19 @@  vectorize_loops (void)
 		    vector_loop->dont_vectorize = true;
 		    ret |= try_vectorize_loop (simduid_to_vf_htab,
 					       &num_vectorized_loops,
-					       vector_loop);
+					       vector_loop,
+					       more_loops);
 		  }
 	      }
 	  }
       }
     else
       ret |= try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops,
-				 loop);
+				 loop, more_loops);
+
+  while (!more_loops.is_empty ())
+    try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops,
+			more_loops.pop (), more_loops);
 
   vect_location = dump_user_location_t ();
 
diff --git a/gcc/tree.h b/gcc/tree.h
index 3dce602dfbaca03f568e1c3638d56dfe3a3fd01c..b1c41131e9d1637784a1024d5c301252a06f89e1 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -22,6 +22,8 @@  along with GCC; see the file COPYING3.  If not see
 
 #include "tree-core.h"
 
+#define MAX_VECTORIZATION_FACTOR INT_MAX
+
 /* Convert a target-independent built-in function code to a combined_fn.  */
 
 inline combined_fn