Split load permutation

Message ID nycvar.YFH.7.76.2006191432570.4397@zhemvz.fhfr.qr
State New
Headers show
Series
  • Split load permutation
Related show

Commit Message

Richard Biener June 19, 2020, 12:37 p.m.
This splits SLP load nodes with load permutation into a SLP
load node (with load "permutation" removing gaps) and a
lane permutation node.  The first and foremost goal of this
is to be able to have a single SLP node for each load group
so we can start making decisions about how to vectorize
them factoring in all loads of that group.  The second
goal is to eventually be able to optimize permutations by
pushing them down where they can be combined from multiple
children to the output.  We do have one very special case
handled by vect_attempt_slp_rearrange_stmts doing it all
the way down for reductions that are associative.

For example for

  l1 = a[0]; l2 = a[1];
  b[0] = l1; b[1] = l2;
  c[0] = l2; c[1] = l1;

wecan avoid generating loads twice.  For

  l1 = a[0]; l2 = a[1]; l3 = a[2];
  b[0] = l1; b[1] = l2;
             c[0] = l2; c[1] = l3;

we will have a SLP load node with three lanes plus
two lane permutation nodes selecting two lanes each.  In
a loop context this will cause a VF of two and three
loads per vectorized loop iteration (plus two permutes)
while previously we used a VF of one with two loads
and no permutation per loop iteration.  In the new
scheme the number of loads is less but we pay with
permutes and a higher VF.

There is (bad) interaction with determining of the vectorization
factor which for BB vectorization causes missed vectorizations
because the "load parts of a dataref group directly" optimization
is not (yet) reflected in the SLP graph.

There is (bad) interaction with CTOR vectorization since we now
get confused about where to insert vectorized stmts when combining
two previously distinct SLP graphs.

My immediate focus is on the SSA verification FAILs but the
other part points at a missing piece in this - a "pass"
that "optimizes" the SLP graph with respect to permutations
and loads, ultimatively for example deciding between using
interleaving schemes, scalar loads, "SLP" + permutation,
load-lanes, etc.;  This is also the thing that blocks
SLP only (apart from teaching the few pieces that cannot do SLP
to do SLP).

I'm posting this mostly to make people think how it fits
their future development and architecture features.

And of course to see whether there are already made up ideas
how to structure such "optimization".

Thanks,
Richard.

PS: patch doesn't bootstrap, it ICEs during libgfortran build.
C vect.exp FAILs are

FAIL: gcc.dg/vect/pr59354.c (internal compiler error)
FAIL: gcc.dg/vect/pr88903-2.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-10.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-11.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-12.c (internal compiler error)
FAIL: gcc.dg/vect/slp-23.c scan-tree-dump-times vect "vectorizing stmts using SLP" 2
FAIL: gcc.dg/vect/slp-43.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-19.c scan-tree-dump-times slp2 "basic block vectorized" 1
FAIL: gcc.dg/vect/bb-slp-29.c scan-tree-dump-times slp1 "basic block vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "basic block vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
FAIL: gcc.dg/vect/bb-slp-subgroups-2.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-subgroups-3.c (internal compiler error)

	* tree-vectorizer.h (vect_get_place_in_interleaving_chain):
	Add flag paramter defaulted to true.
	* tree-vect-slp.c (vect_get_place_in_interleaving_chain):
	Imlement alternate mode not counting gaps.
	(vect_build_slp_tree_2): Split load nodes into load and
	lane permutation.
	(vect_attempt_slp_rearrange_stmts): Pass in the unroll
	factor to consider.
	(calculate_unrolling_factor): New function.
	(vect_analyze_slp_instance): Adjust.
	(vect_analyze_slp): Nail down the unroll factor before
	optimizing permutations in SLP reductions.
	(vect_make_slp_decision): Remove determining unroll
	factor here.
	(vect_schedule_slp_instance): Adjust where we emit vectorized
	stmts.
---
 gcc/tree-vect-slp.c   | 246 +++++++++++++++++++++++++++++++++---------
 gcc/tree-vectorizer.h |   3 +-
 2 files changed, 199 insertions(+), 50 deletions(-)

-- 
2.26.2

Comments

Richard Sandiford June 30, 2020, 11 a.m. | #1
Sorry for the slow reponse…

Richard Biener <rguenther@suse.de> writes:
> This splits SLP load nodes with load permutation into a SLP

> load node (with load "permutation" removing gaps) and a

> lane permutation node.  The first and foremost goal of this

> is to be able to have a single SLP node for each load group

> so we can start making decisions about how to vectorize

> them factoring in all loads of that group.  The second

> goal is to eventually be able to optimize permutations by

> pushing them down where they can be combined from multiple

> children to the output.  We do have one very special case

> handled by vect_attempt_slp_rearrange_stmts doing it all

> the way down for reductions that are associative.


Sounds great!

> For example for

>

>   l1 = a[0]; l2 = a[1];

>   b[0] = l1; b[1] = l2;

>   c[0] = l2; c[1] = l1;

>

> wecan avoid generating loads twice.  For

>

>   l1 = a[0]; l2 = a[1]; l3 = a[2];

>   b[0] = l1; b[1] = l2;

>              c[0] = l2; c[1] = l3;

>

> we will have a SLP load node with three lanes plus

> two lane permutation nodes selecting two lanes each.  In

> a loop context this will cause a VF of two and three

> loads per vectorized loop iteration (plus two permutes)

> while previously we used a VF of one with two loads

> and no permutation per loop iteration.  In the new

> scheme the number of loads is less but we pay with

> permutes and a higher VF.

>

> There is (bad) interaction with determining of the vectorization

> factor which for BB vectorization causes missed vectorizations

> because the "load parts of a dataref group directly" optimization

> is not (yet) reflected in the SLP graph.

>

> There is (bad) interaction with CTOR vectorization since we now

> get confused about where to insert vectorized stmts when combining

> two previously distinct SLP graphs.

>

> My immediate focus is on the SSA verification FAILs but the

> other part points at a missing piece in this - a "pass"

> that "optimizes" the SLP graph with respect to permutations

> and loads, ultimatively for example deciding between using

> interleaving schemes, scalar loads, "SLP" + permutation,

> load-lanes, etc.;  This is also the thing that blocks

> SLP only (apart from teaching the few pieces that cannot do SLP

> to do SLP).

>

> I'm posting this mostly to make people think how it fits

> their future development and architecture features.


Yeah, the interleaving scheme is something we'd very much like for SVE2,
where for some operations that produce multiple vectors, it would be better
to organise them on an even/odd division instead of a low/high division.
There are instructions both to produce and to consume values in odd/even
form, so in many cases no explicit permutation would be needed.

I've also seen cases for downward loops where we reverse vectors after
loading and reverse them again before storing, even though the loop
could easily have operated on the reversed vectors directly.

Another thing we'd like for SVE in general is to allow vectors *with*
gaps throughout the SLP graph, since with predication it's simple to
ignore any inactive element where necessary.  This is often much cheaper
than packing the active elements together to remove gaps.  For example:

  a[0] += 1;
  a[2] += 1;
  a[3] += 1;

should just be a predicated load, add and store, with no shuffling.

Even on targets without predication, it might be better to leave
the gaps in for part of the graph, e.g. if two loads feed a single
permute.  So it would be good if the node representation allowed
arbitrary permutations, including with “dead” elements at any
position.  (I realise that isn't news and that keeping the gap
removal with the load was just “for now” :-))

I guess this to some extent feeds into a long-standing TODO to allow
“don't care” elements in things like vec_perm_indices.

Thanks,
Richard
Richard Biener June 30, 2020, 11:38 a.m. | #2
On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Sorry for the slow reponse…


No problem.

> Richard Biener <rguenther@suse.de> writes:

> > This splits SLP load nodes with load permutation into a SLP

> > load node (with load "permutation" removing gaps) and a

> > lane permutation node.  The first and foremost goal of this

> > is to be able to have a single SLP node for each load group

> > so we can start making decisions about how to vectorize

> > them factoring in all loads of that group.  The second

> > goal is to eventually be able to optimize permutations by

> > pushing them down where they can be combined from multiple

> > children to the output.  We do have one very special case

> > handled by vect_attempt_slp_rearrange_stmts doing it all

> > the way down for reductions that are associative.

> 

> Sounds great!

> 

> > For example for

> >

> >   l1 = a[0]; l2 = a[1];

> >   b[0] = l1; b[1] = l2;

> >   c[0] = l2; c[1] = l1;

> >

> > wecan avoid generating loads twice.  For

> >

> >   l1 = a[0]; l2 = a[1]; l3 = a[2];

> >   b[0] = l1; b[1] = l2;

> >              c[0] = l2; c[1] = l3;

> >

> > we will have a SLP load node with three lanes plus

> > two lane permutation nodes selecting two lanes each.  In

> > a loop context this will cause a VF of two and three

> > loads per vectorized loop iteration (plus two permutes)

> > while previously we used a VF of one with two loads

> > and no permutation per loop iteration.  In the new

> > scheme the number of loads is less but we pay with

> > permutes and a higher VF.

> >

> > There is (bad) interaction with determining of the vectorization

> > factor which for BB vectorization causes missed vectorizations

> > because the "load parts of a dataref group directly" optimization

> > is not (yet) reflected in the SLP graph.

> >

> > There is (bad) interaction with CTOR vectorization since we now

> > get confused about where to insert vectorized stmts when combining

> > two previously distinct SLP graphs.

> >

> > My immediate focus is on the SSA verification FAILs but the

> > other part points at a missing piece in this - a "pass"

> > that "optimizes" the SLP graph with respect to permutations

> > and loads, ultimatively for example deciding between using

> > interleaving schemes, scalar loads, "SLP" + permutation,

> > load-lanes, etc.;  This is also the thing that blocks

> > SLP only (apart from teaching the few pieces that cannot do SLP

> > to do SLP).

> >

> > I'm posting this mostly to make people think how it fits

> > their future development and architecture features.

> 

> Yeah, the interleaving scheme is something we'd very much like for SVE2,

> where for some operations that produce multiple vectors, it would be better

> to organise them on an even/odd division instead of a low/high division.

> There are instructions both to produce and to consume values in odd/even

> form, so in many cases no explicit permutation would be needed.

> 

> I've also seen cases for downward loops where we reverse vectors after

> loading and reverse them again before storing, even though the loop

> could easily have operated on the reversed vectors directly.

>

> Another thing we'd like for SVE in general is to allow vectors *with*

> gaps throughout the SLP graph, since with predication it's simple to

> ignore any inactive element where necessary.  This is often much cheaper

> than packing the active elements together to remove gaps.  For example:

> 

>   a[0] += 1;

>   a[2] += 1;

>   a[3] += 1;

> 

> should just be a predicated load, add and store, with no shuffling.

> 

> Even on targets without predication, it might be better to leave

> the gaps in for part of the graph, e.g. if two loads feed a single

> permute.  So it would be good if the node representation allowed

> arbitrary permutations, including with “dead” elements at any

> position.  (I realise that isn't news and that keeping the gap

> removal with the load was just “for now” :-))


Hmm, OK.  So I'm still experimenting with how to incrementally push
these kind of changes to master.  Unfortunately it resists any
"serious" change because we've painted us into a corner with respect
to load and store handling ;)  Well, it just means the change will
need to be bigger than anticipated.

As for inactive lanes, for SVE this means a mask register for each
operation, correct?  At the moment the SLP graph is a representation
of the scalar GIMPLE IL and we cannot really change that until we
commit to a vectorization factor and more.  So an inactive lane
is simply not there and a "load" is simply another way of building
up a vector from scalar stmt results much like those "built from scalars"
SLP nodes but we of course make them special because we have those
DR groups that are used.

So with the patch as posted the "gaps" are represented in the
load permutation of the "loads" which is where you could create
mask registers from and simply push them to SLP parents (up to
the point you get to a parent whose childs have differing masks...).

I think we're going to have a set of post-processing steps on the
initial SLP graph for both "optimizing" where (and if) to do permutations
and whether to elide gap removal in favor of masks (and we'd then
annotate the SLP nodes with the active mask).

All of this requires pushing back some of the early decisions
(I've mostly identified vectorization factor determining) but also
do load/store classification early.  For example if we end up
with strided loads or stores such operations can fuse with any
permutation at no cost.

At the moment I'm continuing to poke the code for its least
resistance for introducing at least parts of the machinery.
I'm targeting post-processing for merging of identical
permutes across operations like it appears for

  a[0] = b[1] + c[1];
  a[1] = b[0] + c[0];

> I guess this to some extent feeds into a long-standing TODO to allow

> “don't care” elements in things like vec_perm_indices.


Hmm, so when there's no masking and we don't want to remove gaps
we'd replace the permutation removing the gaps with an operation
that turns the inactive lanes to zero (bitwise and) or all ones
(bitwise or).  Tracking the number of "vector" lanes compared
to the number of "real scalar" lanes in a SLP node might be enough
to handle this more or less transparently.

Richard.
Richard Sandiford June 30, 2020, 2:50 p.m. | #3
Richard Biener <rguenther@suse.de> writes:
>> Another thing we'd like for SVE in general is to allow vectors *with*

>> gaps throughout the SLP graph, since with predication it's simple to

>> ignore any inactive element where necessary.  This is often much cheaper

>> than packing the active elements together to remove gaps.  For example:

>> 

>>   a[0] += 1;

>>   a[2] += 1;

>>   a[3] += 1;

>> 

>> should just be a predicated load, add and store, with no shuffling.

>> 

>> Even on targets without predication, it might be better to leave

>> the gaps in for part of the graph, e.g. if two loads feed a single

>> permute.  So it would be good if the node representation allowed

>> arbitrary permutations, including with “dead” elements at any

>> position.  (I realise that isn't news and that keeping the gap

>> removal with the load was just “for now” :-))

>

> Hmm, OK.  So I'm still experimenting with how to incrementally push

> these kind of changes to master.  Unfortunately it resists any

> "serious" change because we've painted us into a corner with respect

> to load and store handling ;)  Well, it just means the change will

> need to be bigger than anticipated.

>

> As for inactive lanes, for SVE this means a mask register for each

> operation, correct?


Certainly for potentially trapping ops and reductions, yeah.
For others it really doesn't matter (and those wouldn't require
a target that supports predication).

So I don't think having gaps necessarily means we have a mask.
Being able to attach masks to nodes (perhaps independently of gaps)
would be useful though.

> At the moment the SLP graph is a representation

> of the scalar GIMPLE IL and we cannot really change that until we

> commit to a vectorization factor and more.  So an inactive lane

> is simply not there and a "load" is simply another way of building

> up a vector from scalar stmt results much like those "built from scalars"

> SLP nodes but we of course make them special because we have those

> DR groups that are used.

>

> So with the patch as posted the "gaps" are represented in the

> load permutation of the "loads" which is where you could create

> mask registers from and simply push them to SLP parents (up to

> the point you get to a parent whose childs have differing masks...).


OK.  But I'd argue that's true of permutations too.  At the point
that the SLP graph just represents scalar gimple IL, we can simply
permute SLP_TREE_SCALAR_STMTS and not have any explicit permute
operations in the graph.

Permutations and gaps only come into play once we add more
vector-related information or restrictions.

> I think we're going to have a set of post-processing steps on the

> initial SLP graph for both "optimizing" where (and if) to do permutations

> and whether to elide gap removal in favor of masks (and we'd then

> annotate the SLP nodes with the active mask).


So we'd start out without any permutations and gaps, and then add
them as post-processing step based on what the target can handle?
If so, sounds good to me FWIW.

> All of this requires pushing back some of the early decisions

> (I've mostly identified vectorization factor determining) but also

> do load/store classification early.  For example if we end up

> with strided loads or stores such operations can fuse with any

> permutation at no cost.

>

> At the moment I'm continuing to poke the code for its least

> resistance for introducing at least parts of the machinery.

> I'm targeting post-processing for merging of identical

> permutes across operations like it appears for

>

>   a[0] = b[1] + c[1];

>   a[1] = b[0] + c[0];

>

>> I guess this to some extent feeds into a long-standing TODO to allow

>> “don't care” elements in things like vec_perm_indices.

>

> Hmm, so when there's no masking and we don't want to remove gaps

> we'd replace the permutation removing the gaps with an operation

> that turns the inactive lanes to zero (bitwise and) or all ones

> (bitwise or).  Tracking the number of "vector" lanes compared

> to the number of "real scalar" lanes in a SLP node might be enough

> to handle this more or less transparently.


I was also thinking of cases like:

  a[0] += b[0];
  a[1] += c[1];
  a[2] += b[2];
  a[3] += c[3];

(for all targets).

If we can somehow be smart about the handling of “c” and load a vector
at c[0] rather than c[1], the rhs of the addition could be a blend of:

  b[0] ____ b[2] ____
  ____ c[1] ____ c[3]

Or, even if we can't:

  b[0] ____ b[2] ____
  c[1] ____ c[3] ____

is a natural permute on some targets (e.g. Arm ones).

This of course requires the usual proof that the other elements of b[]
and c[] can be safely loaded.

Thanks,
Richard
Richard Biener July 1, 2020, 8:52 a.m. | #4
On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:

> >> Another thing we'd like for SVE in general is to allow vectors *with*

> >> gaps throughout the SLP graph, since with predication it's simple to

> >> ignore any inactive element where necessary.  This is often much cheaper

> >> than packing the active elements together to remove gaps.  For example:

> >> 

> >>   a[0] += 1;

> >>   a[2] += 1;

> >>   a[3] += 1;

> >> 

> >> should just be a predicated load, add and store, with no shuffling.

> >> 

> >> Even on targets without predication, it might be better to leave

> >> the gaps in for part of the graph, e.g. if two loads feed a single

> >> permute.  So it would be good if the node representation allowed

> >> arbitrary permutations, including with “dead” elements at any

> >> position.  (I realise that isn't news and that keeping the gap

> >> removal with the load was just “for now” :-))

> >

> > Hmm, OK.  So I'm still experimenting with how to incrementally push

> > these kind of changes to master.  Unfortunately it resists any

> > "serious" change because we've painted us into a corner with respect

> > to load and store handling ;)  Well, it just means the change will

> > need to be bigger than anticipated.

> >

> > As for inactive lanes, for SVE this means a mask register for each

> > operation, correct?

> 

> Certainly for potentially trapping ops and reductions, yeah.

> For others it really doesn't matter (and those wouldn't require

> a target that supports predication).

> 

> So I don't think having gaps necessarily means we have a mask.

> Being able to attach masks to nodes (perhaps independently of gaps)

> would be useful though.

> 

> > At the moment the SLP graph is a representation

> > of the scalar GIMPLE IL and we cannot really change that until we

> > commit to a vectorization factor and more.  So an inactive lane

> > is simply not there and a "load" is simply another way of building

> > up a vector from scalar stmt results much like those "built from scalars"

> > SLP nodes but we of course make them special because we have those

> > DR groups that are used.

> >

> > So with the patch as posted the "gaps" are represented in the

> > load permutation of the "loads" which is where you could create

> > mask registers from and simply push them to SLP parents (up to

> > the point you get to a parent whose childs have differing masks...).

> 

> OK.  But I'd argue that's true of permutations too.  At the point

> that the SLP graph just represents scalar gimple IL, we can simply

> permute SLP_TREE_SCALAR_STMTS and not have any explicit permute

> operations in the graph.


That's true when the only permutes are in leafs.  The 
SLP_TREE_SCALAR_STMTS impose an order of lanes so once
different parents need a different order we need explicit permute
nodes swapping SLP_TREE_SCALAR_STMTS.  Of course that's only
because we have combined multiple scalar stmts into a single
SLP node which really is ...

> Permutations and gaps only come into play once we add more

> vector-related information or restrictions.


... what those "vector-related" restrictions are right now.

> > I think we're going to have a set of post-processing steps on the

> > initial SLP graph for both "optimizing" where (and if) to do permutations

> > and whether to elide gap removal in favor of masks (and we'd then

> > annotate the SLP nodes with the active mask).

> 

> So we'd start out without any permutations and gaps, and then add

> them as post-processing step based on what the target can handle?

> If so, sounds good to me FWIW.


If you start with the notion on loads and stores being
compositions/extracts then yes - based on how we want to
vectorize those operations we'd expose any required permutation
explicitely so we can then combine & optimize those.

For the existing patches I tried to relate loads involving the
same data-ref group by introducing a shared SLP node accessing the
whole group, so that's not starting with no permute and it comes
with some issues.  If we don't relate loads from the same dataref
group early (during SLP discovery, that is), then we have to
try (again based on cost) doing that during that post-processing
which then becomes more complicated.

As said - this is work in progress and I'm experimenting in
multiple directions ...

> > All of this requires pushing back some of the early decisions

> > (I've mostly identified vectorization factor determining) but also

> > do load/store classification early.  For example if we end up

> > with strided loads or stores such operations can fuse with any

> > permutation at no cost.

> >

> > At the moment I'm continuing to poke the code for its least

> > resistance for introducing at least parts of the machinery.

> > I'm targeting post-processing for merging of identical

> > permutes across operations like it appears for

> >

> >   a[0] = b[1] + c[1];

> >   a[1] = b[0] + c[0];

> >

> >> I guess this to some extent feeds into a long-standing TODO to allow

> >> “don't care” elements in things like vec_perm_indices.

> >

> > Hmm, so when there's no masking and we don't want to remove gaps

> > we'd replace the permutation removing the gaps with an operation

> > that turns the inactive lanes to zero (bitwise and) or all ones

> > (bitwise or).  Tracking the number of "vector" lanes compared

> > to the number of "real scalar" lanes in a SLP node might be enough

> > to handle this more or less transparently.

> 

> I was also thinking of cases like:

> 

>   a[0] += b[0];

>   a[1] += c[1];

>   a[2] += b[2];

>   a[3] += c[3];

> 

> (for all targets).

> 

> If we can somehow be smart about the handling of “c” and load a vector

> at c[0] rather than c[1], the rhs of the addition could be a blend of:

> 

>   b[0] ____ b[2] ____

>   ____ c[1] ____ c[3]

> 

> Or, even if we can't:

> 

>   b[0] ____ b[2] ____

>   c[1] ____ c[3] ____

> 

> is a natural permute on some targets (e.g. Arm ones).

> 

> This of course requires the usual proof that the other elements of b[]

> and c[] can be safely loaded.


Yeah, that's another good case (we'd currently outright reject
this because we refer to two different dataref groups ...).

Richard.

Patch

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 5c169f37022..68b0750db44 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -244,7 +244,8 @@  vect_contains_pattern_stmt_p (vec<stmt_vec_info> stmts)
 
 int
 vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,
-				      stmt_vec_info first_stmt_info)
+				      stmt_vec_info first_stmt_info,
+				      bool add_gaps)
 {
   stmt_vec_info next_stmt_info = first_stmt_info;
   int result = 0;
@@ -258,7 +259,7 @@  vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,
 	return result;
       next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
       if (next_stmt_info)
-	result += DR_GROUP_GAP (next_stmt_info);
+	result += add_gaps ? DR_GROUP_GAP (next_stmt_info) : 1;
     }
   while (next_stmt_info);
 
@@ -1243,27 +1244,104 @@  vect_build_slp_tree_2 (vec_info *vinfo,
 	}
       else
 	{
-	  *max_nunits = this_max_nunits;
-	  (*tree_size)++;
+	  /* We represent a grouped load as a vector build of all
+	     lanes of the load group in memory order (a load
+	     with possibly a load permutation compacting a group
+	     with gaps) plus a lane permutation node that selects
+	     and permutes the load group lanes.  That allows
+	     sharing of the vector build between different
+	     SLP sub-graphs like for
+		l1 = a[0]; l2 = a[1];
+		b[0] = l1; b[1] = l2;
+		c[0] = l2; c[1] = l1;
+	     where we can avoid generating loads twice.  For
+		l1 = a[0]; l2 = a[1]; l3 = a[2];
+		b[0] = l1; b[1] = l2;
+		c[0] = l2; c[1] = l3;
+	     we will have a SLP load node with three lanes plus
+	     two lane permutation nodes selecting two lanes each.
+
+	     ???  This has ripple-down effects for vectorization
+	     factor computation and bad interaction with vectorized
+	     stmt placing.  */
 	  node = vect_create_new_slp_node (stmts, 0);
 	  SLP_TREE_VECTYPE (node) = vectype;
 	  /* And compute the load permutation.  Whether it is actually
 	     a permutation depends on the unrolling factor which is
 	     decided later.  */
 	  vec<unsigned> load_permutation;
+	  vec<std::pair<unsigned, unsigned> > lane_permutation;
 	  int j;
 	  stmt_vec_info load_info;
+	  lane_permutation.create (group_size);
 	  load_permutation.create (group_size);
 	  stmt_vec_info first_stmt_info
 	    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
-	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
+	  for (load_info = first_stmt_info; load_info;
+	       load_info = DR_GROUP_NEXT_ELEMENT (load_info))
 	    {
 	      int load_place = vect_get_place_in_interleaving_chain
-		  (load_info, first_stmt_info);
+		  (load_info, first_stmt_info, true);
 	      gcc_assert (load_place != -1);
 	      load_permutation.safe_push (load_place);
 	    }
-	  SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
+	  bool any_lane_permute = false;
+	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
+	    {
+	      int lane_place = vect_get_place_in_interleaving_chain
+		  (load_info, first_stmt_info, false);
+	      gcc_assert (lane_place != -1);
+	      lane_permutation.quick_push (std::make_pair (0, lane_place));
+	      if (lane_place != j)
+		any_lane_permute = true;
+	    }
+	  /* When there's a lane permutation or selection make a separate
+	     SLP node for the full group load in lane-order (but keep handling
+	     gaps with load permutations for now).  */
+	  if (!any_lane_permute && load_permutation.length () == group_size)
+	    {
+	      lane_permutation.release ();
+	      SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
+	    }
+	  else
+	    {
+	      /* Build a SLP node loading all members from the group
+		 in order with gaps not represented.
+		 ???  We easily end up with an odd number of lanes here
+		 requiring a larger VF.  In some cases this is desirable
+		 but as gaps are not explicitely represented here those
+		 present an issue (see dg.exp/vect/slp-23.c for example).
+		 That also can end up requiring unsupported load
+		 permutations.  */
+	      load_permutation.release ();
+	      SLP_TREE_LANE_PERMUTATION (node) = lane_permutation;
+	      SLP_TREE_CODE (node) = VEC_PERM_EXPR;
+	      vec<stmt_vec_info> loads;
+	      loads.create (group_size);
+	      for (load_info = first_stmt_info; load_info;
+		   load_info = DR_GROUP_NEXT_ELEMENT (load_info))
+		loads.safe_push (load_info);
+	      bool *lmatches = XALLOCAVEC (bool, loads.length ());
+	      slp_tree load_node = vect_build_slp_tree (vinfo, loads,
+							loads.length (),
+							&this_max_nunits,
+							lmatches, npermutes,
+							&this_tree_size,
+							bst_map);
+	      /* ???  For BB vect the above doesn't always work.  */
+	      if (!load_node)
+		{
+		  matches[0] = false;
+		  loads.release ();
+		  /* The caller still owns defs if we fail.  */
+		  SLP_TREE_SCALAR_STMTS (node) = vNULL;
+		  vect_free_slp_tree (node, false);
+		  return NULL;
+		}
+	      SLP_TREE_CHILDREN (node).safe_push (load_node);
+	    }
+	  *tree_size += this_tree_size + 1;
+	  *max_nunits = this_max_nunits;
 	  return node;
 	}
     }
@@ -1781,7 +1859,8 @@  vect_slp_rearrange_stmts (slp_tree node, unsigned int group_size,
    otherwise return false.  */
 
 static bool
-vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
+vect_attempt_slp_rearrange_stmts (slp_instance slp_instn,
+				  poly_uint64 unrolling_factor)
 {
   unsigned int i, j;
   unsigned int lidx;
@@ -1807,9 +1886,13 @@  vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
     }
 
   /* Check that the loads in the first sequence are different and there
-     are no gaps between them.  */
+     are no gaps between them.  Also check there is any permutation.  */
+  /* ???  We now would need to check lane permutations or more generally
+     apply optimization of permutations by pushing them up/down the
+     graph.  */
   auto_sbitmap load_index (group_size);
   bitmap_clear (load_index);
+  bool any = false;
   FOR_EACH_VEC_ELT (node->load_permutation, i, lidx)
     {
       if (lidx >= group_size)
@@ -1817,8 +1900,12 @@  vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
       if (bitmap_bit_p (load_index, lidx))
 	return false;
 
+      if (lidx != i)
+	any = true;
       bitmap_set_bit (load_index, lidx);
     }
+  if (!any)
+    return false;
   for (i = 0; i < group_size; i++)
     if (!bitmap_bit_p (load_index, i))
       return false;
@@ -1844,7 +1931,6 @@  vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
 			    node->load_permutation, visited);
 
   /* We are done, no actual permutations need to be generated.  */
-  poly_uint64 unrolling_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_instn);
   FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (slp_instn), i, node)
     {
       stmt_vec_info first_stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
@@ -1971,6 +2057,41 @@  calculate_unrolling_factor (poly_uint64 nunits, unsigned int group_size)
   return exact_div (common_multiple (nunits, group_size), group_size);
 }
 
+/* Since the group size now can change in the SLP graph the minimum
+   unroll factor needs to be decided on a node-by-node base and a common
+   multiple be found for the overall unrolling.  */
+
+static poly_uint64
+calculate_unrolling_factor (slp_tree node,
+			    hash_map<slp_tree, poly_uint64> &visited)
+{
+  bool existed_p;
+  poly_uint64 &uf = visited.get_or_insert (node, &existed_p);
+  if (existed_p)
+    return uf;
+
+  poly_uint64 this_uf
+    = calculate_unrolling_factor (node->max_nunits, SLP_TREE_LANES (node));
+  uf = this_uf;
+
+  slp_tree child;
+  unsigned i;
+  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+    {
+      poly_uint64 child_uf = calculate_unrolling_factor (child, visited);
+      this_uf = force_common_multiple (child_uf, this_uf);
+    }
+  *visited.get (node) = this_uf;
+  return this_uf;
+}
+
+static poly_uint64
+calculate_unrolling_factor (slp_tree node)
+{
+  hash_map<slp_tree, poly_uint64> visited;
+  return calculate_unrolling_factor (node, visited);
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -2097,9 +2218,11 @@  vect_analyze_slp_instance (vec_info *vinfo,
 			      &tree_size, bst_map);
   if (node != NULL)
     {
+      /* ???  Instead of walking the SLP tree here record the minimum
+	 unroll factor instead of max_nunits in the SLP nodes?  */
       /* Calculate the unrolling factor based on the smallest type.  */
       poly_uint64 unrolling_factor
-	= calculate_unrolling_factor (max_nunits, group_size);
+	= calculate_unrolling_factor (node);
 
       if (maybe_ne (unrolling_factor, 1U)
 	  && is_a <bb_vec_info> (vinfo))
@@ -2339,8 +2462,28 @@  vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
       vect_free_slp_tree ((*it).second, false);
   delete bst_map;
 
-  /* Optimize permutations in SLP reductions.  */
   slp_instance instance;
+  /* Get the overall SLP induced unrolling factor.  */
+  poly_uint64 unrolling_factor = 1;
+  FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
+    {
+      /* All unroll factors have the form:
+
+	   GET_MODE_SIZE (vinfo->vector_mode) * X
+
+	 for some rational X, so they must have a common multiple.  */
+      unrolling_factor
+	= force_common_multiple (unrolling_factor,
+				 SLP_INSTANCE_UNROLLING_FACTOR (instance));
+
+    }
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor;
+  else
+    /* For BB vectorization we removed any entries that need unrolling.  */
+    gcc_assert (known_eq (unrolling_factor, 1U));
+
+  /* Optimize permutations in SLP reductions.  */
   FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
     {
       slp_tree node = SLP_INSTANCE_TREE (instance);
@@ -2349,7 +2492,7 @@  vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	 In reduction chain the order of the loads is not important.  */
       if (!STMT_VINFO_DATA_REF (stmt_info)
 	  && !REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-	vect_attempt_slp_rearrange_stmts (instance);
+	vect_attempt_slp_rearrange_stmts (instance, unrolling_factor);
     }
 
   /* Gather all loads in the SLP graph.  */
@@ -2434,7 +2577,6 @@  bool
 vect_make_slp_decision (loop_vec_info loop_vinfo)
 {
   unsigned int i;
-  poly_uint64 unrolling_factor = 1;
   vec<slp_instance> slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
   slp_instance instance;
   int decided_to_slp = 0;
@@ -2444,15 +2586,6 @@  vect_make_slp_decision (loop_vec_info loop_vinfo)
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       /* FORNOW: SLP if you can.  */
-      /* All unroll factors have the form:
-
-	   GET_MODE_SIZE (vinfo->vector_mode) * X
-
-	 for some rational X, so they must have a common multiple.  */
-      unrolling_factor
-	= force_common_multiple (unrolling_factor,
-				 SLP_INSTANCE_UNROLLING_FACTOR (instance));
-
       /* Mark all the stmts that belong to INSTANCE as PURE_SLP stmts.  Later we
 	 call vect_detect_hybrid_slp () to find stmts that need hybrid SLP and
 	 loop-based vectorization.  Such stmts will be marked as HYBRID.  */
@@ -2460,14 +2593,12 @@  vect_make_slp_decision (loop_vec_info loop_vinfo)
       decided_to_slp++;
     }
 
-  LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor;
-
   if (decided_to_slp && dump_enabled_p ())
     {
       dump_printf_loc (MSG_NOTE, vect_location,
 		       "Decided to SLP %d instances. Unrolling factor ",
 		       decided_to_slp);
-      dump_dec (MSG_NOTE, unrolling_factor);
+      dump_dec (MSG_NOTE, LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo));
       dump_printf (MSG_NOTE, "\n");
     }
 
@@ -4205,33 +4336,50 @@  vect_schedule_slp_instance (vec_info *vinfo,
 		     "------>vectorizing SLP node starting from: %G",
 		     stmt_info->stmt);
 
-  /* Vectorized stmts go before the last scalar stmt which is where
-     all uses are ready.  */
-  stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
-  if (last_stmt_info)
-    si = gsi_for_stmt (last_stmt_info->stmt);
+  /* Vectorized stmts for leafs and nodes with datarefs go before
+     the last scalar stmt.  */
+  if (SLP_TREE_CHILDREN (node).is_empty ()
+      || (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+	  && STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node))))
+    {
+      stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
+      si = gsi_for_stmt (last_stmt_info->stmt);
+    }
   else
     {
-      /* Or if we do not have 1:1 matching scalar stmts emit after the
-	 children vectorized defs.  */
+      /* Else emit after the children vectorized defs.  */
       gimple *last_stmt = NULL;
       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-	/* ???  With only external defs the following breaks.  Note
-	   external defs do not get all vector init stmts generated
-	   at the same place.  */
-	if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
-	  {
-	    /* We are emitting all vectorized stmts in the same place and
-	       the last one is the last.
-	       ???  Unless we have a load permutation applied and that
-	       figures to re-use an earlier generated load.  */
-	    unsigned j;
-	    gimple *vstmt;
-	    FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt)
-	      if (!last_stmt
-		  || vect_stmt_dominates_stmt_p (last_stmt, vstmt))
-		last_stmt = vstmt;
-	  }
+	{
+	  if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
+	    {
+	      /* We are emitting all vectorized stmts in the same place and
+		 the last one is the last.
+		 ???  Unless we have a load permutation applied and that
+		 figures to re-use an earlier generated load.  */
+	      unsigned j;
+	      gimple *vstmt;
+	      FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt)
+		if (!last_stmt
+		    || vect_stmt_dominates_stmt_p (last_stmt, vstmt))
+		  last_stmt = vstmt;
+	    }
+	  else
+	    {
+	      /* For externals we have to look at all defs since their
+		 insertion place is decided per vector.  */
+	      unsigned j;
+	      tree vdef;
+	      FOR_EACH_VEC_ELT (SLP_TREE_VEC_DEFS (child), j, vdef)
+		if (TREE_CODE (vdef) == SSA_NAME)
+		  {
+		    gimple *vstmt = SSA_NAME_DEF_STMT (vdef);
+		    if (!last_stmt
+			|| vect_stmt_dominates_stmt_p (last_stmt, vstmt))
+		      last_stmt = vstmt;
+		  }
+	    }
+	}
       if (is_a <gphi *> (last_stmt))
 	si = gsi_after_labels (gimple_bb (last_stmt));
       else
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 32feec3e24b..07ab8d2c4e3 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2007,7 +2007,8 @@  extern bool can_duplicate_and_interleave_p (vec_info *, unsigned int, tree,
 					    tree * = NULL, tree * = NULL);
 extern void duplicate_and_interleave (vec_info *, gimple_seq *, tree,
 				      vec<tree>, unsigned int, vec<tree> &);
-extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info);
+extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info,
+						 bool = true);
 
 /* In tree-vect-patterns.c.  */
 /* Pattern recognition functions.