Message ID  nycvar.YFH.7.76.2006191432570.4397@zhemvz.fhfr.qr 

State  New 
Headers  show 
Series 

Related  show 
Sorry for the slow reponse… Richard Biener <rguenther@suse.de> writes: > This splits SLP load nodes with load permutation into a SLP > load node (with load "permutation" removing gaps) and a > lane permutation node. The first and foremost goal of this > is to be able to have a single SLP node for each load group > so we can start making decisions about how to vectorize > them factoring in all loads of that group. The second > goal is to eventually be able to optimize permutations by > pushing them down where they can be combined from multiple > children to the output. We do have one very special case > handled by vect_attempt_slp_rearrange_stmts doing it all > the way down for reductions that are associative. Sounds great! > For example for > > l1 = a[0]; l2 = a[1]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l1; > > wecan avoid generating loads twice. For > > l1 = a[0]; l2 = a[1]; l3 = a[2]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l3; > > we will have a SLP load node with three lanes plus > two lane permutation nodes selecting two lanes each. In > a loop context this will cause a VF of two and three > loads per vectorized loop iteration (plus two permutes) > while previously we used a VF of one with two loads > and no permutation per loop iteration. In the new > scheme the number of loads is less but we pay with > permutes and a higher VF. > > There is (bad) interaction with determining of the vectorization > factor which for BB vectorization causes missed vectorizations > because the "load parts of a dataref group directly" optimization > is not (yet) reflected in the SLP graph. > > There is (bad) interaction with CTOR vectorization since we now > get confused about where to insert vectorized stmts when combining > two previously distinct SLP graphs. > > My immediate focus is on the SSA verification FAILs but the > other part points at a missing piece in this  a "pass" > that "optimizes" the SLP graph with respect to permutations > and loads, ultimatively for example deciding between using > interleaving schemes, scalar loads, "SLP" + permutation, > loadlanes, etc.; This is also the thing that blocks > SLP only (apart from teaching the few pieces that cannot do SLP > to do SLP). > > I'm posting this mostly to make people think how it fits > their future development and architecture features. Yeah, the interleaving scheme is something we'd very much like for SVE2, where for some operations that produce multiple vectors, it would be better to organise them on an even/odd division instead of a low/high division. There are instructions both to produce and to consume values in odd/even form, so in many cases no explicit permutation would be needed. I've also seen cases for downward loops where we reverse vectors after loading and reverse them again before storing, even though the loop could easily have operated on the reversed vectors directly. Another thing we'd like for SVE in general is to allow vectors *with* gaps throughout the SLP graph, since with predication it's simple to ignore any inactive element where necessary. This is often much cheaper than packing the active elements together to remove gaps. For example: a[0] += 1; a[2] += 1; a[3] += 1; should just be a predicated load, add and store, with no shuffling. Even on targets without predication, it might be better to leave the gaps in for part of the graph, e.g. if two loads feed a single permute. So it would be good if the node representation allowed arbitrary permutations, including with “dead” elements at any position. (I realise that isn't news and that keeping the gap removal with the load was just “for now” :)) I guess this to some extent feeds into a longstanding TODO to allow “don't care” elements in things like vec_perm_indices. Thanks, Richard
On Tue, 30 Jun 2020, Richard Sandiford wrote: > Sorry for the slow reponse… No problem. > Richard Biener <rguenther@suse.de> writes: > > This splits SLP load nodes with load permutation into a SLP > > load node (with load "permutation" removing gaps) and a > > lane permutation node. The first and foremost goal of this > > is to be able to have a single SLP node for each load group > > so we can start making decisions about how to vectorize > > them factoring in all loads of that group. The second > > goal is to eventually be able to optimize permutations by > > pushing them down where they can be combined from multiple > > children to the output. We do have one very special case > > handled by vect_attempt_slp_rearrange_stmts doing it all > > the way down for reductions that are associative. > > Sounds great! > > > For example for > > > > l1 = a[0]; l2 = a[1]; > > b[0] = l1; b[1] = l2; > > c[0] = l2; c[1] = l1; > > > > wecan avoid generating loads twice. For > > > > l1 = a[0]; l2 = a[1]; l3 = a[2]; > > b[0] = l1; b[1] = l2; > > c[0] = l2; c[1] = l3; > > > > we will have a SLP load node with three lanes plus > > two lane permutation nodes selecting two lanes each. In > > a loop context this will cause a VF of two and three > > loads per vectorized loop iteration (plus two permutes) > > while previously we used a VF of one with two loads > > and no permutation per loop iteration. In the new > > scheme the number of loads is less but we pay with > > permutes and a higher VF. > > > > There is (bad) interaction with determining of the vectorization > > factor which for BB vectorization causes missed vectorizations > > because the "load parts of a dataref group directly" optimization > > is not (yet) reflected in the SLP graph. > > > > There is (bad) interaction with CTOR vectorization since we now > > get confused about where to insert vectorized stmts when combining > > two previously distinct SLP graphs. > > > > My immediate focus is on the SSA verification FAILs but the > > other part points at a missing piece in this  a "pass" > > that "optimizes" the SLP graph with respect to permutations > > and loads, ultimatively for example deciding between using > > interleaving schemes, scalar loads, "SLP" + permutation, > > loadlanes, etc.; This is also the thing that blocks > > SLP only (apart from teaching the few pieces that cannot do SLP > > to do SLP). > > > > I'm posting this mostly to make people think how it fits > > their future development and architecture features. > > Yeah, the interleaving scheme is something we'd very much like for SVE2, > where for some operations that produce multiple vectors, it would be better > to organise them on an even/odd division instead of a low/high division. > There are instructions both to produce and to consume values in odd/even > form, so in many cases no explicit permutation would be needed. > > I've also seen cases for downward loops where we reverse vectors after > loading and reverse them again before storing, even though the loop > could easily have operated on the reversed vectors directly. > > Another thing we'd like for SVE in general is to allow vectors *with* > gaps throughout the SLP graph, since with predication it's simple to > ignore any inactive element where necessary. This is often much cheaper > than packing the active elements together to remove gaps. For example: > > a[0] += 1; > a[2] += 1; > a[3] += 1; > > should just be a predicated load, add and store, with no shuffling. > > Even on targets without predication, it might be better to leave > the gaps in for part of the graph, e.g. if two loads feed a single > permute. So it would be good if the node representation allowed > arbitrary permutations, including with “dead” elements at any > position. (I realise that isn't news and that keeping the gap > removal with the load was just “for now” :)) Hmm, OK. So I'm still experimenting with how to incrementally push these kind of changes to master. Unfortunately it resists any "serious" change because we've painted us into a corner with respect to load and store handling ;) Well, it just means the change will need to be bigger than anticipated. As for inactive lanes, for SVE this means a mask register for each operation, correct? At the moment the SLP graph is a representation of the scalar GIMPLE IL and we cannot really change that until we commit to a vectorization factor and more. So an inactive lane is simply not there and a "load" is simply another way of building up a vector from scalar stmt results much like those "built from scalars" SLP nodes but we of course make them special because we have those DR groups that are used. So with the patch as posted the "gaps" are represented in the load permutation of the "loads" which is where you could create mask registers from and simply push them to SLP parents (up to the point you get to a parent whose childs have differing masks...). I think we're going to have a set of postprocessing steps on the initial SLP graph for both "optimizing" where (and if) to do permutations and whether to elide gap removal in favor of masks (and we'd then annotate the SLP nodes with the active mask). All of this requires pushing back some of the early decisions (I've mostly identified vectorization factor determining) but also do load/store classification early. For example if we end up with strided loads or stores such operations can fuse with any permutation at no cost. At the moment I'm continuing to poke the code for its least resistance for introducing at least parts of the machinery. I'm targeting postprocessing for merging of identical permutes across operations like it appears for a[0] = b[1] + c[1]; a[1] = b[0] + c[0]; > I guess this to some extent feeds into a longstanding TODO to allow > “don't care” elements in things like vec_perm_indices. Hmm, so when there's no masking and we don't want to remove gaps we'd replace the permutation removing the gaps with an operation that turns the inactive lanes to zero (bitwise and) or all ones (bitwise or). Tracking the number of "vector" lanes compared to the number of "real scalar" lanes in a SLP node might be enough to handle this more or less transparently. Richard.
Richard Biener <rguenther@suse.de> writes: >> Another thing we'd like for SVE in general is to allow vectors *with* >> gaps throughout the SLP graph, since with predication it's simple to >> ignore any inactive element where necessary. This is often much cheaper >> than packing the active elements together to remove gaps. For example: >> >> a[0] += 1; >> a[2] += 1; >> a[3] += 1; >> >> should just be a predicated load, add and store, with no shuffling. >> >> Even on targets without predication, it might be better to leave >> the gaps in for part of the graph, e.g. if two loads feed a single >> permute. So it would be good if the node representation allowed >> arbitrary permutations, including with “dead” elements at any >> position. (I realise that isn't news and that keeping the gap >> removal with the load was just “for now” :)) > > Hmm, OK. So I'm still experimenting with how to incrementally push > these kind of changes to master. Unfortunately it resists any > "serious" change because we've painted us into a corner with respect > to load and store handling ;) Well, it just means the change will > need to be bigger than anticipated. > > As for inactive lanes, for SVE this means a mask register for each > operation, correct? Certainly for potentially trapping ops and reductions, yeah. For others it really doesn't matter (and those wouldn't require a target that supports predication). So I don't think having gaps necessarily means we have a mask. Being able to attach masks to nodes (perhaps independently of gaps) would be useful though. > At the moment the SLP graph is a representation > of the scalar GIMPLE IL and we cannot really change that until we > commit to a vectorization factor and more. So an inactive lane > is simply not there and a "load" is simply another way of building > up a vector from scalar stmt results much like those "built from scalars" > SLP nodes but we of course make them special because we have those > DR groups that are used. > > So with the patch as posted the "gaps" are represented in the > load permutation of the "loads" which is where you could create > mask registers from and simply push them to SLP parents (up to > the point you get to a parent whose childs have differing masks...). OK. But I'd argue that's true of permutations too. At the point that the SLP graph just represents scalar gimple IL, we can simply permute SLP_TREE_SCALAR_STMTS and not have any explicit permute operations in the graph. Permutations and gaps only come into play once we add more vectorrelated information or restrictions. > I think we're going to have a set of postprocessing steps on the > initial SLP graph for both "optimizing" where (and if) to do permutations > and whether to elide gap removal in favor of masks (and we'd then > annotate the SLP nodes with the active mask). So we'd start out without any permutations and gaps, and then add them as postprocessing step based on what the target can handle? If so, sounds good to me FWIW. > All of this requires pushing back some of the early decisions > (I've mostly identified vectorization factor determining) but also > do load/store classification early. For example if we end up > with strided loads or stores such operations can fuse with any > permutation at no cost. > > At the moment I'm continuing to poke the code for its least > resistance for introducing at least parts of the machinery. > I'm targeting postprocessing for merging of identical > permutes across operations like it appears for > > a[0] = b[1] + c[1]; > a[1] = b[0] + c[0]; > >> I guess this to some extent feeds into a longstanding TODO to allow >> “don't care” elements in things like vec_perm_indices. > > Hmm, so when there's no masking and we don't want to remove gaps > we'd replace the permutation removing the gaps with an operation > that turns the inactive lanes to zero (bitwise and) or all ones > (bitwise or). Tracking the number of "vector" lanes compared > to the number of "real scalar" lanes in a SLP node might be enough > to handle this more or less transparently. I was also thinking of cases like: a[0] += b[0]; a[1] += c[1]; a[2] += b[2]; a[3] += c[3]; (for all targets). If we can somehow be smart about the handling of “c” and load a vector at c[0] rather than c[1], the rhs of the addition could be a blend of: b[0] ____ b[2] ____ ____ c[1] ____ c[3] Or, even if we can't: b[0] ____ b[2] ____ c[1] ____ c[3] ____ is a natural permute on some targets (e.g. Arm ones). This of course requires the usual proof that the other elements of b[] and c[] can be safely loaded. Thanks, Richard
On Tue, 30 Jun 2020, Richard Sandiford wrote: > Richard Biener <rguenther@suse.de> writes: > >> Another thing we'd like for SVE in general is to allow vectors *with* > >> gaps throughout the SLP graph, since with predication it's simple to > >> ignore any inactive element where necessary. This is often much cheaper > >> than packing the active elements together to remove gaps. For example: > >> > >> a[0] += 1; > >> a[2] += 1; > >> a[3] += 1; > >> > >> should just be a predicated load, add and store, with no shuffling. > >> > >> Even on targets without predication, it might be better to leave > >> the gaps in for part of the graph, e.g. if two loads feed a single > >> permute. So it would be good if the node representation allowed > >> arbitrary permutations, including with “dead” elements at any > >> position. (I realise that isn't news and that keeping the gap > >> removal with the load was just “for now” :)) > > > > Hmm, OK. So I'm still experimenting with how to incrementally push > > these kind of changes to master. Unfortunately it resists any > > "serious" change because we've painted us into a corner with respect > > to load and store handling ;) Well, it just means the change will > > need to be bigger than anticipated. > > > > As for inactive lanes, for SVE this means a mask register for each > > operation, correct? > > Certainly for potentially trapping ops and reductions, yeah. > For others it really doesn't matter (and those wouldn't require > a target that supports predication). > > So I don't think having gaps necessarily means we have a mask. > Being able to attach masks to nodes (perhaps independently of gaps) > would be useful though. > > > At the moment the SLP graph is a representation > > of the scalar GIMPLE IL and we cannot really change that until we > > commit to a vectorization factor and more. So an inactive lane > > is simply not there and a "load" is simply another way of building > > up a vector from scalar stmt results much like those "built from scalars" > > SLP nodes but we of course make them special because we have those > > DR groups that are used. > > > > So with the patch as posted the "gaps" are represented in the > > load permutation of the "loads" which is where you could create > > mask registers from and simply push them to SLP parents (up to > > the point you get to a parent whose childs have differing masks...). > > OK. But I'd argue that's true of permutations too. At the point > that the SLP graph just represents scalar gimple IL, we can simply > permute SLP_TREE_SCALAR_STMTS and not have any explicit permute > operations in the graph. That's true when the only permutes are in leafs. The SLP_TREE_SCALAR_STMTS impose an order of lanes so once different parents need a different order we need explicit permute nodes swapping SLP_TREE_SCALAR_STMTS. Of course that's only because we have combined multiple scalar stmts into a single SLP node which really is ... > Permutations and gaps only come into play once we add more > vectorrelated information or restrictions. ... what those "vectorrelated" restrictions are right now. > > I think we're going to have a set of postprocessing steps on the > > initial SLP graph for both "optimizing" where (and if) to do permutations > > and whether to elide gap removal in favor of masks (and we'd then > > annotate the SLP nodes with the active mask). > > So we'd start out without any permutations and gaps, and then add > them as postprocessing step based on what the target can handle? > If so, sounds good to me FWIW. If you start with the notion on loads and stores being compositions/extracts then yes  based on how we want to vectorize those operations we'd expose any required permutation explicitely so we can then combine & optimize those. For the existing patches I tried to relate loads involving the same dataref group by introducing a shared SLP node accessing the whole group, so that's not starting with no permute and it comes with some issues. If we don't relate loads from the same dataref group early (during SLP discovery, that is), then we have to try (again based on cost) doing that during that postprocessing which then becomes more complicated. As said  this is work in progress and I'm experimenting in multiple directions ... > > All of this requires pushing back some of the early decisions > > (I've mostly identified vectorization factor determining) but also > > do load/store classification early. For example if we end up > > with strided loads or stores such operations can fuse with any > > permutation at no cost. > > > > At the moment I'm continuing to poke the code for its least > > resistance for introducing at least parts of the machinery. > > I'm targeting postprocessing for merging of identical > > permutes across operations like it appears for > > > > a[0] = b[1] + c[1]; > > a[1] = b[0] + c[0]; > > > >> I guess this to some extent feeds into a longstanding TODO to allow > >> “don't care” elements in things like vec_perm_indices. > > > > Hmm, so when there's no masking and we don't want to remove gaps > > we'd replace the permutation removing the gaps with an operation > > that turns the inactive lanes to zero (bitwise and) or all ones > > (bitwise or). Tracking the number of "vector" lanes compared > > to the number of "real scalar" lanes in a SLP node might be enough > > to handle this more or less transparently. > > I was also thinking of cases like: > > a[0] += b[0]; > a[1] += c[1]; > a[2] += b[2]; > a[3] += c[3]; > > (for all targets). > > If we can somehow be smart about the handling of “c” and load a vector > at c[0] rather than c[1], the rhs of the addition could be a blend of: > > b[0] ____ b[2] ____ > ____ c[1] ____ c[3] > > Or, even if we can't: > > b[0] ____ b[2] ____ > c[1] ____ c[3] ____ > > is a natural permute on some targets (e.g. Arm ones). > > This of course requires the usual proof that the other elements of b[] > and c[] can be safely loaded. Yeah, that's another good case (we'd currently outright reject this because we refer to two different dataref groups ...). Richard.
diff git a/gcc/treevectslp.c b/gcc/treevectslp.c index 5c169f37022..68b0750db44 100644  a/gcc/treevectslp.c +++ b/gcc/treevectslp.c @@ 244,7 +244,8 @@ vect_contains_pattern_stmt_p (vec<stmt_vec_info> stmts) int vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,  stmt_vec_info first_stmt_info) + stmt_vec_info first_stmt_info, + bool add_gaps) { stmt_vec_info next_stmt_info = first_stmt_info; int result = 0; @@ 258,7 +259,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info, return result; next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info); if (next_stmt_info)  result += DR_GROUP_GAP (next_stmt_info); + result += add_gaps ? DR_GROUP_GAP (next_stmt_info) : 1; } while (next_stmt_info); @@ 1243,27 +1244,104 @@ vect_build_slp_tree_2 (vec_info *vinfo, } else {  *max_nunits = this_max_nunits;  (*tree_size)++; + /* We represent a grouped load as a vector build of all + lanes of the load group in memory order (a load + with possibly a load permutation compacting a group + with gaps) plus a lane permutation node that selects + and permutes the load group lanes. That allows + sharing of the vector build between different + SLP subgraphs like for + l1 = a[0]; l2 = a[1]; + b[0] = l1; b[1] = l2; + c[0] = l2; c[1] = l1; + where we can avoid generating loads twice. For + l1 = a[0]; l2 = a[1]; l3 = a[2]; + b[0] = l1; b[1] = l2; + c[0] = l2; c[1] = l3; + we will have a SLP load node with three lanes plus + two lane permutation nodes selecting two lanes each. + + ??? This has rippledown effects for vectorization + factor computation and bad interaction with vectorized + stmt placing. */ node = vect_create_new_slp_node (stmts, 0); SLP_TREE_VECTYPE (node) = vectype; /* And compute the load permutation. Whether it is actually a permutation depends on the unrolling factor which is decided later. */ vec<unsigned> load_permutation; + vec<std::pair<unsigned, unsigned> > lane_permutation; int j; stmt_vec_info load_info; + lane_permutation.create (group_size); load_permutation.create (group_size); stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info) + for (load_info = first_stmt_info; load_info; + load_info = DR_GROUP_NEXT_ELEMENT (load_info)) { int load_place = vect_get_place_in_interleaving_chain  (load_info, first_stmt_info); + (load_info, first_stmt_info, true); gcc_assert (load_place != 1); load_permutation.safe_push (load_place); }  SLP_TREE_LOAD_PERMUTATION (node) = load_permutation; + bool any_lane_permute = false; + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info) + { + int lane_place = vect_get_place_in_interleaving_chain + (load_info, first_stmt_info, false); + gcc_assert (lane_place != 1); + lane_permutation.quick_push (std::make_pair (0, lane_place)); + if (lane_place != j) + any_lane_permute = true; + } + /* When there's a lane permutation or selection make a separate + SLP node for the full group load in laneorder (but keep handling + gaps with load permutations for now). */ + if (!any_lane_permute && load_permutation.length () == group_size) + { + lane_permutation.release (); + SLP_TREE_LOAD_PERMUTATION (node) = load_permutation; + } + else + { + /* Build a SLP node loading all members from the group + in order with gaps not represented. + ??? We easily end up with an odd number of lanes here + requiring a larger VF. In some cases this is desirable + but as gaps are not explicitely represented here those + present an issue (see dg.exp/vect/slp23.c for example). + That also can end up requiring unsupported load + permutations. */ + load_permutation.release (); + SLP_TREE_LANE_PERMUTATION (node) = lane_permutation; + SLP_TREE_CODE (node) = VEC_PERM_EXPR; + vec<stmt_vec_info> loads; + loads.create (group_size); + for (load_info = first_stmt_info; load_info; + load_info = DR_GROUP_NEXT_ELEMENT (load_info)) + loads.safe_push (load_info); + bool *lmatches = XALLOCAVEC (bool, loads.length ()); + slp_tree load_node = vect_build_slp_tree (vinfo, loads, + loads.length (), + &this_max_nunits, + lmatches, npermutes, + &this_tree_size, + bst_map); + /* ??? For BB vect the above doesn't always work. */ + if (!load_node) + { + matches[0] = false; + loads.release (); + /* The caller still owns defs if we fail. */ + SLP_TREE_SCALAR_STMTS (node) = vNULL; + vect_free_slp_tree (node, false); + return NULL; + } + SLP_TREE_CHILDREN (node).safe_push (load_node); + } + *tree_size += this_tree_size + 1; + *max_nunits = this_max_nunits; return node; } } @@ 1781,7 +1859,8 @@ vect_slp_rearrange_stmts (slp_tree node, unsigned int group_size, otherwise return false. */ static bool vect_attempt_slp_rearrange_stmts (slp_instance slp_instn) +vect_attempt_slp_rearrange_stmts (slp_instance slp_instn, + poly_uint64 unrolling_factor) { unsigned int i, j; unsigned int lidx; @@ 1807,9 +1886,13 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn) } /* Check that the loads in the first sequence are different and there  are no gaps between them. */ + are no gaps between them. Also check there is any permutation. */ + /* ??? We now would need to check lane permutations or more generally + apply optimization of permutations by pushing them up/down the + graph. */ auto_sbitmap load_index (group_size); bitmap_clear (load_index); + bool any = false; FOR_EACH_VEC_ELT (node>load_permutation, i, lidx) { if (lidx >= group_size) @@ 1817,8 +1900,12 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn) if (bitmap_bit_p (load_index, lidx)) return false; + if (lidx != i) + any = true; bitmap_set_bit (load_index, lidx); } + if (!any) + return false; for (i = 0; i < group_size; i++) if (!bitmap_bit_p (load_index, i)) return false; @@ 1844,7 +1931,6 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn) node>load_permutation, visited); /* We are done, no actual permutations need to be generated. */  poly_uint64 unrolling_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_instn); FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (slp_instn), i, node) { stmt_vec_info first_stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; @@ 1971,6 +2057,41 @@ calculate_unrolling_factor (poly_uint64 nunits, unsigned int group_size) return exact_div (common_multiple (nunits, group_size), group_size); } +/* Since the group size now can change in the SLP graph the minimum + unroll factor needs to be decided on a nodebynode base and a common + multiple be found for the overall unrolling. */ + +static poly_uint64 +calculate_unrolling_factor (slp_tree node, + hash_map<slp_tree, poly_uint64> &visited) +{ + bool existed_p; + poly_uint64 &uf = visited.get_or_insert (node, &existed_p); + if (existed_p) + return uf; + + poly_uint64 this_uf + = calculate_unrolling_factor (node>max_nunits, SLP_TREE_LANES (node)); + uf = this_uf; + + slp_tree child; + unsigned i; + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) + { + poly_uint64 child_uf = calculate_unrolling_factor (child, visited); + this_uf = force_common_multiple (child_uf, this_uf); + } + *visited.get (node) = this_uf; + return this_uf; +} + +static poly_uint64 +calculate_unrolling_factor (slp_tree node) +{ + hash_map<slp_tree, poly_uint64> visited; + return calculate_unrolling_factor (node, visited); +} + /* Analyze an SLP instance starting from a group of grouped stores. Call vect_build_slp_tree to build a tree of packed stmts if possible. Return FALSE if it's impossible to SLP any stmt in the loop. */ @@ 2097,9 +2218,11 @@ vect_analyze_slp_instance (vec_info *vinfo, &tree_size, bst_map); if (node != NULL) { + /* ??? Instead of walking the SLP tree here record the minimum + unroll factor instead of max_nunits in the SLP nodes? */ /* Calculate the unrolling factor based on the smallest type. */ poly_uint64 unrolling_factor  = calculate_unrolling_factor (max_nunits, group_size); + = calculate_unrolling_factor (node); if (maybe_ne (unrolling_factor, 1U) && is_a <bb_vec_info> (vinfo)) @@ 2339,8 +2462,28 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) vect_free_slp_tree ((*it).second, false); delete bst_map;  /* Optimize permutations in SLP reductions. */ slp_instance instance; + /* Get the overall SLP induced unrolling factor. */ + poly_uint64 unrolling_factor = 1; + FOR_EACH_VEC_ELT (vinfo>slp_instances, i, instance) + { + /* All unroll factors have the form: + + GET_MODE_SIZE (vinfo>vector_mode) * X + + for some rational X, so they must have a common multiple. */ + unrolling_factor + = force_common_multiple (unrolling_factor, + SLP_INSTANCE_UNROLLING_FACTOR (instance)); + + } + if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo)) + LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor; + else + /* For BB vectorization we removed any entries that need unrolling. */ + gcc_assert (known_eq (unrolling_factor, 1U)); + + /* Optimize permutations in SLP reductions. */ FOR_EACH_VEC_ELT (vinfo>slp_instances, i, instance) { slp_tree node = SLP_INSTANCE_TREE (instance); @@ 2349,7 +2492,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) In reduction chain the order of the loads is not important. */ if (!STMT_VINFO_DATA_REF (stmt_info) && !REDUC_GROUP_FIRST_ELEMENT (stmt_info))  vect_attempt_slp_rearrange_stmts (instance); + vect_attempt_slp_rearrange_stmts (instance, unrolling_factor); } /* Gather all loads in the SLP graph. */ @@ 2434,7 +2577,6 @@ bool vect_make_slp_decision (loop_vec_info loop_vinfo) { unsigned int i;  poly_uint64 unrolling_factor = 1; vec<slp_instance> slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo); slp_instance instance; int decided_to_slp = 0; @@ 2444,15 +2586,6 @@ vect_make_slp_decision (loop_vec_info loop_vinfo) FOR_EACH_VEC_ELT (slp_instances, i, instance) { /* FORNOW: SLP if you can. */  /* All unroll factors have the form:   GET_MODE_SIZE (vinfo>vector_mode) * X   for some rational X, so they must have a common multiple. */  unrolling_factor  = force_common_multiple (unrolling_factor,  SLP_INSTANCE_UNROLLING_FACTOR (instance));  /* Mark all the stmts that belong to INSTANCE as PURE_SLP stmts. Later we call vect_detect_hybrid_slp () to find stmts that need hybrid SLP and loopbased vectorization. Such stmts will be marked as HYBRID. */ @@ 2460,14 +2593,12 @@ vect_make_slp_decision (loop_vec_info loop_vinfo) decided_to_slp++; }  LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor;  if (decided_to_slp && dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, "Decided to SLP %d instances. Unrolling factor ", decided_to_slp);  dump_dec (MSG_NOTE, unrolling_factor); + dump_dec (MSG_NOTE, LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo)); dump_printf (MSG_NOTE, "\n"); } @@ 4205,33 +4336,50 @@ vect_schedule_slp_instance (vec_info *vinfo, ">vectorizing SLP node starting from: %G", stmt_info>stmt);  /* Vectorized stmts go before the last scalar stmt which is where  all uses are ready. */  stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);  if (last_stmt_info)  si = gsi_for_stmt (last_stmt_info>stmt); + /* Vectorized stmts for leafs and nodes with datarefs go before + the last scalar stmt. */ + if (SLP_TREE_CHILDREN (node).is_empty () +  (SLP_TREE_CODE (node) != VEC_PERM_EXPR + && STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node)))) + { + stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node); + si = gsi_for_stmt (last_stmt_info>stmt); + } else {  /* Or if we do not have 1:1 matching scalar stmts emit after the  children vectorized defs. */ + /* Else emit after the children vectorized defs. */ gimple *last_stmt = NULL; FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)  /* ??? With only external defs the following breaks. Note  external defs do not get all vector init stmts generated  at the same place. */  if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)  {  /* We are emitting all vectorized stmts in the same place and  the last one is the last.  ??? Unless we have a load permutation applied and that  figures to reuse an earlier generated load. */  unsigned j;  gimple *vstmt;  FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt)  if (!last_stmt   vect_stmt_dominates_stmt_p (last_stmt, vstmt))  last_stmt = vstmt;  } + { + if (SLP_TREE_DEF_TYPE (child) == vect_internal_def) + { + /* We are emitting all vectorized stmts in the same place and + the last one is the last. + ??? Unless we have a load permutation applied and that + figures to reuse an earlier generated load. */ + unsigned j; + gimple *vstmt; + FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt) + if (!last_stmt +  vect_stmt_dominates_stmt_p (last_stmt, vstmt)) + last_stmt = vstmt; + } + else + { + /* For externals we have to look at all defs since their + insertion place is decided per vector. */ + unsigned j; + tree vdef; + FOR_EACH_VEC_ELT (SLP_TREE_VEC_DEFS (child), j, vdef) + if (TREE_CODE (vdef) == SSA_NAME) + { + gimple *vstmt = SSA_NAME_DEF_STMT (vdef); + if (!last_stmt +  vect_stmt_dominates_stmt_p (last_stmt, vstmt)) + last_stmt = vstmt; + } + } + } if (is_a <gphi *> (last_stmt)) si = gsi_after_labels (gimple_bb (last_stmt)); else diff git a/gcc/treevectorizer.h b/gcc/treevectorizer.h index 32feec3e24b..07ab8d2c4e3 100644  a/gcc/treevectorizer.h +++ b/gcc/treevectorizer.h @@ 2007,7 +2007,8 @@ extern bool can_duplicate_and_interleave_p (vec_info *, unsigned int, tree, tree * = NULL, tree * = NULL); extern void duplicate_and_interleave (vec_info *, gimple_seq *, tree, vec<tree>, unsigned int, vec<tree> &); extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info); +extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info, + bool = true); /* In treevectpatterns.c. */ /* Pattern recognition functions.