Message ID  f9ca7491d8f3c8e35a551fbbe71a0d51@linux.ibm.com 

State  New 
Headers  show 
Series 

Related  show 
"Kewen.Lin" <linkw@linux.ibm.com> writes: > Hi Richard, > > Thanks for the review again! > > on 2020/7/25 上午12:21, Richard Sandiford wrote: >> "Kewen.Lin" <linkw@linux.ibm.com> writes: >> >> Thanks, the rearrangement of the existing code looks good. Could you >> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff >> out into separate patches? >> > > Splitted to https://gcc.gnu.org/pipermail/gccpatches/2020July/550691.html. > > errr... that subject should be with prefix "[PATCH] vect:". > > [snip ...] > (Some comments in the snipped content will be done in v4) > >>> + here. */ >>> + >>> + /* For now we only operate lengthbased partial vectors on Power, >>> + which has constant VF all the time, we need some tweakings below >>> + if it doesn't hold in future. */ >>> + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); >> >> Where do you rely on this? There didn't seem to be any obvious >> to_constant uses. Since this is “only” a cost calculation, we should >> be using assumed_vf. > > Sorry for the confusion. This was intended for the poly things like > VF or nitems_per_ctrl which isn't constant during compilation time, > then get people's attention on the possible runtime cost on things like > scaling up for nitems_step etc. But I just realized that the computations > like the multiply with another constant can operate on the coefficient, > it looks there is no runtime cost then? If so, I think I thought too > much before. ;) > >>>  prologue_cost_vec.release (); >>>  epilogue_cost_vec.release (); >>> + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt, >>> + NULL, NULL_TREE, 0, vect_prologue); >>> + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt, >>> + NULL, NULL_TREE, 0, vect_body); >> >> IMO this seems to be reproducing too much of the functions that you >> referred to. And the danger with that is that they could easily >> get out of sync later. > > Good point! The original intention was to model as possible as we can, > to avoid some bad decision due to some unmodeled pieces, like the case > the loop body is small and some computation become nonnegligible. > The unsync risks seems also applied for other codes. How about adding > some "note" comments in those functions? > > The updated v4 is attached by addressing your comments as well as Segher's > comments. > > BR, > Kewen >  > > gcc/ChangeLog: > > * config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New > function. > (rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop. > * treevectloop.c (vect_estimate_min_profitable_iters): Add cost > modeling for vector with length. > * treevectloopmanip.c (vect_set_loop_controls_directly): Update > function comment. > * treevectstmts.c (vect_gen_len): Likewise. > > diff git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 009afc5f894..86ef584e09b 100644 >  a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ 5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > return retval; > } > > +/* For some target specific vectorization cost which can't be handled per stmt, > + we check the requisite conditions and adjust the vectorization cost > + accordingly if satisfied. One typical example is to model shift cost for > + vector with length by counting number of required lengths under condition > + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ > + > +static void > +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data) > +{ > + struct loop *loop = data>loop_info; > + gcc_assert (loop); > + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); > + > + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > + { > + rgroup_controls *rgc; > + unsigned int num_vectors_m1; > + unsigned int shift_cnt = 0; > + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) > + if (rgc>type) > + /* Each length needs one shift to fill into bits 07. */ > + shift_cnt += num_vectors_m1 + 1; > + > + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, > + NULL, NULL_TREE, 0, vect_body); > + } > +} > + > /* Implement targetm.vectorize.finish_cost. */ > > static void > @@ 5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, > rs6000_cost_data *cost_data = (rs6000_cost_data*) data; > > if (cost_data>loop_info) >  rs6000_density_test (cost_data); > + { > + rs6000_adjust_vect_cost_per_loop (cost_data); > + rs6000_density_test (cost_data); > + } > > /* Don't vectorize minimumvectorizationfactor, simple copy loops > that require versioning for any reason. The vectorization is at > diff git a/gcc/treevectloopmanip.c b/gcc/treevectloopmanip.c > index 490e7befea3..9d0e3fc525e 100644 >  a/gcc/treevectloopmanip.c > +++ b/gcc/treevectloopmanip.c > @@ 412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, > > This means that we cannot guarantee that such an induction variable > would ever hit a value that produces a set of allfalse masks or zero >  lengths for RGC. */ > + lengths for RGC. > + > + Note that please check cost modeling whether needed to be updated in > + function vect_estimate_min_profitable_iters if any changes. */ > > static tree > vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo, > diff git a/gcc/treevectloop.c b/gcc/treevectloop.c > index 06cde4b1da3..a00160a7f2d 100644 >  a/gcc/treevectloop.c > +++ b/gcc/treevectloop.c > @@ 3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks  1, > vector_stmt, NULL, NULL_TREE, 0, vect_body); > } > + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > + { > + /* Referring to the functions vect_set_loop_condition_partial_vectors > + and vect_set_loop_controls_directly, we need to generate each > + length in the prologue and in the loop body if required. Although > + there are some possible optimizations, we consider the worst case > + here. */ > + > + /* For wrap around checking. */ > + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); > + unsigned int compare_precision = TYPE_PRECISION (compare_type); > + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); > + > + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); > + bool need_iterate_p > + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) > + && !vect_known_niters_smaller_than_vf (loop_vinfo)); > + > + /* Init min/max, shift and minus cost relative to single > + scalar_stmt. For now we only use lengthbased partial vectors on > + Power, target specific cost tweaking may be needed for other > + ports in future. */ > + unsigned int min_max_cost = 2; > + unsigned int shift_cost = 1, minus_cost = 1; Please instead add a scalar_min_max to vect_cost_for_stmt, and use scalar_stmt for shift and minus. There shouldn't be any Power things hardcoded into targetindependent code. > + /* Init cost relative to single scalar_stmt. */ > + unsigned int prologue_cnt = 0; > + unsigned int body_cnt = 0; Maybe _stmts instead of _cnt, to make it clearer what we're counting. At first it looked like this was actually the raw cost. I guess with the above change we'd also want separate counters for min_max. > + rgroup_controls *rgc; > + unsigned int num_vectors_m1; > + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) > + if (rgc>type) > + { > + unsigned nitems = rgc>max_nscalars_per_iter * rgc>factor; > + > + /* May need one shift for nitems_total computation. */ > + if (nitems != 1 && !niters_known_p) > + prologue_cnt += shift_cost; > + > + /* Need to handle wrap around. */ > + if (iv_limit == 1 > +  (wi::min_precision (iv_limit * nitems, UNSIGNED) > + > compare_precision)) > + prologue_cnt += (min_max_cost + minus_cost); I think we should instead split this condition out into a subroutine that both this function and vect_set_loop_condition_partial_vectors can use. It would take the loop_vec_info and rgroup_controls as arguments. That means that we might recompute things a few more times, but this shouldn't be performancecritical. Thanks, Richard
Hi Richard, Thanks for the comments! on 2020/7/27 下午9:40, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: [snip] >> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); >> + bool need_iterate_p >> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) >> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); >> + >> + /* Init min/max, shift and minus cost relative to single >> + scalar_stmt. For now we only use lengthbased partial vectors on >> + Power, target specific cost tweaking may be needed for other >> + ports in future. */ >> + unsigned int min_max_cost = 2; >> + unsigned int shift_cost = 1, minus_cost = 1; > > Please instead add a scalar_min_max to vect_cost_for_stmt, and use > scalar_stmt for shift and minus. There shouldn't be any Power things > hardcoded into targetindependent code. > Agree! It's not good to leave them there. I thought to wait and see if other targets which support vector with length can reuse this, or move it to target specific codes then if not sharable. But anyway it looks not good, let's fix it. I had some concerns on vect_cost_for_stmt way, since it seems to allow more computations similar to min/max to be added like this, in long term it probably leads to the situtation like: scalar_min_max, scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it seems not good to maintain. I noticed that i386 port ix86_add_stmt_cost will check stmt_info>stmt, whether is assignment and the subcode of the expression, it provides the chance to check the statement more finegrain, not just as normal scalar_stmt/vector_stmt. For the case here, we don't have the stmt_info, but we know the type of computation(expression), how about to extend the hook add_stmt_cost with one extra tree_code type argument, by default it can be some unmeaningful code, for some needs like here, we specify the tree_code as the code of computation, like {MIN,MAX}_EXPR, then target specific add_stmt_cost can check this tree_code and adjust the cost accordingly. >> + /* Init cost relative to single scalar_stmt. */ >> + unsigned int prologue_cnt = 0; >> + unsigned int body_cnt = 0; > > Maybe _stmts instead of _cnt, to make it clearer what we're counting. > At first it looked like this was actually the raw cost. I guess with > the above change we'd also want separate counters for min_max. > Good point, will fix it in v5. >> + rgroup_controls *rgc; >> + unsigned int num_vectors_m1; >> + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) >> + if (rgc>type) >> + { >> + unsigned nitems = rgc>max_nscalars_per_iter * rgc>factor; >> + >> + /* May need one shift for nitems_total computation. */ >> + if (nitems != 1 && !niters_known_p) >> + prologue_cnt += shift_cost; >> + >> + /* Need to handle wrap around. */ >> + if (iv_limit == 1 >> +  (wi::min_precision (iv_limit * nitems, UNSIGNED) >> + > compare_precision)) >> + prologue_cnt += (min_max_cost + minus_cost); > > I think we should instead split this condition out into a subroutine > that both this function and vect_set_loop_condition_partial_vectors > can use. It would take the loop_vec_info and rgroup_controls as > arguments. OK, will split it in v5. BR, Kewen
"Kewen.Lin" <linkw@linux.ibm.com> writes: >>> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); >>> + bool need_iterate_p >>> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) >>> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); >>> + >>> + /* Init min/max, shift and minus cost relative to single >>> + scalar_stmt. For now we only use lengthbased partial vectors on >>> + Power, target specific cost tweaking may be needed for other >>> + ports in future. */ >>> + unsigned int min_max_cost = 2; >>> + unsigned int shift_cost = 1, minus_cost = 1; >> >> Please instead add a scalar_min_max to vect_cost_for_stmt, and use >> scalar_stmt for shift and minus. There shouldn't be any Power things >> hardcoded into targetindependent code. >> > > Agree! It's not good to leave them there. I thought to wait and see > if other targets which support vector with length can reuse this, or > move it to target specific codes then if not sharable. But anyway > it looks not good, let's fix it. > > I had some concerns on vect_cost_for_stmt way, since it seems to allow > more computations similar to min/max to be added like this, in long > term it probably leads to the situtation like: scalar_min_max, > scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it > seems not good to maintain. I guess doing that doesn't seem so bad to me :) I think it's been a recurring problem that the current classification isn't finegrained enough for some cases. > I noticed that i386 port ix86_add_stmt_cost will check stmt_info>stmt, > whether is assignment and the subcode of the expression, it provides the > chance to check the statement more finegrain, not just as normal > scalar_stmt/vector_stmt. > > For the case here, we don't have the stmt_info, but we know the type > of computation(expression), how about to extend the hook add_stmt_cost > with one extra tree_code type argument, by default it can be some > unmeaningful code, for some needs like here, we specify the tree_code > as the code of computation, like {MIN,MAX}_EXPR, then target specific > add_stmt_cost can check this tree_code and adjust the cost accordingly. If we do that, I guess we should “promote” code_helper out of gimplematch.h and use that instead, so that we can handle internal and builtin functions too. Would like to hear Richard's opinion on the best way forward here. Thanks, Richard
On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > "Kewen.Lin" <linkw@linux.ibm.com> writes: > >>> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); > >>> + bool need_iterate_p > >>> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) > >>> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); > >>> + > >>> + /* Init min/max, shift and minus cost relative to single > >>> + scalar_stmt. For now we only use lengthbased partial vectors on > >>> + Power, target specific cost tweaking may be needed for other > >>> + ports in future. */ > >>> + unsigned int min_max_cost = 2; > >>> + unsigned int shift_cost = 1, minus_cost = 1; > >> > >> Please instead add a scalar_min_max to vect_cost_for_stmt, and use > >> scalar_stmt for shift and minus. There shouldn't be any Power things > >> hardcoded into targetindependent code. > >> > > > > Agree! It's not good to leave them there. I thought to wait and see > > if other targets which support vector with length can reuse this, or > > move it to target specific codes then if not sharable. But anyway > > it looks not good, let's fix it. In other generic places like this we simply use three generic scalar_stmt costs. At least I don't see how min_max should be different from it when shift can be the same as minus. Note this is also how we treat vectorization of MAX_EXPR  scalar cost is one scalar_stmt and vector cost is one vector_stmt. As you say below the add_stmt_cost hook can adjust based on the actual GIMPLE stmt  if one is available (which indeed it isn't here). I'm somewhat lacking context here as well  we actually GIMPLE codegenerate the min/max/shift/minus and only the eventual AND is defered to the target, right? > > I had some concerns on vect_cost_for_stmt way, since it seems to allow > > more computations similar to min/max to be added like this, in long > > term it probably leads to the situtation like: scalar_min_max, > > scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it > > seems not good to maintain. > > I guess doing that doesn't seem so bad to me :) I think it's been > a recurring problem that the current classification isn't finegrained > enough for some cases. But we eventually want to get rid of this classification enum in favor of the add_stmt_cost hook ... > > I noticed that i386 port ix86_add_stmt_cost will check stmt_info>stmt, > > whether is assignment and the subcode of the expression, it provides the > > chance to check the statement more finegrain, not just as normal > > scalar_stmt/vector_stmt. > > > > For the case here, we don't have the stmt_info, but we know the type > > of computation(expression), how about to extend the hook add_stmt_cost > > with one extra tree_code type argument, by default it can be some > > unmeaningful code, for some needs like here, we specify the tree_code > > as the code of computation, like {MIN,MAX}_EXPR, then target specific > > add_stmt_cost can check this tree_code and adjust the cost accordingly. > > If we do that, I guess we should “promote” code_helper out of > gimplematch.h and use that instead, so that we can handle > internal and builtin functions too. > > Would like to hear Richard's opinion on the best way forward here. I'd say defer this to a later patch and for now simply cost one scalar stmt for MIN/MAX. I agree that if we add a tree_code we want a code_helper instead. Note that I want to eventually have a full SLP tree for the final code generation where all info should be there (including SLP nodes for those min/max ops) and which the target could traverse. But I'm not sure if I can make enough progress on that SLPonly thing for GCC 11 even... Richard. > Thanks, > Richard
Hi Richards, on 2020/7/31 下午7:20, Richard Biener wrote: > On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>>>> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); >>>>> + bool need_iterate_p >>>>> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) >>>>> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); >>>>> + >>>>> + /* Init min/max, shift and minus cost relative to single >>>>> + scalar_stmt. For now we only use lengthbased partial vectors on >>>>> + Power, target specific cost tweaking may be needed for other >>>>> + ports in future. */ >>>>> + unsigned int min_max_cost = 2; >>>>> + unsigned int shift_cost = 1, minus_cost = 1; >>>> >>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use >>>> scalar_stmt for shift and minus. There shouldn't be any Power things >>>> hardcoded into targetindependent code. >>>> >>> >>> Agree! It's not good to leave them there. I thought to wait and see >>> if other targets which support vector with length can reuse this, or >>> move it to target specific codes then if not sharable. But anyway >>> it looks not good, let's fix it. > > In other generic places like this we simply use three generic scalar_stmt > costs. At least I don't see how min_max should be different from it > when shift can be the same as minus. Note this is also how we treat Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting for finegrain tuning like i386 port, they will use the same cost. On Power9, to implement min/max it takes double cycles of the normal scalar operations like add/shift, I was trying to model it more finegrained since we probably generate a few min/max here, if the loop body cost is small, I was worried the decision isn't good enough. But yeah, in other generic places, the small loop could also suffer this similar off, they are the same essentially. > vectorization of MAX_EXPR  scalar cost is one scalar_stmt and > vector cost is one vector_stmt. As you say below the add_stmt_cost > hook can adjust based on the actual GIMPLE stmt  if one is > available (which indeed it isn't here). > > I'm somewhat lacking context here as well  we actually GIMPLE > codegenerate the min/max/shift/minus and only the eventual > AND is defered to the target, right? > Yes, min/max/shift/minus are all GIMPLE code, targets like Power will have its target specific cost for shift which moves length to high bits 0:7. One typical case is as below: <bb 3> [local count: 105119324]: _26 = n_11(D) * 4; _37 = MAX_EXPR <_26, 16>; _38 = _37 + 18446744073709551600; _40 = MIN_EXPR <_26, 16>; <bb 4> [local count: 630715945]: # ivtmp_35 = PHI <0(3), ivtmp_36(4)> # loop_len_30 = PHI <_40(3), _44(4)> _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B]; vect_24 = .LEN_LOAD (_19, 4B, loop_len_30); vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24); _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B]; vect_17 = .LEN_LOAD (_1, 4B, loop_len_30); vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17); vect__7.11_8 = vect__5.10_9 + vect__3.7_23; vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8); _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B]; .LEN_STORE (_2, 4B, loop_len_30, vect_28); _42 = MIN_EXPR <ivtmp_35, _38>; _43 = _38  _42; _44 = MIN_EXPR <_43, 16>; ivtmp_36 = ivtmp_35 + 16; if (ivtmp_35 < _38) goto <bb 4>; [83.33%] else goto <bb 5>; [16.67%] >>> I had some concerns on vect_cost_for_stmt way, since it seems to allow >>> more computations similar to min/max to be added like this, in long >>> term it probably leads to the situtation like: scalar_min_max, >>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it >>> seems not good to maintain. >> >> I guess doing that doesn't seem so bad to me :) I think it's been >> a recurring problem that the current classification isn't finegrained >> enough for some cases. > > But we eventually want to get rid of this classification enum in favor > of the add_stmt_cost hook ... > Nice, sounds like each target has to handle it finegrain. :) IIUC, the current modeling doesn't consider the instruction dependency and execution resource etc. like scheduling, even if all costs are finegrained enough, the decision could be suboptimal. >>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info>stmt, >>> whether is assignment and the subcode of the expression, it provides the >>> chance to check the statement more finegrain, not just as normal >>> scalar_stmt/vector_stmt. >>> >>> For the case here, we don't have the stmt_info, but we know the type >>> of computation(expression), how about to extend the hook add_stmt_cost >>> with one extra tree_code type argument, by default it can be some >>> unmeaningful code, for some needs like here, we specify the tree_code >>> as the code of computation, like {MIN,MAX}_EXPR, then target specific >>> add_stmt_cost can check this tree_code and adjust the cost accordingly. >> >> If we do that, I guess we should “promote” code_helper out of >> gimplematch.h and use that instead, so that we can handle >> internal and builtin functions too. >> >> Would like to hear Richard's opinion on the best way forward here. > > I'd say defer this to a later patch and for now simply cost one scalar > stmt for MIN/MAX. I agree that if we add a tree_code we want a > code_helper instead. Note that I want to eventually have a > full SLP tree for the final code generation where all info should be > there (including SLP nodes for those min/max ops) and which the > target could traverse. But I'm not sure if I can make enough progress > on that SLPonly thing for GCC 11 even... > OK, I'm fine to take MIN/MAX as scalar_stmt here. Thank both of you! This new SLP framework looks very promising and powerful. :) BR, Kewen
On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Richards, > > on 2020/7/31 下午7:20, Richard Biener wrote: > > On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> "Kewen.Lin" <linkw@linux.ibm.com> writes: > >>>>> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); > >>>>> + bool need_iterate_p > >>>>> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) > >>>>> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); > >>>>> + > >>>>> + /* Init min/max, shift and minus cost relative to single > >>>>> + scalar_stmt. For now we only use lengthbased partial vectors on > >>>>> + Power, target specific cost tweaking may be needed for other > >>>>> + ports in future. */ > >>>>> + unsigned int min_max_cost = 2; > >>>>> + unsigned int shift_cost = 1, minus_cost = 1; > >>>> > >>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use > >>>> scalar_stmt for shift and minus. There shouldn't be any Power things > >>>> hardcoded into targetindependent code. > >>>> > >>> > >>> Agree! It's not good to leave them there. I thought to wait and see > >>> if other targets which support vector with length can reuse this, or > >>> move it to target specific codes then if not sharable. But anyway > >>> it looks not good, let's fix it. > > > > In other generic places like this we simply use three generic scalar_stmt > > costs. At least I don't see how min_max should be different from it > > when shift can be the same as minus. Note this is also how we treat > > Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting > for finegrain tuning like i386 port, they will use the same cost. On Power9, > to implement min/max it takes double cycles of the normal scalar operations > like add/shift, I was trying to model it more finegrained since we probably > generate a few min/max here, if the loop body cost is small, I was worried > the decision isn't good enough. But yeah, in other generic places, the small > loop could also suffer this similar off, they are the same essentially. > > > vectorization of MAX_EXPR  scalar cost is one scalar_stmt and > > vector cost is one vector_stmt. As you say below the add_stmt_cost > > hook can adjust based on the actual GIMPLE stmt  if one is > > available (which indeed it isn't here). > > > > I'm somewhat lacking context here as well  we actually GIMPLE > > codegenerate the min/max/shift/minus and only the eventual > > AND is defered to the target, right? > > > > Yes, min/max/shift/minus are all GIMPLE code, targets like Power > will have its target specific cost for shift which moves length > to high bits 0:7. > > One typical case is as below: > > <bb 3> [local count: 105119324]: > _26 = n_11(D) * 4; > _37 = MAX_EXPR <_26, 16>; > _38 = _37 + 18446744073709551600; > _40 = MIN_EXPR <_26, 16>; > > <bb 4> [local count: 630715945]: > # ivtmp_35 = PHI <0(3), ivtmp_36(4)> > # loop_len_30 = PHI <_40(3), _44(4)> > _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B]; > vect_24 = .LEN_LOAD (_19, 4B, loop_len_30); > vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24); > _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B]; > vect_17 = .LEN_LOAD (_1, 4B, loop_len_30); > vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17); > vect__7.11_8 = vect__5.10_9 + vect__3.7_23; > vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8); > _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B]; > .LEN_STORE (_2, 4B, loop_len_30, vect_28); > _42 = MIN_EXPR <ivtmp_35, _38>; > _43 = _38  _42; > _44 = MIN_EXPR <_43, 16>; > ivtmp_36 = ivtmp_35 + 16; > if (ivtmp_35 < _38) > goto <bb 4>; [83.33%] > else > goto <bb 5>; [16.67%] > > > >>> I had some concerns on vect_cost_for_stmt way, since it seems to allow > >>> more computations similar to min/max to be added like this, in long > >>> term it probably leads to the situtation like: scalar_min_max, > >>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it > >>> seems not good to maintain. > >> > >> I guess doing that doesn't seem so bad to me :) I think it's been > >> a recurring problem that the current classification isn't finegrained > >> enough for some cases. > > > > But we eventually want to get rid of this classification enum in favor > > of the add_stmt_cost hook ... > > > > Nice, sounds like each target has to handle it finegrain. :) > IIUC, the current modeling doesn't consider the instruction dependency and > execution resource etc. like scheduling, even if all costs are finegrained > enough, the decision could be suboptimal. That's what the finish_cost hook is for  the target can collect all stmts during add_stmt and then apply adjustments at the end (IIRC power does already for shift operation resource constraints). > >>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info>stmt, > >>> whether is assignment and the subcode of the expression, it provides the > >>> chance to check the statement more finegrain, not just as normal > >>> scalar_stmt/vector_stmt. > >>> > >>> For the case here, we don't have the stmt_info, but we know the type > >>> of computation(expression), how about to extend the hook add_stmt_cost > >>> with one extra tree_code type argument, by default it can be some > >>> unmeaningful code, for some needs like here, we specify the tree_code > >>> as the code of computation, like {MIN,MAX}_EXPR, then target specific > >>> add_stmt_cost can check this tree_code and adjust the cost accordingly. > >> > >> If we do that, I guess we should “promote” code_helper out of > >> gimplematch.h and use that instead, so that we can handle > >> internal and builtin functions too. > >> > >> Would like to hear Richard's opinion on the best way forward here. > > > > I'd say defer this to a later patch and for now simply cost one scalar > > stmt for MIN/MAX. I agree that if we add a tree_code we want a > > code_helper instead. Note that I want to eventually have a > > full SLP tree for the final code generation where all info should be > > there (including SLP nodes for those min/max ops) and which the > > target could traverse. But I'm not sure if I can make enough progress > > on that SLPonly thing for GCC 11 even... > > > > OK, I'm fine to take MIN/MAX as scalar_stmt here. Thank both of you! > This new SLP framework looks very promising and powerful. :) As do all things that are still on paper^Win my head only ;) Richard. > > BR, > Kewen
on 2020/7/31 下午9:01, Richard Biener wrote: > On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi Richards, >> >> on 2020/7/31 下午7:20, Richard Biener wrote: >>> On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford >>> <richard.sandiford@arm.com> wrote: >>>> >>>> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>>>>>> + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); >>>>>>> + bool need_iterate_p >>>>>>> + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) >>>>>>> + && !vect_known_niters_smaller_than_vf (loop_vinfo)); >>>>>>> + >>>>>>> + /* Init min/max, shift and minus cost relative to single >>>>>>> + scalar_stmt. For now we only use lengthbased partial vectors on >>>>>>> + Power, target specific cost tweaking may be needed for other >>>>>>> + ports in future. */ >>>>>>> + unsigned int min_max_cost = 2; >>>>>>> + unsigned int shift_cost = 1, minus_cost = 1; >>>>>> >>>>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use >>>>>> scalar_stmt for shift and minus. There shouldn't be any Power things >>>>>> hardcoded into targetindependent code. >>>>>> >>>>> >>>>> Agree! It's not good to leave them there. I thought to wait and see >>>>> if other targets which support vector with length can reuse this, or >>>>> move it to target specific codes then if not sharable. But anyway >>>>> it looks not good, let's fix it. >>> >>> In other generic places like this we simply use three generic scalar_stmt >>> costs. At least I don't see how min_max should be different from it >>> when shift can be the same as minus. Note this is also how we treat >> >> Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting >> for finegrain tuning like i386 port, they will use the same cost. On Power9, >> to implement min/max it takes double cycles of the normal scalar operations >> like add/shift, I was trying to model it more finegrained since we probably >> generate a few min/max here, if the loop body cost is small, I was worried >> the decision isn't good enough. But yeah, in other generic places, the small >> loop could also suffer this similar off, they are the same essentially. >> >>> vectorization of MAX_EXPR  scalar cost is one scalar_stmt and >>> vector cost is one vector_stmt. As you say below the add_stmt_cost >>> hook can adjust based on the actual GIMPLE stmt  if one is >>> available (which indeed it isn't here). >>> >>> I'm somewhat lacking context here as well  we actually GIMPLE >>> codegenerate the min/max/shift/minus and only the eventual >>> AND is defered to the target, right? >>> >> >> Yes, min/max/shift/minus are all GIMPLE code, targets like Power >> will have its target specific cost for shift which moves length >> to high bits 0:7. >> >> One typical case is as below: >> >> <bb 3> [local count: 105119324]: >> _26 = n_11(D) * 4; >> _37 = MAX_EXPR <_26, 16>; >> _38 = _37 + 18446744073709551600; >> _40 = MIN_EXPR <_26, 16>; >> >> <bb 4> [local count: 630715945]: >> # ivtmp_35 = PHI <0(3), ivtmp_36(4)> >> # loop_len_30 = PHI <_40(3), _44(4)> >> _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B]; >> vect_24 = .LEN_LOAD (_19, 4B, loop_len_30); >> vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24); >> _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B]; >> vect_17 = .LEN_LOAD (_1, 4B, loop_len_30); >> vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17); >> vect__7.11_8 = vect__5.10_9 + vect__3.7_23; >> vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8); >> _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B]; >> .LEN_STORE (_2, 4B, loop_len_30, vect_28); >> _42 = MIN_EXPR <ivtmp_35, _38>; >> _43 = _38  _42; >> _44 = MIN_EXPR <_43, 16>; >> ivtmp_36 = ivtmp_35 + 16; >> if (ivtmp_35 < _38) >> goto <bb 4>; [83.33%] >> else >> goto <bb 5>; [16.67%] >> >> >>>>> I had some concerns on vect_cost_for_stmt way, since it seems to allow >>>>> more computations similar to min/max to be added like this, in long >>>>> term it probably leads to the situtation like: scalar_min_max, >>>>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it >>>>> seems not good to maintain. >>>> >>>> I guess doing that doesn't seem so bad to me :) I think it's been >>>> a recurring problem that the current classification isn't finegrained >>>> enough for some cases. >>> >>> But we eventually want to get rid of this classification enum in favor >>> of the add_stmt_cost hook ... >>> >> >> Nice, sounds like each target has to handle it finegrain. :) >> IIUC, the current modeling doesn't consider the instruction dependency and >> execution resource etc. like scheduling, even if all costs are finegrained >> enough, the decision could be suboptimal. > > That's what the finish_cost hook is for  the target can collect > all stmts during add_stmt and then apply adjustments at the end > (IIRC power does already for shift operation resource constraints). > oh, forgot that we have that hook, though I just imagined some generic codes would do this. Good to know we have a way to make it. Nice! BR, Kewen
diff git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 009afc5f894..86ef584e09b 100644  a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ 5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, return retval; } +/* For some target specific vectorization cost which can't be handled per stmt, + we check the requisite conditions and adjust the vectorization cost + accordingly if satisfied. One typical example is to model shift cost for + vector with length by counting number of required lengths under condition + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ + +static void +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data) +{ + struct loop *loop = data>loop_info; + gcc_assert (loop); + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); + + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + rgroup_controls *rgc; + unsigned int num_vectors_m1; + unsigned int shift_cnt = 0; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc>type) + /* Each length needs one shift to fill into bits 07. */ + shift_cnt += num_vectors_m1 + 1; + + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } +} + /* Implement targetm.vectorize.finish_cost. */ static void @@ 5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, rs6000_cost_data *cost_data = (rs6000_cost_data*) data; if (cost_data>loop_info)  rs6000_density_test (cost_data); + { + rs6000_adjust_vect_cost_per_loop (cost_data); + rs6000_density_test (cost_data); + } /* Don't vectorize minimumvectorizationfactor, simple copy loops that require versioning for any reason. The vectorization is at diff git a/gcc/treevectloopmanip.c b/gcc/treevectloopmanip.c index 490e7befea3..9d0e3fc525e 100644  a/gcc/treevectloopmanip.c +++ b/gcc/treevectloopmanip.c @@ 412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, This means that we cannot guarantee that such an induction variable would ever hit a value that produces a set of allfalse masks or zero  lengths for RGC. */ + lengths for RGC. + + Note that please check cost modeling whether needed to be updated in + function vect_estimate_min_profitable_iters if any changes. */ static tree vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo, diff git a/gcc/treevectloop.c b/gcc/treevectloop.c index 06cde4b1da3..a00160a7f2d 100644  a/gcc/treevectloop.c +++ b/gcc/treevectloop.c @@ 3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks  1, vector_stmt, NULL, NULL_TREE, 0, vect_body); } + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + /* Referring to the functions vect_set_loop_condition_partial_vectors + and vect_set_loop_controls_directly, we need to generate each + length in the prologue and in the loop body if required. Although + there are some possible optimizations, we consider the worst case + here. */ + + /* For wrap around checking. */ + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); + unsigned int compare_precision = TYPE_PRECISION (compare_type); + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); + + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); + bool need_iterate_p + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_known_niters_smaller_than_vf (loop_vinfo)); + + /* Init min/max, shift and minus cost relative to single + scalar_stmt. For now we only use lengthbased partial vectors on + Power, target specific cost tweaking may be needed for other + ports in future. */ + unsigned int min_max_cost = 2; + unsigned int shift_cost = 1, minus_cost = 1; + + /* Init cost relative to single scalar_stmt. */ + unsigned int prologue_cnt = 0; + unsigned int body_cnt = 0; + + rgroup_controls *rgc; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc>type) + { + unsigned nitems = rgc>max_nscalars_per_iter * rgc>factor; + + /* May need one shift for nitems_total computation. */ + if (nitems != 1 && !niters_known_p) + prologue_cnt += shift_cost; + + /* Need to handle wrap around. */ + if (iv_limit == 1 +  (wi::min_precision (iv_limit * nitems, UNSIGNED) + > compare_precision)) + prologue_cnt += (min_max_cost + minus_cost); + + /* Need to handle batch limit excepting for the 1st one. */ + prologue_cnt += (min_max_cost + minus_cost) * num_vectors_m1; + + unsigned int num_vectors = num_vectors_m1 + 1; + /* Need to set up lengths in prologue, only one MIN required + since start index is zero. */ + prologue_cnt += min_max_cost * num_vectors; + + /* Need to update lengths in body for next iteration. */ + if (need_iterate_p) + body_cnt += (2 * min_max_cost + minus_cost) * num_vectors; + } + + (void) add_stmt_cost (loop_vinfo, target_cost_data, prologue_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } /* FORNOW: The scalar outside cost is incremented in one of the following ways: @@ 3932,8 +3996,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* ??? The "if" arm is written to handle all cases; see below for what  we would do for !LOOP_VINFO_FULLY_MASKED_P. */  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Rewriting the condition above in terms of the number of vector iterations (vniters) rather than the number of @@ 3960,7 +4024,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", min_vec_niters);  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Now that we know the minimum number of vector iterations, find the minimum niters for which the scalar cost is larger: @@ 4015,6 +4079,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; + else if (min_profitable_iters < peel_iters_prologue) + /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the + vectorized loop to execute at least once. */ + min_profitable_iters = peel_iters_prologue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ 4032,7 +4100,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (vec_outside_cost <= 0) min_profitable_estimate = 0;  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* This is a repeat of the code above, but with + SOC rather than  SOC. */ @@ 4044,7 +4112,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (outside_overhead > 0) min_vec_niters = outside_overhead / saving_per_viter + 1;  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { int threshold = (vec_inside_cost * min_vec_niters + vec_outside_cost diff git a/gcc/treevectstmts.c b/gcc/treevectstmts.c index 31af46ae19c..8550a252f44 100644  a/gcc/treevectstmts.c +++ b/gcc/treevectstmts.c @@ 12090,7 +12090,10 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, min_of_start_and_end = min (START_INDEX, END_INDEX); left_len = END_INDEX  min_of_start_and_end; rhs = min (left_len, LEN_LIMIT);  LEN = rhs; */ + LEN = rhs; + + Note that please check cost modeling whether needed to be updated in + function vect_estimate_min_profitable_iters if any changes. */ gimple_seq vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)