[v3] vect/rs6000: Support vector with length cost modeling

Message ID 1aeabdc7-0cf4-055b-a3ec-74c283053cf5@linux.ibm.com
State New
Headers show
Series
  • [v3] vect/rs6000: Support vector with length cost modeling
Related show

Commit Message

Bill Schmidt via Gcc-patches July 22, 2020, 3:48 p.m.
Hi Richard,

Thanks for the review!

on 2020/7/22 下午5:11, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))

>> -    {

>> -      peel_iters_prologue = 0;

>> -      peel_iters_epilogue = 0;

>> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

>> +	{

>> +	  /* Calculate how many masks we need to generate.  */

>> +	  unsigned int num_masks = 0;

>> +	  rgroup_controls *rgm;

>> +	  unsigned int num_vectors_m1;

>> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)

>> +	    if (rgm->type)

>> +	      num_masks += num_vectors_m1 + 1;

>> +	  gcc_assert (num_masks > 0);

>> +

>> +	  /* In the worst case, we need to generate each mask in the prologue

>> +	     and in the loop body.  One of the loop body mask instructions

>> +	     replaces the comparison in the scalar loop, and since we don't

>> +	     count the scalar comparison against the scalar body, we shouldn't

>> +	     count that vector instruction against the vector body either.

>> +

>> +	     Sometimes we can use unpacks instead of generating prologue

>> +	     masks and sometimes the prologue mask will fold to a constant,

>> +	     so the actual prologue cost might be smaller.  However, it's

>> +	     simpler and safer to use the worst-case cost; if this ends up

>> +	     being the tie-breaker between vectorizing or not, then it's

>> +	     probably better not to vectorize.  */

>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,

>> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);

>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,

>> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);

>> +	}

>> +      else

>> +	{

>> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));

>> +

>> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */

>> +	  if (npeel < 0)

>> +	    {

>> +	      peel_iters_prologue = assumed_vf / 2;

>> +	      /* See below, if peeled iterations are unknown, count a taken

>> +		 branch and a not taken branch per peeled loop.  */

>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,

>> +				    cond_branch_taken, NULL, NULL_TREE, 0,

>> +				    vect_prologue);

>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,

>> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,

>> +				    vect_prologue);

>> +	    }

>> +	  else

>> +	    {

>> +	      peel_iters_prologue = npeel;

>> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))

>> +		/* See vect_get_known_peeling_cost, if peeled iterations are

>> +		   known but number of scalar loop iterations are unknown, count

>> +		   a taken branch per peeled loop.  */

>> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,

>> +				      cond_branch_taken, NULL, NULL_TREE, 0,

>> +				      vect_prologue);

>> +	    }

> 

> I think it'd be good to avoid duplicating this.  How about the

> following structure?

> 

>   if (vect_use_loop_mask_for_alignment_p (…))

>     {

>       peel_iters_prologue = 0;

>       peel_iters_epilogue = 0;

>     }

>   else if (npeel < 0)

>     {

>       … // A

>     }

>   else

>     {

>       …vect_get_known_peeling_cost stuff…

>     }

> 

> but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:

> 

>   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0

> 

> for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to

> whatever value we'd normally use.  Then wrap:

> 

>       (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,

> 			    NULL, NULL_TREE, 0, vect_epilogue);

>       (void) add_stmt_cost (loop_vinfo,

> 			    target_cost_data, 1, cond_branch_not_taken,

> 			    NULL, NULL_TREE, 0, vect_epilogue);

> 

> in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue

> stuff in A conditional on peel_iters_epilogue != 0.

> 

> This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P

> code:

> 

>       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))

> 	{

> 	  /* We need to peel exactly one iteration.  */

> 	  peel_iters_epilogue += 1;

> 	  stmt_info_for_cost *si;

> 	  int j;

> 	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),

> 			    j, si)

> 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,

> 				  si->kind, si->stmt_info, si->vectype,

> 				  si->misalign, vect_epilogue);

> 	}

> 

> Then, after the above, have:

> 

>   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

>     …add costs for mask overhead…

>   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))

>     …add costs for lengths overhead…

> 

> So we'd have one block of code for estimating the prologue and epilogue

> peeling cost, and a separate block of code for the loop control overhead.

> 


It's a great idea, by following your subsequent suggestion to make the structure
like:

  - calculate peel_iters_prologue
  - calculate peel_iters_epilogue
  - add costs associated with peel_iters_prologue
  - add costs associated with peel_iters_epilogue
  - add costs related to branch taken/not_taken.

the updated v3 is attached.

Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1, I'll test it without partial vectors
setting, also on aarch64 later.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
	to determine peel_iters_epilogue to function ...
	(vect_get_peel_iters_epilogue): ... this.  New function.
	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
	with length, refactor cost calculation on peel_iters_prologue and
	peel_iters_epilogue.

Comments

Bill Schmidt via Gcc-patches July 22, 2020, 4:25 p.m. | #1
Hi,

Sorry, please ignore the previously attached file, which isn't the latest one
although almost are the same.   The latest tested is attached here.  

Sorry for the inconvenience.

BR,
Kewen

on 2020/7/22 下午11:48, Kewen.Lin via Gcc-patches wrote:
> 

> It's a great idea, by following your subsequent suggestion to make the structure

> like:

> 

>   - calculate peel_iters_prologue

>   - calculate peel_iters_epilogue

>   - add costs associated with peel_iters_prologue

>   - add costs associated with peel_iters_epilogue

>   - add costs related to branch taken/not_taken.

> 

> the updated v3 is attached.

> 

> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit

> param vect-partial-vector-usage=1, I'll test it without partial vectors

> setting, also on aarch64 later.

> 

> BR,

> Kewen

> -----

> gcc/ChangeLog:

> 

> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.

> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.

> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code

> 	to determine peel_iters_epilogue to function ...

> 	(vect_get_peel_iters_epilogue): ... this.  New function.

> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector

> 	with length, refactor cost calculation on peel_iters_prologue and

> 	peel_iters_epilogue.

>
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..d71f2bf1c16 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..8746c5ae582 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3474,42 +3474,56 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   return NULL;
 }
 
-/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
-int
-vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
-                             int *peel_iters_epilogue,
-                             stmt_vector_for_cost *scalar_cost_vec,
-			     stmt_vector_for_cost *prologue_cost_vec,
-			     stmt_vector_for_cost *epilogue_cost_vec)
+/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
+   and PEEL_ITERS_PROLOGUE.  */
+
+static int
+vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue)
 {
-  int retval = 0;
   int assumed_vf = vect_vf_for_cost (loop_vinfo);
-
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      *peel_iters_epilogue = assumed_vf / 2;
       if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
+	dump_printf_loc (MSG_NOTE, vect_location,
 			 "cost model: epilogue peel iters set to vf/2 "
 			 "because loop iterations are unknown .\n");
-
-      /* If peeled iterations are known but number of scalar loop
-         iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
-				 NULL, NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
-				  NULL, NULL_TREE, 0, vect_epilogue);
+      return assumed_vf / 2;
     }
   else
     {
       int niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      peel_iters_prologue = niters < peel_iters_prologue ?
-                            niters : peel_iters_prologue;
-      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
+      int npeel_prol
+	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
+      int npeel_epil = (niters - npeel_prol) % assumed_vf;
       /* If we need to peel for gaps, but no peeling is required, we have to
 	 peel VF iterations.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue)
-	*peel_iters_epilogue = assumed_vf;
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil)
+	npeel_epil = assumed_vf;
+      return npeel_epil;
+    }
+}
+
+/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
+int
+vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
+			     int *peel_iters_epilogue,
+			     stmt_vector_for_cost *scalar_cost_vec,
+			     stmt_vector_for_cost *prologue_cost_vec,
+			     stmt_vector_for_cost *epilogue_cost_vec)
+{
+  int retval = 0;
+
+  *peel_iters_epilogue
+    = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue);
+
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      /* If peeled iterations are known but number of scalar loop
+	 iterations are unknown, count a taken branch per peeled loop.  */
+      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
+				 NULL_TREE, 0, vect_prologue);
+      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
+				  NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3652,24 +3666,109 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  bool prol_need_br_taken_cost = false;
+  bool prol_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_prologue.  */
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    peel_iters_prologue = 0;
+  else if (npeel < 0)
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      peel_iters_prologue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "prologue peel iters set to vf/2.\n");
 
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	{
-	  /* We need to peel exactly one iteration.  */
-	  peel_iters_epilogue += 1;
-	  stmt_info_for_cost *si;
-	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
-	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
-				  si->kind, si->stmt_info, si->vectype,
-				  si->misalign, vect_epilogue);
-	}
+      /* If peeled iterations are unknown, count a taken branch and a not taken
+	 branch per peeled loop. Even if scalar loop iterations are known,
+	 vector iterations are not known since peeled prologue iterations are
+	 not known. Hence guards remain the same.  */
+      prol_need_br_taken_cost = true;
+      prol_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_prologue = npeel;
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+	/* If peeled iterations are known but number of scalar loop
+	   iterations are unknown, count a taken branch per peeled loop.  */
+	prol_need_br_taken_cost = true;
+    }
+
+  bool epil_need_br_taken_cost = false;
+  bool epil_need_br_not_taken_cost = false;
 
+  /* Calculate peel_iters_epilogue.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+    /* We need to peel exactly one iteration for gaps.  */
+    peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  else if (npeel < 0)
+    {
+      /* If peeling for alignment is unknown, loop bound of main loop
+	 becomes unknown.  */
+      peel_iters_epilogue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "epilogue peel iters set to vf/2 because "
+			       "peeling for alignment is unknown.\n");
+
+      /* See the same reason above in peel_iters_prologue calculation.  */
+      epil_need_br_taken_cost = true;
+      epil_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+	/* If peeled iterations are known but number of scalar loop
+	   iterations are unknown, count a taken branch per peeled loop.  */
+	epil_need_br_taken_cost = true;
+    }
+
+  stmt_info_for_cost *si;
+  int j;
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_prologue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_prologue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_prologue);
+      }
+
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_epilogue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_epilogue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_epilogue);
+      }
+
+  /* Add possible cond_branch_taken/cond_branch_not_taken cost.  */
+  if (prol_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_prologue);
+
+  if (prol_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_prologue);
+
+  if (epil_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_epilogue);
+
+  if (epil_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_epilogue);
+
+  /* Take care of special costs for rgroup controls of partial vectors.  */
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
@@ -3691,93 +3790,79 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	 simpler and safer to use the worst-case cost; if this ends up
 	 being the tie-breaker between vectorizing or not, then it's
 	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
-    }
-  else if (npeel < 0)
-    {
-      peel_iters_prologue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "prologue peel iters set to vf/2.\n");
-
-      /* If peeling for alignment is unknown, loop bound of main loop becomes
-         unknown.  */
-      peel_iters_epilogue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "epilogue peel iters set to vf/2 because "
-		     "peeling for alignment is unknown.\n");
+      /* Refer to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimization, we consider the worst case
+	 here.  */
+
+      /* For now we only operate length-based partial vectors on Power,
+	 which has constant VF all the time, we need some tweakings below
+	 if it doesn't hold in future.  */
+      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prol_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
 
-      /* If peeled iterations are unknown, count a taken branch and a not taken
-         branch per peeled loop. Even if scalar loop iterations are known,
-         vector iterations are not known since peeled prologue iterations are
-         not known. Hence guards remain the same.  */
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      stmt_info_for_cost *si;
-      int j;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
-	{
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_prologue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_prologue);
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_epilogue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_epilogue);
-	}
-    }
-  else
-    {
-      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
-      stmt_info_for_cost *si;
-      int j;
-      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
+	    /* Need one shift for niters_total computation.  */
+	    if (!niters_known_p && nitems != 1)
+	      prol_cnt += shift_cost;
 
-      prologue_cost_vec.create (2);
-      epilogue_cost_vec.create (2);
-      peel_iters_prologue = npeel;
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prol_cnt += (min_max_cost + minus_cost);
 
-      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
-					  &peel_iters_epilogue,
-					  &LOOP_VINFO_SCALAR_ITERATION_COST
-					    (loop_vinfo),
-					  &prologue_cost_vec,
-					  &epilogue_cost_vec);
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
 
-      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_prologue);
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prol_cnt += min_max_cost * num_vectors;
 
-      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_epilogue);
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
 
-      prologue_cost_vec.release ();
-      epilogue_cost_vec.release ();
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
     }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
@@ -3913,8 +3998,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4026,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4081,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4102,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4114,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost
Richard Sandiford July 24, 2020, 4:21 p.m. | #2
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,

>

> Sorry, please ignore the previously attached file, which isn't the latest one

> although almost are the same.   The latest tested is attached here.  

>

> Sorry for the inconvenience.

>

> BR,

> Kewen

>

> on 2020/7/22 下午11:48, Kewen.Lin via Gcc-patches wrote:

>> 

>> It's a great idea, by following your subsequent suggestion to make the structure

>> like:

>> 

>>   - calculate peel_iters_prologue

>>   - calculate peel_iters_epilogue

>>   - add costs associated with peel_iters_prologue

>>   - add costs associated with peel_iters_epilogue

>>   - add costs related to branch taken/not_taken.

>> 

>> the updated v3 is attached.

>> 

>> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit

>> param vect-partial-vector-usage=1, I'll test it without partial vectors

>> setting, also on aarch64 later.

>> 

>> BR,

>> Kewen

>> -----

>> gcc/ChangeLog:

>> 

>> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.

>> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.

>> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code

>> 	to determine peel_iters_epilogue to function ...

>> 	(vect_get_peel_iters_epilogue): ... this.  New function.

>> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector

>> 	with length, refactor cost calculation on peel_iters_prologue and

>> 	peel_iters_epilogue.

>> 


Thanks, the rearrangement of the existing code looks good.  Could you
split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
out into separate patches?

> -/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */

> -int

> -vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,

> -                             int *peel_iters_epilogue,

> -                             stmt_vector_for_cost *scalar_cost_vec,

> -			     stmt_vector_for_cost *prologue_cost_vec,

> -			     stmt_vector_for_cost *epilogue_cost_vec)

> +/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO

> +   and PEEL_ITERS_PROLOGUE.  */


Maybe:

/* Estimate the number of peeled epilogue iterations for LOOP_VINFO.
   PEEL_ITERS_PROLOGUE is the number of peeled prologue iterations,
   or -1 if not known.  */

> […]

> -      peel_iters_prologue = niters < peel_iters_prologue ?

> -                            niters : peel_iters_prologue;

> -      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;

> +      int npeel_prol

> +	= niters < peel_iters_prologue ? niters : peel_iters_prologue;

> +      int npeel_epil = (niters - npeel_prol) % assumed_vf;


Here and in the rest of the patch, I think we should stick with
the “ogue” rather than shorten it this much.

This is a comment about the original code, but it seems simpler to
write as either:

  if (peel_iters_prologue > niters)
    peel_iters_prologue = niters;

or:

  peel_iters_prologue = MIN (niters, peel_iters_prologue);

> […]

> +      /* Refer to the functions vect_set_loop_condition_partial_vectors


s/Refer/Referring/

> +	 and vect_set_loop_controls_directly, we need to generate each

> +	 length in the prologue and in the loop body if required. Although

> +	 there are some possible optimization, we consider the worst case


optimizations

> +	 here.  */

> +

> +      /* For now we only operate length-based partial vectors on Power,

> +	 which has constant VF all the time, we need some tweakings below

> +	 if it doesn't hold in future.  */

> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());


Where do you rely on this?  There didn't seem to be any obvious
to_constant uses.  Since this is “only” a cost calculation, we should
be using assumed_vf.

> +

> +      /* For wrap around checking.  */

> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);

> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);

> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);

> +

> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);

> +      bool need_iterate_p

> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)

> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));

> +

> +      /* Init min/max, shift and minus cost relative to single

> +	 scalar_stmt. For now we only use length-based partial vectors on

> +	 Power, target specific cost tweaking may be needed for other

> +	 ports in future.  */

> +      unsigned int min_max_cost = 2;

> +      unsigned int shift_cost = 1, minus_cost = 1;

> +

> +      /* Init cost relative to single scalar_stmt.  */

> +      unsigned int prol_cnt = 0;

> +      unsigned int body_cnt = 0;

> +

> +      rgroup_controls *rgc;

> +      unsigned int num_vectors_m1;

> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)

> +	if (rgc->type)

> +	  {

> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;

>  

> -      /* If peeled iterations are unknown, count a taken branch and a not taken

> -         branch per peeled loop. Even if scalar loop iterations are known,

> -         vector iterations are not known since peeled prologue iterations are

> -         not known. Hence guards remain the same.  */

> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,

> -			    NULL, NULL_TREE, 0, vect_prologue);

> -      (void) add_stmt_cost (loop_vinfo,

> -			    target_cost_data, 1, cond_branch_not_taken,

> -			    NULL, NULL_TREE, 0, vect_prologue);

> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,

> -			    NULL, NULL_TREE, 0, vect_epilogue);

> -      (void) add_stmt_cost (loop_vinfo,

> -			    target_cost_data, 1, cond_branch_not_taken,

> -			    NULL, NULL_TREE, 0, vect_epilogue);

> -      stmt_info_for_cost *si;

> -      int j;

> -      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)

> -	{

> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,

> -				si->count * peel_iters_prologue,

> -				si->kind, si->stmt_info, si->vectype,

> -				si->misalign,

> -				vect_prologue);

> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,

> -				si->count * peel_iters_epilogue,

> -				si->kind, si->stmt_info, si->vectype,

> -				si->misalign,

> -				vect_epilogue);

> -	}

> -    }

> -  else

> -    {

> -      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;

> -      stmt_info_for_cost *si;

> -      int j;

> -      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);

> +	    /* Need one shift for niters_total computation.  */

> +	    if (!niters_known_p && nitems != 1)

> +	      prol_cnt += shift_cost;

>  

> -      prologue_cost_vec.create (2);

> -      epilogue_cost_vec.create (2);

> -      peel_iters_prologue = npeel;

> +	    /* Need to handle wrap around.  */

> +	    if (iv_limit == -1

> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)

> +		    > compare_precision))

> +	      prol_cnt += (min_max_cost + minus_cost);

>  

> -      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,

> -					  &peel_iters_epilogue,

> -					  &LOOP_VINFO_SCALAR_ITERATION_COST

> -					    (loop_vinfo),

> -					  &prologue_cost_vec,

> -					  &epilogue_cost_vec);

> +	    /* Need to handle batch limit excepting for the 1st one.  */

> +	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;

>  

> -      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)

> -	(void) add_stmt_cost (loop_vinfo,

> -			      data, si->count, si->kind, si->stmt_info,

> -			      si->vectype, si->misalign, vect_prologue);

> +	    unsigned int num_vectors = num_vectors_m1 + 1;

> +	    /* Need to set up lengths in prologue, only one MIN required

> +	       since start index is zero.  */

> +	    prol_cnt += min_max_cost * num_vectors;

>  

> -      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)

> -	(void) add_stmt_cost (loop_vinfo,

> -			      data, si->count, si->kind, si->stmt_info,

> -			      si->vectype, si->misalign, vect_epilogue);

> +	    /* Need to update lengths in body for next iteration.  */

> +	    if (need_iterate_p)

> +	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;

> +	  }

>  

> -      prologue_cost_vec.release ();

> -      epilogue_cost_vec.release ();

> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,

> +			    NULL, NULL_TREE, 0, vect_prologue);

> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,

> +			    NULL, NULL_TREE, 0, vect_body);


IMO this seems to be reproducing too much of the functions that you
referred to.  And the danger with that is that they could easily
get out of sync later.

Thanks,
Richard

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..d71f2bf1c16 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@  rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@  rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..c85b3ea1393 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3474,42 +3474,56 @@  vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   return NULL;
 }
 
-/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
-int
-vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
-                             int *peel_iters_epilogue,
-                             stmt_vector_for_cost *scalar_cost_vec,
-			     stmt_vector_for_cost *prologue_cost_vec,
-			     stmt_vector_for_cost *epilogue_cost_vec)
+/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
+   and PEEL_ITERS_PROLOGUE.  */
+
+static int
+vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue)
 {
-  int retval = 0;
   int assumed_vf = vect_vf_for_cost (loop_vinfo);
-
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      *peel_iters_epilogue = assumed_vf / 2;
       if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
+	dump_printf_loc (MSG_NOTE, vect_location,
 			 "cost model: epilogue peel iters set to vf/2 "
 			 "because loop iterations are unknown .\n");
-
-      /* If peeled iterations are known but number of scalar loop
-         iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
-				 NULL, NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
-				  NULL, NULL_TREE, 0, vect_epilogue);
+      return assumed_vf / 2;
     }
   else
     {
       int niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      peel_iters_prologue = niters < peel_iters_prologue ?
-                            niters : peel_iters_prologue;
-      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
+      int npeel_prol
+	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
+      int npeel_epil = (niters - npeel_prol) % assumed_vf;
       /* If we need to peel for gaps, but no peeling is required, we have to
 	 peel VF iterations.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue)
-	*peel_iters_epilogue = assumed_vf;
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil)
+	npeel_epil = assumed_vf;
+      return npeel_epil;
+    }
+}
+
+/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
+int
+vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
+			     int *peel_iters_epilogue,
+			     stmt_vector_for_cost *scalar_cost_vec,
+			     stmt_vector_for_cost *prologue_cost_vec,
+			     stmt_vector_for_cost *epilogue_cost_vec)
+{
+  int retval = 0;
+
+  *peel_iters_epilogue
+    = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue);
+
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      /* If peeled iterations are known but number of scalar loop
+	 iterations are unknown, count a taken branch per peeled loop.  */
+      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
+				 NULL_TREE, 0, vect_prologue);
+      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
+				  NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3652,24 +3666,106 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  bool prol_need_br_taken_cost = false;
+  bool prol_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_prologue.  */
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    peel_iters_prologue = 0;
+  else if (npeel < 0)
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      peel_iters_prologue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "prologue peel iters set to vf/2.\n");
 
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+      /* If peeled iterations are unknown, count a taken branch and a not taken
+	 branch per peeled loop. Even if scalar loop iterations are known,
+	 vector iterations are not known since peeled prologue iterations are
+	 not known. Hence guards remain the same.  */
+      prol_need_br_taken_cost = true;
+      prol_need_br_not_taken_cost = true;
+    }
+  else
+    peel_iters_prologue = npeel;
+
+  bool epil_need_br_taken_cost = false;
+  bool epil_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_epilogue.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+    /* We need to peel exactly one iteration for gaps.  */
+    peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  else if (npeel < 0)
+    {
+      /* If peeling for alignment is unknown, loop bound of main loop
+	 becomes unknown.  */
+      peel_iters_epilogue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "epilogue peel iters set to vf/2 because "
+			       "peeling for alignment is unknown.\n");
+
+      /* See the same reason above in peel_iters_prologue calculation.  */
+      epil_need_br_taken_cost = true;
+      epil_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
 	{
-	  /* We need to peel exactly one iteration.  */
-	  peel_iters_epilogue += 1;
-	  stmt_info_for_cost *si;
-	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
-	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
-				  si->kind, si->stmt_info, si->vectype,
-				  si->misalign, vect_epilogue);
+	  /* If peeled iterations are known but number of scalar loop
+	     iterations are unknown, count a taken branch per peeled loop.  */
+	  prol_need_br_taken_cost = true;
+	  epil_need_br_taken_cost = true;
 	}
+    }
+
+  stmt_info_for_cost *si;
+  int j;
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_prologue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_prologue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_prologue);
+      }
+
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_epilogue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_epilogue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_epilogue);
+      }
+
+  /* Add possible cond_branch_taken/cond_branch_not_taken cost.  */
+  if (prol_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_prologue);
 
+  if (prol_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_prologue);
+
+  if (epil_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_epilogue);
+
+  if (epil_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_epilogue);
+
+  /* Take care of special costs for rgroup controls of partial vectors.  */
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
@@ -3691,93 +3787,79 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	 simpler and safer to use the worst-case cost; if this ends up
 	 being the tie-breaker between vectorizing or not, then it's
 	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
-    }
-  else if (npeel < 0)
-    {
-      peel_iters_prologue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "prologue peel iters set to vf/2.\n");
-
-      /* If peeling for alignment is unknown, loop bound of main loop becomes
-         unknown.  */
-      peel_iters_epilogue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "epilogue peel iters set to vf/2 because "
-		     "peeling for alignment is unknown.\n");
+      /* Refer to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimization, we consider the worst case
+	 here.  */
+
+      /* For now we only operate length-based partial vectors on Power,
+	 which has constant VF all the time, we need some tweakings below
+	 if it doesn't hold in future.  */
+      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prol_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
 
-      /* If peeled iterations are unknown, count a taken branch and a not taken
-         branch per peeled loop. Even if scalar loop iterations are known,
-         vector iterations are not known since peeled prologue iterations are
-         not known. Hence guards remain the same.  */
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      stmt_info_for_cost *si;
-      int j;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
-	{
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_prologue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_prologue);
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_epilogue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_epilogue);
-	}
-    }
-  else
-    {
-      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
-      stmt_info_for_cost *si;
-      int j;
-      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
+	    /* Need one shift for niters_total computation.  */
+	    if (!niters_known_p && nitems != 1)
+	      prol_cnt += shift_cost;
 
-      prologue_cost_vec.create (2);
-      epilogue_cost_vec.create (2);
-      peel_iters_prologue = npeel;
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prol_cnt += (min_max_cost + minus_cost);
 
-      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
-					  &peel_iters_epilogue,
-					  &LOOP_VINFO_SCALAR_ITERATION_COST
-					    (loop_vinfo),
-					  &prologue_cost_vec,
-					  &epilogue_cost_vec);
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
 
-      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_prologue);
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prol_cnt += min_max_cost * num_vectors;
 
-      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_epilogue);
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
 
-      prologue_cost_vec.release ();
-      epilogue_cost_vec.release ();
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
     }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
@@ -3913,8 +3995,8 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4023,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4078,10 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4099,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4111,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost