[1/2,vect] PR 88915: Vectorize epilogues when versioning loops

Message ID 385547e6-abbd-3633-ad69-d4fb6e604c97@arm.com
State New
Headers show
Series
  • Improve vectorization of epilogues
Related show

Commit Message

Andre Simoes Dias Vieira Aug. 23, 2019, 4:50 p.m.
Hi,

This patch is an improvement on my last RFC.  As you pointed out, we can 
do the vectorization analysis of the epilogues before doing the 
transformation, using the same approach as used by openmp simd.  I have 
not yet incorporated the cost tweaks for vectorizing the epilogue, I 
would like to do this in a subsequent patch, to make it easier to test 
the differences.

I currently disable the vectorization of epilogues when versioning for 
iterations.  This is simply because I do not completely understand how 
the assumptions are created and couldn't determine whether using 
skip_vectors with this would work.  If you don't think it is a problem 
or have a testcase to show it work I would gladly look at it.

Bootstrapped this and the next patch on x86_64 and 
aarch64-unknown-linux-gnu, with no regressions (after test changes in 
next patch).

gcc/ChangeLog:
2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>

          PR 88915
          * gentype.c (main): Add poly_uint64 type to generator.
          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
          whether we vectorize epilogue loops.
          (vect_analyze_loop): Idem.
          (vect_transform_loop): Pass decision to vectorize epilogues
          to vect_do_peeling.
          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
          when doing loop versioning if we decided to vectorize epilogues.
          (vect-loop_versioning): Moved decision to check_profitability
          based on cost model.
          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
          vect_analyze_loop, vect_transform_loop): Update declarations.
          * tree-vectorizer.c: Include params.h
          (try_vectorize_loop_1): Initialize vect_epilogues_nomask
          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
          and vect_transform_loop.  Also make sure vectorizing epilogues
          does not count towards number of vectorized loops.

Comments

Richard Biener Aug. 26, 2019, 12:41 p.m. | #1
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote:

> Hi,

> 

> This patch is an improvement on my last RFC.  As you pointed out, we can do

> the vectorization analysis of the epilogues before doing the transformation,

> using the same approach as used by openmp simd.  I have not yet incorporated

> the cost tweaks for vectorizing the epilogue, I would like to do this in a

> subsequent patch, to make it easier to test the differences.

> 

> I currently disable the vectorization of epilogues when versioning for

> iterations.  This is simply because I do not completely understand how the

> assumptions are created and couldn't determine whether using skip_vectors with

> this would work.  If you don't think it is a problem or have a testcase to

> show it work I would gladly look at it.


I don't think there's any problem.  Basically the versioning condition
is if (can_we_compute_niter), most of the time it is an extra
condition from niter analysis under which niter is for example zero.
This should also be the same with all vector sizes.

-               delete loop_vinfo;
+               {
+                 /* Set versioning threshold of the original LOOP_VINFO 
based
+                    on the last vectorization of the epilog.  */
+                 LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo)
+                   = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+                 delete loop_vinfo;
+               }

I'm not sure this works reliably since the order we process vector
sizes is under target control and not necessarily decreasing.  I think
you want to keep track of the minimum here?  Preferably separately
I guess.

From what I see vect_analyze_loop_2 doesn't need vect_epilogues_nomask
and thus it doesn't change throughout the iteration.

       else
-       delete loop_vinfo;
+       {
+         /* Disable epilog vectorization if we can't determine the 
epilogs can
+            be vectorized.  */
+         *vect_epilogues_nomask &= vectorized_loops > 1;
+         delete loop_vinfo;
+       }

and this is a bit premature and instead it should be done
just before returning success?  Maybe also storing the
epilogue vector sizes we can handle in the loop_vinfo,
thereby representing !vect_epilogues_nomask if there are no
such sizes which also means that

@@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> 
*&simduid_to_vf_htab,

   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, 
num_vectorized_loops,
-                                new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" 
count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+                                  new_loop, loop_vinfo, NULL, NULL);
+    }


can be optimized to not re-check all smaller sizes (but even assert
re-analysis succeeds to the original result for the actual transform).

Otherwise this looks reasonable to me.

Thanks,
Richard.

> 

> Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu,

> with no regressions (after test changes in next patch).

> 

> gcc/ChangeLog:

> 2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>          PR 88915

>          * gentype.c (main): Add poly_uint64 type to generator.

>          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine

>          whether we vectorize epilogue loops.

>          (vect_analyze_loop): Idem.

>          (vect_transform_loop): Pass decision to vectorize epilogues

>          to vect_do_peeling.

>          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors

>          when doing loop versioning if we decided to vectorize epilogues.

>          (vect-loop_versioning): Moved decision to check_profitability

>          based on cost model.

>          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,

>          vect_analyze_loop, vect_transform_loop): Update declarations.

>          * tree-vectorizer.c: Include params.h

>          (try_vectorize_loop_1): Initialize vect_epilogues_nomask

>          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop

>          and vect_transform_loop.  Also make sure vectorizing epilogues

>          does not count towards number of vectorized loops.

> 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
Jeff Law Sept. 4, 2019, 3:34 p.m. | #2
On 8/23/19 10:50 AM, Andre Vieira (lists) wrote:
> Hi,

> 

> This patch is an improvement on my last RFC.  As you pointed out, we can

> do the vectorization analysis of the epilogues before doing the

> transformation, using the same approach as used by openmp simd.  I have

> not yet incorporated the cost tweaks for vectorizing the epilogue, I

> would like to do this in a subsequent patch, to make it easier to test

> the differences.

> 

> I currently disable the vectorization of epilogues when versioning for

> iterations.  This is simply because I do not completely understand how

> the assumptions are created and couldn't determine whether using

> skip_vectors with this would work.  If you don't think it is a problem

> or have a testcase to show it work I would gladly look at it.

> 

> Bootstrapped this and the next patch on x86_64 and

> aarch64-unknown-linux-gnu, with no regressions (after test changes in

> next patch).

> 

> gcc/ChangeLog:

> 2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>          PR 88915

>          * gentype.c (main): Add poly_uint64 type to generator.

>          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine

>          whether we vectorize epilogue loops.

>          (vect_analyze_loop): Idem.

>          (vect_transform_loop): Pass decision to vectorize epilogues

>          to vect_do_peeling.

>          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors

>          when doing loop versioning if we decided to vectorize epilogues.

>          (vect-loop_versioning): Moved decision to check_profitability

>          based on cost model.

>          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,

>          vect_analyze_loop, vect_transform_loop): Update declarations.

>          * tree-vectorizer.c: Include params.h

>          (try_vectorize_loop_1): Initialize vect_epilogues_nomask

>          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop

>          and vect_transform_loop.  Also make sure vectorizing epilogues

>          does not count towards number of vectorized loops.

Nit.  In several places you use "epilog", proper spelling is "epilogue".



> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c

> index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644

> --- a/gcc/tree-vectorizer.c

> +++ b/gcc/tree-vectorizer.c

> @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,

>  

>    /* Epilogue of vectorized loop must be vectorized too.  */

>    if (new_loop)

> -    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,

> -				 new_loop, loop_vinfo, NULL, NULL);

> +    {

> +      /* Don't include vectorized epilogues in the "vectorized loops" count.

> +       */

> +      unsigned dont_count = *num_vectorized_loops;

> +      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,

> +				   new_loop, loop_vinfo, NULL, NULL);

> +    }

Nit.  Don't wrap a comment with just the closing */ on its own line.
Instead wrap before "count" so that.

This is fine for the trunk after fixing those nits.

jeff
Andre Simoes Dias Vieira Sept. 18, 2019, 11:11 a.m. | #3
Hi Richard,

As I mentioned in the IRC channel, this is my current work in progress 
patch. It currently ICE's when vectorizing 
gcc/testsuite/gcc.c-torture/execute/nestfunc-2.c with '-O3' and '--param 
vect-epilogues-nomask=1'.

It ICE's because the epilogue loop (after if conversion) and main loop 
(before vectorization) are not the same, there are a bunch of extra BBs 
and the epilogue loop seems to need some cleaning up too.

Let me know if you see a way around this issue.

Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
       exit->prev = exit;
     }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("machine_mode", &pos));
       POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
       POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos));
+      POS_HERE (do_scalar_typedef ("vector_sizes", &pos));
       POS_HERE (do_typedef ("PTR", 
 			    create_pointer (resolve_typedef ("void", &pos)),
 			    &pos));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..b1c13dafdeeec8d95f00c232822d3ab9b11f4046 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -26,6 +26,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
+#include "tree-if-conv.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "fold-const.h"
@@ -1730,6 +1731,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 {
   unsigned int i;
   vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
+  vec<data_reference> datarefs_copy = loop_vinfo->shared->datarefs_copy;
   struct data_reference *dr;
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
@@ -1756,6 +1758,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
       if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
 	vect_update_init_of_dr (dr, niters, code);
     }
+  FOR_EACH_VEC_ELT (datarefs_copy, i, dr)
+    {
+      dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr);
+      if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
+	vect_update_init_of_dr (dr, niters, code);
+    }
 }
 
 /* For the information recorded in LOOP_VINFO prepare the loop for peeling
@@ -2409,6 +2417,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+    = loop_vinfo->epilogue_vinfos.length () > 0;
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2469,12 +2479,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2586,6 +2596,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
       /* Peel epilog and put it on exit edge of loop.  */
       epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
@@ -2966,9 +2977,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
    *COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +2997,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..3c355eccc5bef6668456fddf485b4996f2d2fb38 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -885,6 +885,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
 }
 
 /* Free all levels of MASKS.  */
@@ -960,6 +962,7 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
 
   loop->aux = NULL;
 }
@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
     estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1
@@ -1864,6 +1873,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -2183,9 +2193,12 @@ start_over:
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
      increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+	  && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
     {
       poly_uint64 niters_th = 0;
+      unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
       if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
 	{
@@ -2206,6 +2219,14 @@ start_over:
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
+
+      /*  Use the same condition as vect_transform_loop to decide when to use
+	  the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && ordered_p (th, niters_th))
+	niters_th = ordered_max (poly_uint64 (th), niters_th);
+
       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
     }
 
@@ -2329,14 +2350,8 @@ again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, vector_sizes vector_sizes)
 {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-						loop->simdlen != 0);
   unsigned int next_size = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2357,6 +2372,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2375,24 +2393,53 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	{
+	  LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+	}
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  first_vector_size = current_vector_size;
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  */
+		  if (vect_epilogues)
+		    {
+		      loop->epilogue_vsizes.reserve (1);
+		      loop->epilogue_vsizes.quick_push (current_vector_size);
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning threshold.
+		       */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2430,6 +2477,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, current_vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -8483,6 +8532,7 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   gimple *stmt;
   bool check_profitability = false;
   unsigned int th;
+  auto_vec<gimple *> orig_stmts;
 
   DUMP_VECT_SCOPE ("vec_transform_loop");
 
@@ -8497,11 +8547,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8519,18 +8569,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
-      poly_uint64 versioning_threshold
-	= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-      if (check_profitability
-	  && ordered_p (poly_uint64 (th), versioning_threshold))
-	{
-	  versioning_threshold = ordered_max (poly_uint64 (th),
-					      versioning_threshold);
-	  check_profitability = false;
-	}
       class loop *sloop
-	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+	= vect_loop_versioning (loop_vinfo);
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8558,6 +8598,66 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
 			      check_profitability, niters_no_overflow);
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+
+  if (loop_vinfo->epilogue_vinfos.length () == 0)
+    epilogue = NULL;
+
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */
+  if (epilogue && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      vector_sizes vector_sizes = loop->epilogue_vsizes;
+      unsigned int next_size = 0;
+      unsigned HOST_WIDE_INT eiters
+	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+      epilogue->nb_iterations_upper_bound = eiters - 1;
+      epilogue->any_upper_bound = true;
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+	     && !(constant_multiple_p (current_vector_size,
+				       vector_sizes[next_size], &ratio)
+		  && eiters >= lowest_vf / ratio))
+	next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+	epilogue = NULL;
+    }
+
+  if (epilogue)
+    {
+      loop_vec_info epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+      epilogue->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilogue;
+      epilogue->simduid = loop->simduid;
+
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* update stmts in stmt_vec_info for epilog */
+      gimple_stmt_iterator gsi;
+      gphi_iterator phi_gsi;
+      basic_block *bbs = get_loop_body (loop);
+
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (phi_gsi = gsi_start_phis (bbs[i]); !gsi_end_p (phi_gsi);
+	       gsi_next (&phi_gsi))
+	    orig_stmts.safe_push (phi_gsi.phi ());
+
+	  for (gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	    orig_stmts.safe_push (gsi_stmt (gsi));
+	}
+    }
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8814,57 +8914,86 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
 
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
 
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+      hash_map<tree,tree> mapping;
+      auto_vec<stmt_vec_info> worklist;
+      basic_block *bbs = get_loop_body (epilogue);
+      gimple_stmt_iterator gsi;
+      gphi_iterator phi_gsi;
+      gimple * orig_stmt, * new_stmt;
+      stmt_vec_info stmt_vinfo = NULL;
+
+      LOOP_VINFO_BBS (epilogue_vinfo) = bbs;
+      for (unsigned i = 0; i < epilogue->num_nodes; ++i)
 	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (current_vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
+	  for (phi_gsi = gsi_start_phis (bbs[i]); !gsi_end_p (phi_gsi);
+	       gsi_next (&phi_gsi))
+	    {
+	      gcc_assert (orig_stmts.length () > 0);
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = phi_gsi.phi ();
+
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      mapping.put (gimple_phi_result (orig_stmt),
+			    gimple_phi_result (new_stmt));
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		worklist.safe_push (stmt_vinfo);
+	    }
+
+	  for (gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gcc_assert (orig_stmts.length () > 0);
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = gsi_stmt (gsi);
+
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      if (is_gimple_assign (orig_stmt))
+		{
+		  gcc_assert (is_gimple_assign (new_stmt));
+		  mapping.put (gimple_assign_lhs (orig_stmt),
+			      gimple_assign_lhs (new_stmt));
+		}
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		worklist.safe_push (stmt_vinfo);
+	    }
+	  gcc_assert (orig_stmts.is_empty ());
 	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (current_vector_size, vector_sizes[next_size]))
-	  next_size += 1;
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
+      for (unsigned i = 0; i < worklist.length (); ++i)
+	{
+	  tree *new_t;
+	  gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (worklist[i]);
 
-  if (epilogue)
-    {
-      epilogue->force_vectorize = loop->force_vectorize;
-      epilogue->safelen = loop->safelen;
-      epilogue->dont_vectorize = false;
+	  while (seq)
+	    {
+	      for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+		if ((new_t = mapping.get(gimple_op (seq, j))))
+		  gimple_set_op (seq, j, *new_t);
+	      seq = seq->next;
+	    }
+	}
 
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
+      vect_analyze_scalar_cycles (epilogue_vinfo);
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..6e453d39190b362b6d5a01bc2167e10a617f91f9 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -564,6 +564,8 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  vec<_loop_vec_info *> epilogue_vinfos;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1480,10 +1482,9 @@ extern void vect_set_loop_condition (class loop *, loop_vec_info,
 extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
-class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
@@ -1610,7 +1611,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    vector_sizes);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 173e6b51652fd023893b38da786ff28f827553b5..71bbf4fdf8dc7588c45a0e8feef9272b52c0c04c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -875,6 +875,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  auto_vector_sizes auto_vector_sizes;
+  vector_sizes vector_sizes;
+  bool assert_versioning = false;
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
+  /* If this is an epilogue, we already know what vector sizes we will use for
+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes.is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes;
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+						    loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -898,6 +927,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
     {
+      /* If this loops requires versioning, make sure the analyzis done on the
+	 epilogue loops succeeds.  */
+      gcc_assert (!assert_versioning);
+
       /* Free existing information if loop is analyzed with some
 	 assumptions.  */
       if (loop_constraint_set_p (loop, LOOP_C_FINITE))
@@ -1013,8 +1046,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Andre Simoes Dias Vieira Oct. 8, 2019, 1:16 p.m. | #4
Hi Richard,

As I mentioned in the IRC channel, I managed to get "most" of the 
regression testsuite working for x86_64 (avx512) and aarch64.

On x86_64 I get a failure that I can't explain, was hoping you might be 
able to have a look with me:
"PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"

vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first 
iteration of the main loop.  The tree dumps look alright, but I do 
notice the stack usage seems to change between --param 
vect-epilogue-nomask={0,1}.

Am I missing to update some field that may later lead to the amount of 
stack being used? I am confused, it could very well be that I am missing 
something obvious, I am not too familiar with x86's ISA. I will try to 
investigate further.

This patch needs further clean-up and more comments (or comment 
updates), but I thought I'd share current state to see if you can help 
me unblock.

Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
       exit->prev = exit;
     }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("machine_mode", &pos));
       POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
       POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos));
+      POS_HERE (do_scalar_typedef ("vector_sizes", &pos));
       POS_HERE (do_typedef ("PTR", 
 			    create_pointer (resolve_typedef ("void", &pos)),
 			    &pos));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..189f7458b1b20be06a9a20d3ee05e74bc176434c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -26,6 +26,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
+#include "tree-if-conv.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "fold-const.h"
@@ -1724,7 +1725,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1736,19 +1737,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2401,14 +2390,18 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance)
 {
   edge e, guard_e;
-  tree type = TREE_TYPE (niters), guard_cond;
+  tree type = TREE_TYPE (niters), guard_cond, advance_guard = NULL;
   basic_block guard_bb, guard_to;
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+    = loop_vinfo->epilogue_vinfos.length () > 0
+    && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2466,15 +2459,61 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+      /* Don't vectorize epilogues if not most inner loop or if you may need to
+	 peel the epilogue loop for alignment.  */
+      if (loop->inner != NULL
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+	vect_epilogues = false;
+
+    }
+
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  bool epilogue_any_upper_bound = false;
+  unsigned HOST_WIDE_INT eiters = 0;
+  tree niters_vector_mult_vf;
+
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      vector_sizes vector_sizes = loop->epilogue_vsizes;
+      unsigned next_size = 0;
+      eiters = (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      if (prolog_peeling > 0)
+	eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+      epilogue_any_upper_bound = true;
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+	     && !(constant_multiple_p (current_vector_size,
+				       vector_sizes[next_size], &ratio)
+		  && eiters >= lowest_vf / ratio))
+	next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+	vect_epilogues = false;
+    }
+
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2503,7 +2542,17 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
 
   dump_user_location_t loop_loc = find_loop_location (loop);
-  class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  class loop *scalar_loop;
+  if (vect_epilogues)
+    {
+      scalar_loop = get_loop_copy (loop);
+      LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+	= LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+      LOOP_VINFO_SCALAR_LOOP (loop_vinfo) = NULL;
+    }
+  else
+   scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2586,12 +2635,24 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
       /* Peel epilog and put it on exit edge of loop.  */
       epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
 			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");
 	  gcc_unreachable ();
 	}
+
+      if (epilogue_any_upper_bound && prolog_peeling >= 0)
+	{
+	  epilog->any_upper_bound = true;
+	  epilog->nb_iterations_upper_bound = eiters + 1;
+	}
+      else if (prolog_peeling < 0)
+	{
+	  epilog->any_upper_bound = false;
+	}
+
       epilog->force_vectorize = false;
       slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
@@ -2608,6 +2669,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 						check_profitability);
 	  /* Build guard against NITERSM1 since NITERS may overflow.  */
 	  guard_cond = fold_build2 (LT_EXPR, boolean_type_node, nitersm1, t);
+	  advance_guard = guard_cond;
 	  guard_bb = anchor;
 	  guard_to = split_edge (loop_preheader_edge (epilog));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
@@ -2635,7 +2697,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2699,10 +2760,105 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of iterations for our epilogue.  */
+      tree cond_niters, niters;
+
+      /* Depending on whether we peel for gaps we take niters or niters - 1,
+	 we will refer to this as N - G, where both N and G are the NITERS and
+	 GAP for the original loop.  */
+      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	? LOOP_VINFO_NITERSM1 (loop_vinfo)
+	: LOOP_VINFO_NITERS (loop_vinfo);
+
+      /* Here we build a vector factorization mask:
+	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
+      tree vf_mask = build_int_cst (TREE_TYPE (niters),
+				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
+			     vf_mask,
+			     build_one_cst (TREE_TYPE (vf_mask)));
+      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
+
+      /* Here we calculate:
+	 niters = N - ((N-G) & ~(VF -1)) */
+      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+			    LOOP_VINFO_NITERS (loop_vinfo),
+			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
+					 niters,
+					 vf_mask));
+
+      if (skip_vector)
+	{
+	  /* We do this by constructing:
+	     cond_niters = !do_we_enter_main_loop ? N + niters_prolog : niters
+	     we add npeel, the number of peeled iterations for alignment, to N
+	     in case we don't enter the main loop, has these have already been
+	     subtracted from N (the number of iterations of the main loop).
+	     Since the prolog peeling is also skipped if we skip the
+	     vectorization we must add them back.  */
+	  cond_niters
+	    = fold_build3 (COND_EXPR, TREE_TYPE (niters),
+			   advance_guard,
+			   fold_build2 (PLUS_EXPR, TREE_TYPE (niters),
+					LOOP_VINFO_NITERS (loop_vinfo),
+					fold_convert (TREE_TYPE (niters),
+						      niters_prolog)),
+			   niters);
+	}
+      else
+	cond_niters = niters;
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = cond_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (cond_niters),
+		       cond_niters, build_one_cst (TREE_TYPE (cond_niters)));
+
+      /* We now calculate the amount of iterations we must advance our
+         epilogue's data references by.
+	 Make sure to use sizetype here as we might use a negative constant
+	 if the loop peels for alignment.  If the target is 64-bit this can go
+	 wrong if the computation is not done in sizetype.  */
+      *advance = fold_convert (sizetype, niters);
+
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      *advance,
+			      fold_convert (sizetype,
+					    LOOP_VINFO_NITERS (loop_vinfo)));
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      build_zero_cst (TREE_TYPE (*advance)),
+			      *advance);
+
+      if (skip_vector)
+	{
+	  *advance
+	    = fold_build3 (COND_EXPR, TREE_TYPE (*advance),
+			   advance_guard,
+			   fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+					build_zero_cst (TREE_TYPE (*advance)),
+					fold_convert (TREE_TYPE (*advance),
+						      niters_prolog)),
+			   *advance);
+	}
+
+      /* Redo the peeling for niter analysis as the NITERs and need for
+	 alignment have been updated to take the main loop into
+	 account.  */
+      LOOP_VINFO_PEELING_FOR_NITER (epilogue_vinfo) = false;
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
@@ -2966,9 +3122,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
    *COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +3142,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..6dbde0fe35c29d0357cf5c6e7ab5599957a8242a 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -885,6 +885,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
 }
 
 /* Free all levels of MASKS.  */
@@ -960,6 +962,7 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
 
   loop->aux = NULL;
 }
@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
     estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1
@@ -1852,6 +1861,56 @@ vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
     }
 }
 
+
+/* Decides whether we need to create an epilogue loop to handle
+   remaining scalar iterations and sets PEELING_FOR_NITERS accordingly.  */
+
+void
+determine_peel_for_niter (loop_vec_info loop_vinfo)
+{
+
+  unsigned HOST_WIDE_INT const_vf;
+  HOST_WIDE_INT max_niter
+    = likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
+
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
+  if (!th && LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
+					  (loop_vinfo));
+
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    /* The main loop handles all iterations.  */
+    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
+  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+    {
+      /* Work out the (constant) number of iterations that need to be
+	 peeled for reasons other than niters.  */
+      unsigned int peel_niter = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+	peel_niter += 1;
+      if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter,
+		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
+	LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
+    }
+  else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	   /* ??? When peeling for gaps but not alignment, we could
+	      try to check whether the (variable) niters is known to be
+	      VF * N + 1.  That's something of a niche case though.  */
+	   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&const_vf)
+	   || ((tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
+		< (unsigned) exact_log2 (const_vf))
+	       /* In case of versioning, check if the maximum number of
+		  iterations is greater than th.  If they are identical,
+		  the epilogue is unnecessary.  */
+	       && (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+		   || ((unsigned HOST_WIDE_INT) max_niter
+		       > (th / const_vf) * const_vf))))
+    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
+}
+
+
 /* Function vect_analyze_loop_2.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -1864,6 +1923,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -1979,7 +2039,6 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   vect_compute_single_scalar_iteration_cost (loop_vinfo);
 
   poly_uint64 saved_vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  unsigned th;
 
   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  */
   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
@@ -2019,9 +2078,6 @@ start_over:
 		   LOOP_VINFO_INT_NITERS (loop_vinfo));
     }
 
-  HOST_WIDE_INT max_niter
-    = likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
-
   /* Analyze the alignment of the data-refs in the loop.
      Fail if a data reference is found that cannot be vectorized.  */
 
@@ -2125,42 +2181,7 @@ start_over:
     return opt_result::failure_at (vect_location,
 				   "Loop costings not worthwhile.\n");
 
-  /* Decide whether we need to create an epilogue loop to handle
-     remaining scalar iterations.  */
-  th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-
-  unsigned HOST_WIDE_INT const_vf;
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-    /* The main loop handles all iterations.  */
-    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
-  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
-    {
-      /* Work out the (constant) number of iterations that need to be
-	 peeled for reasons other than niters.  */
-      unsigned int peel_niter = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	peel_niter += 1;
-      if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter,
-		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
-	LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
-    }
-  else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
-	   /* ??? When peeling for gaps but not alignment, we could
-	      try to check whether the (variable) niters is known to be
-	      VF * N + 1.  That's something of a niche case though.  */
-	   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&const_vf)
-	   || ((tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
-		< (unsigned) exact_log2 (const_vf))
-	       /* In case of versioning, check if the maximum number of
-		  iterations is greater than th.  If they are identical,
-		  the epilogue is unnecessary.  */
-	       && (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-		   || ((unsigned HOST_WIDE_INT) max_niter
-		       > (th / const_vf) * const_vf))))
-    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
-
+  determine_peel_for_niter (loop_vinfo);
   /* If an epilogue loop is required make sure we can create one.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
       || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
@@ -2183,9 +2204,12 @@ start_over:
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
      increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+	  && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
     {
       poly_uint64 niters_th = 0;
+      unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
       if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
 	{
@@ -2206,6 +2230,14 @@ start_over:
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
+
+      /*  Use the same condition as vect_transform_loop to decide when to use
+	  the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && ordered_p (th, niters_th))
+	niters_th = ordered_max (poly_uint64 (th), niters_th);
+
       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
     }
 
@@ -2329,14 +2361,8 @@ again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, vector_sizes vector_sizes)
 {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-						loop->simdlen != 0);
   unsigned int next_size = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2357,6 +2383,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2375,24 +2404,54 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	{
+	  LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+	}
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  first_vector_size = current_vector_size;
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  */
+		  if (vect_epilogues)
+		    {
+		      loop->aux = NULL;
+		      loop->epilogue_vsizes.reserve (1);
+		      loop->epilogue_vsizes.quick_push (current_vector_size);
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2430,6 +2489,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, current_vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -8460,6 +8521,33 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+
+
+static tree
+replace_ops (tree op, hash_map<tree, tree> &mapping)
+{
+  if (!op)
+    return NULL;
+
+  tree *new_op;
+  tree ret = NULL;
+  for (int j = 0; j < TREE_OPERAND_LENGTH (op); ++j)
+    {
+      if ((new_op = mapping.get (TREE_OPERAND (op, j))))
+	{
+	  TREE_OPERAND (op, j) = *new_op;
+	  ret = *new_op;
+	}
+      else
+	ret = replace_ops (TREE_OPERAND (op, j), mapping);
+
+      if (ret)
+	return ret;
+    }
+
+  return NULL;
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -8483,6 +8571,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   gimple *stmt;
   bool check_profitability = false;
   unsigned int th;
+  auto_vec<gimple *> orig_stmts;
+  auto_vec<dr_vec_info *> gather_scatter_drs;
+  auto_vec<gimple *> gather_scatter_stmts;
 
   DUMP_VECT_SCOPE ("vec_transform_loop");
 
@@ -8497,11 +8588,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8519,18 +8610,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
-      poly_uint64 versioning_threshold
-	= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-      if (check_profitability
-	  && ordered_p (poly_uint64 (th), versioning_threshold))
-	{
-	  versioning_threshold = ordered_max (poly_uint64 (th),
-					      versioning_threshold);
-	  check_profitability = false;
-	}
       class loop *sloop
-	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+	= vect_loop_versioning (loop_vinfo);
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8555,9 +8636,58 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance);
+
+  if (epilogue)
+    {
+      basic_block *orig_bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+
+      gimple_stmt_iterator orig_gsi;
+      gphi_iterator orig_phi_gsi;
+      gimple *stmt;
+      stmt_vec_info stmt_vinfo;
+
+      /* The stmt_vec_info's of the epilogue were constructed for the main loop
+	 and need to be updated to refer to the cloned variables used in the
+	 epilogue loop.  We do this by assuming the original main loop and the
+	 epilogue loop are identical (aside the different SSA names).  This
+	 means we assume we can go through each BB in the loop and each STMT in
+	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each
+	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep
+	 track of the original state of the main loop, before vectorization.
+	 After vectorization we proceed to update the epilogue's stmt_vec_infos
+	 information.  We also update the references in PATTERN_DEF_SEQ's,
+	 RELATED_STMT's and data_references.  Mainly the latter has to be
+	 updated after we are done vectorizing the main loop, as the
+	 data_references are shared between main and epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (orig_phi_gsi = gsi_start_phis (orig_bbs[i]);
+	       !gsi_end_p (orig_phi_gsi); gsi_next (&orig_phi_gsi))
+	    orig_stmts.safe_push (orig_phi_gsi.phi ());
+	  for (orig_gsi = gsi_start_bb (orig_bbs[i]);
+	       !gsi_end_p (orig_gsi); gsi_next (&orig_gsi))
+	    {
+	      stmt = gsi_stmt (orig_gsi);
+	      orig_stmts.safe_push (stmt);
+	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);
+	      /* Data references pointing to gather loads and scatter stores
+		 require special treatment because the address computation
+		 happens in a different gimple node, pointed to by DR_REF.  In
+		 contrast to normal loads and stores where we only need to
+		 update the offset of the data reference.  */
+	      if (stmt_vinfo
+		  && STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+		gather_scatter_drs.safe_push (STMT_VINFO_DR_INFO (stmt_vinfo));
+	    }
+	}
+    }
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8814,57 +8944,157 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
 
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (current_vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (current_vector_size, vector_sizes[next_size]))
-	  next_size += 1;
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+      vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
+      auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
+      hash_map<tree,tree> mapping;
+      gimple * orig_stmt, * new_stmt;
+      gimple_stmt_iterator epilogue_gsi;
+      gphi_iterator epilogue_phi_gsi;
+      stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+      basic_block *epilogue_bbs = get_loop_body (epilogue);
 
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
+      LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+      /* We are done vectorizing the main loop, so now we update the epilogues
+	 stmt_vec_info's.  At the same time we set the gimple UID of each
+	 statement in the epilogue, as these are used to look them up in the
+	 epilogues loop_vec_info later.  We also keep track of what
+	 stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+	 need updating and we construct a mapping between variables defined in
+	 the main loop and their corresponding names in epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	       !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	    {
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = epilogue_phi_gsi.phi ();
+
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      mapping.put (gimple_phi_result (orig_stmt),
+			    gimple_phi_result (new_stmt));
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		pattern_worklist.safe_push (stmt_vinfo);
+
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      while (related_vinfo && related_vinfo != stmt_vinfo)
+		{
+		  related_worklist.safe_push (related_vinfo);
+		  /* Set BB such that the assert in
+		    'get_initial_def_for_reduction' is able to determine that
+		    the BB of the related stmt is inside this loop.  */
+		  gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+				 gimple_bb (new_stmt));
+		  related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+		}
+	    }
+
+	  for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	       !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	    {
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = gsi_stmt (epilogue_gsi);
+
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      if (is_gimple_assign (orig_stmt))
+		{
+		  gcc_assert (is_gimple_assign (new_stmt));
+		  mapping.put (gimple_assign_lhs (orig_stmt),
+			      gimple_assign_lhs (new_stmt));
+		}
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		pattern_worklist.safe_push (stmt_vinfo);
+
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      while (related_vinfo && related_vinfo != stmt_vinfo)
+		{
+		  related_worklist.safe_push (related_vinfo);
+		  /* Set BB such that the assert in
+		    'get_initial_def_for_reduction' is able to determine that
+		    the BB of the related stmt is inside this loop.  */
+		  gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+				 gimple_bb (new_stmt));
+		  related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+		}
+	    }
+	  gcc_assert (orig_stmts.length () == 0);
+	}
+
+      /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the
+	 original main loop and thus need to be updated to refer to the cloned
+	 variables used in the epilogue.  */
+      for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+	{
+	  gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
+	  tree *new_op;
+
+	  while (seq)
+	    {
+	      for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+		{
+		  tree op = gimple_op (seq, j);
+		  if ((new_op = mapping.get(op)))
+		    gimple_set_op (seq, j, *new_op);
+		  else
+		    {
+		      op = unshare_expr (op);
+		      replace_ops (op, mapping);
+		      gimple_set_op (seq, j, op);
+		    }
+		}
+	      seq = seq->next;
+	    }
+	}
+
+      /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be
+	 updated.  */
+      for (unsigned i = 0; i < related_worklist.length (); ++i)
+	{
+	  tree *new_t;
+	  gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
+	  for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	    if ((new_t = mapping.get(gimple_op (stmt, j))))
+	      gimple_set_op (stmt, j, *new_t);
+	}
+
+      tree new_op;
+      for (unsigned i = 0; i < gather_scatter_drs.length (); ++i)
+	{
+	  dr_vec_info *dr_info = gather_scatter_drs[i];
+	  data_reference *dr = dr_info->dr;
+	  gcc_assert (dr);
+	  DR_REF (dr) = unshare_expr (DR_REF (dr));
+	  new_op = replace_ops (DR_REF (dr), mapping);
+	  if (new_op)
+	    DR_STMT (dr_info->dr) = SSA_NAME_DEF_STMT (new_op);
+	}
 
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
+      epilogue_vinfo->shared->datarefs_copy.release ();
+      epilogue_vinfo->shared->save_datarefs ();
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..9788c02535999e2e08cb03d1f20ddd80ff448d51 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -564,6 +564,8 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  vec<_loop_vec_info *> epilogue_vinfos;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1480,13 +1482,15 @@ extern void vect_set_loop_condition (class loop *, loop_vec_info,
 extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
-class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
+
 
 /* In tree-vect-stmts.c.  */
 extern poly_uint64 current_vector_size;
@@ -1600,6 +1604,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 /* FORNOW: Used in tree-parloops.c.  */
 extern stmt_vec_info vect_force_simple_reduction (loop_vec_info, stmt_vec_info,
 						  bool *, bool);
@@ -1610,7 +1616,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    vector_sizes);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 173e6b51652fd023893b38da786ff28f827553b5..71bbf4fdf8dc7588c45a0e8feef9272b52c0c04c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -875,6 +875,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  auto_vector_sizes auto_vector_sizes;
+  vector_sizes vector_sizes;
+  bool assert_versioning = false;
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
+  /* If this is an epilogue, we already know what vector sizes we will use for
+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes.is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes;
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+						    loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -898,6 +927,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
     {
+      /* If this loops requires versioning, make sure the analyzis done on the
+	 epilogue loops succeeds.  */
+      gcc_assert (!assert_versioning);
+
       /* Free existing information if loop is analyzed with some
 	 assumptions.  */
       if (loop_constraint_set_p (loop, LOOP_C_FINITE))
@@ -1013,8 +1046,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Richard Biener Oct. 9, 2019, 8:54 a.m. | #5
On Tue, 8 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richard,

> 

> As I mentioned in the IRC channel, I managed to get "most" of the regression

> testsuite working for x86_64 (avx512) and aarch64.

> 

> On x86_64 I get a failure that I can't explain, was hoping you might be able

> to have a look with me:

> "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"

> 

> vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first

> iteration of the main loop.  The tree dumps look alright, but I do notice the

> stack usage seems to change between --param vect-epilogue-nomask={0,1}.


So the issue is that we have

=> 0x0000000000400778 <+72>:    vmovdqa64 %zmm1,-0x40(%rax)

but the memory accessed is not appropriately aligned.  The vectorizer
sets DECL_USER_ALIGN on the stack local but somehow later it downs
it to 256:

Old value = 640
New value = 576
ensure_base_align (dr_info=0x526f788) at 
/tmp/trunk/gcc/tree-vect-stmts.c:6294
6294              DECL_USER_ALIGN (base_decl) = 1;
(gdb) l
6289          if (decl_in_symtab_p (base_decl))
6290            symtab_node::get (base_decl)->increase_alignment 
(align_base_to);
6291          else
6292            {
6293              SET_DECL_ALIGN (base_decl, align_base_to);
6294              DECL_USER_ALIGN (base_decl) = 1;
6295            }

this means vectorizing the epilogue modifies the DRs, in particular
the base alignment?

> Am I missing to update some field that may later lead to the amount of stack

> being used? I am confused, it could very well be that I am missing something

> obvious, I am not too familiar with x86's ISA. I will try to investigate

> further.

> 

> This patch needs further clean-up and more comments (or comment updates), but

> I thought I'd share current state to see if you can help me unblock.

> 

> Cheers,

> Andre

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
Andre Simoes Dias Vieira Oct. 10, 2019, 1:50 p.m. | #6
Hi,

After all the discussions and respins I now believe this patch is close 
to what we envisioned.

This patch achieves two things when vect-epilogues-nomask=1:
1) It analyzes the original loop for each supported vector size and 
saves this analysis per loop, as well as the vector sizes we know we can 
vectorize the loop for.
2) When loop versioning it uses the 'skip_vector' code path to vectorize 
the epilogue, and uses the lowest versioning threshold between the main 
and epilogue's.

As side effects of this patch I also changed ensure_base_align to only 
update the alignment if the new alignment is lower than the current one. 
  This function already did that if the object was a symbol, now it 
behaves this way for any object.

I bootstrapped this patch with both vect-epilogues-nomask turned on and 
off on x86_64 (AVX512) and aarch64.  Regression tests looked good.

Is this OK for trunk?

gcc/ChangeLog:
2019-10-10  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * cfgloop.h (loop): Add epilogue_vsizes member.
     * cfgloop.c (flow_loop_free): Release epilogue_vsizes.
     (alloc_loop): Initialize epilogue_vsizes.
     * gentype.c (main): Add poly_uint64 type and vector_sizes to
     generator.
     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
     (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (determine_peel_for_niter): New. Outlined code to re-use in two
     places.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (replace_ops): New helper function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and update necessary information.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     (vect_loop_versioning): Moved decision to check_profitability
     based on cost model.
     * tree-vect-stmts.c (ensure_base_align): Only update alignment
     if new alignment is lower.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member.
     (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters,
     vect_update_inits_of_drs, determine_peel_for_niter,
     vect_analyze_loop): Add or update declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     create loop_vec_info's for epilogues when available.  Otherwise analyse
     epilogue separately.



Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
       exit->prev = exit;
     }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("machine_mode", &pos));
       POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
       POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos));
+      POS_HERE (do_scalar_typedef ("vector_sizes", &pos));
       POS_HERE (do_typedef ("PTR", 
 			    create_pointer (resolve_typedef ("void", &pos)),
 			    &pos));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..6349e4e808edfc0813ad1d0a1125420d9b0b260c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1724,7 +1724,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1734,21 +1734,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
 
-  /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
+  /* Adjust niters to sizetype.  We used to insert the stmts on loop preheader
+     here, but since we might use these niters to update the epilogues niters
+     and data references we can't insert them here as this definition might not
+     always dominate its uses.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2401,14 +2392,18 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance)
 {
   edge e, guard_e;
-  tree type = TREE_TYPE (niters), guard_cond;
+  tree type = TREE_TYPE (niters), guard_cond, vector_guard = NULL;
   basic_block guard_bb, guard_to;
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+    = loop_vinfo->epilogue_vinfos.length () > 0
+    && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+      /* Don't vectorize epilogues if this is not the most inner loop or if
+	 the epilogue may need peeling for alignment as the vectorizer doesn't
+	 know how to handle these situations properly yet.  */
+      if (loop->inner != NULL
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+	vect_epilogues = false;
+
+    }
+
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  bool epilogue_any_upper_bound = false;
+  unsigned HOST_WIDE_INT eiters = 0;
+  tree niters_vector_mult_vf;
+
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      vector_sizes vector_sizes = loop->epilogue_vsizes;
+      unsigned next_size = 0;
+      eiters = (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      if (prolog_peeling > 0)
+	eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+      epilogue_any_upper_bound = true;
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+	     && !(constant_multiple_p (current_vector_size,
+				       vector_sizes[next_size], &ratio)
+		  && eiters >= lowest_vf / ratio))
+	next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+	vect_epilogues = false;
+    }
+
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2503,7 +2545,25 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
 
   dump_user_location_t loop_loc = find_loop_location (loop);
-  class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  class loop *scalar_loop;
+  /* If we are vectorizing the epilogue then we should use a copy of the
+     original main loop to vectorize.  This copy has already been if-converted
+     and is identical to the loop on which the analysis was done, making it
+     easier to update loop_vec_info, stmt_vec_info and dr_vec_info references
+     where needed.  */
+  if (vect_epilogues)
+    {
+      scalar_loop = get_loop_copy (loop);
+      /* Make sure to set the epilogue's epilogue scalar loop, such that we can
+	 we can use the original scalar loop as remaining epilogue if
+	 necessary.  */
+      LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+	= LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+      LOOP_VINFO_SCALAR_LOOP (loop_vinfo) = NULL;
+    }
+  else
+   scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2592,6 +2652,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");
 	  gcc_unreachable ();
 	}
+
+      if (epilogue_any_upper_bound && prolog_peeling >= 0)
+	{
+	  epilog->any_upper_bound = true;
+	  epilog->nb_iterations_upper_bound = eiters + 1;
+	}
+
       epilog->force_vectorize = false;
       slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
@@ -2608,6 +2675,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 						check_profitability);
 	  /* Build guard against NITERSM1 since NITERS may overflow.  */
 	  guard_cond = fold_build2 (LT_EXPR, boolean_type_node, nitersm1, t);
+	  vector_guard = guard_cond;
 	  guard_bb = anchor;
 	  guard_to = split_edge (loop_preheader_edge (epilog));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
@@ -2635,7 +2703,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2699,10 +2766,108 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of iterations for our epilogue.  */
+      tree cond_niters, niters;
+
+      /* Depending on whether we peel for gaps we take niters or niters - 1,
+	 we will refer to this as N - G, where N and G are the NITERS and
+	 GAP for the original loop.  */
+      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	? LOOP_VINFO_NITERSM1 (loop_vinfo)
+	: LOOP_VINFO_NITERS (loop_vinfo);
+
+      /* Here we build a vector factorization mask:
+	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
+      tree vf_mask = build_int_cst (TREE_TYPE (niters),
+				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
+			     vf_mask,
+			     build_one_cst (TREE_TYPE (vf_mask)));
+      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
+
+      /* Here we calculate:
+	 niters = N - ((N-G) & ~(VF -1)) */
+      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+			    LOOP_VINFO_NITERS (loop_vinfo),
+			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
+					 niters,
+					 vf_mask));
+
+      if (skip_vector)
+	{
+	  /* If it is not guaranteed we enter the main loop we need to
+	     make the niters of the epilogue conditional on entireing the main
+	     loop.  We do this by constructing:
+	     cond_niters = !do_we_enter_main_loop ? N + niters_prolog : niters
+	     we add niters_prolog, the number of peeled iterations in the
+	     prolog, to N in case we don't enter the main loop, as these have
+	     already been subtracted from N (the number of iterations of the
+	     main loop).  Since the prolog peeling is also skipped if we skip the
+	     main loop we must add those interations back.  */
+	  cond_niters
+	    = fold_build3 (COND_EXPR, TREE_TYPE (niters),
+			   vector_guard,
+			   fold_build2 (PLUS_EXPR, TREE_TYPE (niters),
+					LOOP_VINFO_NITERS (loop_vinfo),
+					fold_convert (TREE_TYPE (niters),
+						      niters_prolog)),
+			   niters);
+	}
+      else
+	cond_niters = niters;
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = cond_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (cond_niters),
+		       cond_niters, build_one_cst (TREE_TYPE (cond_niters)));
+
+      /* We now calculate the amount of iterations we must advance our
+	 epilogue's data references by.  Make sure to use sizetype here as
+	 otherwise the pointer computation may go wrong on targets with
+	 different pointer sizes to the used niters type.  */
+      *advance = fold_convert (sizetype, niters);
+
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      *advance,
+			      fold_convert (sizetype,
+					    LOOP_VINFO_NITERS (loop_vinfo)));
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      build_zero_cst (TREE_TYPE (*advance)),
+			      *advance);
+
+      if (skip_vector)
+	{
+	  /* If we are skipping the vectorized loop then we must roll back the
+	     data references by the amount we might have expected to peel in
+	     the, also skipped, prolog.  */
+	  *advance
+	    = fold_build3 (COND_EXPR, TREE_TYPE (*advance),
+			   vector_guard,
+			   fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+					build_zero_cst (TREE_TYPE (*advance)),
+					fold_convert (TREE_TYPE (*advance),
+						      niters_prolog)),
+			   *advance);
+	}
+
+      /* Redo the peeling for niter analysis as the NITERs and alignment
+	 may have been updated to take the main loop into account.  */
+      LOOP_VINFO_PEELING_FOR_NITER (epilogue_vinfo) = false;
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
@@ -2966,9 +3131,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
    *COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..5cba0bcf9df93bb25dcd37c8deeff601d3e64c8f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -713,7 +713,7 @@ vect_fixup_scalar_cycles_with_patterns (loop_vec_info loop_vinfo)
    Return the loop exit condition.  */
 
 
-static gcond *
+gcond *
 vect_get_loop_niters (class loop *loop, tree *assumptions,
 		      tree *number_of_iterations, tree *number_of_iterationsm1)
 {
@@ -885,6 +885,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
 }
 
 /* Free all levels of MASKS.  */
@@ -960,6 +962,7 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
 
   loop->aux = NULL;
 }
@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
     estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1
@@ -1852,6 +1861,56 @@ vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
     }
 }
 
+
+/* Decides whether we need to create an epilogue loop to handle
+   remaining scalar iterations and sets PEELING_FOR_NITERS accordingly.  */
+
+void
+determine_peel_for_niter (loop_vec_info loop_vinfo)
+{
+
+  unsigned HOST_WIDE_INT const_vf;
+  HOST_WIDE_INT max_niter
+    = likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
+
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
+  if (!th && LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
+					  (loop_vinfo));
+
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    /* The main loop handles all iterations.  */
+    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
+  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+    {
+      /* Work out the (constant) number of iterations that need to be
+	 peeled for reasons other than niters.  */
+      unsigned int peel_niter = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+	peel_niter += 1;
+      if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter,
+		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
+	LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
+    }
+  else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	   /* ??? When peeling for gaps but not alignment, we could
+	      try to check whether the (variable) niters is known to be
+	      VF * N + 1.  That's something of a niche case though.  */
+	   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&const_vf)
+	   || ((tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
+		< (unsigned) exact_log2 (const_vf))
+	       /* In case of versioning, check if the maximum number of
+		  iterations is greater than th.  If they are identical,
+		  the epilogue is unnecessary.  */
+	       && (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+		   || ((unsigned HOST_WIDE_INT) max_niter
+		       > (th / const_vf) * const_vf))))
+    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
+}
+
+
 /* Function vect_analyze_loop_2.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -1864,6 +1923,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -1979,7 +2039,6 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   vect_compute_single_scalar_iteration_cost (loop_vinfo);
 
   poly_uint64 saved_vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  unsigned th;
 
   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  */
   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
@@ -2019,9 +2078,6 @@ start_over:
 		   LOOP_VINFO_INT_NITERS (loop_vinfo));
     }
 
-  HOST_WIDE_INT max_niter
-    = likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
-
   /* Analyze the alignment of the data-refs in the loop.
      Fail if a data reference is found that cannot be vectorized.  */
 
@@ -2125,42 +2181,7 @@ start_over:
     return opt_result::failure_at (vect_location,
 				   "Loop costings not worthwhile.\n");
 
-  /* Decide whether we need to create an epilogue loop to handle
-     remaining scalar iterations.  */
-  th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-
-  unsigned HOST_WIDE_INT const_vf;
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-    /* The main loop handles all iterations.  */
-    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
-  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
-    {
-      /* Work out the (constant) number of iterations that need to be
-	 peeled for reasons other than niters.  */
-      unsigned int peel_niter = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	peel_niter += 1;
-      if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter,
-		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
-	LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
-    }
-  else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
-	   /* ??? When peeling for gaps but not alignment, we could
-	      try to check whether the (variable) niters is known to be
-	      VF * N + 1.  That's something of a niche case though.  */
-	   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&const_vf)
-	   || ((tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
-		< (unsigned) exact_log2 (const_vf))
-	       /* In case of versioning, check if the maximum number of
-		  iterations is greater than th.  If they are identical,
-		  the epilogue is unnecessary.  */
-	       && (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-		   || ((unsigned HOST_WIDE_INT) max_niter
-		       > (th / const_vf) * const_vf))))
-    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;
-
+  determine_peel_for_niter (loop_vinfo);
   /* If an epilogue loop is required make sure we can create one.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
       || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
@@ -2183,9 +2204,12 @@ start_over:
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
      increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+	  && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
     {
       poly_uint64 niters_th = 0;
+      unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
       if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
 	{
@@ -2206,6 +2230,14 @@ start_over:
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
+
+      /*  Use the same condition as vect_transform_loop to decide when to use
+	  the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && ordered_p (th, niters_th))
+	niters_th = ordered_max (poly_uint64 (th), niters_th);
+
       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
     }
 
@@ -2329,14 +2361,8 @@ again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, vector_sizes vector_sizes)
 {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-						loop->simdlen != 0);
   unsigned int next_size = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2357,6 +2383,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2375,24 +2404,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  first_vector_size = current_vector_size;
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  */
+		  if (vect_epilogues)
+		    {
+		      loop->aux = NULL;
+		      loop->epilogue_vsizes.reserve (1);
+		      loop->epilogue_vsizes.quick_push (current_vector_size);
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2430,6 +2487,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, current_vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -8460,6 +8519,34 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+/* Helper function to replace a SSA name in OP with its equivalent SSA name in
+   MAPPING.  */
+
+static tree
+replace_ops (tree op, hash_map<tree, tree> &mapping)
+{
+  if (!op)
+    return NULL;
+
+  tree *new_op;
+  tree ret = NULL;
+  for (int j = 0; j < TREE_OPERAND_LENGTH (op); ++j)
+    {
+      if ((new_op = mapping.get (TREE_OPERAND (op, j))))
+	{
+	  TREE_OPERAND (op, j) = *new_op;
+	  ret = *new_op;
+	}
+      else
+	ret = replace_ops (TREE_OPERAND (op, j), mapping);
+
+      if (ret)
+	return ret;
+    }
+
+  return NULL;
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -8483,6 +8570,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   gimple *stmt;
   bool check_profitability = false;
   unsigned int th;
+  auto_vec<gimple *> orig_stmts;
+  auto_vec<dr_vec_info *> gather_scatter_drs;
+  auto_vec<dr_vec_info *> drs;
+  auto_vec<gimple *> gather_scatter_stmts;
 
   DUMP_VECT_SCOPE ("vec_transform_loop");
 
@@ -8497,11 +8588,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8519,18 +8610,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
-      poly_uint64 versioning_threshold
-	= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-      if (check_profitability
-	  && ordered_p (poly_uint64 (th), versioning_threshold))
-	{
-	  versioning_threshold = ordered_max (poly_uint64 (th),
-					      versioning_threshold);
-	  check_profitability = false;
-	}
       class loop *sloop
-	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+	= vect_loop_versioning (loop_vinfo);
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8555,9 +8636,64 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance);
+
+  if (epilogue)
+    {
+      basic_block *orig_bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+
+      gimple_stmt_iterator orig_gsi;
+      gphi_iterator orig_phi_gsi;
+      gimple *stmt;
+      stmt_vec_info stmt_vinfo;
+      dr_vec_info *dr_vinfo;
+
+      /* The stmt_vec_info's of the epilogue were constructed for the main loop
+	 and need to be updated to refer to the cloned variables used in the
+	 epilogue loop.  We do this by assuming the original main loop and the
+	 epilogue loop are identical (aside the different SSA names).  This
+	 means we assume we can go through each BB in the loop and each STMT in
+	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each
+	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep
+	 track of the original state of the main loop, before vectorization.
+	 After vectorization we proceed to update the epilogue's stmt_vec_infos
+	 information.  We also update the references in PATTERN_DEF_SEQ's,
+	 RELATED_STMT's and data_references.  Mainly the latter has to be
+	 updated after we are done vectorizing the main loop, as the
+	 data_references are shared between main and epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (orig_phi_gsi = gsi_start_phis (orig_bbs[i]);
+	       !gsi_end_p (orig_phi_gsi); gsi_next (&orig_phi_gsi))
+	    orig_stmts.safe_push (orig_phi_gsi.phi ());
+	  for (orig_gsi = gsi_start_bb (orig_bbs[i]);
+	       !gsi_end_p (orig_gsi); gsi_next (&orig_gsi))
+	    {
+	      stmt = gsi_stmt (orig_gsi);
+	      orig_stmts.safe_push (stmt);
+	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);
+	      /* Data references pointing to gather loads and scatter stores
+		 require special treatment because the address computation
+		 happens in a different gimple node, pointed to by DR_REF.  In
+		 contrast to normal loads and stores where we only need to
+		 update the offset of the data reference.  */
+	      if (stmt_vinfo != NULL
+		  && stmt_vinfo->dr_aux.stmt == stmt_vinfo)
+		{
+		  dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo);
+		  if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt))
+		    gather_scatter_drs.safe_push (dr_vinfo);
+		  drs.safe_push (dr_vinfo);
+		}
+	    }
+	}
+    }
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8814,58 +8950,168 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
 
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (current_vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (current_vector_size, vector_sizes[next_size]))
-	  next_size += 1;
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+      vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
+      auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
+      hash_map<tree,tree> mapping;
+      gimple * orig_stmt, * new_stmt;
+      gimple_stmt_iterator epilogue_gsi;
+      gphi_iterator epilogue_phi_gsi;
+      stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+      basic_block *epilogue_bbs = get_loop_body (epilogue);
 
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
+      LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+      /* We are done vectorizing the main loop, so now we update the epilogues
+	 stmt_vec_info's.  At the same time we set the gimple UID of each
+	 statement in the epilogue, as these are used to look them up in the
+	 epilogues loop_vec_info later.  We also keep track of what
+	 stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+	 need updating and we construct a mapping between variables defined in
+	 the main loop and their corresponding names in epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	       !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	    {
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = epilogue_phi_gsi.phi ();
 
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
-    }
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      mapping.put (gimple_phi_result (orig_stmt),
+			    gimple_phi_result (new_stmt));
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		pattern_worklist.safe_push (stmt_vinfo);
+
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      while (related_vinfo && related_vinfo != stmt_vinfo)
+		{
+		  related_worklist.safe_push (related_vinfo);
+		  /* Set BB such that the assert in
+		    'get_initial_def_for_reduction' is able to determine that
+		    the BB of the related stmt is inside this loop.  */
+		  gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+				 gimple_bb (new_stmt));
+		  related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+		}
+	    }
+
+	  for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	       !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	    {
+	      orig_stmt = orig_stmts[0];
+	      orig_stmts.ordered_remove (0);
+	      new_stmt = gsi_stmt (epilogue_gsi);
+
+	      stmt_vinfo
+		= epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	      STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	      gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	      if (is_gimple_assign (orig_stmt))
+		{
+		  gcc_assert (is_gimple_assign (new_stmt));
+		  mapping.put (gimple_assign_lhs (orig_stmt),
+			      gimple_assign_lhs (new_stmt));
+		}
+
+	      if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+		pattern_worklist.safe_push (stmt_vinfo);
+
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	      while (related_vinfo && related_vinfo != stmt_vinfo)
+		{
+		  related_worklist.safe_push (related_vinfo);
+		  /* Set BB such that the assert in
+		    'get_initial_def_for_reduction' is able to determine that
+		    the BB of the related stmt is inside this loop.  */
+		  gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+				 gimple_bb (new_stmt));
+		  related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+		}
+	    }
+	  gcc_assert (orig_stmts.length () == 0);
+	}
+
+      /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the
+	 original main loop and thus need to be updated to refer to the cloned
+	 variables used in the epilogue.  */
+      for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+	{
+	  gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
+	  tree *new_op;
+
+	  while (seq)
+	    {
+	      for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+		{
+		  tree op = gimple_op (seq, j);
+		  if ((new_op = mapping.get(op)))
+		    gimple_set_op (seq, j, *new_op);
+		  else
+		    {
+		      op = unshare_expr (op);
+		      replace_ops (op, mapping);
+		      gimple_set_op (seq, j, op);
+		    }
+		}
+	      seq = seq->next;
+	    }
+	}
+
+      /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be
+	 updated.  */
+      for (unsigned i = 0; i < related_worklist.length (); ++i)
+	{
+	  tree *new_t;
+	  gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
+	  for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	    if ((new_t = mapping.get(gimple_op (stmt, j))))
+	      gimple_set_op (stmt, j, *new_t);
+	}
+
+      tree new_op;
+      /* Data references for gather loads and scatter stores do not use the
+	 updated offset we set using ADVANCE.  Instead we have to make sure the
+	 reference in the data references point to the corresponding copy of
+	 the original in the epilogue.  */
+      for (unsigned i = 0; i < gather_scatter_drs.length (); ++i)
+	{
+	  dr_vec_info *dr_vinfo = gather_scatter_drs[i];
+	  data_reference *dr = dr_vinfo->dr;
+	  gcc_assert (dr);
+	  DR_REF (dr) = unshare_expr (DR_REF (dr));
+	  new_op = replace_ops (DR_REF (dr), mapping);
+	  if (new_op)
+	    DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (new_op);
+	}
+
+	  /* The vector size of the epilogue is smaller than that of the main loop
+	     so the alignment is either the same or lower. This means the dr will
+	     thus by definition be aligned.  */
+	  for (unsigned i = 0; i <drs.length (); ++i)
+	    drs[i]->base_misaligned = false;
+
+	  epilogue_vinfo->shared->datarefs_copy.release ();
+	  epilogue_vinfo->shared->save_datarefs ();
+	}
 
   return epilogue;
 }
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 601a6f55fbff388c89f88d994e790aebf2bf960e..201549da6c0cbae0797a23ae1b8967b9895505e9 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -6288,7 +6288,7 @@ ensure_base_align (dr_vec_info *dr_info)
 
       if (decl_in_symtab_p (base_decl))
 	symtab_node::get (base_decl)->increase_alignment (align_base_to);
-      else
+      else if (DECL_ALIGN (base_decl) < align_base_to)
 	{
 	  SET_DECL_ALIGN (base_decl, align_base_to);
           DECL_USER_ALIGN (base_decl) = 1;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..00ab80544f6a7ffac8f62f09f2b2ba099b24d83e 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -564,6 +564,8 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  vec<_loop_vec_info *> epilogue_vinfos;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1480,13 +1482,16 @@ extern void vect_set_loop_condition (class loop *, loop_vec_info,
 extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
-class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern gcond * vect_get_loop_niters (class loop *, tree *, tree *, tree *);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
+
 
 /* In tree-vect-stmts.c.  */
 extern poly_uint64 current_vector_size;
@@ -1600,6 +1605,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 /* FORNOW: Used in tree-parloops.c.  */
 extern stmt_vec_info vect_force_simple_reduction (loop_vec_info, stmt_vec_info,
 						  bool *, bool);
@@ -1610,7 +1617,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    vector_sizes);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 173e6b51652fd023893b38da786ff28f827553b5..71bbf4fdf8dc7588c45a0e8feef9272b52c0c04c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -875,6 +875,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  auto_vector_sizes auto_vector_sizes;
+  vector_sizes vector_sizes;
+  bool assert_versioning = false;
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
+  /* If this is an epilogue, we already know what vector sizes we will use for
+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes.is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_LOOP (orig_loop_vinfo)->epilogue_vsizes;
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+						    loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -898,6 +927,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
     {
+      /* If this loops requires versioning, make sure the analyzis done on the
+	 epilogue loops succeeds.  */
+      gcc_assert (!assert_versioning);
+
       /* Free existing information if loop is analyzed with some
 	 assumptions.  */
       if (loop_constraint_set_p (loop, LOOP_C_FINITE))
@@ -1013,8 +1046,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Richard Biener Oct. 11, 2019, 12:57 p.m. | #7
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> Hi,

> 

> After all the discussions and respins I now believe this patch is close to

> what we envisioned.

> 

> This patch achieves two things when vect-epilogues-nomask=1:

> 1) It analyzes the original loop for each supported vector size and saves this

> analysis per loop, as well as the vector sizes we know we can vectorize the

> loop for.

> 2) When loop versioning it uses the 'skip_vector' code path to vectorize the

> epilogue, and uses the lowest versioning threshold between the main and

> epilogue's.

> 

> As side effects of this patch I also changed ensure_base_align to only update

> the alignment if the new alignment is lower than the current one.  This

> function already did that if the object was a symbol, now it behaves this way

> for any object.

> 

> I bootstrapped this patch with both vect-epilogues-nomask turned on and off on

> x86_64 (AVX512) and aarch64.  Regression tests looked good.

> 

> Is this OK for trunk?


+
+  /* Keep track of vector sizes we know we can vectorize the epilogue 
with.  */
+  vector_sizes epilogue_vsizes;
 };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks 
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.

For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.

You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
LOOP_VINFO_EPILOGUE_P.

@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
     
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    { 
...
+       vect_epilogues = false;
+    }
+

I don't understand what all this does - it clearly needs a comment.
Maybe the overall comment of the function should be amended with
an overview of how we handle [multiple] epilogue loop vectorization?

+
+      if (epilogue_any_upper_bound && prolog_peeling >= 0)
+       {
+         epilog->any_upper_bound = true;
+         epilog->nb_iterations_upper_bound = eiters + 1;
+       }
+

comment missing.  How can prolog_peeling be < 0?  We likely
didn't set the upper bound because we don't know it in the
case we skipped the vector loop (skip_vector)?  So make sure
to not introduce wrong-code issues here - maybe do this
optimization as followup?

 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-                     unsigned int th, bool check_profitability,
-                     poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 { 
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT 
(loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);

-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, 
scalar_loop_iters,
                             build_int_cst (TREE_TYPE (scalar_loop_iters),
                                            th - 1));

split out this refactoring - preapproved.

@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info 
loop_vinfo)
       return 0;
     }

-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
     estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1

it's clearer if the old code is completely in a else {} path
even though vect_vf_for_cost - 1 should never be -1.

+/* Decides whether we need to create an epilogue loop to handle
+   remaining scalar iterations and sets PEELING_FOR_NITERS accordingly.  
*/
+      
+void                  
+determine_peel_for_niter (loop_vec_info loop_vinfo)
+{   
+  

extra vertical space

+  unsigned HOST_WIDE_INT const_vf;
+  HOST_WIDE_INT max_niter 

if it's a 1:1 copy outlined then split it out - preapproved
(so further reviews get smaller patches ;))  I'd add a
LOOP_VINFO_PEELING_FOR_NITER () = false as final else
since that's what we do by default?

-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

not sure why we need to do this for epilouges?

+
+      /*  Use the same condition as vect_transform_loop to decide when to 
use
+         the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+         && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+         && ordered_p (th, niters_th))
+       niters_th = ordered_max (poly_uint64 (th), niters_th);

that's an independent change, right?  Please split out, it's
pre-approved if it tests OK separately.

+static tree
+replace_ops (tree op, hash_map<tree, tree> &mapping)
+{

I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
comes up first (not a 1:1 match but hints at a possible tree
sharing issue in your variant).

@@ -8497,11 +8588,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_NOTE, vect_location,
-                        "Profitability threshold is %d loop 
iterations.\n",
-                         th);
-      check_profitability = true;
+       if (dump_enabled_p ())
+         dump_printf_loc (MSG_NOTE, vect_location,
+                          "Profitability threshold is %d loop 
iterations.\n",
+                          th);
+       check_profitability = true;
     }

   /* Make sure there exists a single-predecessor exit bb.  Do this before

obvious (separate)

+  tree advance;
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, 
&niters_vector,
                              &step_vector, &niters_vector_mult_vf, th,
-                             check_profitability, niters_no_overflow);
+                             check_profitability, niters_no_overflow,
+                             &advance);
+
+  if (epilogue)
+    {
+      basic_block *orig_bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
...

please record this in vect_do_peeling itself and store the
orig_stmts/drs/etc. in the epilogue loop_vinfo and ...

+      /* We are done vectorizing the main loop, so now we update the 
epilogues
+        stmt_vec_info's.  At the same time we set the gimple UID of each
+        statement in the epilogue, as these are used to look them up in 
the
+        epilogues loop_vec_info later.  We also keep track of what
...

split this out to a new function.  I wonder why you need to record
the DRs, are they not available via ->datarefs and lookup_dr ()?

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 
601a6f55fbff388c89f88d994e790aebf2bf960e..201549da6c0cbae0797a23ae1b8967b9895505e9 
100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -6288,7 +6288,7 @@ ensure_base_align (dr_vec_info *dr_info)

       if (decl_in_symtab_p (base_decl))
        symtab_node::get (base_decl)->increase_alignment (align_base_to);
-      else
+      else if (DECL_ALIGN (base_decl) < align_base_to)
        {
          SET_DECL_ALIGN (base_decl, align_base_to);
           DECL_USER_ALIGN (base_decl) = 1;

split out - preapproved.

Still have to go over the main loop doing the analysis/transform.

Thanks, it looks really promising (albeit exepectedly ugly due to
the data rewriting).

Richard.


> gcc/ChangeLog:

> 2019-10-10  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>     PR 88915

>     * cfgloop.h (loop): Add epilogue_vsizes member.

>     * cfgloop.c (flow_loop_free): Release epilogue_vsizes.

>     (alloc_loop): Initialize epilogue_vsizes.

>     * gentype.c (main): Add poly_uint64 type and vector_sizes to

>     generator.

>     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.

>     (_loop_vec_info): Initialize epilogue_vinfos.

>     (~_loop_vec_info): Release epilogue_vinfos.

>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate

>     number of iterations of epilogue.

>     (determine_peel_for_niter): New. Outlined code to re-use in two

>     places.

>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported

>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

>     versioning threshold needed for main loop.

>     (vect_analyze_loop): Likewise.

>     (replace_ops): New helper function.

>     (vect_transform_loop): When vectorizing epilogues re-use analysis done

>     on main loop and update necessary information.

>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

>     stmts on loop preheader edge.

>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if

>     we decided to vectorize epilogues.  Update epilogues NITERS and

>     construct ADVANCE to update epilogues data references where needed.

>     (vect_loop_versioning): Moved decision to check_profitability

>     based on cost model.

>     * tree-vect-stmts.c (ensure_base_align): Only update alignment

>     if new alignment is lower.

>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member.

>     (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters,

>     vect_update_inits_of_drs, determine_peel_for_niter,

>     vect_analyze_loop): Add or update declarations.

>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

>     create loop_vec_info's for epilogues when available.  Otherwise analyse

>     epilogue separately.

> 

> 

> 

> Cheers,

> Andre

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
Andre Simoes Dias Vieira Oct. 22, 2019, 12:48 p.m. | #8
Hi Richi,

See inline responses to your comments.

On 11/10/2019 13:57, Richard Biener wrote:
> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> 

>> Hi,

>>


> 

> +

> +  /* Keep track of vector sizes we know we can vectorize the epilogue

> with.  */

> +  vector_sizes epilogue_vsizes;

>   };

> 

> please don't enlarge struct loop, instead track this somewhere

> in the vectorizer (in loop_vinfo?  I see you already have

> epilogue_vinfos there - so the loop_vinfo simply lacks

> convenient access to the vector_size?)  I don't see any

> use that could be trivially adjusted to look at a loop_vinfo

> member instead.


Done.
> 

> For the vect_update_inits_of_drs this means that we'd possibly

> do less CSE.  Not sure if really an issue.


CSE of what exactly? You are afraid we are repeating a calculation here 
we have done elsewhere before?

> 

> You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes

> LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to

> LOOP_VINFO_EPILOGUE_P.


I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is 
because I then use the resulting loop info.  If there are cases you feel 
strongly about let me know.
> 

> @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree

> niters, tree nitersm1,

>     else

>       niters_prolog = build_int_cst (type, 0);

>       

> +  loop_vec_info epilogue_vinfo = NULL;

> +  if (vect_epilogues)

> +    {

> ...

> +       vect_epilogues = false;

> +    }

> +

> 

> I don't understand what all this does - it clearly needs a comment.

> Maybe the overall comment of the function should be amended with

> an overview of how we handle [multiple] epilogue loop vectorization?


I added more comments both here and on top of the function.  Hopefully 
it is a bit clearer now, but it might need some tweaking.

> 

> +

> +      if (epilogue_any_upper_bound && prolog_peeling >= 0)

> +       {

> +         epilog->any_upper_bound = true;

> +         epilog->nb_iterations_upper_bound = eiters + 1;

> +       }

> +

> 

> comment missing.  How can prolog_peeling be < 0?  We likely

> didn't set the upper bound because we don't know it in the

> case we skipped the vector loop (skip_vector)?  So make sure

> to not introduce wrong-code issues here - maybe do this

> optimization as followup?n

> 


So the reason for this code wasn't so much an optimization as it was for 
correctness.  But I was mistaken, the failure I was seeing without this 
code was not because of this code, but rather being hidden by it. The 
problem I was seeing was that a prolog was being created using the 
original loop copy, rather than the scalar loop, leading to MASK_LOAD 
and MASK_STORE being left in the scalar prolog, leading to expand ICEs. 
I have fixed that issue by making sure the SCALAR_LOOP is used for 
prolog peeling and either the loop copy or SCALAR loop for epilogue 
peeling depending on whether we will be vectorizing said epilogue.


> @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info

> loop_vinfo)

>         return 0;

>       }

> 

> -  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);

> +  HOST_WIDE_INT estimated_niter = -1;

> +

> +  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))

> +    estimated_niter

> +      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;

> +  if (estimated_niter == -1)

> +    estimated_niter = estimated_stmt_executions_int (loop);

>     if (estimated_niter == -1)

>       estimated_niter = likely_max_stmt_executions_int (loop);

>     if (estimated_niter != -1

> 

> it's clearer if the old code is completely in a else {} path

> even though vect_vf_for_cost - 1 should never be -1.

> 

Done for the == -1 cases, need to keep the != -1 outside of course.
> -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))

> +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)

> +      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))

> +         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

> 

> not sure why we need to do this for epilouges?

> 


This is because we want to compute the versioning threshold for 
epilogues such that we can use the minimum versioning threshold when 
versioning the main loop.  The reason we need to ask we need to ask the 
original main loop is partially because of code in 
'vect_analyze_data_ref_dependences' that chooses to not do DR dependence 
analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the 
epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is 
always 0.

The piece of code is preceded by this comment:
   /* For epilogues we either have no aliases or alias versioning
      was applied to original loop.  Therefore we may just get max_vf
      using VF of original loop.  */

I have added some comments to make it clearer.
> 

> +static tree

> +replace_ops (tree op, hash_map<tree, tree> &mapping)

> +{

> 

> I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree

> comes up first (not a 1:1 match but hints at a possible tree

> sharing issue in your variant).

> 


The reason I couldn't use simplify_replace_tree is because I didn't know 
what the "OLD" value is at the time I want to call it.  Basically I want 
to check whether an SSA name is a key in MAPPING and if so replace it 
with the corresponding VALUE.

I have changed simplify_replace_tree such that valueize can take a 
context parameter. I replaced one use of replace_ops with it and the 
other I specialized as I found that it was always a MEM_REF and we 
needed to replace the address it was dereferencing.

> 

> +  tree advance;

>     epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1,

> &niters_vector,

>                                &step_vector, &niters_vector_mult_vf, th,

> -                             check_profitability, niters_no_overflow);

> +                             check_profitability, niters_no_overflow,

> +                             &advance);

> +

> +  if (epilogue)

> +    {

> +      basic_block *orig_bbs = get_loop_body (loop);

> +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);

> ...

> 

> orig_stmts/drs/etc. in the epilogue loop_vinfo and ...

> 

> +      /* We are done vectorizing the main loop, so now we update the

> epilogues

> +        stmt_vec_info's.  At the same time we set the gimple UID of each

> +        statement in the epilogue, as these are used to look them up in

> the

> +        epilogues loop_vec_info later.  We also keep track of what

> ...

> 

> split this out to a new function.  I wonder why you need to record

> the DRs, are they not available via ->datarefs and lookup_dr ()?


lookup_dr may no longer work at this point. I found that for some memory 
accesses by the time I got to this point, the DR_STMT of the 
data_reference pointed to a scalar statement that no longer existed and 
the lookup_dr to that data reference ICE's.  I can't make this update 
before we transform the loop because the data references are shared, so 
I decided to capture the dr_vec_info's instead. Apparently we don't ever 
do a lookup_dr past this point, which I must admit is surprising.

> Still have to go over the main loop doing the analysis/transform.

> 

> Thanks, it looks really promising (albeit exepectedly ugly due to

> the data rewriting).

> 


Yeah, though I feel like now that I have put it away into functions it 
makes it look cleaner.  That vect_transform_loop function was getting 
too big!

Is this OK for trunk?

gcc/ChangeLog:
2019-10-22  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * gentype.c (main): Add poly_uint64 type and vector_sizes to
     generator.
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
     (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos,
     epilogue_vsizes and update_epilogue_vinfo members.
     (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS,
      LOOP_VINFO_EPILOGUE_SIZES): Define MACROs.
     (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("machine_mode", &pos));
       POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
       POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos));
+      POS_HERE (do_scalar_typedef ("vector_sizes", &pos));
       POS_HERE (do_typedef ("PTR", 
 			    create_pointer (resolve_typedef ("void", &pos)),
 			    &pos));
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
 				   class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tree simplify_replace_tree (tree, tree,
+				   tree, tree (*)(tree, void *) = NULL,
+				   void * = NULL);
 extern void substitute_in_loop_info (class loop *, tree, tree);
 
 #endif /* GCC_TREE_SSA_LOOP_NITER_H */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		       tree (*valueize) (tree))
+		       tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
     {
       if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	    return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
     {
       e = TREE_OPERAND (expr, i);
-      se = simplify_replace_tree (e, old, new_tree, valueize);
+      se = simplify_replace_tree (e, old, new_tree, valueize, context);
       if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -309,6 +309,10 @@ static vn_tables_t valid_info;
 /* Valueization hook.  Valueize NAME if it is an SSA name, otherwise
    just return it.  */
 tree (*vn_valueize) (tree);
+tree vn_valueize_wrapper (tree t, void* context ATTRIBUTE_UNUSED)
+{
+  return vn_valueize (t);
+}
 
 
 /* This represents the top of the VN lattice, which is the universal
@@ -6407,7 +6411,7 @@ process_bb (rpo_elim &avail, basic_block bb,
       if (bb->loop_father->nb_iterations)
 	bb->loop_father->nb_iterations
 	  = simplify_replace_tree (bb->loop_father->nb_iterations,
-				   NULL_TREE, NULL_TREE, vn_valueize);
+				   NULL_TREE, NULL_TREE, &vn_valueize_wrapper);
     }
 
   /* Value-number all defs in the basic-block.  */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index a2902267c62889a63af09d121a631e6d8c6f69d5..cd13d46a6a85f1f0111e97d0877feb33e401e45d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1724,7 +1724,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1734,21 +1734,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
 
-  /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
+  /* Adjust niters to sizetype.  We used to insert the stmts on loop preheader
+     here, but since we might use these niters to update the epilogues niters
+     and data references we can't insert them here as this definition might not
+     always dominate its uses.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2391,7 +2382,22 @@ slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
-   Returns created epilogue or NULL.
+   This function returns the epilogue loop if a decision was made to vectorize
+   it, otherwise NULL.
+
+   The analysis resulting in this epilogue loop's loop_vec_info was performed
+   in the same vect_analyze_loop call as the main loop's.  At that time
+   vect_analyze_loop constructs a list of accepted loop_vec_info's for lower
+   vectorization factors than the main loop.  This list is stored in the main
+   loop's loop_vec_info in the 'epilogue_vinfo' member.  Everytime we decide to
+   vectorize the epilogue loop for a lower vectorization factor,  the
+   loop_vec_info sitting at the top of the epilogue_vinfo list is removed,
+   updated and linked to the epilogue loop.  This is later used to vectorize
+   the epilogue.  The reason the loop_vec_info needs updating is that it was
+   constructed based on the original main loop, and the epilogue loop is a
+   copy of this loop, so all links pointing to statements in the original loop
+   need updating.  Furthermore, these loop_vec_info's share the
+   data_reference's records, which will also need to be updated.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
@@ -2401,14 +2407,18 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance)
 {
   edge e, guard_e;
-  tree type = TREE_TYPE (niters), guard_cond;
+  tree type = TREE_TYPE (niters), guard_cond, vector_guard = NULL;
   basic_block guard_bb, guard_to;
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+    = loop_vinfo->epilogue_vinfos.length () > 0
+    && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      /* Take the next epilogue_vinfo to vectorize for.  */
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+      /* Don't vectorize epilogues if this is not the most inner loop or if
+	 the epilogue may need peeling for alignment as the vectorizer doesn't
+	 know how to handle these situations properly yet.  */
+      if (loop->inner != NULL
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+	vect_epilogues = false;
+
+    }
+
+  tree niters_vector_mult_vf;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      loop_vec_info orig_loop_vinfo;
+      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+      else
+	orig_loop_vinfo = loop_vinfo;
+      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      unsigned next_size = 0;
+      unsigned HOST_WIDE_INT eiters
+	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      if (prolog_peeling > 0)
+	eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+	     && !(constant_multiple_p (current_vector_size,
+				       vector_sizes[next_size], &ratio)
+		  && eiters >= lowest_vf / ratio))
+	next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+	vect_epilogues = false;
+    }
+
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   dump_user_location_t loop_loc = find_loop_location (loop);
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  if (vect_epilogues)
+    /* Make sure to set the epilogue's epilogue scalar loop, such that we can
+       we can use the original scalar loop as remaining epilogue if
+       necessary.  */
+    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			   "loop can't be duplicated to exit edge.\n");
 	  gcc_unreachable ();
 	}
-      /* Peel epilog and put it on exit edge of loop.  */
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
+	 said epilog then we should use a copy of the main loop as a starting
+	 point.  This loop may have been already had some preliminary
+	 transformations to allow for more optimal vectorizationg, for example
+	 if-conversion.  If we are not vectorizing the epilog then we should
+	 use the scalar loop as the transformations mentioned above make less
+	 or no sense when not vectorizing.  */
+      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
 			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");
 	  gcc_unreachable ();
 	}
+
       epilog->force_vectorize = false;
       slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
@@ -2608,6 +2683,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 						check_profitability);
 	  /* Build guard against NITERSM1 since NITERS may overflow.  */
 	  guard_cond = fold_build2 (LT_EXPR, boolean_type_node, nitersm1, t);
+	  vector_guard = guard_cond;
 	  guard_bb = anchor;
 	  guard_to = split_edge (loop_preheader_edge (epilog));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
@@ -2635,7 +2711,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2699,10 +2774,163 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of iterations for our epilogue.  */
+      tree cond_niters, niters;
+
+      /* Depending on whether we peel for gaps we take niters or niters - 1,
+	 we will refer to this as N - G, where N and G are the NITERS and
+	 GAP for the original loop.  */
+      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	? LOOP_VINFO_NITERSM1 (loop_vinfo)
+	: LOOP_VINFO_NITERS (loop_vinfo);
+
+      /* Here we build a vector factorization mask:
+	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
+      tree vf_mask = build_int_cst (TREE_TYPE (niters),
+				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
+			     vf_mask,
+			     build_one_cst (TREE_TYPE (vf_mask)));
+      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
+
+      /* Here we calculate:
+	 niters = N - ((N-G) & ~(VF -1)) */
+      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+			    LOOP_VINFO_NITERS (loop_vinfo),
+			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
+					 niters,
+					 vf_mask));
+
+      if (skip_vector)
+	{
+	  /* If it is not guaranteed we enter the main loop we need to
+	     make the niters of the epilogue conditional on entireing the main
+	     loop.  We do this by constructing:
+	     cond_niters = !do_we_enter_main_loop ? N + niters_prolog : niters
+	     we add niters_prolog, the number of peeled iterations in the
+	     prolog, to N in case we don't enter the main loop, as these have
+	     already been subtracted from N (the number of iterations of the
+	     main loop).  Since the prolog peeling is also skipped if we skip the
+	     main loop we must add those interations back.  */
+	  cond_niters
+	    = fold_build3 (COND_EXPR, TREE_TYPE (niters),
+			   vector_guard,
+			   fold_build2 (PLUS_EXPR, TREE_TYPE (niters),
+					LOOP_VINFO_NITERS (loop_vinfo),
+					fold_convert (TREE_TYPE (niters),
+						      niters_prolog)),
+			   niters);
+	}
+      else
+	cond_niters = niters;
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = cond_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (cond_niters),
+		       cond_niters, build_one_cst (TREE_TYPE (cond_niters)));
+
+      /* We now calculate the amount of iterations we must advance our
+	 epilogue's data references by.  Make sure to use sizetype here as
+	 otherwise the pointer computation may go wrong on targets with
+	 different pointer sizes to the used niters type.  */
+      *advance = fold_convert (sizetype, niters);
+
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      *advance,
+			      fold_convert (sizetype,
+					    LOOP_VINFO_NITERS (loop_vinfo)));
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      build_zero_cst (TREE_TYPE (*advance)),
+			      *advance);
+
+      if (skip_vector)
+	{
+	  /* If we are skipping the vectorized loop then we must roll back the
+	     data references by the amount we might have expected to peel in
+	     the, also skipped, prolog.  */
+	  *advance
+	    = fold_build3 (COND_EXPR, TREE_TYPE (*advance),
+			   vector_guard,
+			   fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+					build_zero_cst (TREE_TYPE (*advance)),
+					fold_convert (TREE_TYPE (*advance),
+						      niters_prolog)),
+			   *advance);
+	}
+
+      /* Redo the peeling for niter analysis as the NITERs and alignment
+	 may have been updated to take the main loop into account.  */
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  if (vect_epilogues)
+    {
+      basic_block *bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilog);
+
+      LOOP_VINFO_UP_STMTS (epilogue_vinfo).create (0);
+      LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).create (0);
+      LOOP_VINFO_UP_DRS (epilogue_vinfo).create (0);
+
+      gimple_stmt_iterator gsi;
+      gphi_iterator phi_gsi;
+      gimple *stmt;
+      stmt_vec_info stmt_vinfo;
+      dr_vec_info *dr_vinfo;
+
+      /* The stmt_vec_info's of the epilogue were constructed for the main loop
+	 and need to be updated to refer to the cloned variables used in the
+	 epilogue loop.  We do this by assuming the original main loop and the
+	 epilogue loop are identical (aside the different SSA names).  This
+	 means we assume we can go through each BB in the loop and each STMT in
+	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each
+	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep
+	 track of the original state of the main loop, before vectorization.
+	 After vectorization we proceed to update the epilogue's stmt_vec_infos
+	 information.  We also update the references in PATTERN_DEF_SEQ's,
+	 RELATED_STMT's and data_references.  Mainly the latter has to be
+	 updated after we are done vectorizing the main loop, as the
+	 data_references are shared between main and epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (phi_gsi = gsi_start_phis (bbs[i]);
+	       !gsi_end_p (phi_gsi); gsi_next (&phi_gsi))
+	    LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (phi_gsi.phi ());
+	  for (gsi = gsi_start_bb (bbs[i]);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      stmt = gsi_stmt (gsi);
+	      LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (stmt);
+	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);
+	      if (stmt_vinfo != NULL
+		  && stmt_vinfo->dr_aux.stmt == stmt_vinfo)
+		{
+		  dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo);
+		  /* Data references pointing to gather loads and scatter stores
+		     require special treatment because the address computation
+		     happens in a different gimple node, pointed to by DR_REF.
+		     In contrast to normal loads and stores where we only need
+		     to update the offset of the data reference.  */
+		  if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt))
+		    LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).safe_push (dr_vinfo);
+		  LOOP_VINFO_UP_DRS (epilogue_vinfo).safe_push (dr_vinfo);
+		}
+	    }
+	}
+    }
+
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 72b80f46b1a9fa0bc8392809c286b5fac9a74451..81a5576a13004248d15db80145652d37f432c695 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -715,7 +715,7 @@ vect_fixup_scalar_cycles_with_patterns (loop_vec_info loop_vinfo)
    Return the loop exit condition.  */
 
 
-static gcond *
+gcond *
 vect_get_loop_niters (class loop *loop, tree *assumptions,
 		      tree *number_of_iterations, tree *number_of_iterationsm1)
 {
@@ -886,6 +886,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
+  epilogue_vsizes.create (8);
 }
 
 /* Free all levels of MASKS.  */
@@ -910,6 +913,8 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
+  epilogue_vsizes.release ();
 
   loop->aux = NULL;
 }
@@ -1683,9 +1688,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
-  if (estimated_niter == -1)
-    estimated_niter = likely_max_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter;
+
+  /* If we are vectorizing an epilogue then we know the maximum number of
+     scalar iterations it will cover is at least one lower than the
+     vectorization factor of the main loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  else
+    {
+      estimated_niter = estimated_stmt_executions_int (loop);
+      if (estimated_niter == -1)
+	estimated_niter = likely_max_stmt_executions_int (loop);
+    }
   if (estimated_niter != -1
       && ((unsigned HOST_WIDE_INT) estimated_niter
 	  < MAX (th, (unsigned) min_profitable_estimate)))
@@ -1872,6 +1888,15 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
+
+  /* If we are dealing with an epilogue then orig_loop_vinfo points to the
+     loop_vec_info of the first vectorized loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  else
+    orig_loop_vinfo = loop_vinfo;
+  gcc_assert (orig_loop_vinfo);
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -2151,8 +2176,18 @@ start_over:
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+     increase threshold for this case if necessary.
+
+     If we are analyzing an epilogue we still want to check what it's
+     versioning threshold would be.  If we decide to vectorize the epilogues we
+     will want to use the lowest versioning threshold of all epilogues and main
+     loop.  This will enable us to enter a vectorized epilogue even when
+     versioning the loop.  We can't simply check whether the epilogue requires
+     versioning though since we may have skipped some versioning checks when
+     analyzing the epilogue. For instance, checks for alias versioning will be
+     skipped when dealing with epilogues as we assume we already checked them
+     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
+  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
     {
       poly_uint64 niters_th = 0;
       unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
@@ -2307,14 +2342,8 @@ again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, vector_sizes vector_sizes)
 {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-						loop->simdlen != 0);
   unsigned int next_size = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  first_vector_size = current_vector_size;
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  */
+		  if (vect_epilogues)
+		    {
+		      loop->aux = NULL;
+		      first_loop_vinfo->epilogue_vsizes.reserve (1);
+		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2408,6 +2468,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, current_vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -8128,6 +8190,188 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+/* Helper function to pass to simplify_replace_tree to enable replacing tree's
+   in the hash_map with its corresponding values.  */
+static tree
+find_in_mapping (tree t, void *context)
+{
+  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;
+
+  tree *value = mapping->get (t);
+  return value ? *value : t;
+}
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
+{
+  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+  auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
+  hash_map<tree,tree> mapping;
+  gimple *orig_stmt, *new_stmt;
+  gimple_stmt_iterator epilogue_gsi;
+  gphi_iterator epilogue_phi_gsi;
+  stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+  basic_block *epilogue_bbs = get_loop_body (epilogue);
+
+  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
+
+
+  /* We are done vectorizing the main loop, so now we update the epilogues
+     stmt_vec_info's.  At the same time we set the gimple UID of each
+     statement in the epilogue, as these are used to look them up in the
+     epilogues loop_vec_info later.  We also keep track of what
+     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+     need updating and we construct a mapping between variables defined in
+     the main loop and their corresponding names in epilogue.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
+    {
+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	{
+	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
+	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
+	  new_stmt = epilogue_phi_gsi.phi ();
+
+	  stmt_vinfo
+	    = epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	  mapping.put (gimple_phi_result (orig_stmt),
+			gimple_phi_result (new_stmt));
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+
+      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	{
+	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
+	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
+	  new_stmt = gsi_stmt (epilogue_gsi);
+
+	  stmt_vinfo
+	    = epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	  if (is_gimple_assign (orig_stmt))
+	    {
+	      gcc_assert (is_gimple_assign (new_stmt));
+	      mapping.put (gimple_assign_lhs (orig_stmt),
+			  gimple_assign_lhs (new_stmt));
+	    }
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+      gcc_assert (LOOP_VINFO_UP_STMTS (epilogue_vinfo).length () == 0);
+    }
+
+  /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the
+     original main loop and thus need to be updated to refer to the cloned
+     variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+    {
+      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
+      tree *new_op;
+
+      while (seq)
+	{
+	  for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+	    {
+	      tree op = gimple_op (seq, j);
+	      if ((new_op = mapping.get(op)))
+		gimple_set_op (seq, j, *new_op);
+	      else
+		{
+		  op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+					 &find_in_mapping, &mapping);
+		  gimple_set_op (seq, j, op);
+		}
+	    }
+	  seq = seq->next;
+	}
+    }
+
+  /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be
+     updated.  */
+  for (unsigned i = 0; i < related_worklist.length (); ++i)
+    {
+      tree *new_t;
+      gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
+      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	if ((new_t = mapping.get(gimple_op (stmt, j))))
+	  gimple_set_op (stmt, j, *new_t);
+    }
+
+  tree *new_op;
+  /* Data references for gather loads and scatter stores do not use the
+     updated offset we set using ADVANCE.  Instead we have to make sure the
+     reference in the data references point to the corresponding copy of
+     the original in the epilogue.  */
+  for (unsigned i = 0; i < LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).length (); ++i)
+    {
+      dr_vec_info *dr_vinfo = LOOP_VINFO_UP_GT_DRS (epilogue_vinfo)[i];
+      data_reference *dr = dr_vinfo->dr;
+      gcc_assert (dr);
+      gcc_assert (TREE_CODE (DR_REF (dr)) == MEM_REF);
+      new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0));
+
+      if (new_op)
+	{
+	  DR_REF (dr) = unshare_expr (DR_REF (dr));
+	  TREE_OPERAND (DR_REF (dr), 0) = *new_op;
+	  DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (*new_op);
+	}
+    }
+
+  /* The vector size of the epilogue is smaller than that of the main loop
+     so the alignment is either the same or lower. This means the dr will
+     thus by definition be aligned.  */
+  for (unsigned i = 0; i < LOOP_VINFO_UP_DRS (epilogue_vinfo).length (); ++i)
+    LOOP_VINFO_UP_DRS (epilogue_vinfo)[i]->base_misaligned = false;
+
+
+  LOOP_VINFO_UP_STMTS (epilogue_vinfo).release ();
+  LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).release ();
+  LOOP_VINFO_UP_DRS (epilogue_vinfo).release ();
+
+  epilogue_vinfo->shared->datarefs_copy.release ();
+  epilogue_vinfo->shared->save_datarefs ();
+}
+
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -8165,11 +8409,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8213,9 +8457,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
+
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance);
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8474,57 +8722,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
-
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (current_vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (current_vector_size, vector_sizes[next_size]))
-	  next_size += 1;
+      update_epilogue_loop_vinfo (epilogue, advance);
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
-
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
-
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index bdb6b87c7b2d61302c33b071f737ecea41c06d33..fecd22f14bf03edc39ef325d3d80bf258b99603d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -561,8 +561,26 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  vec<_loop_vec_info *> epilogue_vinfos;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.
+     Only the first vectorized loop keeps track of these, for all its possible
+     epilogues.  */
+  vector_sizes epilogue_vsizes;
+
+  struct
+  {
+    vec<gimple *, va_heap, vl_ptr> orig_stmts;
+    vec<dr_vec_info *,va_heap, vl_ptr> gather_scatter_drs;
+    vec<dr_vec_info *,va_heap, vl_ptr> drs;
+  } update_epilogue_vinfo;
+
 } *loop_vec_info;
 
+#define LOOP_VINFO_UP_STMTS(L)	(L)->update_epilogue_vinfo.orig_stmts
+#define LOOP_VINFO_UP_GT_DRS(L)	(L)->update_epilogue_vinfo.gather_scatter_drs
+#define LOOP_VINFO_UP_DRS(L)	(L)->update_epilogue_vinfo.drs
+
 /* Access Functions.  */
 #define LOOP_VINFO_LOOP(L)                 (L)->loop
 #define LOOP_VINFO_BBS(L)                  (L)->bbs
@@ -613,6 +631,7 @@ public:
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 #define LOOP_VINFO_SIMD_IF_COND(L)         (L)->simd_if_cond
+#define LOOP_VINFO_EPILOGUE_SIZES(L)	   (L)->epilogue_vsizes
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
@@ -1516,10 +1535,14 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern gcond * vect_get_loop_niters (class loop *, tree *, tree *, tree *);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
+
 
 /* In tree-vect-stmts.c.  */
 extern poly_uint64 current_vector_size;
@@ -1627,6 +1650,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
 extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
@@ -1634,7 +1659,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    vector_sizes);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 3e8637f070d5cd526d6626d2b7ba1c5f9243ce0a..4dbb03cdcc8e9612083136d3ef9b5b16d6e30b13 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -875,6 +875,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  auto_vector_sizes auto_vector_sizes;
+  vector_sizes vector_sizes;
+  bool assert_versioning = false;
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
+  /* If this is an epilogue, we already know what vector sizes we will use for
+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+						    loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -898,6 +927,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
     {
+      /* If this loops requires versioning, make sure the analyzis done on the
+	 epilogue loops succeeds.  */
+      gcc_assert (!assert_versioning);
+
       /* Free existing information if loop is analyzed with some
 	 assumptions.  */
       if (loop_constraint_set_p (loop, LOOP_C_FINITE))
@@ -1013,8 +1046,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Richard Biener Oct. 22, 2019, 1:56 p.m. | #9
On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richi,

> 

> See inline responses to your comments.

> 

> On 11/10/2019 13:57, Richard Biener wrote:

> > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> > 

> >> Hi,

> >>

> 

> > 

> > +

> > +  /* Keep track of vector sizes we know we can vectorize the epilogue

> > with.  */

> > +  vector_sizes epilogue_vsizes;

> >   };

> > 

> > please don't enlarge struct loop, instead track this somewhere

> > in the vectorizer (in loop_vinfo?  I see you already have

> > epilogue_vinfos there - so the loop_vinfo simply lacks

> > convenient access to the vector_size?)  I don't see any

> > use that could be trivially adjusted to look at a loop_vinfo

> > member instead.

> 

> Done.

> > 

> > For the vect_update_inits_of_drs this means that we'd possibly

> > do less CSE.  Not sure if really an issue.

> 

> CSE of what exactly? You are afraid we are repeating a calculation here we

> have done elsewhere before?


All uses of those inits now possibly get the expression instead of
just the SSA name we inserted code for once.  But as said, we'll see.

> > 

> > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes

> > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to

> > LOOP_VINFO_EPILOGUE_P.

> 

> I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I

> then use the resulting loop info.  If there are cases you feel strongly about

> let me know.


Not too strongly, no.

> > 

> > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree

> > niters, tree nitersm1,

> >     else

> >       niters_prolog = build_int_cst (type, 0);

> >       

> > +  loop_vec_info epilogue_vinfo = NULL;

> > +  if (vect_epilogues)

> > +    {

> > ...

> > +       vect_epilogues = false;

> > +    }

> > +

> > 

> > I don't understand what all this does - it clearly needs a comment.

> > Maybe the overall comment of the function should be amended with

> > an overview of how we handle [multiple] epilogue loop vectorization?

> 

> I added more comments both here and on top of the function.  Hopefully it is a

> bit clearer now, but it might need some tweaking.

> 

> > 

> > +

> > +      if (epilogue_any_upper_bound && prolog_peeling >= 0)

> > +       {

> > +         epilog->any_upper_bound = true;

> > +         epilog->nb_iterations_upper_bound = eiters + 1;

> > +       }

> > +

> > 

> > comment missing.  How can prolog_peeling be < 0?  We likely

> > didn't set the upper bound because we don't know it in the

> > case we skipped the vector loop (skip_vector)?  So make sure

> > to not introduce wrong-code issues here - maybe do this

> > optimization as followup?n

> > 

> 

> So the reason for this code wasn't so much an optimization as it was for

> correctness.  But I was mistaken, the failure I was seeing without this code

> was not because of this code, but rather being hidden by it. The problem I was

> seeing was that a prolog was being created using the original loop copy,

> rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in

> the scalar prolog, leading to expand ICEs. I have fixed that issue by making

> sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or

> SCALAR loop for epilogue peeling depending on whether we will be vectorizing

> said epilogue.


OK.

> 

> > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info

> > loop_vinfo)

> >         return 0;

> >       }

> > 

> > -  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);

> > +  HOST_WIDE_INT estimated_niter = -1;

> > +

> > +  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))

> > +    estimated_niter

> > +      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;

> > +  if (estimated_niter == -1)

> > +    estimated_niter = estimated_stmt_executions_int (loop);

> >     if (estimated_niter == -1)

> >       estimated_niter = likely_max_stmt_executions_int (loop);

> >     if (estimated_niter != -1

> > 

> > it's clearer if the old code is completely in a else {} path

> > even though vect_vf_for_cost - 1 should never be -1.

> > 

> Done for the == -1 cases, need to keep the != -1 outside of course.

> > -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))

> > +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)

> > +      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))

> > +         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

> > 

> > not sure why we need to do this for epilouges?

> > 

> 

> This is because we want to compute the versioning threshold for epilogues such

> that we can use the minimum versioning threshold when versioning the main

> loop.  The reason we need to ask we need to ask the original main loop is

> partially because of code in 'vect_analyze_data_ref_dependences' that chooses

> to not do DR dependence analysis and thus never fills

> LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence

> LOOP_VINFO_COMP_ALIAS_DDRS is always 0.

> 

> The piece of code is preceded by this comment:

>   /* For epilogues we either have no aliases or alias versioning

>      was applied to original loop.  Therefore we may just get max_vf

>      using VF of original loop.  */

> 

> I have added some comments to make it clearer.

> > 

> > +static tree

> > +replace_ops (tree op, hash_map<tree, tree> &mapping)

> > +{

> > 

> > I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree

> > comes up first (not a 1:1 match but hints at a possible tree

> > sharing issue in your variant).

> > 

> 

> The reason I couldn't use simplify_replace_tree is because I didn't know what

> the "OLD" value is at the time I want to call it.  Basically I want to check

> whether an SSA name is a key in MAPPING and if so replace it with the

> corresponding VALUE.

> 

> I have changed simplify_replace_tree such that valueize can take a context

> parameter. I replaced one use of replace_ops with it and the other I

> specialized as I found that it was always a MEM_REF and we needed to replace

> the address it was dereferencing.

> 

> > 

> > +  tree advance;

> >     epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1,

> > &niters_vector,

> >                                &step_vector, &niters_vector_mult_vf, th,

> > -                             check_profitability, niters_no_overflow);

> > +                             check_profitability, niters_no_overflow,

> > +                             &advance);

> > +

> > +  if (epilogue)

> > +    {

> > +      basic_block *orig_bbs = get_loop_body (loop);

> > +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);

> > ...

> > 

> > orig_stmts/drs/etc. in the epilogue loop_vinfo and ...

> > 

> > +      /* We are done vectorizing the main loop, so now we update the

> > epilogues

> > +        stmt_vec_info's.  At the same time we set the gimple UID of each

> > +        statement in the epilogue, as these are used to look them up in

> > the

> > +        epilogues loop_vec_info later.  We also keep track of what

> > ...

> > 

> > split this out to a new function.  I wonder why you need to record

> > the DRs, are they not available via ->datarefs and lookup_dr ()?

> 

> lookup_dr may no longer work at this point. I found that for some memory

> accesses by the time I got to this point, the DR_STMT of the data_reference

> pointed to a scalar statement that no longer existed and the lookup_dr to that

> data reference ICE's.  I can't make this update before we transform the loop

> because the data references are shared, so I decided to capture the

> dr_vec_info's instead. Apparently we don't ever do a lookup_dr past this

> point, which I must admit is surprising.


OK, as long as this fixup code is well isolated we can see how to
make it prettier later ;)  But yes, we have some vectorizer transforms
that remove old stmts (bad).  At least that's true for stores, we
could probably delay actual (scalar) stmt removal until the whole
series of loop + epilogue vectorization is finished.

As said, let's try as followup.

> > Still have to go over the main loop doing the analysis/transform.

> > 

> > Thanks, it looks really promising (albeit exepectedly ugly due to

> > the data rewriting).

> > 

> 

> Yeah, though I feel like now that I have put it away into functions it makes

> it look cleaner.  That vect_transform_loop function was getting too big!

> 

> Is this OK for trunk?


You probably no longer need the gentype.c hunk.

+}
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance)

function comment missing

+
+
+  /* We are done vectorizing the main loop, so now we update the 
epilogues

too much vertical space.

+  /* We are done vectorizing the main loop, so now we update the 
epilogues
+     stmt_vec_info's.  At the same time we set the gimple UID of each
+     statement in the epilogue, as these are used to look them up in the
+     epilogues loop_vec_info later.  We also keep track of what
+     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+     need updating and we construct a mapping between variables defined 
in
+     the main loop and their corresponding names in epilogue.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

so for the following code I wonder if you can make use of the
fact that loop copying also copies UIDs, so you should be able
to match stmts via their UIDs and get at the other loop infos
stmt_info by the copy loop stmt UID.

I wonder why you need no modification for the SLP tree?

Otherwise the patch looks OK.

Thanks,
Richard.

> gcc/ChangeLog:

> 2019-10-22  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>     PR 88915

>     * gentype.c (main): Add poly_uint64 type and vector_sizes to

>     generator.

>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.

>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter

>     and make the valueize function pointer also take a void pointer.

>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap

>     around vn_valueize, to call it without a context.

>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.

>     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.

>     (_loop_vec_info): Initialize epilogue_vinfos.

>     (~_loop_vec_info): Release epilogue_vinfos.

>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate

>     number of iterations of epilogue.

>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported

>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

>     versioning threshold needed for main loop.

>     (vect_analyze_loop): Likewise.

>     (find_in_mapping): New helper function.

>     (update_epilogue_loop_vinfo): New function.

>     (vect_transform_loop): When vectorizing epilogues re-use analysis done

>     on main loop and call update_epilogue_loop_vinfo to update it.

>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

>     stmts on loop preheader edge.

>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if

>     we decided to vectorize epilogues.  Update epilogues NITERS and

>     construct ADVANCE to update epilogues data references where needed.

>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos,

>     epilogue_vsizes and update_epilogue_vinfo members.

>     (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS,

>     LOOP_VINFO_EPILOGUE_SIZES): Define MACROs.

>     (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs,

>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.

>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

>     created loop_vec_info's for epilogues when available.  Otherwise 

> analyse

>     epilogue separately.

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
Richard Sandiford Oct. 22, 2019, 5:52 p.m. | #10
Thanks for doing this.  Hope this message doesn't cover too much old
ground or duplicate too much...

"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>    else

>      niters_prolog = build_int_cst (type, 0);

>  

> +  loop_vec_info epilogue_vinfo = NULL;

> +  if (vect_epilogues)

> +    {

> +      /* Take the next epilogue_vinfo to vectorize for.  */

> +      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];

> +      loop_vinfo->epilogue_vinfos.ordered_remove (0);

> +

> +      /* Don't vectorize epilogues if this is not the most inner loop or if

> +	 the epilogue may need peeling for alignment as the vectorizer doesn't

> +	 know how to handle these situations properly yet.  */

> +      if (loop->inner != NULL

> +	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))

> +	vect_epilogues = false;

> +

> +    }


Nit: excess blank line before "}".  Sorry if this was discussed before,
but what's the reason for delaying the check for "loop->inner" to
this point, rather than doing it in vect_analyze_loop?

> +

> +  tree niters_vector_mult_vf;

> +  unsigned int lowest_vf = constant_lower_bound (vf);

> +  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work

> +     on niters already ajusted for the iterations of the prologue.  */


Pre-existing typo: adjusted.  But...

> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

> +      && known_eq (vf, lowest_vf))

> +    {

> +      loop_vec_info orig_loop_vinfo;

> +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))

> +	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);

> +      else

> +	orig_loop_vinfo = loop_vinfo;

> +      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);

> +      unsigned next_size = 0;

> +      unsigned HOST_WIDE_INT eiters

> +	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

> +	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));

> +

> +      if (prolog_peeling > 0)

> +	eiters -= prolog_peeling;


...is that comment still true?  We're now subtracting the peeling
amount here.

Might be worth asserting prolog_peeling >= 0, just to emphasise
that we can't get here for variable peeling amounts, and then subtract
prolog_peeling unconditionally (assuming that's the right thing to do).

> +      eiters

> +	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);

> +

> +      unsigned int ratio;

> +      while (next_size < vector_sizes.length ()

> +	     && !(constant_multiple_p (current_vector_size,

> +				       vector_sizes[next_size], &ratio)

> +		  && eiters >= lowest_vf / ratio))

> +	next_size += 1;

> +

> +      if (next_size == vector_sizes.length ())

> +	vect_epilogues = false;

> +    }

> +

>    /* Prolog loop may be skipped.  */

>    bool skip_prolog = (prolog_peeling != 0);

>    /* Skip to epilog if scalar loop may be preferred.  It's only needed

> -     when we peel for epilog loop and when it hasn't been checked with

> -     loop versioning.  */

> +     when we peel for epilog loop or when we loop version.  */

>    bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

>  		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),

>  				  bound_prolog + bound_epilog)

> -		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));

> +		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)

> +			 || vect_epilogues));


The comment update looks wrong here: without epilogues, we don't need
the skip when loop versioning, because loop versioning ensures that we
have at least one vector iteration.

(I think "it" was supposed to mean "skipping to the epilogue" rather
than the epilogue loop itself, in case that's the confusion.)

It'd be good to mention the epilogue condition in the comment too.

> @@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>  

>    dump_user_location_t loop_loc = find_loop_location (loop);

>    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);

> +  if (vect_epilogues)

> +    /* Make sure to set the epilogue's epilogue scalar loop, such that we can

> +       we can use the original scalar loop as remaining epilogue if

> +       necessary.  */


Double "we can".

> +    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)

> +      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);

> +

>    if (prolog_peeling)

>      {

>        e = loop_preheader_edge (loop);

> @@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>  			   "loop can't be duplicated to exit edge.\n");

>  	  gcc_unreachable ();

>  	}

> -      /* Peel epilog and put it on exit edge of loop.  */

> -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);

> +      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing

> +	 said epilog then we should use a copy of the main loop as a starting

> +	 point.  This loop may have been already had some preliminary


s/been//

> +	 transformations to allow for more optimal vectorizationg, for example


typo: vectorizationg

> +	 if-conversion.  If we are not vectorizing the epilog then we should

> +	 use the scalar loop as the transformations mentioned above make less

> +	 or no sense when not vectorizing.  */

> +      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;

> +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);

>        if (!epilog)

>  	{

>  	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,

>  			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");

>  	  gcc_unreachable ();

>  	}

> +

>        epilog->force_vectorize = false;

>        slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);

>  

> [...]

> @@ -2699,10 +2774,163 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>        adjust_vec_debug_stmts ();

>        scev_reset ();

>      }

> +

> +  if (vect_epilogues)

> +    {

> +      epilog->aux = epilogue_vinfo;

> +      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;

> +

> +      loop_constraint_clear (epilog, LOOP_C_INFINITE);

> +

> +      /* We now must calculate the number of iterations for our epilogue.  */

> +      tree cond_niters, niters;

> +

> +      /* Depending on whether we peel for gaps we take niters or niters - 1,

> +	 we will refer to this as N - G, where N and G are the NITERS and

> +	 GAP for the original loop.  */

> +      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)

> +	? LOOP_VINFO_NITERSM1 (loop_vinfo)

> +	: LOOP_VINFO_NITERS (loop_vinfo);

> +

> +      /* Here we build a vector factorization mask:

> +	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */

> +      tree vf_mask = build_int_cst (TREE_TYPE (niters),

> +				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));

> +      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),

> +			     vf_mask,

> +			     build_one_cst (TREE_TYPE (vf_mask)));

> +      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);

> +

> +      /* Here we calculate:

> +	 niters = N - ((N-G) & ~(VF -1)) */

> +      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),

> +			    LOOP_VINFO_NITERS (loop_vinfo),

> +			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),

> +					 niters,

> +					 vf_mask));


Might be a daft question, sorry, but why does this need to be so
complicated?  Couldn't we just use the final value of the main loop's
IV to calculate how many iterations are left?

The current code wouldn't for example work for non-power-of-2 SVE vectors.
vect_set_loop_condition_unmasked is structured to cope with that case
(in length-agnostic mode only), even when an epilogue is needed.

> [...]

> -  return epilog;

> +  if (vect_epilogues)

> +    {

> +      basic_block *bbs = get_loop_body (loop);

> +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilog);

> +

> +      LOOP_VINFO_UP_STMTS (epilogue_vinfo).create (0);

> +      LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).create (0);

> +      LOOP_VINFO_UP_DRS (epilogue_vinfo).create (0);

> +

> +      gimple_stmt_iterator gsi;

> +      gphi_iterator phi_gsi;

> +      gimple *stmt;

> +      stmt_vec_info stmt_vinfo;

> +      dr_vec_info *dr_vinfo;

> +

> +      /* The stmt_vec_info's of the epilogue were constructed for the main loop

> +	 and need to be updated to refer to the cloned variables used in the

> +	 epilogue loop.  We do this by assuming the original main loop and the

> +	 epilogue loop are identical (aside the different SSA names).  This

> +	 means we assume we can go through each BB in the loop and each STMT in

> +	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each

> +	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep

> +	 track of the original state of the main loop, before vectorization.

> +	 After vectorization we proceed to update the epilogue's stmt_vec_infos

> +	 information.  We also update the references in PATTERN_DEF_SEQ's,

> +	 RELATED_STMT's and data_references.  Mainly the latter has to be

> +	 updated after we are done vectorizing the main loop, as the

> +	 data_references are shared between main and epilogue.  */

> +      for (unsigned i = 0; i < loop->num_nodes; ++i)

> +	{

> +	  for (phi_gsi = gsi_start_phis (bbs[i]);

> +	       !gsi_end_p (phi_gsi); gsi_next (&phi_gsi))

> +	    LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (phi_gsi.phi ());

> +	  for (gsi = gsi_start_bb (bbs[i]);

> +	       !gsi_end_p (gsi); gsi_next (&gsi))

> +	    {

> +	      stmt = gsi_stmt (gsi);

> +	      LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (stmt);

> +	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);


Nit: double space before "=".

> +	      if (stmt_vinfo != NULL

> +		  && stmt_vinfo->dr_aux.stmt == stmt_vinfo)

> +		{

> +		  dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo);

> +		  /* Data references pointing to gather loads and scatter stores

> +		     require special treatment because the address computation

> +		     happens in a different gimple node, pointed to by DR_REF.

> +		     In contrast to normal loads and stores where we only need

> +		     to update the offset of the data reference.  */

> +		  if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt))

> +		    LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).safe_push (dr_vinfo);

> +		  LOOP_VINFO_UP_DRS (epilogue_vinfo).safe_push (dr_vinfo);

> +		}

> +	    }

> +	}

> +    }

> +

> +  return vect_epilogues ? epilog : NULL;

>  }

>  

>  /* Function vect_create_cond_for_niters_checks.

> [...]

> @@ -2151,8 +2176,18 @@ start_over:

>    /* During peeling, we need to check if number of loop iterations is

>       enough for both peeled prolog loop and vector loop.  This check

>       can be merged along with threshold check of loop versioning, so

> -     increase threshold for this case if necessary.  */

> -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))

> +     increase threshold for this case if necessary.

> +

> +     If we are analyzing an epilogue we still want to check what it's


s/it's/its/

> +     versioning threshold would be.  If we decide to vectorize the epilogues we

> +     will want to use the lowest versioning threshold of all epilogues and main

> +     loop.  This will enable us to enter a vectorized epilogue even when

> +     versioning the loop.  We can't simply check whether the epilogue requires

> +     versioning though since we may have skipped some versioning checks when

> +     analyzing the epilogue. For instance, checks for alias versioning will be


Nit: should be two spaces after ".".

> +     skipped when dealing with epilogues as we assume we already checked them

> +     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */

> +  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))

>      {

>        poly_uint64 niters_th = 0;

>        unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);

> @@ -2307,14 +2342,8 @@ again:

>     be vectorized.  */

>  opt_loop_vec_info

>  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

> -		   vec_info_shared *shared)

> +		   vec_info_shared *shared, vector_sizes vector_sizes)

>  {

> -  auto_vector_sizes vector_sizes;

> -

> -  /* Autodetect first vector size we try.  */

> -  current_vector_size = 0;

> -  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,

> -						loop->simdlen != 0);

>    unsigned int next_size = 0;

>  

>    DUMP_VECT_SCOPE ("analyze_loop_nest");

> @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>    poly_uint64 autodetected_vector_size = 0;

>    opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);

>    poly_uint64 first_vector_size = 0;

> +  poly_uint64 lowest_th = 0;

> +  unsigned vectorized_loops = 0;

> +  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);

>    while (1)

>      {

>        /* Check the CFG characteristics of the loop (nesting, entry/exit).  */

> @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>  

>        if (orig_loop_vinfo)

>  	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;

> +      else if (vect_epilogues && first_loop_vinfo)

> +	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;

>  

>        opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);

>        if (res)

>  	{

>  	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;

> +	  vectorized_loops++;

>  

> -	  if (loop->simdlen

> -	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),

> -			   (unsigned HOST_WIDE_INT) loop->simdlen))

> +	  if ((loop->simdlen

> +	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),

> +			    (unsigned HOST_WIDE_INT) loop->simdlen))

> +	      || vect_epilogues)

>  	    {

>  	      if (first_loop_vinfo == NULL)

>  		{

>  		  first_loop_vinfo = loop_vinfo;

> +		  lowest_th

> +		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);

>  		  first_vector_size = current_vector_size;

>  		  loop->aux = NULL;

>  		}

>  	      else

> -		delete loop_vinfo;

> +		{

> +		  /* Keep track of vector sizes that we know we can vectorize

> +		     the epilogue with.  */

> +		  if (vect_epilogues)

> +		    {

> +		      loop->aux = NULL;

> +		      first_loop_vinfo->epilogue_vsizes.reserve (1);

> +		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);

> +		      first_loop_vinfo->epilogue_vinfos.reserve (1);

> +		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);


I've messed you around, sorry, but the patches I committed this weekend
mean we now store the vector size in the loop_vinfo.  It'd be good to
avoid a separate epilogue_vsizes array if possible.

> +		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;

> +		      poly_uint64 th

> +			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);

> +		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)

> +				  || maybe_ne (lowest_th, 0U));

> +		      /* Keep track of the known smallest versioning

> +			 threshold.  */

> +		      if (ordered_p (lowest_th, th))

> +			lowest_th = ordered_min (lowest_th, th);

> +		    }

> +		  else

> +		    delete loop_vinfo;

> +		}

>  	    }

>  	  else

>  	    {

> @@ -2408,6 +2468,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>  		  dump_dec (MSG_NOTE, current_vector_size);

>  		  dump_printf (MSG_NOTE, "\n");

>  		}

> +	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;

> +

>  	      return first_loop_vinfo;

>  	    }

>  	  else

> @@ -8128,6 +8190,188 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,

>      *seen_store = stmt_info;

>  }

>  

> +/* Helper function to pass to simplify_replace_tree to enable replacing tree's

> +   in the hash_map with its corresponding values.  */

> +static tree

> +find_in_mapping (tree t, void *context)

> +{

> +  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;

> +

> +  tree *value = mapping->get (t);

> +  return value ? *value : t;

> +}

> +

> +static void

> +update_epilogue_loop_vinfo (class loop *epilogue, tree advance)

> +{

> +  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);

> +  auto_vec<stmt_vec_info> pattern_worklist, related_worklist;

> +  hash_map<tree,tree> mapping;

> +  gimple *orig_stmt, *new_stmt;

> +  gimple_stmt_iterator epilogue_gsi;

> +  gphi_iterator epilogue_phi_gsi;

> +  stmt_vec_info stmt_vinfo = NULL, related_vinfo;

> +  basic_block *epilogue_bbs = get_loop_body (epilogue);

> +

> +  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;

> +

> +  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);

> +

> +

> +  /* We are done vectorizing the main loop, so now we update the epilogues

> +     stmt_vec_info's.  At the same time we set the gimple UID of each


"epilogue's stmt_vec_infos"

> +     statement in the epilogue, as these are used to look them up in the

> +     epilogues loop_vec_info later.  We also keep track of what


epilogue's

> +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might


PATTERN_DEF_SEQs and RELATED_STMTs

> +     need updating and we construct a mapping between variables defined in

> +     the main loop and their corresponding names in epilogue.  */

> +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

> +    {

> +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);

> +	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))

> +	{

> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];

> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);

> +	  new_stmt = epilogue_phi_gsi.phi ();

> +

> +	  stmt_vinfo

> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);


Nit: fits one line.

> +

> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;

> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));

> +

> +	  mapping.put (gimple_phi_result (orig_stmt),

> +			gimple_phi_result (new_stmt));


Nit: indented too far.

> +

> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))

> +	    pattern_worklist.safe_push (stmt_vinfo);

> +

> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> +	  while (related_vinfo && related_vinfo != stmt_vinfo)

> +	    {

> +	      related_worklist.safe_push (related_vinfo);

> +	      /* Set BB such that the assert in

> +		'get_initial_def_for_reduction' is able to determine that

> +		the BB of the related stmt is inside this loop.  */

> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),

> +			     gimple_bb (new_stmt));

> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);

> +	    }

> +	}

> +

> +      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);

> +	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))

> +	{

> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];

> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);

> +	  new_stmt = gsi_stmt (epilogue_gsi);

> +

> +	  stmt_vinfo

> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);


Fits on one line.

> +

> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;

> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));

> +

> +	  if (is_gimple_assign (orig_stmt))

> +	    {

> +	      gcc_assert (is_gimple_assign (new_stmt));

> +	      mapping.put (gimple_assign_lhs (orig_stmt),

> +			  gimple_assign_lhs (new_stmt));

> +	    }


Why just assigns?  Don't we need to handle calls too?

Maybe just use gimple_get_lhs here.

> +

> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))

> +	    pattern_worklist.safe_push (stmt_vinfo);

> +

> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> +	  while (related_vinfo && related_vinfo != stmt_vinfo)

> +	    {

> +	      related_worklist.safe_push (related_vinfo);

> +	      /* Set BB such that the assert in

> +		'get_initial_def_for_reduction' is able to determine that

> +		the BB of the related stmt is inside this loop.  */

> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),

> +			     gimple_bb (new_stmt));

> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);

> +	    }

> +	}

> +      gcc_assert (LOOP_VINFO_UP_STMTS (epilogue_vinfo).length () == 0);

> +    }

> +

> +  /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the


PATTERN_DEF_SEQs

> +     original main loop and thus need to be updated to refer to the cloned

> +     variables used in the epilogue.  */

> +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)

> +    {

> +      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);

> +      tree *new_op;

> +

> +      while (seq)

> +	{

> +	  for (unsigned j = 1; j < gimple_num_ops (seq); ++j)

> +	    {

> +	      tree op = gimple_op (seq, j);

> +	      if ((new_op = mapping.get(op)))

> +		gimple_set_op (seq, j, *new_op);

> +	      else

> +		{

> +		  op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,

> +					 &find_in_mapping, &mapping);

> +		  gimple_set_op (seq, j, op);

> +		}

> +	    }

> +	  seq = seq->next;

> +	}

> +    }

> +

> +  /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be


as above

> +     updated.  */

> +  for (unsigned i = 0; i < related_worklist.length (); ++i)

> +    {

> +      tree *new_t;

> +      gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);

> +      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)

> +	if ((new_t = mapping.get(gimple_op (stmt, j))))


These days I think:

	if (tree *new_t = mapping.get(gimple_op (stmt, j)))

is preferred.

> +	  gimple_set_op (stmt, j, *new_t);

> +    }

> +

> +  tree *new_op;

> +  /* Data references for gather loads and scatter stores do not use the

> +     updated offset we set using ADVANCE.  Instead we have to make sure the

> +     reference in the data references point to the corresponding copy of

> +     the original in the epilogue.  */

> +  for (unsigned i = 0; i < LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).length (); ++i)

> +    {

> +      dr_vec_info *dr_vinfo = LOOP_VINFO_UP_GT_DRS (epilogue_vinfo)[i];

> +      data_reference *dr = dr_vinfo->dr;

> +      gcc_assert (dr);

> +      gcc_assert (TREE_CODE (DR_REF (dr)) == MEM_REF);

> +      new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0));

> +

> +      if (new_op)


Likewise:

      if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0)))

here.

> +	{

> +	  DR_REF (dr) = unshare_expr (DR_REF (dr));

> +	  TREE_OPERAND (DR_REF (dr), 0) = *new_op;

> +	  DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (*new_op);

> +	}

> +    }

> +

> +  /* The vector size of the epilogue is smaller than that of the main loop

> +     so the alignment is either the same or lower. This means the dr will

> +     thus by definition be aligned.  */

> +  for (unsigned i = 0; i < LOOP_VINFO_UP_DRS (epilogue_vinfo).length (); ++i)

> +    LOOP_VINFO_UP_DRS (epilogue_vinfo)[i]->base_misaligned = false;

> +

> +

> +  LOOP_VINFO_UP_STMTS (epilogue_vinfo).release ();

> +  LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).release ();

> +  LOOP_VINFO_UP_DRS (epilogue_vinfo).release ();

> +

> +  epilogue_vinfo->shared->datarefs_copy.release ();

> +  epilogue_vinfo->shared->save_datarefs ();

> +}

> +

> +

>  /* Function vect_transform_loop.

>  

>     The analysis phase has determined that the loop is vectorizable.

> [...]

> @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,

>  		 LOCATION_FILE (vect_location.get_location_t ()),

>  		 LOCATION_LINE (vect_location.get_location_t ()));

>  

> +  /* If this is an epilogue, we already know what vector sizes we will use for

> +     vectorization as the analyzis was part of the main vectorized loop.  Use

> +     these instead of going through all vector sizes again.  */

> +  if (orig_loop_vinfo

> +      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())

> +    {

> +      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);

> +      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);

> +      current_vector_size = vector_sizes[0];

> +    }

> +  else

> +    {

> +      /* Autodetect first vector size we try.  */

> +      current_vector_size = 0;

> +

> +      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,

> +						    loop->simdlen != 0);

> +      vector_sizes = auto_vector_sizes;

> +    }

> +

>    /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */

> -  opt_loop_vec_info loop_vinfo

> -    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);

> -  loop->aux = loop_vinfo;

> +  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);

> +  if (loop_vec_info_for_loop (loop))

> +    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));

> +  else

> +    {

> +      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);

> +      loop->aux = loop_vinfo;

> +    }


I don't really understand what this is doing for the epilogue case.
Do we call vect_analyze_loop again?  Are vector_sizes[1:] significant
for epilogues?

Thanks,
Richard
Andre Simoes Dias Vieira Oct. 25, 2019, 4:18 p.m. | #11
On 22/10/2019 14:56, Richard Biener wrote:
> On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:

> 

>> Hi Richi,

>>

>> See inline responses to your comments.

>>

>> On 11/10/2019 13:57, Richard Biener wrote:

>>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

>>>

>>>> Hi,

>>>>

>>

>>>

>>> +

>>> +  /* Keep track of vector sizes we know we can vectorize the epilogue

>>> with.  */

>>> +  vector_sizes epilogue_vsizes;

>>>    };

>>>

>>> please don't enlarge struct loop, instead track this somewhere

>>> in the vectorizer (in loop_vinfo?  I see you already have

>>> epilogue_vinfos there - so the loop_vinfo simply lacks

>>> convenient access to the vector_size?)  I don't see any

>>> use that could be trivially adjusted to look at a loop_vinfo

>>> member instead.

>>

>> Done.

>>>

>>> For the vect_update_inits_of_drs this means that we'd possibly

>>> do less CSE.  Not sure if really an issue.

>>

>> CSE of what exactly? You are afraid we are repeating a calculation here we

>> have done elsewhere before?

> 

> All uses of those inits now possibly get the expression instead of

> just the SSA name we inserted code for once.  But as said, we'll see.

> 


This code changed after some comments from Richard Sandiford.

> +  /* We are done vectorizing the main loop, so now we update the

> epilogues

> +     stmt_vec_info's.  At the same time we set the gimple UID of each

> +     statement in the epilogue, as these are used to look them up in the

> +     epilogues loop_vec_info later.  We also keep track of what

> +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might

> +     need updating and we construct a mapping between variables defined

> in

> +     the main loop and their corresponding names in epilogue.  */

> +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

> 

> so for the following code I wonder if you can make use of the

> fact that loop copying also copies UIDs, so you should be able

> to match stmts via their UIDs and get at the other loop infos

> stmt_info by the copy loop stmt UID.

> 

> I wonder why you need no modification for the SLP tree?

> 

I checked with Tamar and the SLP tree works with the position of 
operands and not SSA_NAMES.  So we should be fine.
Andre Simoes Dias Vieira Oct. 25, 2019, 4:18 p.m. | #12
On 22/10/2019 18:52, Richard Sandiford wrote:
> Thanks for doing this.  Hope this message doesn't cover too much old

> ground or duplicate too much...

> 

> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:

>> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>>     else

>>       niters_prolog = build_int_cst (type, 0);

>>   

>> +  loop_vec_info epilogue_vinfo = NULL;

>> +  if (vect_epilogues)

>> +    {

>> +      /* Take the next epilogue_vinfo to vectorize for.  */

>> +      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];

>> +      loop_vinfo->epilogue_vinfos.ordered_remove (0);

>> +

>> +      /* Don't vectorize epilogues if this is not the most inner loop or if

>> +	 the epilogue may need peeling for alignment as the vectorizer doesn't

>> +	 know how to handle these situations properly yet.  */

>> +      if (loop->inner != NULL

>> +	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))

>> +	vect_epilogues = false;

>> +

>> +    }

> 

> Nit: excess blank line before "}".  Sorry if this was discussed before,

> but what's the reason for delaying the check for "loop->inner" to

> this point, rather than doing it in vect_analyze_loop?


Done.
> 

>> +

>> +  tree niters_vector_mult_vf;

>> +  unsigned int lowest_vf = constant_lower_bound (vf);

>> +  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work

>> +     on niters already ajusted for the iterations of the prologue.  */

> 

> Pre-existing typo: adjusted.  But...

> 

>> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

>> +      && known_eq (vf, lowest_vf))

>> +    {

>> +      loop_vec_info orig_loop_vinfo;

>> +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))

>> +	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);

>> +      else

>> +	orig_loop_vinfo = loop_vinfo;

>> +      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);

>> +      unsigned next_size = 0;

>> +      unsigned HOST_WIDE_INT eiters

>> +	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

>> +	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));

>> +

>> +      if (prolog_peeling > 0)

>> +	eiters -= prolog_peeling;

> 

> ...is that comment still true?  We're now subtracting the peeling

> amount here.


It is not, "adjusted" the comment ;)

> Might be worth asserting prolog_peeling >= 0, just to emphasise

> that we can't get here for variable peeling amounts, and then subtract

> prolog_peeling unconditionally (assuming that's the right thing to do).

> 

Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even with 
prolog_peeling < 0, since we still know the constant number of scalar 
iterations, we just don't know how many vector iterations will be 
performed due to the runtime peeling. I will however, not reject 
vectorizing the epilogue, when we don't know how much we are peeling.
>> +      eiters

>> +	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);

>> +

>> +      unsigned int ratio;

>> +      while (next_size < vector_sizes.length ()

>> +	     && !(constant_multiple_p (current_vector_size,

>> +				       vector_sizes[next_size], &ratio)

>> +		  && eiters >= lowest_vf / ratio))

>> +	next_size += 1;

>> +

>> +      if (next_size == vector_sizes.length ())

>> +	vect_epilogues = false;

>> +    }

>> +

>>     /* Prolog loop may be skipped.  */

>>     bool skip_prolog = (prolog_peeling != 0);

>>     /* Skip to epilog if scalar loop may be preferred.  It's only needed

>> -     when we peel for epilog loop and when it hasn't been checked with

>> -     loop versioning.  */

>> +     when we peel for epilog loop or when we loop version.  */

>>     bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

>>   		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),

>>   				  bound_prolog + bound_epilog)

>> -		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));

>> +		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)

>> +			 || vect_epilogues));

> 

> The comment update looks wrong here: without epilogues, we don't need

> the skip when loop versioning, because loop versioning ensures that we

> have at least one vector iteration.

> 

> (I think "it" was supposed to mean "skipping to the epilogue" rather

> than the epilogue loop itself, in case that's the confusion.)

> 

> It'd be good to mention the epilogue condition in the comment too.

> 


Rewrote comment, hopefully this now better reflects reality.

>> +

>> +  if (vect_epilogues)

>> +    {

>> +      epilog->aux = epilogue_vinfo;

>> +      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;

>> +

>> +      loop_constraint_clear (epilog, LOOP_C_INFINITE);

>> +

>> +      /* We now must calculate the number of iterations for our epilogue.  */

>> +      tree cond_niters, niters;

>> +

>> +      /* Depending on whether we peel for gaps we take niters or niters - 1,

>> +	 we will refer to this as N - G, where N and G are the NITERS and

>> +	 GAP for the original loop.  */

>> +      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)

>> +	? LOOP_VINFO_NITERSM1 (loop_vinfo)

>> +	: LOOP_VINFO_NITERS (loop_vinfo);

>> +

>> +      /* Here we build a vector factorization mask:

>> +	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */

>> +      tree vf_mask = build_int_cst (TREE_TYPE (niters),

>> +				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));

>> +      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),

>> +			     vf_mask,

>> +			     build_one_cst (TREE_TYPE (vf_mask)));

>> +      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);

>> +

>> +      /* Here we calculate:

>> +	 niters = N - ((N-G) & ~(VF -1)) */

>> +      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),

>> +			    LOOP_VINFO_NITERS (loop_vinfo),

>> +			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),

>> +					 niters,

>> +					 vf_mask));

> 

> Might be a daft question, sorry, but why does this need to be so

> complicated?  Couldn't we just use the final value of the main loop's

> IV to calculate how many iterations are left?

> 

> The current code wouldn't for example work for non-power-of-2 SVE vectors.

> vect_set_loop_condition_unmasked is structured to cope with that case

> (in length-agnostic mode only), even when an epilogue is needed.


Good call, as we discussed I changed my approach here. Rather than using 
a conditional expression to guard against skipping the main loop, I now 
use a phi-node to carry the IV.  This actually already exists, so I am 
duplicating here, but I didn't know what the best way was to "grab" this 
existing IV.


>> +     skipped when dealing with epilogues as we assume we already checked them

>> +     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */

>> +  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))

>>       {

>>         poly_uint64 niters_th = 0;

>>         unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);

>> @@ -2307,14 +2342,8 @@ again:

>>      be vectorized.  */

>>   opt_loop_vec_info

>>   vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>> -		   vec_info_shared *shared)

>> +		   vec_info_shared *shared, vector_sizes vector_sizes)

>>   {

>> -  auto_vector_sizes vector_sizes;

>> -

>> -  /* Autodetect first vector size we try.  */

>> -  current_vector_size = 0;

>> -  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,

>> -						loop->simdlen != 0);

>>     unsigned int next_size = 0;

>>   

>>     DUMP_VECT_SCOPE ("analyze_loop_nest");

>> @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>>     poly_uint64 autodetected_vector_size = 0;

>>     opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);

>>     poly_uint64 first_vector_size = 0;

>> +  poly_uint64 lowest_th = 0;

>> +  unsigned vectorized_loops = 0;

>> +  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);

>>     while (1)

>>       {

>>         /* Check the CFG characteristics of the loop (nesting, entry/exit).  */

>> @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,

>>   

>>         if (orig_loop_vinfo)

>>   	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;

>> +      else if (vect_epilogues && first_loop_vinfo)

>> +	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;

>>   

>>         opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);

>>         if (res)

>>   	{

>>   	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;

>> +	  vectorized_loops++;

>>   

>> -	  if (loop->simdlen

>> -	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),

>> -			   (unsigned HOST_WIDE_INT) loop->simdlen))

>> +	  if ((loop->simdlen

>> +	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),

>> +			    (unsigned HOST_WIDE_INT) loop->simdlen))

>> +	      || vect_epilogues)

>>   	    {

>>   	      if (first_loop_vinfo == NULL)

>>   		{

>>   		  first_loop_vinfo = loop_vinfo;

>> +		  lowest_th

>> +		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);

>>   		  first_vector_size = current_vector_size;

>>   		  loop->aux = NULL;

>>   		}

>>   	      else

>> -		delete loop_vinfo;

>> +		{

>> +		  /* Keep track of vector sizes that we know we can vectorize

>> +		     the epilogue with.  */

>> +		  if (vect_epilogues)

>> +		    {

>> +		      loop->aux = NULL;

>> +		      first_loop_vinfo->epilogue_vsizes.reserve (1);

>> +		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);

>> +		      first_loop_vinfo->epilogue_vinfos.reserve (1);

>> +		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);

> 

> I've messed you around, sorry, but the patches I committed this weekend

> mean we now store the vector size in the loop_vinfo.  It'd be good to

> avoid a separate epilogue_vsizes array if possible.


Rebased. Actually quite happy with that, makes for a cleaner patch on my 
end :)
> 

>> +

>> +	  stmt_vinfo

>> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);

> 

> Nit: fits one line.

> 

>> +

>> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;

>> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));

>> +

>> +	  mapping.put (gimple_phi_result (orig_stmt),

>> +			gimple_phi_result (new_stmt));

> 

> Nit: indented too far.

> 

>> +

>> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))

>> +	    pattern_worklist.safe_push (stmt_vinfo);

>> +

>> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

>> +	  while (related_vinfo && related_vinfo != stmt_vinfo)

>> +	    {

>> +	      related_worklist.safe_push (related_vinfo);

>> +	      /* Set BB such that the assert in

>> +		'get_initial_def_for_reduction' is able to determine that

>> +		the BB of the related stmt is inside this loop.  */

>> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),

>> +			     gimple_bb (new_stmt));

>> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);

>> +	    }

>> +	}

>> +

>> +      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);

>> +	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))

>> +	{

>> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];

>> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);

>> +	  new_stmt = gsi_stmt (epilogue_gsi);

>> +

>> +	  stmt_vinfo

>> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);

> 

> Fits on one line.

> 

>> +

>> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;

>> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));

>> +

>> +	  if (is_gimple_assign (orig_stmt))

>> +	    {

>> +	      gcc_assert (is_gimple_assign (new_stmt));

>> +	      mapping.put (gimple_assign_lhs (orig_stmt),

>> +			  gimple_assign_lhs (new_stmt));

>> +	    }

> 

> Why just assigns?  Don't we need to handle calls too?

> 

> Maybe just use gimple_get_lhs here.


Changed.
>> @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,

>>   		 LOCATION_FILE (vect_location.get_location_t ()),

>>   		 LOCATION_LINE (vect_location.get_location_t ()));

>>   

>> +  /* If this is an epilogue, we already know what vector sizes we will use for

>> +     vectorization as the analyzis was part of the main vectorized loop.  Use

>> +     these instead of going through all vector sizes again.  */

>> +  if (orig_loop_vinfo

>> +      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())

>> +    {

>> +      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);

>> +      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);

>> +      current_vector_size = vector_sizes[0];

>> +    }

>> +  else

>> +    {

>> +      /* Autodetect first vector size we try.  */

>> +      current_vector_size = 0;

>> +

>> +      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,

>> +						    loop->simdlen != 0);

>> +      vector_sizes = auto_vector_sizes;

>> +    }

>> +

>>     /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */

>> -  opt_loop_vec_info loop_vinfo

>> -    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);

>> -  loop->aux = loop_vinfo;

>> +  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);

>> +  if (loop_vec_info_for_loop (loop))

>> +    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));

>> +  else

>> +    {

>> +      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);

>> +      loop->aux = loop_vinfo;

>> +    }

> 

> I don't really understand what this is doing for the epilogue case.

> Do we call vect_analyze_loop again?  Are vector_sizes[1:] significant

> for epilogues?


The vector sizes code here is no longer needed after your patch. The 
loop_vec_info is just checking whether loop already has one set (which 
is the case for epilogues) and use that, or if not then analyse it 
(which is the case for the first vectorization).  I'll add some comments.
> 

> Thanks,

> Richard

>
Andre Simoes Dias Vieira Oct. 25, 2019, 4:20 p.m. | #13
Hi,

This is the reworked patch after your comments.

I have moved the epilogue check into the analysis form disguised under 
'!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing 
the "lowest threshold" check there.

The only place where we may reject an epilogue_vinfo is when we know the 
number of scalar iterations and we realize the number of iterations left 
after the main loop are not enough to enter the vectorized epilogue so 
we optimize away that code-gen.  The only way we know this to be true is 
if the number of scalar iterations are known and the peeling for 
alignment is known. So we know we will enter the main loop regardless, 
so whether the threshold we use is for a lower VF or not it shouldn't 
matter as much, I would even like to think that check isn't done, but I 
am not sure... Might be worth checking as an optimization.


Is this OK for trunk?

gcc/ChangeLog:
2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
     (vect_do_peeling, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.



Cheers,
Andre
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
 				   class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tree simplify_replace_tree (tree, tree,
+				   tree, tree (*)(tree, void *) = NULL,
+				   void * = NULL);
 extern void substitute_in_loop_info (class loop *, tree, tree);
 
 #endif /* GCC_TREE_SSA_LOOP_NITER_H */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		       tree (*valueize) (tree))
+		       tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
     {
       if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	    return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
     {
       e = TREE_OPERAND (expr, i);
-      se = simplify_replace_tree (e, old, new_tree, valueize);
+      se = simplify_replace_tree (e, old, new_tree, valueize, context);
       if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -309,6 +309,10 @@ static vn_tables_t valid_info;
 /* Valueization hook.  Valueize NAME if it is an SSA name, otherwise
    just return it.  */
 tree (*vn_valueize) (tree);
+tree vn_valueize_wrapper (tree t, void* context ATTRIBUTE_UNUSED)
+{
+  return vn_valueize (t);
+}
 
 
 /* This represents the top of the VN lattice, which is the universal
@@ -6407,7 +6411,7 @@ process_bb (rpo_elim &avail, basic_block bb,
       if (bb->loop_father->nb_iterations)
 	bb->loop_father->nb_iterations
 	  = simplify_replace_tree (bb->loop_father->nb_iterations,
-				   NULL_TREE, NULL_TREE, vn_valueize);
+				   NULL_TREE, NULL_TREE, &vn_valueize_wrapper);
     }
 
   /* Value-number all defs in the basic-block.  */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index b3246bc7a099e491e5c2fd32835dc5c848931d0a..dffb40ec9999a0363e53b1748af2fdcf270710ff 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1726,7 +1726,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1736,21 +1736,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
 
-  /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
+  /* Adjust niters to sizetype.  We used to insert the stmts on loop preheader
+     here, but since we might use these niters to update the epilogues niters
+     and data references we can't insert them here as this definition might not
+     always dominate its uses.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2393,7 +2384,22 @@ slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
-   Returns created epilogue or NULL.
+   This function returns the epilogue loop if a decision was made to vectorize
+   it, otherwise NULL.
+
+   The analysis resulting in this epilogue loop's loop_vec_info was performed
+   in the same vect_analyze_loop call as the main loop's.  At that time
+   vect_analyze_loop constructs a list of accepted loop_vec_info's for lower
+   vectorization factors than the main loop.  This list is stored in the main
+   loop's loop_vec_info in the 'epilogue_vinfos' member.  Everytime we decide to
+   vectorize the epilogue loop for a lower vectorization factor,  the
+   loop_vec_info sitting at the top of the epilogue_vinfos list is removed,
+   updated and linked to the epilogue loop.  This is later used to vectorize
+   the epilogue.  The reason the loop_vec_info needs updating is that it was
+   constructed based on the original main loop, and the epilogue loop is a
+   copy of this loop, so all links pointing to statements in the original loop
+   need updating.  Furthermore, these loop_vec_infos share the
+   data_reference's records, which will also need to be updated.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
@@ -2403,7 +2409,8 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance, drs_init_vec &orig_drs_init)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2411,6 +2418,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues = loop_vinfo->epilogue_vinfos.length () > 0;
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2464,19 +2472,73 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   int bound_prolog = 0;
   if (prolog_peeling)
     niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
-						 &bound_prolog);
+						  &bound_prolog);
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+    }
+
+  tree niters_vector_mult_vf = NULL_TREE;
+  /* Saving NITERs before the loop, as this may be changed by prologue.  */
+  tree before_loop_niters = LOOP_VINFO_NITERS (loop_vinfo);
+  edge update_e = NULL, skip_e = NULL;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* If we know the number of scalar iterations for the main loop we should
+     check whether after the main loop there are enough iterations left over
+     for the epilogue.  */
+  if (vect_epilogues
+      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && prolog_peeling >= 0
+      && known_eq (vf, lowest_vf))
+    {
+      unsigned HOST_WIDE_INT eiters
+	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+      unsigned int ratio;
+      while (!(constant_multiple_p (loop_vinfo->vector_size,
+				    epilogue_vinfo->vector_size, &ratio)
+	       && eiters >= lowest_vf / ratio))
+	{
+	  delete epilogue_vinfo;
+	  epilogue_vinfo = NULL;
+	  if (loop_vinfo->epilogue_vinfos.length () == 0)
+	    {
+	      vect_epilogues = false;
+	      break;
+	    }
+	  epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+	  loop_vinfo->epilogue_vinfos.ordered_remove (0);
+	}
+    }
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
-  /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+  /* Skip this loop to epilog when there are not enough iterations to enter this
+     vectorized loop.  If true we should perform runtime checks on the NITERS
+     to check whether we should skip the current vectorized loop.  If we know
+     the number of scalar iterations we may choose to add a runtime check if
+     this number "maybe" smaller than the number of iterations required
+     when we know the number of scalar iterations may potentially
+     be smaller than the number of iterations required to enter this loop, for
+     this we use the upper bounds on the prolog and epilog peeling.  When we
+     don't know the number of iterations and don't require versioning it is
+     because we have asserted that there are enough scalar iterations to enter
+     the main loop, so this skip is not necessary.  When we are versioning then
+     we only add such a skip if we have chosen to vectorize the epilogue.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2506,6 +2568,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   dump_user_location_t loop_loc = find_loop_location (loop);
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  if (vect_epilogues)
+    /* Make sure to set the epilogue's epilogue scalar loop, such that we can
+       use the original scalar loop as remaining epilogue if necessary.  */
+    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2552,6 +2620,15 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog);
 	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
 	}
+
+      /* Save original inits for each data_reference before advancing them with
+	 NITERS_PROLOG.  */
+      unsigned int i;
+      struct data_reference *dr;
+      vec<data_reference_p> datarefs = loop_vinfo->shared->datarefs;
+      FOR_EACH_VEC_ELT (datarefs, i, dr)
+	orig_drs_init.safe_push (std::make_pair (dr, DR_OFFSET (dr)));
+
       /* Update init address of DRs.  */
       vect_update_inits_of_drs (loop_vinfo, niters_prolog, PLUS_EXPR);
       /* Update niters for vector loop.  */
@@ -2586,8 +2663,15 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			   "loop can't be duplicated to exit edge.\n");
 	  gcc_unreachable ();
 	}
-      /* Peel epilog and put it on exit edge of loop.  */
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
+	 said epilog then we should use a copy of the main loop as a starting
+	 point.  This loop may have already had some preliminary transformations
+	 to allow for more optimal vectorization, for example if-conversion.
+	 If we are not vectorizing the epilog then we should use the scalar loop
+	 as the transformations mentioned above make less or no sense when not
+	 vectorizing.  */
+      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
@@ -2616,6 +2700,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 					   guard_to, guard_bb,
 					   prob_vector.invert (),
 					   irred_flag);
+	  skip_e = guard_e;
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
@@ -2637,7 +2722,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2660,7 +2744,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       /* Update IVs of original loop as if they were advanced by
 	 niters_vector_mult_vf steps.  */
       gcc_checking_assert (vect_can_advance_ivs_p (loop_vinfo));
-      edge update_e = skip_vector ? e : loop_preheader_edge (epilog);
+      update_e = skip_vector ? e : loop_preheader_edge (epilog);
       vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
 					update_e);
 
@@ -2701,10 +2785,75 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of NITERS performed by the previous
+	 loop and EPILOGUE_NITERS to be performed by the epilogue.  */
+      tree niters = fold_build2 (PLUS_EXPR, TREE_TYPE (niters_vector_mult_vf),
+				 niters_prolog, niters_vector_mult_vf);
+
+      /* If skip_vector we may skip the previous loop, we insert a phi-node to
+	 determine whether we are coming from the previous vectorized loop
+	 using the update_e edge or the skip_vector basic block using the
+	 skip_e edge.  */
+      if (skip_vector)
+	{
+	  gcc_assert (update_e != NULL && skip_e != NULL);
+	  gphi *new_phi = create_phi_node (make_ssa_name (TREE_TYPE (niters)),
+					   update_e->dest);
+	  tree new_ssa = make_ssa_name (TREE_TYPE (niters));
+	  gimple *stmt = gimple_build_assign (new_ssa, niters);
+	  gimple_stmt_iterator gsi;
+	  if (TREE_CODE (niters_vector_mult_vf) == SSA_NAME
+	      && SSA_NAME_DEF_STMT (niters_vector_mult_vf)->bb != NULL)
+	    {
+	      gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (niters_vector_mult_vf));
+	      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
+	    }
+	  else
+	    {
+	      gsi = gsi_last_bb (update_e->src);
+	      gsi_insert_before (&gsi, stmt, GSI_NEW_STMT);
+	    }
+
+	  niters = new_ssa;
+	  add_phi_arg (new_phi, niters, update_e, UNKNOWN_LOCATION);
+	  add_phi_arg (new_phi, build_zero_cst (TREE_TYPE (niters)), skip_e,
+		       UNKNOWN_LOCATION);
+	  niters = PHI_RESULT (new_phi);
+	}
+
+      /* Subtract the number of iterations performed by the vectorized loop
+	 from the number of total iterations.  */
+      tree epilogue_niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+					  before_loop_niters,
+					  niters);
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = epilogue_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (epilogue_niters),
+		       epilogue_niters,
+		       build_one_cst (TREE_TYPE (epilogue_niters)));
+
+      /* Set ADVANCE to the number of iterations performed by the previous
+	 loop and its prologue.  */
+      *advance = niters;
+
+      /* Redo the peeling for niter analysis as the NITERs and alignment
+	 may have been updated to take the main loop into account.  */
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e22d2dd7abbf43aa0c8707b9422b90612188ad2a..f7f471b98189698dcad1f8d56313314b094f4035 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -885,6 +885,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
 }
 
 /* Free all levels of MASKS.  */
@@ -909,6 +911,7 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
 
   loop->aux = NULL;
 }
@@ -1682,9 +1685,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
-  if (estimated_niter == -1)
-    estimated_niter = likely_max_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter;
+
+  /* If we are vectorizing an epilogue then we know the maximum number of
+     scalar iterations it will cover is at least one lower than the
+     vectorization factor of the main loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  else
+    {
+      estimated_niter = estimated_stmt_executions_int (loop);
+      if (estimated_niter == -1)
+	estimated_niter = likely_max_stmt_executions_int (loop);
+    }
   if (estimated_niter != -1
       && ((unsigned HOST_WIDE_INT) estimated_niter
 	  < MAX (th, (unsigned) min_profitable_estimate)))
@@ -1871,6 +1885,15 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
+
+  /* If we are dealing with an epilogue then orig_loop_vinfo points to the
+     loop_vec_info of the first vectorized loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  else
+    orig_loop_vinfo = loop_vinfo;
+  gcc_assert (orig_loop_vinfo);
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -2150,8 +2173,18 @@ start_over:
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+     increase threshold for this case if necessary.
+
+     If we are analyzing an epilogue we still want to check what its
+     versioning threshold would be.  If we decide to vectorize the epilogues we
+     will want to use the lowest versioning threshold of all epilogues and main
+     loop.  This will enable us to enter a vectorized epilogue even when
+     versioning the loop.  We can't simply check whether the epilogue requires
+     versioning though since we may have skipped some versioning checks when
+     analyzing the epilogue.  For instance, checks for alias versioning will be
+     skipped when dealing with epilogues as we assume we already checked them
+     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
+  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
     {
       poly_uint64 niters_th = 0;
       unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
@@ -2344,6 +2377,14 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 next_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+
+  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is enabled, this
+     is not a simd loop and it is the most inner loop.  */
+  bool vect_epilogues
+    = !loop->simdlen && loop->inner == NULL
+      && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2363,6 +2404,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (next_size == 0)
@@ -2371,18 +2414,43 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  Only vectorize first epilogue.  */
+		  if (vect_epilogues
+		      && first_loop_vinfo->epilogue_vinfos.is_empty ())
+		    {
+		      loop->aux = NULL;
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2416,6 +2484,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, first_loop_vinfo->vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -7925,6 +7995,211 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+/* Helper function to pass to simplify_replace_tree to enable replacing tree's
+   in the hash_map with its corresponding values.  */
+
+static tree
+find_in_mapping (tree t, void *context)
+{
+  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;
+
+  tree *value = mapping->get (t);
+  return value ? *value : t;
+}
+
+/* Update EPILOGUE's loop_vec_info.  EPILOGUE was constructed as a copy of the
+   original loop that has now been vectorized.
+
+   The inits of the data_references need to be advanced with the number of
+   iterations of the main loop.  This has been computed in vect_do_peeling and
+   is stored in parameter ADVANCE.  We first restore the data_references
+   initial offset with the values recored in ORIG_DRS_INIT.
+
+   Since the loop_vec_info of this EPILOGUE was constructed for the original
+   loop, its stmt_vec_infos all point to the original statements.  These need
+   to be updated to point to their corresponding copies as well as the SSA_NAMES
+   in their PATTERN_DEF_SEQs and RELATED_STMTs.
+
+   The data_reference's connections also need to be updated.  Their
+   corresponding dr_vec_info need to be reconnected to the EPILOGUE's
+   stmt_vec_infos, their statements need to point to their corresponding copy,
+   if they are gather loads or scatter stores then their reference needs to be
+   updated to point to its corresponding copy and finally we set
+   'base_misaligned' to false as we have already peeled for alignment in the
+   prologue of the main loop.  */
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance,
+			    drs_init_vec &orig_drs_init)
+{
+  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+  auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
+  hash_map<tree,tree> mapping;
+  gimple *orig_stmt, *new_stmt;
+  gimple_stmt_iterator epilogue_gsi;
+  gphi_iterator epilogue_phi_gsi;
+  stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+  basic_block *epilogue_bbs = get_loop_body (epilogue);
+
+  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+  /* Restore original data_reference's offset, before the previous loop and its
+     prologue.  */
+  std::pair<data_reference*, tree> *dr_init;
+  unsigned i;
+  for (i = 0; orig_drs_init.iterate (i, &dr_init); i++)
+    DR_OFFSET (dr_init->first) = dr_init->second;
+
+  /* Advance data_reference's with the number of iterations of the previous
+     loop and its prologue.  */
+  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
+
+
+  /* The EPILOGUE loop is a copy of the original loop so they share the same
+     gimple UIDs.  In this loop we update the loop_vec_info of the EPILOGUE to
+     point to the copied statements.  We also create a mapping of all LHS' in
+     the original loop and all the LHS' in the EPILOGUE and create worklists to
+     update teh STMT_VINFO_PATTERN_DEF_SEQs and STMT_VINFO_RELATED_STMTs.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
+    {
+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	{
+	  new_stmt = epilogue_phi_gsi.phi ();
+
+	  gcc_assert (gimple_uid (new_stmt) > 0);
+	  stmt_vinfo
+	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
+
+	  orig_stmt = STMT_VINFO_STMT (stmt_vinfo);
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+
+	  mapping.put (gimple_phi_result (orig_stmt),
+		       gimple_phi_result (new_stmt));
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+
+      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	{
+	  new_stmt = gsi_stmt (epilogue_gsi);
+
+	  gcc_assert (gimple_uid (new_stmt) > 0);
+	  stmt_vinfo
+	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
+
+	  orig_stmt = STMT_VINFO_STMT (stmt_vinfo);
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+
+	  if (tree old_lhs = gimple_get_lhs (orig_stmt))
+	    mapping.put (old_lhs, gimple_get_lhs (new_stmt));
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+    }
+
+  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
+     original main loop and thus need to be updated to refer to the cloned
+     variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+    {
+      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
+      tree *new_op;
+
+      while (seq)
+	{
+	  for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+	    {
+	      tree op = gimple_op (seq, j);
+	      if ((new_op = mapping.get(op)))
+		gimple_set_op (seq, j, *new_op);
+	      else
+		{
+		  op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+					 &find_in_mapping, &mapping);
+		  gimple_set_op (seq, j, op);
+		}
+	    }
+	  seq = seq->next;
+	}
+    }
+
+  /* Just like the PATTERN_DEF_SEQs the RELATED_STMTs also need to be
+     updated.  */
+  for (unsigned i = 0; i < related_worklist.length (); ++i)
+    {
+      gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
+      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	if (tree *new_t = mapping.get(gimple_op (stmt, j)))
+	  gimple_set_op (stmt, j, *new_t);
+    }
+
+  struct data_reference *dr;
+  vec<data_reference_p> datarefs = epilogue_vinfo->shared->datarefs;
+  FOR_EACH_VEC_ELT (datarefs, i, dr)
+    {
+      orig_stmt = DR_STMT (dr);
+      gcc_assert (gimple_uid (orig_stmt) > 0);
+      stmt_vinfo = epilogue_vinfo->stmt_vec_infos[gimple_uid (orig_stmt) - 1];
+      /* Data references for gather loads and scatter stores do not use the
+	 updated offset we set using ADVANCE.  Instead we have to make sure the
+	 reference in the data references point to the corresponding copy of
+	 the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+	{
+	  int j;
+	  if (TREE_CODE (DR_REF (dr)) == MEM_REF)
+	    j = 0;
+	  else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
+	    j = 1;
+	  else
+	    gcc_unreachable ();
+
+	  if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
+	    {
+	      DR_REF (dr) = unshare_expr (DR_REF (dr));
+	      TREE_OPERAND (DR_REF (dr), j) = *new_op;
+	    }
+	}
+      DR_STMT (dr) = STMT_VINFO_STMT (stmt_vinfo);
+      stmt_vinfo->dr_aux.stmt = stmt_vinfo;
+      /* The vector size of the epilogue is smaller than that of the main loop
+	 so the alignment is either the same or lower. This means the dr will
+	 thus by definition be aligned.  */
+      STMT_VINFO_DR_INFO (stmt_vinfo)->base_misaligned = false;
+    }
+
+  epilogue_vinfo->shared->datarefs_copy.release ();
+  epilogue_vinfo->shared->save_datarefs ();
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -7962,11 +8237,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8010,9 +8285,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
+  drs_init_vec orig_drs_init;
+
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance, orig_drs_init);
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8271,57 +8551,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
-
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (loop_vinfo->vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (loop_vinfo->vector_size, vector_sizes[next_size]))
-	  next_size += 1;
-
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
+      update_epilogue_loop_vinfo (epilogue, advance, orig_drs_init);
 
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
-
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 56be28b0cc5a77412f996e70636b08d5b615813e..71b5f380e2c91a7a551f6e26920bb17809abedf0 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@ typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include <utility>
 
 /* Used for naming of new temporaries.  */
 enum vect_var_kind {
@@ -456,6 +457,8 @@ struct rgroup_masks {
 
 typedef auto_vec<rgroup_masks> vec_loop_masks;
 
+typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
+
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
 /*-----------------------------------------------------------------*/
@@ -639,6 +642,10 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  /* Used to store loop_vec_infos of epilogues of this loop during
+     analysis.  */
+  vec<_loop_vec_info *> epilogue_vinfos;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1589,10 +1596,12 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *, drs_init_vec &);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
 
 /* In tree-vect-stmts.c.  */
 extern tree get_vectype_for_scalar_type (vec_info *, tree);
@@ -1700,6 +1709,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 
 /* In tree-vect-loop.c.  */
 extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
 extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 				  enum tree_code);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 30dcc442c4c440c44ef3ba29a03182834229ba35..8e02647c7bad6ce4a92a225a4d37f82439f771ae 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -874,6 +874,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -881,10 +882,17 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
-  /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  /* In the case of epilogue vectorization the loop already has its
+     loop_vec_info set, we do not require to analyze the loop in this case.  */
+  if (loop_vec_info vinfo = loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (vinfo);
+  else
+    {
+      /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -1012,8 +1020,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Richard Biener Oct. 28, 2019, 12:48 p.m. | #14
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> 

> 

> On 22/10/2019 14:56, Richard Biener wrote:

> > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:

> > 

> >> Hi Richi,

> >>

> >> See inline responses to your comments.

> >>

> >> On 11/10/2019 13:57, Richard Biener wrote:

> >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> >>>

> >>>> Hi,

> >>>>

> >>

> >>>

> >>> +

> >>> +  /* Keep track of vector sizes we know we can vectorize the epilogue

> >>> with.  */

> >>> +  vector_sizes epilogue_vsizes;

> >>>    };

> >>>

> >>> please don't enlarge struct loop, instead track this somewhere

> >>> in the vectorizer (in loop_vinfo?  I see you already have

> >>> epilogue_vinfos there - so the loop_vinfo simply lacks

> >>> convenient access to the vector_size?)  I don't see any

> >>> use that could be trivially adjusted to look at a loop_vinfo

> >>> member instead.

> >>

> >> Done.

> >>>

> >>> For the vect_update_inits_of_drs this means that we'd possibly

> >>> do less CSE.  Not sure if really an issue.

> >>

> >> CSE of what exactly? You are afraid we are repeating a calculation here we

> >> have done elsewhere before?

> > 

> > All uses of those inits now possibly get the expression instead of

> > just the SSA name we inserted code for once.  But as said, we'll see.

> > 

> 

> This code changed after some comments from Richard Sandiford.

> 

> > +  /* We are done vectorizing the main loop, so now we update the

> > epilogues

> > +     stmt_vec_info's.  At the same time we set the gimple UID of each

> > +     statement in the epilogue, as these are used to look them up in the

> > +     epilogues loop_vec_info later.  We also keep track of what

> > +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might

> > +     need updating and we construct a mapping between variables defined

> > in

> > +     the main loop and their corresponding names in epilogue.  */

> > +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

> > 

> > so for the following code I wonder if you can make use of the

> > fact that loop copying also copies UIDs, so you should be able

> > to match stmts via their UIDs and get at the other loop infos

> > stmt_info by the copy loop stmt UID.

> > 

> > I wonder why you need no modification for the SLP tree?

> > 

> I checked with Tamar and the SLP tree works with the position of operands and

> not SSA_NAMES.  So we should be fine.


There's now SLP_TREE_SCALAR_OPS but only for invariants so I guess
we should indeed be fine here.  Everything else is already
stmt_infos which you patch with the new underlying stmts.

Richard.
Richard Biener Oct. 28, 2019, 2:16 p.m. | #15
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> Hi,

> 

> This is the reworked patch after your comments.

> 

> I have moved the epilogue check into the analysis form disguised under

> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the

> "lowest threshold" check there.

> 

> The only place where we may reject an epilogue_vinfo is when we know the

> number of scalar iterations and we realize the number of iterations left after

> the main loop are not enough to enter the vectorized epilogue so we optimize

> away that code-gen.  The only way we know this to be true is if the number of

> scalar iterations are known and the peeling for alignment is known. So we know

> we will enter the main loop regardless, so whether the threshold we use is for

> a lower VF or not it shouldn't matter as much, I would even like to think that

> check isn't done, but I am not sure... Might be worth checking as an

> optimization.

> 

> 

> Is this OK for trunk?


+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+       {
..
+         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+           pattern_worklist.safe_push (stmt_vinfo);
+
+         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+         while (related_vinfo && related_vinfo != stmt_vinfo)
+           {

I think PHIs cannot have patterns.  You can assert
that STMT_VINFO_RELATED_STMT is NULL I think.

+         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+         while (related_vinfo && related_vinfo != stmt_vinfo)
+           {
+             related_worklist.safe_push (related_vinfo);
+             /* Set BB such that the assert in
+               'get_initial_def_for_reduction' is able to determine that
+               the BB of the related stmt is inside this loop.  */
+             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+                            gimple_bb (new_stmt));
+             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+           }

do we really keep references to "nested" patterns?  Thus, do you
need this loop?

+  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
+     original main loop and thus need to be updated to refer to the 
cloned
+     variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+    {
...
+                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+                                        &find_in_mapping, &mapping);
+                 gimple_set_op (seq, j, op);

you do this for the pattern-def seq but not for the related one.
I guess you ran into this for COND_EXPR conditions.  I wondered
to use a shared worklist for both the def-seq and the main pattern
stmt or at least to split out the replacement so you can share it.

+      /* Data references for gather loads and scatter stores do not use 
the
+        updated offset we set using ADVANCE.  Instead we have to make 
sure the
+        reference in the data references point to the corresponding copy 
of
+        the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+       {
+         int j;
+         if (TREE_CODE (DR_REF (dr)) == MEM_REF)
+           j = 0;
+         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
+           j = 1;
+         else
+           gcc_unreachable ();
+
+         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
+           {
+             DR_REF (dr) = unshare_expr (DR_REF (dr));
+             TREE_OPERAND (DR_REF (dr), j) = *new_op;
+           }

huh, do you really only ever see MEM_REF or ARRAY_REF here?
I would guess using simplify_replace_tree is safer.
There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
updated, is that correct?

Otherwise looks OK to me.

Thanks,
Richard.


> gcc/ChangeLog:

> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>     PR 88915

>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.

>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter

>     and make the valueize function pointer also take a void pointer.

>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap

>     around vn_valueize, to call it without a context.

>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.

>     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.

>     (~_loop_vec_info): Release epilogue_vinfos.

>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate

>     number of iterations of epilogue.

>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported

>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

>     versioning threshold needed for main loop.

>     (vect_analyze_loop): Likewise.

>     (find_in_mapping): New helper function.

>     (update_epilogue_loop_vinfo): New function.

>     (vect_transform_loop): When vectorizing epilogues re-use analysis done

>     on main loop and call update_epilogue_loop_vinfo to update it.

>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

>     stmts on loop preheader edge.

>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if

>     we decided to vectorize epilogues.  Update epilogues NITERS and

>     construct ADVANCE to update epilogues data references where needed.

>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.

>     (vect_do_peeling, vect_update_inits_of_drs,

>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.

>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

>     created loop_vec_info's for epilogues when available.  Otherwise 

> analyse

>     epilogue separately.

> 

> 

> 

> Cheers,

> Andre

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
Andre Simoes Dias Vieira Oct. 28, 2019, 6:31 p.m. | #16
Hi,

Reworked according to your comments, see inline for clarification.

Is this OK for trunk?

gcc/ChangeLog:
2019-10-28  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
     (vect_do_peeling, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.



Cheers,
Andre

On 28/10/2019 14:16, Richard Biener wrote:
> On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> 

>> Hi,

>>

>> This is the reworked patch after your comments.

>>

>> I have moved the epilogue check into the analysis form disguised under

>> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the

>> "lowest threshold" check there.

>>

>> The only place where we may reject an epilogue_vinfo is when we know the

>> number of scalar iterations and we realize the number of iterations left after

>> the main loop are not enough to enter the vectorized epilogue so we optimize

>> away that code-gen.  The only way we know this to be true is if the number of

>> scalar iterations are known and the peeling for alignment is known. So we know

>> we will enter the main loop regardless, so whether the threshold we use is for

>> a lower VF or not it shouldn't matter as much, I would even like to think that

>> check isn't done, but I am not sure... Might be worth checking as an

>> optimization.

>>

>>

>> Is this OK for trunk?

> 

> +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);

> +          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))

> +       {

> ..

> +         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))

> +           pattern_worklist.safe_push (stmt_vinfo);

> +

> +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> +         while (related_vinfo && related_vinfo != stmt_vinfo)

> +           {

> 

> I think PHIs cannot have patterns.  You can assert

> that STMT_VINFO_RELATED_STMT is NULL I think.


Done.
> 

> +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> +         while (related_vinfo && related_vinfo != stmt_vinfo)

> +           {

> +             related_worklist.safe_push (related_vinfo);

> +             /* Set BB such that the assert in

> +               'get_initial_def_for_reduction' is able to determine that

> +               the BB of the related stmt is inside this loop.  */

> +             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),

> +                            gimple_bb (new_stmt));

> +             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);

> +           }

> 

> do we really keep references to "nested" patterns?  Thus, do you

> need this loop?


Changed and added asserts.  They didn't trigger so I suppose you are 
right, I didn't know at the time whether it was possible, so I just 
operated on the side of caution.  Can remove the asserts and so on if 
you want.
> 

> +  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the

> +     original main loop and thus need to be updated to refer to the

> cloned

> +     variables used in the epilogue.  */

> +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)

> +    {

> ...

> +                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,

> +                                        &find_in_mapping, &mapping);

> +                 gimple_set_op (seq, j, op);

> 

> you do this for the pattern-def seq but not for the related one.

> I guess you ran into this for COND_EXPR conditions.  I wondered

> to use a shared worklist for both the def-seq and the main pattern

> stmt or at least to split out the replacement so you can share it.


I think that was it yeah, reworked it now to use the same list. Less 
code, thanks!
> 

> +      /* Data references for gather loads and scatter stores do not use

> the

> +        updated offset we set using ADVANCE.  Instead we have to make

> sure the

> +        reference in the data references point to the corresponding copy

> of

> +        the original in the epilogue.  */

> +      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))

> +       {

> +         int j;

> +         if (TREE_CODE (DR_REF (dr)) == MEM_REF)

> +           j = 0;

> +         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)

> +           j = 1;

> +         else

> +           gcc_unreachable ();

> +

> +         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))

> +           {

> +             DR_REF (dr) = unshare_expr (DR_REF (dr));

> +             TREE_OPERAND (DR_REF (dr), j) = *new_op;

> +           }

> 

> huh, do you really only ever see MEM_REF or ARRAY_REF here?

> I would guess using simplify_replace_tree is safer.

> There's also DR_BASE_ADDRESS - we seem to leave the DRs partially

> updated, is that correct?


Yeah can use simplify_replace_tree indeed.  And I have changed it so it 
updates DR_BASE_ADDRESS.  I think DR_BASE_ADDRESS never actually changed 
in the way we use data_references... Either way, replacing them if they 
do change is cleaner and more future proof.
> 

> Otherwise looks OK to me.

> 

> Thanks,

> Richard.

> 

> 

>> gcc/ChangeLog:

>> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>

>>

>>      PR 88915

>>      * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.

>>      * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter

>>      and make the valueize function pointer also take a void pointer.

>>      * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap

>>      around vn_valueize, to call it without a context.

>>      (process_bb): Use vn_valueize_wrapper instead of vn_valueize.

>>      * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.

>>      (~_loop_vec_info): Release epilogue_vinfos.

>>      (vect_analyze_loop_costing): Use knowledge of main VF to estimate

>>      number of iterations of epilogue.

>>      (vect_analyze_loop_2): Adapt to analyse main loop for all supported

>>      vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

>>      versioning threshold needed for main loop.

>>      (vect_analyze_loop): Likewise.

>>      (find_in_mapping): New helper function.

>>      (update_epilogue_loop_vinfo): New function.

>>      (vect_transform_loop): When vectorizing epilogues re-use analysis done

>>      on main loop and call update_epilogue_loop_vinfo to update it.

>>      * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

>>      stmts on loop preheader edge.

>>      (vect_do_peeling): Enable skip-vectors when doing loop versioning if

>>      we decided to vectorize epilogues.  Update epilogues NITERS and

>>      construct ADVANCE to update epilogues data references where needed.

>>      * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.

>>      (vect_do_peeling, vect_update_inits_of_drs,

>>       determine_peel_for_niter, vect_analyze_loop): Add or update declarations.

>>      * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

>>      created loop_vec_info's for epilogues when available.  Otherwise

>> analyse

>>      epilogue separately.

>>

>>

>>

>> Cheers,

>> Andre

>>

>
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
 				   class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tree simplify_replace_tree (tree, tree,
+				   tree, tree (*)(tree, void *) = NULL,
+				   void * = NULL);
 extern void substitute_in_loop_info (class loop *, tree, tree);
 
 #endif /* GCC_TREE_SSA_LOOP_NITER_H */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		       tree (*valueize) (tree))
+		       tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
     {
       if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	    return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
     {
       e = TREE_OPERAND (expr, i);
-      se = simplify_replace_tree (e, old, new_tree, valueize);
+      se = simplify_replace_tree (e, old, new_tree, valueize, context);
       if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -309,6 +309,10 @@ static vn_tables_t valid_info;
 /* Valueization hook.  Valueize NAME if it is an SSA name, otherwise
    just return it.  */
 tree (*vn_valueize) (tree);
+tree vn_valueize_wrapper (tree t, void* context ATTRIBUTE_UNUSED)
+{
+  return vn_valueize (t);
+}
 
 
 /* This represents the top of the VN lattice, which is the universal
@@ -6407,7 +6411,7 @@ process_bb (rpo_elim &avail, basic_block bb,
       if (bb->loop_father->nb_iterations)
 	bb->loop_father->nb_iterations
 	  = simplify_replace_tree (bb->loop_father->nb_iterations,
-				   NULL_TREE, NULL_TREE, vn_valueize);
+				   NULL_TREE, NULL_TREE, &vn_valueize_wrapper);
     }
 
   /* Value-number all defs in the basic-block.  */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index b3246bc7a099e491e5c2fd32835dc5c848931d0a..dffb40ec9999a0363e53b1748af2fdcf270710ff 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1726,7 +1726,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1736,21 +1736,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
 
-  /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
+  /* Adjust niters to sizetype.  We used to insert the stmts on loop preheader
+     here, but since we might use these niters to update the epilogues niters
+     and data references we can't insert them here as this definition might not
+     always dominate its uses.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2393,7 +2384,22 @@ slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
-   Returns created epilogue or NULL.
+   This function returns the epilogue loop if a decision was made to vectorize
+   it, otherwise NULL.
+
+   The analysis resulting in this epilogue loop's loop_vec_info was performed
+   in the same vect_analyze_loop call as the main loop's.  At that time
+   vect_analyze_loop constructs a list of accepted loop_vec_info's for lower
+   vectorization factors than the main loop.  This list is stored in the main
+   loop's loop_vec_info in the 'epilogue_vinfos' member.  Everytime we decide to
+   vectorize the epilogue loop for a lower vectorization factor,  the
+   loop_vec_info sitting at the top of the epilogue_vinfos list is removed,
+   updated and linked to the epilogue loop.  This is later used to vectorize
+   the epilogue.  The reason the loop_vec_info needs updating is that it was
+   constructed based on the original main loop, and the epilogue loop is a
+   copy of this loop, so all links pointing to statements in the original loop
+   need updating.  Furthermore, these loop_vec_infos share the
+   data_reference's records, which will also need to be updated.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
@@ -2403,7 +2409,8 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance, drs_init_vec &orig_drs_init)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2411,6 +2418,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues = loop_vinfo->epilogue_vinfos.length () > 0;
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2464,19 +2472,73 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   int bound_prolog = 0;
   if (prolog_peeling)
     niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
-						 &bound_prolog);
+						  &bound_prolog);
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+    }
+
+  tree niters_vector_mult_vf = NULL_TREE;
+  /* Saving NITERs before the loop, as this may be changed by prologue.  */
+  tree before_loop_niters = LOOP_VINFO_NITERS (loop_vinfo);
+  edge update_e = NULL, skip_e = NULL;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* If we know the number of scalar iterations for the main loop we should
+     check whether after the main loop there are enough iterations left over
+     for the epilogue.  */
+  if (vect_epilogues
+      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && prolog_peeling >= 0
+      && known_eq (vf, lowest_vf))
+    {
+      unsigned HOST_WIDE_INT eiters
+	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+      unsigned int ratio;
+      while (!(constant_multiple_p (loop_vinfo->vector_size,
+				    epilogue_vinfo->vector_size, &ratio)
+	       && eiters >= lowest_vf / ratio))
+	{
+	  delete epilogue_vinfo;
+	  epilogue_vinfo = NULL;
+	  if (loop_vinfo->epilogue_vinfos.length () == 0)
+	    {
+	      vect_epilogues = false;
+	      break;
+	    }
+	  epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+	  loop_vinfo->epilogue_vinfos.ordered_remove (0);
+	}
+    }
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
-  /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+  /* Skip this loop to epilog when there are not enough iterations to enter this
+     vectorized loop.  If true we should perform runtime checks on the NITERS
+     to check whether we should skip the current vectorized loop.  If we know
+     the number of scalar iterations we may choose to add a runtime check if
+     this number "maybe" smaller than the number of iterations required
+     when we know the number of scalar iterations may potentially
+     be smaller than the number of iterations required to enter this loop, for
+     this we use the upper bounds on the prolog and epilog peeling.  When we
+     don't know the number of iterations and don't require versioning it is
+     because we have asserted that there are enough scalar iterations to enter
+     the main loop, so this skip is not necessary.  When we are versioning then
+     we only add such a skip if we have chosen to vectorize the epilogue.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2506,6 +2568,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   dump_user_location_t loop_loc = find_loop_location (loop);
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  if (vect_epilogues)
+    /* Make sure to set the epilogue's epilogue scalar loop, such that we can
+       use the original scalar loop as remaining epilogue if necessary.  */
+    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2552,6 +2620,15 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog);
 	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
 	}
+
+      /* Save original inits for each data_reference before advancing them with
+	 NITERS_PROLOG.  */
+      unsigned int i;
+      struct data_reference *dr;
+      vec<data_reference_p> datarefs = loop_vinfo->shared->datarefs;
+      FOR_EACH_VEC_ELT (datarefs, i, dr)
+	orig_drs_init.safe_push (std::make_pair (dr, DR_OFFSET (dr)));
+
       /* Update init address of DRs.  */
       vect_update_inits_of_drs (loop_vinfo, niters_prolog, PLUS_EXPR);
       /* Update niters for vector loop.  */
@@ -2586,8 +2663,15 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			   "loop can't be duplicated to exit edge.\n");
 	  gcc_unreachable ();
 	}
-      /* Peel epilog and put it on exit edge of loop.  */
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
+	 said epilog then we should use a copy of the main loop as a starting
+	 point.  This loop may have already had some preliminary transformations
+	 to allow for more optimal vectorization, for example if-conversion.
+	 If we are not vectorizing the epilog then we should use the scalar loop
+	 as the transformations mentioned above make less or no sense when not
+	 vectorizing.  */
+      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
@@ -2616,6 +2700,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 					   guard_to, guard_bb,
 					   prob_vector.invert (),
 					   irred_flag);
+	  skip_e = guard_e;
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
@@ -2637,7 +2722,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2660,7 +2744,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       /* Update IVs of original loop as if they were advanced by
 	 niters_vector_mult_vf steps.  */
       gcc_checking_assert (vect_can_advance_ivs_p (loop_vinfo));
-      edge update_e = skip_vector ? e : loop_preheader_edge (epilog);
+      update_e = skip_vector ? e : loop_preheader_edge (epilog);
       vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
 					update_e);
 
@@ -2701,10 +2785,75 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of NITERS performed by the previous
+	 loop and EPILOGUE_NITERS to be performed by the epilogue.  */
+      tree niters = fold_build2 (PLUS_EXPR, TREE_TYPE (niters_vector_mult_vf),
+				 niters_prolog, niters_vector_mult_vf);
+
+      /* If skip_vector we may skip the previous loop, we insert a phi-node to
+	 determine whether we are coming from the previous vectorized loop
+	 using the update_e edge or the skip_vector basic block using the
+	 skip_e edge.  */
+      if (skip_vector)
+	{
+	  gcc_assert (update_e != NULL && skip_e != NULL);
+	  gphi *new_phi = create_phi_node (make_ssa_name (TREE_TYPE (niters)),
+					   update_e->dest);
+	  tree new_ssa = make_ssa_name (TREE_TYPE (niters));
+	  gimple *stmt = gimple_build_assign (new_ssa, niters);
+	  gimple_stmt_iterator gsi;
+	  if (TREE_CODE (niters_vector_mult_vf) == SSA_NAME
+	      && SSA_NAME_DEF_STMT (niters_vector_mult_vf)->bb != NULL)
+	    {
+	      gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (niters_vector_mult_vf));
+	      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
+	    }
+	  else
+	    {
+	      gsi = gsi_last_bb (update_e->src);
+	      gsi_insert_before (&gsi, stmt, GSI_NEW_STMT);
+	    }
+
+	  niters = new_ssa;
+	  add_phi_arg (new_phi, niters, update_e, UNKNOWN_LOCATION);
+	  add_phi_arg (new_phi, build_zero_cst (TREE_TYPE (niters)), skip_e,
+		       UNKNOWN_LOCATION);
+	  niters = PHI_RESULT (new_phi);
+	}
+
+      /* Subtract the number of iterations performed by the vectorized loop
+	 from the number of total iterations.  */
+      tree epilogue_niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+					  before_loop_niters,
+					  niters);
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = epilogue_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (epilogue_niters),
+		       epilogue_niters,
+		       build_one_cst (TREE_TYPE (epilogue_niters)));
+
+      /* Set ADVANCE to the number of iterations performed by the previous
+	 loop and its prologue.  */
+      *advance = niters;
+
+      /* Redo the peeling for niter analysis as the NITERs and alignment
+	 may have been updated to take the main loop into account.  */
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e22d2dd7abbf43aa0c8707b9422b90612188ad2a..1feff2820c7d2f54ca285db699fd68e682f2bac9 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -885,6 +885,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
 }
 
 /* Free all levels of MASKS.  */
@@ -909,6 +911,7 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
 
   loop->aux = NULL;
 }
@@ -1682,9 +1685,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
-  if (estimated_niter == -1)
-    estimated_niter = likely_max_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter;
+
+  /* If we are vectorizing an epilogue then we know the maximum number of
+     scalar iterations it will cover is at least one lower than the
+     vectorization factor of the main loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  else
+    {
+      estimated_niter = estimated_stmt_executions_int (loop);
+      if (estimated_niter == -1)
+	estimated_niter = likely_max_stmt_executions_int (loop);
+    }
   if (estimated_niter != -1
       && ((unsigned HOST_WIDE_INT) estimated_niter
 	  < MAX (th, (unsigned) min_profitable_estimate)))
@@ -1871,6 +1885,15 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
+
+  /* If we are dealing with an epilogue then orig_loop_vinfo points to the
+     loop_vec_info of the first vectorized loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  else
+    orig_loop_vinfo = loop_vinfo;
+  gcc_assert (orig_loop_vinfo);
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -2150,8 +2173,18 @@ start_over:
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+     increase threshold for this case if necessary.
+
+     If we are analyzing an epilogue we still want to check what its
+     versioning threshold would be.  If we decide to vectorize the epilogues we
+     will want to use the lowest versioning threshold of all epilogues and main
+     loop.  This will enable us to enter a vectorized epilogue even when
+     versioning the loop.  We can't simply check whether the epilogue requires
+     versioning though since we may have skipped some versioning checks when
+     analyzing the epilogue.  For instance, checks for alias versioning will be
+     skipped when dealing with epilogues as we assume we already checked them
+     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
+  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
     {
       poly_uint64 niters_th = 0;
       unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
@@ -2344,6 +2377,14 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 next_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+
+  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is enabled, this
+     is not a simd loop and it is the most inner loop.  */
+  bool vect_epilogues
+    = !loop->simdlen && loop->inner == NULL
+      && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2363,6 +2404,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (next_size == 0)
@@ -2371,18 +2414,43 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  Only vectorize first epilogue.  */
+		  if (vect_epilogues
+		      && first_loop_vinfo->epilogue_vinfos.is_empty ())
+		    {
+		      loop->aux = NULL;
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2416,6 +2484,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, first_loop_vinfo->vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -7925,6 +7995,189 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+/* Helper function to pass to simplify_replace_tree to enable replacing tree's
+   in the hash_map with its corresponding values.  */
+
+static tree
+find_in_mapping (tree t, void *context)
+{
+  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;
+
+  tree *value = mapping->get (t);
+  return value ? *value : t;
+}
+
+/* Update EPILOGUE's loop_vec_info.  EPILOGUE was constructed as a copy of the
+   original loop that has now been vectorized.
+
+   The inits of the data_references need to be advanced with the number of
+   iterations of the main loop.  This has been computed in vect_do_peeling and
+   is stored in parameter ADVANCE.  We first restore the data_references
+   initial offset with the values recored in ORIG_DRS_INIT.
+
+   Since the loop_vec_info of this EPILOGUE was constructed for the original
+   loop, its stmt_vec_infos all point to the original statements.  These need
+   to be updated to point to their corresponding copies as well as the SSA_NAMES
+   in their PATTERN_DEF_SEQs and RELATED_STMTs.
+
+   The data_reference's connections also need to be updated.  Their
+   corresponding dr_vec_info need to be reconnected to the EPILOGUE's
+   stmt_vec_infos, their statements need to point to their corresponding copy,
+   if they are gather loads or scatter stores then their reference needs to be
+   updated to point to its corresponding copy and finally we set
+   'base_misaligned' to false as we have already peeled for alignment in the
+   prologue of the main loop.  */
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance,
+			    drs_init_vec &orig_drs_init)
+{
+  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+  auto_vec<gimple *> stmt_worklist;
+  hash_map<tree,tree> mapping;
+  gimple *orig_stmt, *new_stmt;
+  gimple_stmt_iterator epilogue_gsi;
+  gphi_iterator epilogue_phi_gsi;
+  stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+  basic_block *epilogue_bbs = get_loop_body (epilogue);
+
+  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+  /* Restore original data_reference's offset, before the previous loop and its
+     prologue.  */
+  std::pair<data_reference*, tree> *dr_init;
+  unsigned i;
+  for (i = 0; orig_drs_init.iterate (i, &dr_init); i++)
+    DR_OFFSET (dr_init->first) = dr_init->second;
+
+  /* Advance data_reference's with the number of iterations of the previous
+     loop and its prologue.  */
+  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
+
+
+  /* The EPILOGUE loop is a copy of the original loop so they share the same
+     gimple UIDs.  In this loop we update the loop_vec_info of the EPILOGUE to
+     point to the copied statements.  We also create a mapping of all LHS' in
+     the original loop and all the LHS' in the EPILOGUE and create worklists to
+     update teh STMT_VINFO_PATTERN_DEF_SEQs and STMT_VINFO_RELATED_STMTs.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
+    {
+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	{
+	  new_stmt = epilogue_phi_gsi.phi ();
+
+	  gcc_assert (gimple_uid (new_stmt) > 0);
+	  stmt_vinfo
+	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
+
+	  orig_stmt = STMT_VINFO_STMT (stmt_vinfo);
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+
+	  mapping.put (gimple_phi_result (orig_stmt),
+		       gimple_phi_result (new_stmt));
+	  /* PHI nodes can not have patterns or related statements.  */
+	  gcc_assert (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo) == NULL
+		      && STMT_VINFO_RELATED_STMT (stmt_vinfo) == NULL);
+	}
+
+      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	{
+	  new_stmt = gsi_stmt (epilogue_gsi);
+
+	  gcc_assert (gimple_uid (new_stmt) > 0);
+	  stmt_vinfo
+	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
+
+	  orig_stmt = STMT_VINFO_STMT (stmt_vinfo);
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+
+	  if (tree old_lhs = gimple_get_lhs (orig_stmt))
+	    mapping.put (old_lhs, gimple_get_lhs (new_stmt));
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    {
+	      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
+	      while (seq)
+		{
+		  stmt_worklist.safe_push (seq);
+		  seq = seq->next;
+		}
+
+	    }
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  if (related_vinfo != NULL && related_vinfo != stmt_vinfo)
+	    {
+	      gimple *stmt = STMT_VINFO_STMT (related_vinfo);
+	      stmt_worklist.safe_push (stmt);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (stmt,
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	      gcc_assert (related_vinfo == NULL
+			  || related_vinfo == stmt_vinfo);
+	    }
+	}
+    }
+
+  /* The PATTERN_DEF_SEQs and RELATED_STMTs in the epilogue were constructed
+     using the original main loop and thus need to be updated to refer to the
+     cloned variables used in the epilogue.  */
+  for (unsigned i = 0; i < stmt_worklist.length (); ++i)
+    {
+      gimple *stmt = stmt_worklist[i];
+      tree *new_op;
+
+      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	{
+	  tree op = gimple_op (stmt, j);
+	  if ((new_op = mapping.get(op)))
+	    gimple_set_op (stmt, j, *new_op);
+	  else
+	    {
+	      op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+				     &find_in_mapping, &mapping);
+	      gimple_set_op (stmt, j, op);
+	    }
+	}
+    }
+
+  struct data_reference *dr;
+  vec<data_reference_p> datarefs = epilogue_vinfo->shared->datarefs;
+  FOR_EACH_VEC_ELT (datarefs, i, dr)
+    {
+      orig_stmt = DR_STMT (dr);
+      gcc_assert (gimple_uid (orig_stmt) > 0);
+      stmt_vinfo = epilogue_vinfo->stmt_vec_infos[gimple_uid (orig_stmt) - 1];
+      /* Data references for gather loads and scatter stores do not use the
+	 updated offset we set using ADVANCE.  Instead we have to make sure the
+	 reference in the data references point to the corresponding copy of
+	 the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+	{
+	  DR_REF (dr)
+	    = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE,
+				     &find_in_mapping, &mapping);
+	  DR_BASE_ADDRESS (dr)
+	    = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE, NULL_TREE,
+				     &find_in_mapping, &mapping);
+	}
+      DR_STMT (dr) = STMT_VINFO_STMT (stmt_vinfo);
+      stmt_vinfo->dr_aux.stmt = stmt_vinfo;
+      /* The vector size of the epilogue is smaller than that of the main loop
+	 so the alignment is either the same or lower. This means the dr will
+	 thus by definition be aligned.  */
+      STMT_VINFO_DR_INFO (stmt_vinfo)->base_misaligned = false;
+    }
+
+  epilogue_vinfo->shared->datarefs_copy.release ();
+  epilogue_vinfo->shared->save_datarefs ();
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -7962,11 +8215,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8010,9 +8263,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
+  drs_init_vec orig_drs_init;
+
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance, orig_drs_init);
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8271,57 +8529,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
-
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (loop_vinfo->vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (loop_vinfo->vector_size, vector_sizes[next_size]))
-	  next_size += 1;
+      update_epilogue_loop_vinfo (epilogue, advance, orig_drs_init);
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
-
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
-
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 56be28b0cc5a77412f996e70636b08d5b615813e..71b5f380e2c91a7a551f6e26920bb17809abedf0 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@ typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include <utility>
 
 /* Used for naming of new temporaries.  */
 enum vect_var_kind {
@@ -456,6 +457,8 @@ struct rgroup_masks {
 
 typedef auto_vec<rgroup_masks> vec_loop_masks;
 
+typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
+
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
 /*-----------------------------------------------------------------*/
@@ -639,6 +642,10 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  /* Used to store loop_vec_infos of epilogues of this loop during
+     analysis.  */
+  vec<_loop_vec_info *> epilogue_vinfos;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1589,10 +1596,12 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *, drs_init_vec &);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
 
 /* In tree-vect-stmts.c.  */
 extern tree get_vectype_for_scalar_type (vec_info *, tree);
@@ -1700,6 +1709,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 
 /* In tree-vect-loop.c.  */
 extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
 extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 				  enum tree_code);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 30dcc442c4c440c44ef3ba29a03182834229ba35..8e02647c7bad6ce4a92a225a4d37f82439f771ae 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -874,6 +874,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -881,10 +882,17 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
-  /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  /* In the case of epilogue vectorization the loop already has its
+     loop_vec_info set, we do not require to analyze the loop in this case.  */
+  if (loop_vec_info vinfo = loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (vinfo);
+  else
+    {
+      /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -1012,8 +1020,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }
Richard Biener Oct. 29, 2019, 11:48 a.m. | #17
On Mon, 28 Oct 2019, Andre Vieira (lists) wrote:

> Hi,

> 

> Reworked according to your comments, see inline for clarification.

> 

> Is this OK for trunk?


+             gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
+             while (seq)
+               {
+                 stmt_worklist.safe_push (seq);
+                 seq = seq->next;
+               }

you're supposed to do to the following, not access the ->next
implementation detail:

    for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); 
gsi_next (&gsi))
      stmt_worklist.safe_push (gsi_stmt (gsi));


+      /* Data references for gather loads and scatter stores do not use 
the
+        updated offset we set using ADVANCE.  Instead we have to make 
sure
the
+        reference in the data references point to the corresponding copy 
of
+        the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+       {
+         DR_REF (dr)
+           = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE,
+                                    &find_in_mapping, &mapping);
+         DR_BASE_ADDRESS (dr)
+           = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE,
NULL_TREE,
+                                    &find_in_mapping, &mapping);
+       }

Hmm.  So for other DRs we account for the previous vector loop
by adjusting DR_OFFSET?  But STMT_VINFO_GATHER_SCATTER_P ends up
using (unconditionally) DR_REF here?  In that case it seems
best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and
DR_OFFSET?  I wonder how prologue peeling deals with
STMT_VINFO_GATHER_SCATTER_P ... I see the caller of
vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P.

I wonder if (as followup to not delay this further) we can
"offload" all the DR adjustment by storing ADVANCE in dr_vec_info
and accounting for that when we create the dataref pointers in
vectorizable_load/store?  That way we could avoid saving/restoring
DR_OFFSET as well. 

So, the patch is OK with the sequence iteration fixed.  I think
sorting out the above can be done as followup.

Thanks,
Richard.

> gcc/ChangeLog:

> 2019-10-28  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> 

>     PR 88915

>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.

>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter

>     and make the valueize function pointer also take a void pointer.

>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap

>     around vn_valueize, to call it without a context.

>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.

>     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.

>     (~_loop_vec_info): Release epilogue_vinfos.

>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate

>     number of iterations of epilogue.

>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported

>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

>     versioning threshold needed for main loop.

>     (vect_analyze_loop): Likewise.

>     (find_in_mapping): New helper function.

>     (update_epilogue_loop_vinfo): New function.

>     (vect_transform_loop): When vectorizing epilogues re-use analysis done

>     on main loop and call update_epilogue_loop_vinfo to update it.

>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

>     stmts on loop preheader edge.

>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if

>     we decided to vectorize epilogues.  Update epilogues NITERS and

>     construct ADVANCE to update epilogues data references where needed.

>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.

>     (vect_do_peeling, vect_update_inits_of_drs,

>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.

>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

>     created loop_vec_info's for epilogues when available.  Otherwise 

> analyse

>     epilogue separately.

> 

> 

> 

> Cheers,

> Andre

> 

> On 28/10/2019 14:16, Richard Biener wrote:

> > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> > 

> >> Hi,

> >>

> >> This is the reworked patch after your comments.

> >>

> >> I have moved the epilogue check into the analysis form disguised under

> >> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing

> >> the

> >> "lowest threshold" check there.

> >>

> >> The only place where we may reject an epilogue_vinfo is when we know the

> >> number of scalar iterations and we realize the number of iterations left

> >> after

> >> the main loop are not enough to enter the vectorized epilogue so we

> >> optimize

> >> away that code-gen.  The only way we know this to be true is if the number

> >> of

> >> scalar iterations are known and the peeling for alignment is known. So we

> >> know

> >> we will enter the main loop regardless, so whether the threshold we use is

> >> for

> >> a lower VF or not it shouldn't matter as much, I would even like to think

> >> that

> >> check isn't done, but I am not sure... Might be worth checking as an

> >> optimization.

> >>

> >>

> >> Is this OK for trunk?

> > 

> > +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);

> > +          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))

> > +       {

> > ..

> > +         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))

> > +           pattern_worklist.safe_push (stmt_vinfo);

> > +

> > +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> > +         while (related_vinfo && related_vinfo != stmt_vinfo)

> > +           {

> > 

> > I think PHIs cannot have patterns.  You can assert

> > that STMT_VINFO_RELATED_STMT is NULL I think.

> 

> Done.

> > 

> > +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);

> > +         while (related_vinfo && related_vinfo != stmt_vinfo)

> > +           {

> > +             related_worklist.safe_push (related_vinfo);

> > +             /* Set BB such that the assert in

> > +               'get_initial_def_for_reduction' is able to determine that

> > +               the BB of the related stmt is inside this loop.  */

> > +             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),

> > +                            gimple_bb (new_stmt));

> > +             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);

> > +           }

> > 

> > do we really keep references to "nested" patterns?  Thus, do you

> > need this loop?

> 

> Changed and added asserts.  They didn't trigger so I suppose you are right, I

> didn't know at the time whether it was possible, so I just operated on the

> side of caution.  Can remove the asserts and so on if you want.

> > 

> > +  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the

> > +     original main loop and thus need to be updated to refer to the

> > cloned

> > +     variables used in the epilogue.  */

> > +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)

> > +    {

> > ...

> > +                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,

> > +                                        &find_in_mapping, &mapping);

> > +                 gimple_set_op (seq, j, op);

> > 

> > you do this for the pattern-def seq but not for the related one.

> > I guess you ran into this for COND_EXPR conditions.  I wondered

> > to use a shared worklist for both the def-seq and the main pattern

> > stmt or at least to split out the replacement so you can share it.

> 

> I think that was it yeah, reworked it now to use the same list. Less code,

> thanks!

> > 

> > +      /* Data references for gather loads and scatter stores do not use

> > the

> > +        updated offset we set using ADVANCE.  Instead we have to make

> > sure the

> > +        reference in the data references point to the corresponding copy

> > of

> > +        the original in the epilogue.  */

> > +      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))

> > +       {

> > +         int j;

> > +         if (TREE_CODE (DR_REF (dr)) == MEM_REF)

> > +           j = 0;

> > +         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)

> > +           j = 1;

> > +         else

> > +           gcc_unreachable ();

> > +

> > +         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))

> > +           {

> > +             DR_REF (dr) = unshare_expr (DR_REF (dr));

> > +             TREE_OPERAND (DR_REF (dr), j) = *new_op;

> > +           }

> > 

> > huh, do you really only ever see MEM_REF or ARRAY_REF here?

> > I would guess using simplify_replace_tree is safer.

> > There's also DR_BASE_ADDRESS - we seem to leave the DRs partially

> > updated, is that correct?

> 

> Yeah can use simplify_replace_tree indeed.  And I have changed it so it

> updates DR_BASE_ADDRESS.  I think DR_BASE_ADDRESS never actually changed in

> the way we use data_references... Either way, replacing them if they do change

> is cleaner and more future proof.

> > 

> > Otherwise looks OK to me.

> > 

> > Thanks,

> > Richard.

> > 

> > 

> >> gcc/ChangeLog:

> >> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>

> >>

> >>      PR 88915

> >>      * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.

> >>      * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter

> >>      and make the valueize function pointer also take a void pointer.

> >>      * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap

> >>      around vn_valueize, to call it without a context.

> >>      (process_bb): Use vn_valueize_wrapper instead of vn_valueize.

> >>      * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.

> >>      (~_loop_vec_info): Release epilogue_vinfos.

> >>      (vect_analyze_loop_costing): Use knowledge of main VF to estimate

> >>      number of iterations of epilogue.

> >>      (vect_analyze_loop_2): Adapt to analyse main loop for all supported

> >>      vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest

> >>      versioning threshold needed for main loop.

> >>      (vect_analyze_loop): Likewise.

> >>      (find_in_mapping): New helper function.

> >>      (update_epilogue_loop_vinfo): New function.

> >>      (vect_transform_loop): When vectorizing epilogues re-use analysis done

> >>      on main loop and call update_epilogue_loop_vinfo to update it.

> >>      * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert

> >>      stmts on loop preheader edge.

> >>      (vect_do_peeling): Enable skip-vectors when doing loop versioning if

> >>      we decided to vectorize epilogues.  Update epilogues NITERS and

> >>      construct ADVANCE to update epilogues data references where needed.

> >>      * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.

> >>      (vect_do_peeling, vect_update_inits_of_drs,

> >>      determine_peel_for_niter, vect_analyze_loop): Add or update

> >>      declarations.

> >>      * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already

> >>      created loop_vec_info's for epilogues when available.  Otherwise

> >> analyse

> >>      epilogue separately.

> >>

> >>

> >>

> >> Cheers,

> >> Andre

> >>

> > 

> 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Patch

diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..56ffa08a7dee54837441f0c743f8c0faa285c74b 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@  main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..3b5f14c45b5b9b601120c6776734bbafefe1e178 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2401,7 +2401,8 @@  class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 bool vect_epilogues_nomask)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2474,7 +2475,8 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues_nomask));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2966,9 +2968,7 @@  vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
    *COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +2988,15 @@  vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..305ee2b06eabde9091049da829e6fc93161aa13f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1858,7 +1858,8 @@  vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
    for it.  The different analyses will record information in the
    loop_vec_info struct.  */
 static opt_result
-vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts,
+		     bool *vect_epilogues_nomask)
 {
   opt_result ok = opt_result::success ();
   int res;
@@ -2179,6 +2180,11 @@  start_over:
         }
     }
 
+  /* Disable epilogue vectorization if versioning is required because of the
+     iteration count.  TODO: Needs investigation as to whether it is possible
+     to vectorize epilogues in this case.  */
+  *vect_epilogues_nomask &= !LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
@@ -2186,6 +2192,7 @@  start_over:
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
       poly_uint64 niters_th = 0;
+      unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
       if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
 	{
@@ -2206,6 +2213,14 @@  start_over:
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
+
+      /*  Use the same condition as vect_transform_loop to decide when to use
+	  the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && ordered_p (th, niters_th))
+	niters_th = ordered_max (poly_uint64 (th), niters_th);
+
       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
     }
 
@@ -2329,7 +2344,7 @@  again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, bool *vect_epilogues_nomask)
 {
   auto_vector_sizes vector_sizes;
 
@@ -2357,6 +2372,7 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  unsigned vectorized_loops = 0;
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2376,14 +2392,17 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
 
-      opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
+      opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts,
+					    vect_epilogues_nomask);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || *vect_epilogues_nomask)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
@@ -2392,7 +2411,13 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Set versioning threshold of the original LOOP_VINFO based
+		     on the last vectorization of the epilog.  */
+		  LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo)
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		  delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2401,7 +2426,12 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 	    }
 	}
       else
-	delete loop_vinfo;
+	{
+	  /* Disable epilog vectorization if we can't determine the epilogs can
+	     be vectorized.  */
+	  *vect_epilogues_nomask &= vectorized_loops > 1;
+	  delete loop_vinfo;
+	}
 
       if (next_size == 0)
 	autodetected_vector_size = current_vector_size;
@@ -8468,7 +8498,7 @@  vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
    Returns scalar epilogue loop if any.  */
 
 class loop *
-vect_transform_loop (loop_vec_info loop_vinfo)
+vect_transform_loop (loop_vec_info loop_vinfo, bool vect_epilogues_nomask)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   class loop *epilogue = NULL;
@@ -8497,11 +8527,11 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8519,18 +8549,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
-      poly_uint64 versioning_threshold
-	= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-      if (check_profitability
-	  && ordered_p (poly_uint64 (th), versioning_threshold))
-	{
-	  versioning_threshold = ordered_max (poly_uint64 (th),
-					      versioning_threshold);
-	  check_profitability = false;
-	}
       class loop *sloop
-	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+	= vect_loop_versioning (loop_vinfo);
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8557,7 +8577,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      vect_epilogues_nomask);
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8818,7 +8839,7 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     epilogue = NULL;
 
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+  if (!vect_epilogues_nomask)
     epilogue = NULL;
 
   if (epilogue)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..e87170c592036a6f3f5330e1ebf5d125441861a6 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1480,10 +1480,10 @@  extern void vect_set_loop_condition (class loop *, loop_vec_info,
 extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
-class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
@@ -1610,7 +1610,8 @@  extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    bool *);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
@@ -1622,7 +1623,7 @@  extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 
 /* Drive for loop transformation stage.  */
-extern class loop *vect_transform_loop (loop_vec_info);
+extern class loop *vect_transform_loop (loop_vec_info, bool);
 extern opt_loop_vec_info vect_analyze_loop_form (class loop *,
 						 vec_info_shared *);
 extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *,
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -61,6 +61,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "predict.h"
+#include "params.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "cgraph.h"
@@ -875,6 +876,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  bool vect_epilogues_nomask = PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -884,7 +886,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
   opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
+    = vect_analyze_loop (loop, orig_loop_vinfo, &shared, &vect_epilogues_nomask);
   loop->aux = loop_vinfo;
 
   if (!loop_vinfo)
@@ -980,7 +982,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 			 "loop vectorized using variable length vectors\n");
     }
 
-  loop_p new_loop = vect_transform_loop (loop_vinfo);
+  loop_p new_loop = vect_transform_loop (loop_vinfo, vect_epilogues_nomask);
   (*num_vectorized_loops)++;
   /* Now that the loop has been vectorized, allow it to be unrolled
      etc.  */
@@ -1013,8 +1015,13 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }