[5/7,v3] vect: Support vector load/store with length in vectorizer

Message ID 68981da8-f7a1-4c95-6d64-2a1d8b748b9f@linux.ibm.com
State New
Headers show
Series
  • Untitled series #26171
Related show

Commit Message

Kewen.Lin via Gcc-patches June 2, 2020, 9:03 a.m.
Hi Richard,

on 2020/5/29 下午4:32, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>> on 2020/5/27 下午6:02, Richard Sandiford wrote:

>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>>>> Hi Richard,

>>>>


Snip ...

>>

>> Thanks a lot for your detailed explanation!  This proposal looks good

>> based on the current implementation of both masking and length.  I may

>> think too much, but I had a bit concern as below when some targets have

>> both masking and length supports in future, such as ppc adds masking

>> support like SVE.

>>

>> I assumed that you meant each vectorizable_* routine should record the

>> objs for any available partial vectorisation approaches.  If one target

>> supports both, we would have both recorded but decide not to do partial

>> vectorisation finally since both have records.  The target can disable

>> length like through optab to resolve it, but there is one possibility

>> that the masking support can be imperfect initially since ISA support

>> could be gradual, it further leads some vectorizable_* check or final

>> verification to fail for masking, and length approach may work here but

>> it gets disabled.  We can miss to use partial vectorisation here.

>>

>> The other assumption is that each vectorizable_* routine record the 

>> first available partial vectorisation approach, let's assume masking

>> takes preference, then it's fine to record just one here even if one

>> target supports both approaches, but we still have the possiblity to

>> miss the partial vectorisation chance as some check/verify fail with

>> masking but fine with length.

>>

>> Does this concern make sense?

> 

> There's nothing to stop us using masks and lengths in the same loop

> in future if we need to.  It would “just” be a case of setting up both

> the masks and the lengths in vect_set_loop_condition.  But the point is

> that doing that would be extra code, and there's no point writing that

> extra code until it's needed.

> 

> If some future arch does support both mask-based and length-based

> approaches, I think that's even less reason to make a binary choice

> between them.  How we prioritise the length and mask approaches when

> both are available is something that we'll have to decide at the time.

> 

> If your concern is that the arch might support masked operations

> without wanting them to be used for loop control, we could test for

> that case by checking whether while_ult_optab is implemented.

> 

> Thanks,

> Richard

> 


Thanks for your further expalanation, as you pointed out, my concern
is just one case of mixing mask-based and length-based.  I didn't
realize it and thought we still used one approach for one loop at the
time, but it's senseless.

The v3 patch attached to use can_partial_vect_p.  In the regression
testing with explicit vect-with-length-scope setting, I saw several
reduction failures, updated vectorizable_condition to set
can_partial_vect_p to false for !EXTRACT_LAST_REDUCTION under your
guidance to ensure it either records sth. or clearing
can_partial_vect_p.

Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
failures found even with explicit vect-with-length-scope settings.

But I met one regression failure on aarch64-linux-gnu as below:

PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

It's caused by vectorizable_condition's change, without the change,
it can use fully-masking for the outer loop.  The reduction_type is
TREE_CODE_REDUCTION here, so can_partial_vect_p gets cleared.

From the optimized dumping, the previous IRs look fine.  It's doing
reduction for inner loop, but we are checking partial vectorisation
for the outer loop.  I'm not sure whether to adjust the current
guard is reasonable for this case.  Could you help to give some
insights?  Thanks in advance!

BR,
Kewen
------
gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_mask): Renamed to ...
	(vect_set_loop_mask_or_len): ... this.  Update variable names
	accordingly.
	(vect_maybe_permute_loop_masks): Replace rgroup_masks with rgroup_objs.
	(vect_set_loop_masks_directly): Renamed to ...
	(vect_set_loop_objs_directly): ... this.  Extend the support to cover
	vector with length, call vect_gen_len for length, replace rgroup_masks
	with rgroup_objs, replace vect_set_loop_mask with
	vect_set_loop_mask_or_len.
	(vect_set_loop_condition_masked): Renamed to ...
	(vect_set_loop_condition_partial): ... this.  Extend the support to
	cover length-based partial vectorization, replace rgroup_masks with
	rgroup_objs, replace vect_iv_limit_for_full_masking with
	vect_iv_limit_for_partial_vect.
	(vect_set_loop_condition_unmasked): Renamed to ...
	(vect_set_loop_condition_normal): ... this.
	(vect_set_loop_condition): Replace vect_set_loop_condition_masked with
	vect_set_loop_condition_partial, replace
	vect_set_loop_condition_unmasked with vect_set_loop_condition_normal.
	(vect_gen_vector_loop_niters): Use LOOP_VINFO_PARTIAL_VECT_P for
	partial vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(vect_do_peeling): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P, adjust for
	epilogue handling for length-based partial vectorization.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
	fully_with_length_p and epil_partial_vect_p, replace can_fully_mask_p
	with can_partial_vect_p.
	(release_vec_loop_masks): Replace rgroup_masks with rgroup_objs.
	(release_vec_loop_lens): New function.
	(_loop_vec_info::~_loop_vec_info): Use it to free the loop lens.
	(can_produce_all_loop_masks_p): Replace rgroup_masks with rgroup_objs.
	(vect_get_max_nscalars_per_iter): Likewise.
	(min_prec_for_max_niters): New function.  Factored out from ...
	(vect_verify_full_masking): ... this.  Replace
	vect_iv_limit_for_full_masking with vect_iv_limit_for_partial_vect.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_costing): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(determine_peel_for_niter): Likewise.
	(vect_analyze_loop_2): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P, replace LOOP_VINFO_FULLY_MASKED_P with
	LOOP_VINFO_PARTIAL_VECT_P.  Check loop-wide reasons for disabling loops
	with length.  Make the final decision about use vector access with
	length or not.  Disable LOOP_VINFO_CAN_PARTIAL_VECT_P if both
	length-based and length-based approaches recorded.  Mark epilogue go
	with length-based approach if suitable.
	(vect_analyze_loop): Add handlings for epilogue of loop that is marked
	to use partial vectorization approach.
	(vect_estimate_min_profitable_iters): Replace rgroup_masks with
	rgroup_objs.  Adjust for loop with length-based partial vectorization.
	(vectorizable_reduction): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P, adjust some dumpings.
	(vectorizable_live_operation): Likewise.
	(vect_record_loop_mask): Replace rgroup_masks with rgroup_objs.
	(vect_get_loop_mask): Likewise.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): Likewise.
	(vect_transform_loop): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(vect_iv_limit_for_full_masking): Renamed to ...
	(vect_iv_limit_for_partial_vect): ... here. 
	* tree-vect-stmts.c (permute_vec_elements):
	(check_load_store_masking): Renamed to ...
	(check_load_store_partial_vect): ... here.  Add length-based partial
	vectorization checks.
	(vectorizable_operation): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P.
	(vectorizable_store): Replace check_load_store_masking with
	check_load_store_partial_vect.  Add handlings for length-based partial
	vectorization.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P.  Guard partial vectorization reduction
	only for EXTRACT_LAST_REDUCTION.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_masks): Renamed to ...
	(struct rgroup_objs): ... this.  Add anonymous union to field
	max_nscalars_per_iter and mask_type.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens, fully_with_length_p and
	epil_partial_vect_p.  Rename can_fully_mask_p to can_partial_vect_p.
	(LOOP_VINFO_CAN_FULLY_MASK_P): Renamed to ...
	(LOOP_VINFO_CAN_PARTIAL_VECT_P): ... this.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): New macro.
	(LOOP_VINFO_EPIL_PARTIAL_VECT_P): Likewise.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_PARTIAL_VECT_P): Likewise.
	(vect_iv_limit_for_full_masking): Renamed to ...
	(vect_iv_limit_for_partial_vect): ... this.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.

Comments

Richard Sandiford June 2, 2020, 11:50 a.m. | #1
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,

>

> on 2020/5/29 下午4:32, Richard Sandiford wrote:

>> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>>> on 2020/5/27 下午6:02, Richard Sandiford wrote:

>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>>>>> Hi Richard,

>>>>>

>

> Snip ...

>

>>>

>>> Thanks a lot for your detailed explanation!  This proposal looks good

>>> based on the current implementation of both masking and length.  I may

>>> think too much, but I had a bit concern as below when some targets have

>>> both masking and length supports in future, such as ppc adds masking

>>> support like SVE.

>>>

>>> I assumed that you meant each vectorizable_* routine should record the

>>> objs for any available partial vectorisation approaches.  If one target

>>> supports both, we would have both recorded but decide not to do partial

>>> vectorisation finally since both have records.  The target can disable

>>> length like through optab to resolve it, but there is one possibility

>>> that the masking support can be imperfect initially since ISA support

>>> could be gradual, it further leads some vectorizable_* check or final

>>> verification to fail for masking, and length approach may work here but

>>> it gets disabled.  We can miss to use partial vectorisation here.

>>>

>>> The other assumption is that each vectorizable_* routine record the 

>>> first available partial vectorisation approach, let's assume masking

>>> takes preference, then it's fine to record just one here even if one

>>> target supports both approaches, but we still have the possiblity to

>>> miss the partial vectorisation chance as some check/verify fail with

>>> masking but fine with length.

>>>

>>> Does this concern make sense?

>> 

>> There's nothing to stop us using masks and lengths in the same loop

>> in future if we need to.  It would “just” be a case of setting up both

>> the masks and the lengths in vect_set_loop_condition.  But the point is

>> that doing that would be extra code, and there's no point writing that

>> extra code until it's needed.

>> 

>> If some future arch does support both mask-based and length-based

>> approaches, I think that's even less reason to make a binary choice

>> between them.  How we prioritise the length and mask approaches when

>> both are available is something that we'll have to decide at the time.

>> 

>> If your concern is that the arch might support masked operations

>> without wanting them to be used for loop control, we could test for

>> that case by checking whether while_ult_optab is implemented.

>> 

>> Thanks,

>> Richard

>> 

>

> Thanks for your further expalanation, as you pointed out, my concern

> is just one case of mixing mask-based and length-based.  I didn't

> realize it and thought we still used one approach for one loop at the

> time, but it's senseless.

>

> The v3 patch attached to use can_partial_vect_p.  In the regression

> testing with explicit vect-with-length-scope setting, I saw several

> reduction failures, updated vectorizable_condition to set

> can_partial_vect_p to false for !EXTRACT_LAST_REDUCTION under your

> guidance to ensure it either records sth. or clearing

> can_partial_vect_p.

>

> Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable

> failures found even with explicit vect-with-length-scope settings.

>

> But I met one regression failure on aarch64-linux-gnu as below:

>

> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

>

> It's caused by vectorizable_condition's change, without the change,

> it can use fully-masking for the outer loop.  The reduction_type is

> TREE_CODE_REDUCTION here, so can_partial_vect_p gets cleared.

>

> From the optimized dumping, the previous IRs look fine.  It's doing

> reduction for inner loop, but we are checking partial vectorisation

> for the outer loop.  I'm not sure whether to adjust the current

> guard is reasonable for this case.  Could you help to give some

> insights?  Thanks in advance!

>

> BR,

> Kewen

> ------

> gcc/ChangeLog


It would be easier to review, and perhaps easier to bisect,
if some of the mechanical changes were split out.  E.g.:

- Rename can_fully_mask_p to can_use_partial_vectors_p.

- Rename fully_masked_p to using_partial_vectors_p.

- Rename things related to rgroup_masks.  I think “rgroup_controls”
  or “rgroup_guards” might be more descriptive than “rgroup_objs”.

These should be fairly mechanical changes and can happen ahead of
the main series.  It'll then be easier to see what's different
for masks and lengths, separately from the more mechanical stuff.

As far as:

+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };

goes, some parts of the code seem to use mask_type for lengths too,
which I'm a bit nervous about.  I think we should either be consistent
about which union field we use (always mask_type for masks, always
vec_type for lengths) or we should just rename mask_type to something
more generic.  Just "type" might be good enough with a suitable comment.

>  {

>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);

>    tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);

> -  tree mask_type = rgm->mask_type;

> -  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;

> -  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);

> +

> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

> +  if (!vect_for_masking)

> +    {

> +      /* Obtain target supported length type.  */

> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;

> +      unsigned int len_prec = GET_MODE_PRECISION (len_mode);

> +      compare_type = build_nonstandard_integer_type (len_prec, true);

> +      /* Simply set iv_type as same as compare_type.  */

> +      iv_type = compare_type;


This might not be the best time to bring this up :-) but it seems
odd to be asking the target for the induction variable type here.
I got the impression that the hook was returning DImode, whereas
the PowerPC instructions only looked at the low 8 bits of the length.
If so, forcing a naturally 32-bit IV to DImode would insert extra
sign/zero extensions, even though the input to the length intrinsics
would have been happy with the 32-bit IV.

I think it would make sense to ask the target for its minimum
precision P (which would be 8 bits if the above is correct).
The starting point would then be the maximum of:

- this P
- the IV's natural precision
- the precision needed to hold:
    the maximum number of scalar iterations multiplied by the scale factor
    (to convert scalar counts to bytes)

If the IV might wrap at that precision without producing all-zero lengths,
it would be worth doubling the precision to avoid the wrapping issue,
provided that we don't go beyond BITS_PER_WORD.

> +  tree obj_type = rgo->mask_type;

> +  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */

> +  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;


I think whether we count scalars or count bytes is really a separate
decision that shouldn't be tied directly to using lengths.  Length-based
loads and stores on other arches might want to count scalars too.
I'm not saying you should add support for that (it wouldn't be tested),
but I think we should avoid structuring the code to make it harder to
add in future.

So I think nscalars_per_iter should always count scalars and anything
length-based should be separate.  Would it make sense to store the
length scale factor as a separate field?  I.e. using the terms
above the rgroup_masks comment, the length IV step is:

   factor * nS * VF == factor * nV * nL

That way, applying the factor becomes separate from lengths vs. masks.
The factor would also be useful in calculating the IV precision above.

> [...]

> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.

> -   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the

> -   number of iterations of the original scalar loop that should be

> -   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are

> -   as for vect_set_loop_condition.

> +/* Make LOOP iterate NITERS times using objects like masks (and

> +   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization

> +   of LOOP.  NITERS is the number of iterations of the original scalar

> +   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO

> +   and FINAL_IV are as for vect_set_loop_condition.

>  

>     Insert the branch-back condition before LOOP_COND_GSI and return the

>     final gcond.  */

>  

>  static gcond *

> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,

> -				tree niters, tree final_iv,

> -				bool niters_maybe_zero,

> -				gimple_stmt_iterator loop_cond_gsi)

> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,

> +				 tree niters, tree final_iv,

> +				 bool niters_maybe_zero,

> +				 gimple_stmt_iterator loop_cond_gsi)

>  {

>    gimple_seq preheader_seq = NULL;

>    gimple_seq header_seq = NULL;

>  

> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

> +

>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);

> +  if (!vect_for_masking)

> +    {

> +      /* Obtain target supported length type as compare_type.  */

> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;

> +      unsigned len_prec = GET_MODE_PRECISION (len_mode);

> +      compare_type = build_nonstandard_integer_type (len_prec, true);


Same comment as above about the choice of IV type.  We shouldn't
recalculate this multiple times.  It would be better to calculate
it upfront and store it in the loop_vinfo.

> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>    if (vect_epilogues

>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

>        && prolog_peeling >= 0

> -      && known_eq (vf, lowest_vf))

> +      && known_eq (vf, lowest_vf)

> +      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))


Why's this check needed?

>      {

>        unsigned HOST_WIDE_INT eiters

>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c

> index 80e33b61be7..99e6cb904ba 100644

> --- a/gcc/tree-vect-loop.c

> +++ b/gcc/tree-vect-loop.c

> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)

>      vec_outside_cost (0),

>      vec_inside_cost (0),

>      vectorizable (false),

> -    can_fully_mask_p (true),

> +    can_partial_vect_p (true),


I think “can_use_partial_vectors_p” reads better

>      fully_masked_p (false),

> +    fully_with_length_p (false),


I think it would be better if these two were a single flag
(using_partial_vectors_p), with masking vs. lengths being derived
information.

> +    epil_partial_vect_p (false),

>      peeling_for_gaps (false),

>      peeling_for_niter (false),

>      no_data_dependencies (false),

> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)

>  void

>  release_vec_loop_masks (vec_loop_masks *masks)

>  {

> -  rgroup_masks *rgm;

> +  rgroup_objs *rgm;

>    unsigned int i;

>    FOR_EACH_VEC_ELT (*masks, i, rgm)

> -    rgm->masks.release ();

> +    rgm->objs.release ();

>    masks->release ();

>  }

>  

> +/* Free all levels of LENS.  */

> +

> +void

> +release_vec_loop_lens (vec_loop_lens *lens)

> +{

> +  rgroup_objs *rgl;

> +  unsigned int i;

> +  FOR_EACH_VEC_ELT (*lens, i, rgl)

> +    rgl->objs.release ();

> +  lens->release ();

> +}

> +


There's no need to duplicate this function.

The overall approach looks good though.  I just think we need to work
through the details a bit more.

Thanks,
Richard
Segher Boessenkool June 2, 2020, 5:01 p.m. | #2
On Tue, Jun 02, 2020 at 12:50:25PM +0100, Richard Sandiford wrote:
> This might not be the best time to bring this up :-) but it seems

> odd to be asking the target for the induction variable type here.

> I got the impression that the hook was returning DImode, whereas

> the PowerPC instructions only looked at the low 8 bits of the length.


It's the top(!) 8 bits of the register actually (the other 56 bits are
"do not care" bits).

> If so, forcing a naturally 32-bit IV to DImode would insert extra

> sign/zero extensions, even though the input to the length intrinsics

> would have been happy with the 32-bit IV.


It's a shift left always.  All bits beyond the 8 drop out.


Thanks for the great reviews Richard, much appreciated!


Segher
Kewen.Lin via Gcc-patches June 3, 2020, 6:33 a.m. | #3
Hi Richard,

Thanks a lot for your great comments!

on 2020/6/2 下午7:50, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>> Hi Richard,

>>

>> on 2020/5/29 下午4:32, Richard Sandiford wrote:

>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>>>> on 2020/5/27 下午6:02, Richard Sandiford wrote:

>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:

>>>>>> Hi Richard,

>>>>>>


snip ...

> 

> It would be easier to review, and perhaps easier to bisect,

> if some of the mechanical changes were split out.  E.g.:

> 

> - Rename can_fully_mask_p to can_use_partial_vectors_p.

> 

> - Rename fully_masked_p to using_partial_vectors_p.

> 

> - Rename things related to rgroup_masks.  I think “rgroup_controls”

>   or “rgroup_guards” might be more descriptive than “rgroup_objs”.

> 

> These should be fairly mechanical changes and can happen ahead of

> the main series.  It'll then be easier to see what's different

> for masks and lengths, separately from the more mechanical stuff.

> 


Good suggestion.  My fault, I should have done it before. 
Will split it into some NFC patches.

> As far as:

> 

> +  union

> +  {

> +    /* The type of mask to use, based on the highest nS recorded above.  */

> +    tree mask_type;

> +    /* Any vector type to use these lengths.  */

> +    tree vec_type;

> +  };

> 

> goes, some parts of the code seem to use mask_type for lengths too,

> which I'm a bit nervous about.  I think we should either be consistent

> about which union field we use (always mask_type for masks, always

> vec_type for lengths) or we should just rename mask_type to something

> more generic.  Just "type" might be good enough with a suitable comment.


Will fix it.

> 

>>  {

>>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);

>>    tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);

>> -  tree mask_type = rgm->mask_type;

>> -  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;

>> -  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);

>> +

>> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

>> +  if (!vect_for_masking)

>> +    {

>> +      /* Obtain target supported length type.  */

>> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;

>> +      unsigned int len_prec = GET_MODE_PRECISION (len_mode);

>> +      compare_type = build_nonstandard_integer_type (len_prec, true);

>> +      /* Simply set iv_type as same as compare_type.  */

>> +      iv_type = compare_type;

> 

> This might not be the best time to bring this up :-) but it seems

> odd to be asking the target for the induction variable type here.

> I got the impression that the hook was returning DImode, whereas

> the PowerPC instructions only looked at the low 8 bits of the length.

> If so, forcing a naturally 32-bit IV to DImode would insert extra

> sign/zero extensions, even though the input to the length intrinsics

> would have been happy with the 32-bit IV.

> 


Good point, I'll check it with some cases.  As Segher pointed out, the 8
bits in bits 0-7 (the top, abnormal I admit), these vector with length
instructions are guarded in 64 bits only.  IIUC the extra sign/zero
extensions would exist in pre-header with current setting?  At that time
I thought the iv with less precision than length mode had to be converted
later for length's need, it looks good to use length mode simply.

> I think it would make sense to ask the target for its minimum

> precision P (which would be 8 bits if the above is correct).

> The starting point would then be the maximum of:

> 

> - this P

> - the IV's natural precision

> - the precision needed to hold:

>     the maximum number of scalar iterations multiplied by the scale factor

>     (to convert scalar counts to bytes)

> 

> If the IV might wrap at that precision without producing all-zero lengths,

> it would be worth doubling the precision to avoid the wrapping issue,

> provided that we don't go beyond BITS_PER_WORD.

> 

Thanks! Will think/test more on this part.

>> +  tree obj_type = rgo->mask_type;

>> +  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */

>> +  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;

> 

> I think whether we count scalars or count bytes is really a separate

> decision that shouldn't be tied directly to using lengths.  Length-based

> loads and stores on other arches might want to count scalars too.

> I'm not saying you should add support for that (it wouldn't be tested),

> but I think we should avoid structuring the code to make it harder to

> add in future.

> 


It makes sense, will update it.

> So I think nscalars_per_iter should always count scalars and anything

> length-based should be separate.  Would it make sense to store the

> length scale factor as a separate field?  I.e. using the terms

> above the rgroup_masks comment, the length IV step is:

> 

>    factor * nS * VF == factor * nV * nL

> 


Yeah, factor*nS becomes what we wanted for length-based in bytes, factor
* nL would be the vector size.

> That way, applying the factor becomes separate from lengths vs. masks.

> The factor would also be useful in calculating the IV precision above.

> 


Yeah, nice!

>> [...]

>> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.

>> -   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the

>> -   number of iterations of the original scalar loop that should be

>> -   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are

>> -   as for vect_set_loop_condition.

>> +/* Make LOOP iterate NITERS times using objects like masks (and

>> +   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization

>> +   of LOOP.  NITERS is the number of iterations of the original scalar

>> +   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO

>> +   and FINAL_IV are as for vect_set_loop_condition.

>>  

>>     Insert the branch-back condition before LOOP_COND_GSI and return the

>>     final gcond.  */

>>  

>>  static gcond *

>> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,

>> -				tree niters, tree final_iv,

>> -				bool niters_maybe_zero,

>> -				gimple_stmt_iterator loop_cond_gsi)

>> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,

>> +				 tree niters, tree final_iv,

>> +				 bool niters_maybe_zero,

>> +				 gimple_stmt_iterator loop_cond_gsi)

>>  {

>>    gimple_seq preheader_seq = NULL;

>>    gimple_seq header_seq = NULL;

>>  

>> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

>> +

>>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);

>> +  if (!vect_for_masking)

>> +    {

>> +      /* Obtain target supported length type as compare_type.  */

>> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;

>> +      unsigned len_prec = GET_MODE_PRECISION (len_mode);

>> +      compare_type = build_nonstandard_integer_type (len_prec, true);

> 

> Same comment as above about the choice of IV type.  We shouldn't

> recalculate this multiple times.  It would be better to calculate

> it upfront and store it in the loop_vinfo.


OK.

> 

>> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,

>>    if (vect_epilogues

>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

>>        && prolog_peeling >= 0

>> -      && known_eq (vf, lowest_vf))

>> +      && known_eq (vf, lowest_vf)

>> +      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))

> 

> Why's this check needed?

> 


It's mainly for length-based epilogue handlings.

       while (!(constant_multiple_p
                (GET_MODE_SIZE (loop_vinfo->vector_mode),
                 GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio)
                 && eiters >= lowest_vf / ratio + epilogue_gaps))

This "if" part checks whether remaining eiters enough for the epilogue,
if eiters less than epilogue's lowest_vf, it will back out the epilogue.
But for partial_vectors, it should be acceptable, since it can deal
with partial.  Probably I should use using_partial_vectors_p here
instead of LOOP_VINFO_FULLY_WITH_LENGTH_P, although the masking won't
have the possiblity to handle the epilogue, the concept would be the same.

>>      {

>>        unsigned HOST_WIDE_INT eiters

>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c

>> index 80e33b61be7..99e6cb904ba 100644

>> --- a/gcc/tree-vect-loop.c

>> +++ b/gcc/tree-vect-loop.c

>> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)

>>      vec_outside_cost (0),

>>      vec_inside_cost (0),

>>      vectorizable (false),

>> -    can_fully_mask_p (true),

>> +    can_partial_vect_p (true),

> 

> I think “can_use_partial_vectors_p” reads better


Will update with it.

> 

>>      fully_masked_p (false),

>> +    fully_with_length_p (false),

> 

> I think it would be better if these two were a single flag

> (using_partial_vectors_p), with masking vs. lengths being derived

> information.

> 


Will update it.

>> +    epil_partial_vect_p (false),

>>      peeling_for_gaps (false),

>>      peeling_for_niter (false),

>>      no_data_dependencies (false),

>> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)

>>  void

>>  release_vec_loop_masks (vec_loop_masks *masks)

>>  {

>> -  rgroup_masks *rgm;

>> +  rgroup_objs *rgm;

>>    unsigned int i;

>>    FOR_EACH_VEC_ELT (*masks, i, rgm)

>> -    rgm->masks.release ();

>> +    rgm->objs.release ();

>>    masks->release ();

>>  }

>>  

>> +/* Free all levels of LENS.  */

>> +

>> +void

>> +release_vec_loop_lens (vec_loop_lens *lens)

>> +{

>> +  rgroup_objs *rgl;

>> +  unsigned int i;

>> +  FOR_EACH_VEC_ELT (*lens, i, rgl)

>> +    rgl->objs.release ();

>> +  lens->release ();

>> +}

>> +

> 

> There's no need to duplicate this function.

> 


Good catch, will rename and merge them.

BR,
Kewen

> The overall approach looks good though.  I just think we need to work

> through the details a bit more.

> 

> Thanks,

> Richard

>

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8b9935dfe65..ac765feab13 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13079,6 +13079,13 @@  by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 4aec480798b..d4309101067 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -964,4 +964,8 @@  Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 8c5e696b995..0a5770c7d28 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -256,17 +256,17 @@  adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
 			gimple_bb (update_phi));
 }
 
-/* Define one loop mask MASK from loop LOOP.  INIT_MASK is the value that
-   the mask should have during the first iteration and NEXT_MASK is the
+/* Define one loop mask/length OBJ from loop LOOP.  INIT_OBJ is the value that
+   the mask/length should have during the first iteration and NEXT_OBJ is the
    value that it should have on subsequent iterations.  */
 
 static void
-vect_set_loop_mask (class loop *loop, tree mask, tree init_mask,
-		    tree next_mask)
+vect_set_loop_mask_or_len (class loop *loop, tree obj, tree init_obj,
+			   tree next_obj)
 {
-  gphi *phi = create_phi_node (mask, loop->header);
-  add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION);
-  add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION);
+  gphi *phi = create_phi_node (obj, loop->header);
+  add_phi_arg (phi, init_obj, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+  add_phi_arg (phi, next_obj, loop_latch_edge (loop), UNKNOWN_LOCATION);
 }
 
 /* Add SEQ to the end of LOOP's preheader block.  */
@@ -320,8 +320,8 @@  interleave_supported_p (vec_perm_indices *indices, tree vectype,
    latter.  Return true on success, adding any new statements to SEQ.  */
 
 static bool
-vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
-			       rgroup_masks *src_rgm)
+vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_objs *dest_rgm,
+			       rgroup_objs *src_rgm)
 {
   tree src_masktype = src_rgm->mask_type;
   tree dest_masktype = dest_rgm->mask_type;
@@ -338,10 +338,10 @@  vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       machine_mode dest_mode = insn_data[icode1].operand[0].mode;
       gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
       tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
 			    ? VEC_UNPACK_HI_EXPR
 			    : VEC_UNPACK_LO_EXPR);
@@ -371,10 +371,10 @@  vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       tree masks[2];
       for (unsigned int i = 0; i < 2; ++i)
 	masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
 					      src, src, masks[i & 1]);
 	  gimple_seq_add_stmt (seq, stmt);
@@ -384,60 +384,80 @@  vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
   return false;
 }
 
-/* Helper for vect_set_loop_condition_masked.  Generate definitions for
-   all the masks in RGM and return a mask that is nonzero when the loop
+/* Helper for vect_set_loop_condition_partial.  Generate definitions for
+   all the objs in RGO and return a obj that is nonzero when the loop
    needs to iterate.  Add any new preheader statements to PREHEADER_SEQ.
    Use LOOP_COND_GSI to insert code before the exit gcond.
 
-   RGM belongs to loop LOOP.  The loop originally iterated NITERS
+   RGO belongs to loop LOOP.  The loop originally iterated NITERS
    times and has been vectorized according to LOOP_VINFO.
 
    If NITERS_SKIP is nonnull, the first iteration of the vectorized loop
    starts with NITERS_SKIP dummy iterations of the scalar loop before
-   the real work starts.  The mask elements for these dummy iterations
+   the real work starts.  The obj elements for these dummy iterations
    must be 0, to ensure that the extra iterations do not have an effect.
 
    It is known that:
 
-     NITERS * RGM->max_nscalars_per_iter
+     NITERS * RGO->max_nscalars_per_iter
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGM->max_nscalars_per_iter
+     VF * RGO->max_nscalars_per_iter
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGM->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGO->max_nscalars_per_iter
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGM.  */
+   would ever hit a value that produces a set of all-false masks or
+   zero byte length for RGO.  */
 
 static tree
-vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
+vect_set_loop_objs_directly (class loop *loop, loop_vec_info loop_vinfo,
 			      gimple_seq *preheader_seq,
 			      gimple_stmt_iterator loop_cond_gsi,
-			      rgroup_masks *rgm, tree niters, tree niters_skip,
+			      rgroup_objs *rgo, tree niters, tree niters_skip,
 			      bool might_wrap_p)
 {
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
-  tree mask_type = rgm->mask_type;
-  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
-  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+      /* Simply set iv_type as same as compare_type.  */
+      iv_type = compare_type;
+    }
+
+  tree obj_type = rgo->mask_type;
+  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
+  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;
+  poly_uint64 nscalars_per_obj = TYPE_VECTOR_SUBPARTS (obj_type);
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (obj_type));
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree vec_size = NULL_TREE;
+  /* For length, we probably need vec_size to check length in range.  */
+  if (!vect_for_masking)
+    vec_size = build_int_cst (compare_type, vector_size);
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
-     first iteration of the vector loop.  */
+     first iteration of the vector loop.  For vector with length, take
+     scalar values as bytes.  */
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
   if (nscalars_per_iter != 1)
     {
-      /* We checked before choosing to use a fully-masked loop that these
-	 multiplications don't overflow.  */
+      /* We checked before choosing to use a fully-masked or fully with length
+	 loop that these multiplications don't overflow.  */
       tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
       tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
@@ -541,28 +561,28 @@  vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
   test_index = gimple_convert (&test_seq, compare_type, test_index);
   gsi_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT);
 
-  /* Provide a definition of each mask in the group.  */
-  tree next_mask = NULL_TREE;
-  tree mask;
+  /* Provide a definition of each obj in the group.  */
+  tree next_obj = NULL_TREE;
+  tree obj;
   unsigned int i;
-  FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask)
+  poly_uint64 batch_cnt = vect_for_masking ? nscalars_per_obj : vector_size;
+  FOR_EACH_VEC_ELT_REVERSE (rgo->objs, i, obj)
     {
-      /* Previous masks will cover BIAS scalars.  This mask covers the
+      /* Previous objs will cover BIAS scalars.  This obj covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_mask * i;
+      poly_uint64 bias = batch_cnt * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
-	 to have a full mask.  */
+	 to have a full mask or length.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_mask));
+	   && known_ge (const_limit, (i + 1) * batch_cnt));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each mask
-	 and adjust the bound down by BIAS.  */
+	 or length and adjust the bound down by BIAS.  */
       tree this_test_limit = test_limit;
       if (i != 0)
 	{
@@ -574,9 +594,9 @@  vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 					  bias_tree);
 	}
 
-      /* Create the initial mask.  First include all scalars that
+      /* Create the initial obj.  First include all scalars that
 	 are within the loop limit.  */
-      tree init_mask = NULL_TREE;
+      tree init_obj = NULL_TREE;
       if (!first_iteration_full)
 	{
 	  tree start, end;
@@ -598,9 +618,18 @@  vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_mask, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_obj = make_temp_ssa_name (obj_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_obj, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_obj = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_obj, start, end, vec_size);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -610,51 +639,76 @@  vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	  && !(poly_int_tree_p (nscalars_skip, &const_skip)
 	       && known_le (const_skip, bias)))
 	{
-	  tree unskipped_mask = vect_gen_while_not (preheader_seq, mask_type,
+	  tree unskipped_mask = vect_gen_while_not (preheader_seq, obj_type,
 						    bias_tree, nscalars_skip);
-	  if (init_mask)
-	    init_mask = gimple_build (preheader_seq, BIT_AND_EXPR, mask_type,
-				      init_mask, unskipped_mask);
+	  if (init_obj)
+	    init_obj = gimple_build (preheader_seq, BIT_AND_EXPR, obj_type,
+				      init_obj, unskipped_mask);
 	  else
-	    init_mask = unskipped_mask;
+	    init_obj = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
-      if (!init_mask)
-	/* First iteration is full.  */
-	init_mask = build_minus_one_cst (mask_type);
+      /* First iteration is full.  */
+      if (!init_obj)
+	{
+	  if (vect_for_masking)
+	    init_obj = build_minus_one_cst (obj_type);
+	  else
+	    init_obj = vec_size;
+	}
 
-      /* Get the mask value for the next iteration of the loop.  */
-      next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_mask, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      /* Get the obj value for the next iteration of the loop.  */
+      if (vect_for_masking)
+	{
+	  next_obj = make_temp_ssa_name (obj_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_obj, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_obj = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  tree end = this_test_limit;
+	  gimple_seq seq = vect_gen_len (next_obj, test_index, end, vec_size);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
-      vect_set_loop_mask (loop, mask, init_mask, next_mask);
+      vect_set_loop_mask_or_len (loop, obj, init_obj, next_obj);
     }
-  return next_mask;
+  return next_obj;
 }
 
-/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
-   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
-   number of iterations of the original scalar loop that should be
-   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
-   as for vect_set_loop_condition.
+/* Make LOOP iterate NITERS times using objects like masks (and
+   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
+   of LOOP.  NITERS is the number of iterations of the original scalar
+   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
+   and FINAL_IV are as for vect_set_loop_condition.
 
    Insert the branch-back condition before LOOP_COND_GSI and return the
    final gcond.  */
 
 static gcond *
-vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
-				tree niters, tree final_iv,
-				bool niters_maybe_zero,
-				gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
+				 tree niters, tree final_iv,
+				 bool niters_maybe_zero,
+				 gimple_stmt_iterator loop_cond_gsi)
 {
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type as compare_type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+    }
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
-  tree orig_niters = niters;
 
+  tree orig_niters = niters;
   /* Type of the initial value of NITERS.  */
   tree ni_actual_type = TREE_TYPE (niters);
   unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
@@ -677,42 +731,45 @@  vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   else
     niters = gimple_convert (&preheader_seq, compare_type, niters);
 
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
 
-  /* Iterate over all the rgroups and fill in their masks.  We could use
-     the first mask from any rgroup for the loop condition; here we
+  /* Iterate over all the rgroups and fill in their objs.  We could use
+     the first obj from any rgroup for the loop condition; here we
      arbitrarily pick the last.  */
-  tree test_mask = NULL_TREE;
-  rgroup_masks *rgm;
+  tree test_obj = NULL_TREE;
+  rgroup_objs *rgo;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgm)
-    if (!rgm->masks.is_empty ())
+  auto_vec<rgroup_objs> *objs = vect_for_masking
+				  ? &LOOP_VINFO_MASKS (loop_vinfo)
+				  : &LOOP_VINFO_LENS (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (*objs, i, rgo)
+    if (!rgo->objs.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1];
-	    if (!half_rgm->masks.is_empty ()
-		&& vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm))
+	    rgroup_objs *half_rgo = &(*objs)[nmasks / 2 - 1];
+	    if (!half_rgo->objs.is_empty ()
+		&& vect_maybe_permute_loop_masks (&header_seq, rgo, half_rgo))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero byte length before wrapping around.  */
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
+	     || (wi::min_precision (iv_limit * rgo->max_nscalars_per_iter,
 				    UNSIGNED)
 		 > compare_precision));
 
-	/* Set up all masks for this group.  */
-	test_mask = vect_set_loop_masks_directly (loop, loop_vinfo,
+	/* Set up all masks/lengths for this group.  */
+	test_obj = vect_set_loop_objs_directly (loop, loop_vinfo,
 						  &preheader_seq,
-						  loop_cond_gsi, rgm,
+						  loop_cond_gsi, rgo,
 						  niters, niters_skip,
 						  might_wrap_p);
       }
@@ -724,8 +781,8 @@  vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   /* Get a boolean result that tells us whether to iterate.  */
   edge exit_edge = single_exit (loop);
   tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
-  tree zero_mask = build_zero_cst (TREE_TYPE (test_mask));
-  gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask,
+  tree zero_obj = build_zero_cst (TREE_TYPE (test_obj));
+  gcond *cond_stmt = gimple_build_cond (code, test_obj, zero_obj,
 					NULL_TREE, NULL_TREE);
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
@@ -748,13 +805,12 @@  vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
 }
 
 /* Like vect_set_loop_condition, but handle the case in which there
-   are no loop masks.  */
+   are no loop masks/lengths.  */
 
 static gcond *
-vect_set_loop_condition_unmasked (class loop *loop, tree niters,
-				  tree step, tree final_iv,
-				  bool niters_maybe_zero,
-				  gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_normal (class loop *loop, tree niters, tree step,
+			      tree final_iv, bool niters_maybe_zero,
+			      gimple_stmt_iterator loop_cond_gsi)
 {
   tree indx_before_incr, indx_after_incr;
   gcond *cond_stmt;
@@ -912,14 +968,14 @@  vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
   gcond *orig_cond = get_loop_exit_condition (loop);
   gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
 
-  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-    cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
-						final_iv, niters_maybe_zero,
-						loop_cond_gsi);
+  if (loop_vinfo && LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
+    cond_stmt
+      = vect_set_loop_condition_partial (loop, loop_vinfo, niters, final_iv,
+					 niters_maybe_zero, loop_cond_gsi);
   else
-    cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
-						  final_iv, niters_maybe_zero,
-						  loop_cond_gsi);
+    cond_stmt
+      = vect_set_loop_condition_normal (loop, niters, step, final_iv,
+					niters_maybe_zero, loop_cond_gsi);
 
   /* Remove old loop exit test.  */
   stmt_vec_info orig_cond_info;
@@ -1938,8 +1994,7 @@  vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters,
     ni_minus_gap = niters;
 
   unsigned HOST_WIDE_INT const_vf;
-  if (vf.is_constant (&const_vf)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (vf.is_constant (&const_vf) && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       /* Create: niters >> log2(vf) */
       /* If it's known that niters == number of latch executions + 1 doesn't
@@ -2471,7 +2526,7 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   poly_uint64 bound_epilog = 0;
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
@@ -2567,7 +2622,8 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 80e33b61be7..99e6cb904ba 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -813,8 +813,10 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vec_outside_cost (0),
     vec_inside_cost (0),
     vectorizable (false),
-    can_fully_mask_p (true),
+    can_partial_vect_p (true),
     fully_masked_p (false),
+    fully_with_length_p (false),
+    epil_partial_vect_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -880,13 +882,25 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 void
 release_vec_loop_masks (vec_loop_masks *masks)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (*masks, i, rgm)
-    rgm->masks.release ();
+    rgm->objs.release ();
   masks->release ();
 }
 
+/* Free all levels of LENS.  */
+
+void
+release_vec_loop_lens (vec_loop_lens *lens)
+{
+  rgroup_objs *rgl;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (*lens, i, rgl)
+    rgl->objs.release ();
+  lens->release ();
+}
+
 /* Free all memory used by the _loop_vec_info, as well as all the
    stmt_vec_info structs of all the stmts in the loop.  */
 
@@ -895,6 +909,7 @@  _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_masks (&masks);
+  release_vec_loop_lens (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -935,7 +950,7 @@  cse_and_gimplify_to_preheader (loop_vec_info loop_vinfo, tree expr)
 static bool
 can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     if (rgm->mask_type != NULL_TREE
@@ -954,12 +969,40 @@  vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 {
   unsigned int res = 1;
   unsigned int i;
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     res = MAX (res, rgm->max_nscalars_per_iter);
   return res;
 }
 
+/* Calculate the minimal bits necessary to represent the maximal iteration
+   count of loop with loop_vec_info LOOP_VINFO which is scaling with a given
+   factor FACTOR.  */
+
+static unsigned
+min_prec_for_max_niters (loop_vec_info loop_vinfo, unsigned int factor)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Get the maximum number of iterations that is representable
+     in the counter type.  */
+  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
+  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
+
+  /* Get a more refined estimate for the number of iterations.  */
+  widest_int max_back_edges;
+  if (max_loop_iterations (loop, &max_back_edges))
+    max_ni = wi::smin (max_ni, max_back_edges + 1);
+
+  /* Account for factor, in which each bit is replicated N times.  */
+  max_ni *= factor;
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+
+  return min_ni_width;
+}
+
 /* Each statement in LOOP_VINFO can be masked where necessary.  Check
    whether we can actually generate the masks required.  Return true if so,
    storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE.  */
@@ -967,7 +1010,6 @@  vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 static bool
 vect_verify_full_masking (loop_vec_info loop_vinfo)
 {
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int min_ni_width;
   unsigned int max_nscalars_per_iter
     = vect_get_max_nscalars_per_iter (loop_vinfo);
@@ -978,27 +1020,14 @@  vect_verify_full_masking (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
     return false;
 
-  /* Get the maximum number of iterations that is representable
-     in the counter type.  */
-  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
-  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
-
-  /* Get a more refined estimate for the number of iterations.  */
-  widest_int max_back_edges;
-  if (max_loop_iterations (loop, &max_back_edges))
-    max_ni = wi::smin (max_ni, max_back_edges + 1);
-
-  /* Account for rgroup masks, in which each bit is replicated N times.  */
-  max_ni *= max_nscalars_per_iter;
-
   /* Work out how many bits we need to represent the limit.  */
-  min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+  min_ni_width = min_prec_for_max_niters (loop_vinfo, max_nscalars_per_iter);
 
   /* Find a scalar mode for which WHILE_ULT is supported.  */
   opt_scalar_int_mode cmp_mode_iter;
   tree cmp_type = NULL_TREE;
   tree iv_type = NULL_TREE;
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
   unsigned int iv_precision = UINT_MAX;
 
   if (iv_limit != -1)
@@ -1056,6 +1085,33 @@  vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_objs *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width
+    = min_prec_for_max_niters (loop_vinfo, rgl->nbytes_per_iter);
+
+  unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode);
+  if (len_bits < min_ni_width)
+    return false;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -1628,9 +1684,9 @@  vect_analyze_loop_costing (loop_vec_info loop_vinfo)
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
 
-  /* Only fully-masked loops can have iteration counts less than the
-     vectorization factor.  */
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  /* Only fully-masked or fully with length loops can have iteration counts less
+     than the vectorization factor.  */
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       if (known_niters_smaller_than_vf (loop_vinfo))
 	{
@@ -1858,7 +1914,7 @@  determine_peel_for_niter (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     /* The main loop handles all iterations.  */
     LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
@@ -2047,7 +2103,7 @@  vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
       vect_optimize_slp (loop_vinfo);
     }
 
-  bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
+  bool saved_can_partial_vect_p = LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo);
 
   /* We don't expect to have to roll back to anything other than an empty
      set of rgroups.  */
@@ -2129,10 +2185,24 @@  start_over:
       return ok;
     }
 
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorized loop because we"
+			 " don't expect to mix partial vectorization"
+			 " approaches for the same loop.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+    }
+
   /* Decide whether to use a fully-masked loop for this vectorization
      factor.  */
   LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+    = (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
        && vect_verify_full_masking (loop_vinfo));
   if (dump_enabled_p ())
     {
@@ -2144,6 +2214,50 @@  start_over:
 			 "not using a fully-masked loop.\n");
     }
 
+  /* Decide whether to use vector access with length.  */
+  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+    = (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+       && vect_verify_loop_lens (loop_vinfo));
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+      && (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase peeling"
+			 " for alignment or gaps is required.\n");
+      LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (param_vect_with_length_scope == 0)
+	LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+      /* The epilogue and other known niters less than VF cases can still use
+	 vector access with length fully.  */
+      else if (param_vect_with_length_scope == 1
+	       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !known_niters_smaller_than_vf (loop_vinfo))
+	{
+	  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+	  LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo) = true;
+	}
+    }
+  else
+    /* Always set it as false in case previous tries set it.  */
+    LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo) = false;
+
+  if (dump_enabled_p ())
+    {
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location, "using vector access with"
+						  " length for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with"
+						  " length for loop fully.\n");
+    }
+
   /* If epilog loop is required because of data accesses with gaps,
      one additional iteration needs to be peeled.  Check if there is
      enough iterations for vectorization.  */
@@ -2163,7 +2277,7 @@  start_over:
   /* If we're vectorizing an epilogue loop, we either need a fully-masked
      loop or a loop that has a lower VF than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
 		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
     return opt_result::failure_at (vect_location,
@@ -2362,12 +2476,13 @@  again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
-  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
+  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = saved_can_partial_vect_p;
 
   goto start_over;
 }
@@ -2646,8 +2761,10 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      if (ordered_p (lowest_th, th))
 		lowest_th = ordered_min (lowest_th, th);
 	    }
-	  else
-	    delete loop_vinfo;
+	  else {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	  }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2672,6 +2789,7 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2679,6 +2797,22 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial vectorization,
+	 but want to only adopt it for the epilogue.  The retry should be in the
+	 same mode as original.  */
+      if (vect_epilogues && loop_vinfo
+	  && LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+		      && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectorization.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3493,7 +3627,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
-      rgroup_masks *rgm;
+      rgroup_objs *rgm;
       unsigned int num_vectors_m1;
       FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
 	if (rgm->mask_type)
@@ -3519,6 +3653,11 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -3808,7 +3947,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		 "  Calculated minimum iters for profitability: %d\n",
 		 min_profitable_iters);
 
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
@@ -6761,6 +6900,7 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
+
   /* All but single defuse-cycle optimized, lane-reducing and fold-left
      reductions go through their own vectorizable_* routines.  */
   if (!single_defuse_cycle
@@ -6779,7 +6919,7 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
       STMT_VINFO_DEF_TYPE (vect_orig_stmt (tem)) = vect_internal_def;
       STMT_VINFO_DEF_TYPE (tem) = vect_internal_def;
     }
-  else if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+  else if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
     {
       vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
       internal_fn cond_fn = get_conditional_internal_fn (code);
@@ -6792,9 +6932,9 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because no"
+			     "can't use a partial vectorized loop because no"
 			     " conditional operation is available.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	}
       else if (reduction_type == FOLD_LEFT_REDUCTION
 	       && reduc_fn == IFN_LAST
@@ -6804,9 +6944,9 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because no"
+			     "can't use a partial vectorized loop because no"
 			     " conditional operation is available.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	}
       else
 	vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
@@ -8005,33 +8145,33 @@  vectorizable_live_operation (loop_vec_info loop_vinfo,
   if (!vec_stmt_p)
     {
       /* No transformation required.  */
-      if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+      if (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
 	{
 	  if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
 					       OPTIMIZE_FOR_SPEED))
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because "
+				 "can't use a partial vectorized loop because "
 				 "the target doesn't support extract last "
 				 "reduction.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else if (slp_node)
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because an "
-				 "SLP statement is live after the loop.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+				 "can't use a partial vectorized loop because "
+				 "an SLP statement is live after the loop.\n");
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else if (ncopies > 1)
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because"
+				 "can't use a partial vectorized loop because"
 				 " ncopies is greater than 1.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else
 	    {
@@ -8041,6 +8181,7 @@  vectorizable_live_operation (loop_vec_info loop_vinfo,
 				     1, vectype, NULL);
 	    }
 	}
+
       return true;
     }
 
@@ -8285,7 +8426,7 @@  vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
     masks->safe_grow_cleared (nvectors);
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   /* The number of scalars per iteration and the number of vectors are
      both compile-time constants.  */
   unsigned int nscalars_per_iter
@@ -8316,24 +8457,24 @@  tree
 vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
 		    unsigned int nvectors, tree vectype, unsigned int index)
 {
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   tree mask_type = rgm->mask_type;
 
   /* Populate the rgroup's mask array, if this is the first time we've
      used it.  */
-  if (rgm->masks.is_empty ())
+  if (rgm->objs.is_empty ())
     {
-      rgm->masks.safe_grow_cleared (nvectors);
+      rgm->objs.safe_grow_cleared (nvectors);
       for (unsigned int i = 0; i < nvectors; ++i)
 	{
 	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
 	  /* Provide a dummy definition until the real one is available.  */
 	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
-	  rgm->masks[i] = mask;
+	  rgm->objs[i] = mask;
 	}
     }
 
-  tree mask = rgm->masks[index];
+  tree mask = rgm->objs[index];
   if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
 		TYPE_VECTOR_SUBPARTS (vectype)))
     {
@@ -8354,6 +8495,66 @@  vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		       unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, total bytes of them and the number of
+     vectors are both compile-time constants.  */
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nbytes_per_iter
+    = exact_div (nvectors * vector_size, vf).to_constant ();
+
+  /* The one associated to the same nvectors should have the same bytes per
+     iteration.  */
+  if (!rgl->vec_type)
+    {
+      rgl->vec_type = vectype;
+      rgl->nbytes_per_iter = nbytes_per_iter;
+    }
+  else
+    gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter);
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index)
+{
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->objs.is_empty ())
+    {
+      rgl->objs.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+	  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+	  tree len_type = build_nonstandard_integer_type (len_prec, true);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->objs[i] = len;
+	}
+    }
+
+  return rgl->objs[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
@@ -8713,7 +8914,7 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
 	  && known_eq (lowest_vf, vf))
 	{
 	  niters_vector
@@ -8881,7 +9082,7 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* True if the final iteration might not handle a full vector's
      worth of scalar iterations.  */
-  bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  bool final_iter_may_be_partial = LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
@@ -9184,12 +9385,14 @@  optimize_mask_stores (class loop *loop)
 }
 
 /* Decide whether it is possible to use a zero-based induction variable
-   when vectorizing LOOP_VINFO with a fully-masked loop.  If it is,
-   return the value that the induction variable must be able to hold
-   in order to ensure that the loop ends with an all-false mask.
+   when vectorizing LOOP_VINFO with a fully-masked or fully with length
+   loop.  If it is, return the value that the induction variable must
+   be able to hold in order to ensure that the loop ends with an
+   all-false mask or zero byte length.
    Return -1 otherwise.  */
+
 widest_int
-vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo)
+vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo)
 {
   tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e7822c44951..1bd2d2bd581 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1771,9 +1771,9 @@  static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
 				  gimple_stmt_iterator *);
 
 /* Check whether a load or store statement in the loop described by
-   LOOP_VINFO is possible in a fully-masked loop.  This is testing
-   whether the vectorizer pass has the appropriate support, as well as
-   whether the target does.
+   LOOP_VINFO is possible in a fully-masked or fully with length loop.
+   This is testing whether the vectorizer pass has the appropriate support,
+   as well as whether the target does.
 
    VLS_TYPE says whether the statement is a load or store and VECTYPE
    is the type of the vector being loaded or stored.  MEMORY_ACCESS_TYPE
@@ -1783,14 +1783,14 @@  static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
    its arguments.  If the load or store is conditional, SCALAR_MASK is the
    condition under which it occurs.
 
-   Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
-   supported, otherwise record the required mask types.  */
+   Clear LOOP_VINFO_CAN_PARTIAL_VECT_P if a fully-masked or fully with
+   length loop is not supported, otherwise record the required mask types.  */
 
 static void
-check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
-			  vec_load_store_type vls_type, int group_size,
-			  vect_memory_access_type memory_access_type,
-			  gather_scatter_info *gs_info, tree scalar_mask)
+check_load_store_partial_vect (loop_vec_info loop_vinfo, tree vectype,
+			       vec_load_store_type vls_type, int group_size,
+			       vect_memory_access_type memory_access_type,
+			       gather_scatter_info *gs_info, tree scalar_mask)
 {
   /* Invariant loads need no special support.  */
   if (memory_access_type == VMAT_INVARIANT)
@@ -1807,10 +1807,10 @@  check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because the"
-			     " target doesn't have an appropriate masked"
+			     "can't use a partial vectorized loop because"
+			     " the target doesn't have an appropriate"
 			     " load/store-lanes instruction.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
@@ -1830,10 +1830,10 @@  check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because the"
-			     " target doesn't have an appropriate masked"
+			     "can't use a partial vectorized loop because"
+			     " the target doesn't have an appropriate"
 			     " gather load or scatter store instruction.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
@@ -1848,35 +1848,61 @@  check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	 scalar loop.  We need more work to support other mappings.  */
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because an access"
-			 " isn't contiguous.\n");
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+			 "can't use a partial vectorized loop because an"
+			 " access isn't contiguous.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+			 "can't use a partial vectorized loop because of"
+			 " the unexpected mode.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+  machine_mode mask_mode;
+  bool partial_vectorized_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+	 We checked in get_group_load_store_type that the extra elements
+	 don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      partial_vectorized_p = true;
+    }
+
+  optab op = is_load ? lenload_optab : lenstore_optab;
+  if (convert_optab_handler (op, vecmode, targetm.vectorize.length_mode))
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+      else
+	gcc_unreachable ();
+      partial_vectorized_p = true;
+    }
+
+  if (!partial_vectorized_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorized loop because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorized load or store.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -6187,7 +6213,7 @@  vectorizable_operation (vec_info *vinfo,
 	 should only change the active lanes of the reduction chain,
 	 keeping the inactive lanes as-is.  */
       if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+	  && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
 	  && reduc_idx >= 0)
 	{
 	  if (cond_fn == IFN_LAST
@@ -6198,7 +6224,7 @@  vectorizable_operation (vec_info *vinfo,
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 				 "can't use a fully-masked loop because no"
 				 " conditional operation is available.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else
 	    vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
@@ -7527,10 +7553,10 @@  vectorizable_store (vec_info *vinfo,
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
-	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
-				  memory_access_type, &gs_info, mask);
+      if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	check_load_store_partial_vect (loop_vinfo, vectype, vls_type,
+				       group_size, memory_access_type, &gs_info,
+				       mask);
 
       if (slp_node
 	  && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0],
@@ -8068,6 +8094,15 @@  vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -8320,10 +8355,15 @@  vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -8403,6 +8443,17 @@  vectorizable_store (vec_info *vinfo,
 		  new_stmt_info
 		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  new_stmt_info
+		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8834,10 +8885,10 @@  vectorizable_load (vec_info *vinfo,
       if (!slp)
 	STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
-	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
-				  memory_access_type, &gs_info, mask);
+      if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	check_load_store_partial_vect (loop_vinfo, vectype, VLS_LOAD,
+				       group_size, memory_access_type, &gs_info,
+				       mask);
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
@@ -8937,6 +8988,7 @@  vectorizable_load (vec_info *vinfo,
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
 
       if (grouped_load)
 	{
@@ -9234,6 +9286,15 @@  vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9555,15 +9616,20 @@  vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
 
+
 	      if (i > 0)
 		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
 					       gsi, stmt_info, bump);
@@ -9629,6 +9695,18 @@  vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -10279,11 +10357,16 @@  vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+      /* For reduction, we expect EXTRACT_LAST_REDUCTION so far.  */
+      if (loop_vinfo && for_reduction
+	  && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  else
+	    LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -12480,3 +12563,35 @@  vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_bytes = END_INDEX - min_of_start_and_end;
+   rhs = min (left_bytes, VECTOR_SIZE);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_bytes larger than 255, it can't be saturated to vector
+   size.  One target hook can be provided if other ports don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size);
+
+  tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2eb3ab5d280..9d84766d724 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -461,20 +461,32 @@  is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The masks needed by rgroups with nV vectors, according to the
-   description above.  */
-struct rgroup_masks {
-  /* The largest nS for all rgroups that use these masks.  */
-  unsigned int max_nscalars_per_iter;
-
-  /* The type of mask to use, based on the highest nS recorded above.  */
-  tree mask_type;
+/* The masks/lengths (called as objects) needed by rgroups with nV vectors,
+   according to the description above.  */
+struct rgroup_objs {
+  union
+  {
+    /* The largest nS for all rgroups that use these masks.  */
+    unsigned int max_nscalars_per_iter;
+    /* The total bytes for any nS per iteration.  */
+    unsigned int nbytes_per_iter;
+  };
 
-  /* A vector of nV masks, in iteration order.  */
-  vec<tree> masks;
+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };
+
+  /* A vector of nV objs, in iteration order.  */
+  vec<tree> objs;
 };
 
-typedef auto_vec<rgroup_masks> vec_loop_masks;
+typedef auto_vec<rgroup_objs> vec_loop_masks;
+
+typedef auto_vec<rgroup_objs> vec_loop_lens;
 
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
@@ -523,6 +535,10 @@  public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -620,12 +636,20 @@  public:
   /* Is the loop vectorizable? */
   bool vectorizable;
 
-  /* Records whether we still have the option of using a fully-masked loop.  */
-  bool can_fully_mask_p;
+  /* Records whether we can use partial vector approaches for this loop, for
+     now we support masking and length approaches.  */
+  bool can_partial_vect_p;
 
   /* True if have decided to use a fully-masked loop.  */
   bool fully_masked_p;
 
+  /* True if have decided to use length access for the loop fully.  */
+  bool fully_with_length_p;
+
+  /* Records whether we can use partial vector approaches for the epilogue of
+     this loop, for now we only support length approach.  */
+  bool epil_partial_vect_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -687,8 +711,11 @@  public:
 #define LOOP_VINFO_COST_MODEL_THRESHOLD(L) (L)->th
 #define LOOP_VINFO_VERSIONING_THRESHOLD(L) (L)->versioning_threshold
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
-#define LOOP_VINFO_CAN_FULLY_MASK_P(L)     (L)->can_fully_mask_p
+#define LOOP_VINFO_CAN_PARTIAL_VECT_P(L)   (L)->can_partial_vect_p
 #define LOOP_VINFO_FULLY_MASKED_P(L)       (L)->fully_masked_p
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)  (L)->fully_with_length_p
+#define LOOP_VINFO_EPIL_PARTIAL_VECT_P(L)  (L)->epil_partial_vect_p
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
@@ -741,6 +768,10 @@  public:
    || LOOP_REQUIRES_VERSIONING_FOR_NITERS (L)		\
    || LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (L))
 
+/* Whether operates on partial vector.  */
+#define LOOP_VINFO_PARTIAL_VECT_P(L)                                           \
+  (LOOP_VINFO_FULLY_MASKED_P (L) || LOOP_VINFO_FULLY_WITH_LENGTH_P (L))
+
 #define LOOP_VINFO_NITERS_KNOWN_P(L)          \
   (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0)
 
@@ -1824,7 +1855,7 @@  extern tree vect_create_addr_base_for_vector_ref (vec_info *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
-extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
+extern widest_int vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo);
 /* Used in tree-vect-loop-manip.c */
 extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
@@ -1842,6 +1873,10 @@  extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */