PR90424 - lowpart vector set recognition

Message ID alpine.LSU.2.20.1905141438260.10704@zhemvz.fhfr.qr
State New
Headers show
Series
  • PR90424 - lowpart vector set recognition
Related show

Commit Message

Richard Biener May 14, 2019, 12:49 p.m.
The following makes SSA rewrite (update-address-taken) recognize
sets of aligned sub-vectors in aligned position
(v2qi into v16qi, but esp. v8qi into v16qi).  It uses the
BIT_INSERT_EXPR support for this, enabling that for vector
typed values.  This makes us turn for example

typedef unsigned char v16qi __attribute__((vector_size(16)));
v16qi load (const void *p)
{
  v16qi r;
  __builtin_memcpy (&r, p, 8);
  return r;
}

into the following

load (const void * p)
{
  v16qi r;
  long unsigned int _3;
  v16qi _5;
  vector(8) unsigned char _7;

  <bb 2> :
  _3 = MEM[(char * {ref-all})p_2(D)];
  _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);
  r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;
  _5 = r_9;
  return _5;

this isn't yet nicely expanded since the BIT_INSERT_EXPR
expansion simply goes through store_bit_field and there's
no vector-mode vec_set.

Similar as to the single-element insert SSA rewrite already
handles the transform is conditional on the involved
vector types having non-BLKmode.  This is somewhat bad
since the transform is supposed to enable SSA optimizations
by rewriting memory vectors into SSA form.  Since splitting
of larger generic vectors happens very much later only
this pessimizes their use.  But the BIT_INSERT_EXPR
expansion doesn't cope with BLKmode entities (source or
destination).

Extending BIT_INSERT_EXPR this way seems natural given
the support of CONSTRUCTORs with smaller vectors.
BIT_FIELD_REF isn't particularly restricted so can be
used to extract sub-vectors as well.

Code generation is as bad as before (RTL expansion eventually
spills) but SSA optimizations are enabled on less trivial
testcases.

Boostrap / regtest running on x86_64-unknown-linux-gnu.

Comments?

Richard.

2019-05-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/90424
	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from
	aligned subvectors.
	(execute_update_addresses_taken): Likewise.
	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

	* g++.target/i386/pr90424-1.C: New testcase.
	* g++.target/i386/pr90424-2.C: Likewise.

Comments

Richard Sandiford May 14, 2019, 1:27 p.m. | #1
Richard Biener <rguenther@suse.de> writes:
> The following makes SSA rewrite (update-address-taken) recognize

> sets of aligned sub-vectors in aligned position

> (v2qi into v16qi, but esp. v8qi into v16qi).  It uses the

> BIT_INSERT_EXPR support for this, enabling that for vector

> typed values.  This makes us turn for example

>

> typedef unsigned char v16qi __attribute__((vector_size(16)));

> v16qi load (const void *p)

> {

>   v16qi r;

>   __builtin_memcpy (&r, p, 8);

>   return r;

> }

>

> into the following

>

> load (const void * p)

> {

>   v16qi r;

>   long unsigned int _3;

>   v16qi _5;

>   vector(8) unsigned char _7;

>

>   <bb 2> :

>   _3 = MEM[(char * {ref-all})p_2(D)];

>   _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);

>   r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;

>   _5 = r_9;

>   return _5;

>

> this isn't yet nicely expanded since the BIT_INSERT_EXPR

> expansion simply goes through store_bit_field and there's

> no vector-mode vec_set.

>

> Similar as to the single-element insert SSA rewrite already

> handles the transform is conditional on the involved

> vector types having non-BLKmode.  This is somewhat bad

> since the transform is supposed to enable SSA optimizations

> by rewriting memory vectors into SSA form.  Since splitting

> of larger generic vectors happens very much later only

> this pessimizes their use.  But the BIT_INSERT_EXPR

> expansion doesn't cope with BLKmode entities (source or

> destination).

>

> Extending BIT_INSERT_EXPR this way seems natural given

> the support of CONSTRUCTORs with smaller vectors.

> BIT_FIELD_REF isn't particularly restricted so can be

> used to extract sub-vectors as well.

>

> Code generation is as bad as before (RTL expansion eventually

> spills) but SSA optimizations are enabled on less trivial

> testcases.

>

> Boostrap / regtest running on x86_64-unknown-linux-gnu.

>

> Comments?

>

> Richard.

>

> 2019-05-14  Richard Biener  <rguenther@suse.de>

>

> 	PR tree-optimization/90424

> 	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from

> 	aligned subvectors.

> 	(execute_update_addresses_taken): Likewise.

> 	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

>

> 	* g++.target/i386/pr90424-1.C: New testcase.

> 	* g++.target/i386/pr90424-2.C: Likewise.

>

> Index: gcc/tree-ssa.c

> ===================================================================

> --- gcc/tree-ssa.c	(revision 271155)

> +++ gcc/tree-ssa.c	(working copy)

> @@ -1521,14 +1521,28 @@ non_rewritable_lvalue_p (tree lhs)

>        if (DECL_P (decl)

>  	  && VECTOR_TYPE_P (TREE_TYPE (decl))

>  	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode

> -	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> -			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)

> +	  && multiple_of_p (sizetype,

> +			    TYPE_SIZE_UNIT (TREE_TYPE (decl)),

> +			    TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

>  	  && known_ge (mem_ref_offset (lhs), 0)

>  	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),

>  		       mem_ref_offset (lhs))

>  	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),

>  			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))

> -	return false;

> +	{

> +	  if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))),

> +			       0))

> +	    return false;

> +	  /* For sub-vector inserts the insert vector mode has to be

> +	     supported.  */

> +	  tree vtype = build_vector_type

> +	      (TREE_TYPE (TREE_TYPE (decl)),

> +	       tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))

> +	       / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))));


AFAICT nothing guarantees tree_fits_uhwi_p for the lhs, so this isn't
poly-int clean.  Is there a guarantee that lhs is a multiple of the
element size even for integers?  Or are we just relying on a vector
of 0 elements being rejected?

Maybe something like:

	  tree elt_type = TREE_TYPE (TREE_TYPE (decl));
	  unsigned int elt_bits = tree_to_uhwi (TYPE_SIZE (elt_type));
	  poly_uint64 lhs_bits, nelts;
	  if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits)
	      && multiple_p (lhs_bits, elt_bits, &nelts))
            {
	      tree vtype = build_vector_type (elt_type, nelts);

?  

Thanks,
Richard
Richard Biener May 14, 2019, 2:02 p.m. | #2
On Tue, 14 May 2019, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:

> > The following makes SSA rewrite (update-address-taken) recognize

> > sets of aligned sub-vectors in aligned position

> > (v2qi into v16qi, but esp. v8qi into v16qi).  It uses the

> > BIT_INSERT_EXPR support for this, enabling that for vector

> > typed values.  This makes us turn for example

> >

> > typedef unsigned char v16qi __attribute__((vector_size(16)));

> > v16qi load (const void *p)

> > {

> >   v16qi r;

> >   __builtin_memcpy (&r, p, 8);

> >   return r;

> > }

> >

> > into the following

> >

> > load (const void * p)

> > {

> >   v16qi r;

> >   long unsigned int _3;

> >   v16qi _5;

> >   vector(8) unsigned char _7;

> >

> >   <bb 2> :

> >   _3 = MEM[(char * {ref-all})p_2(D)];

> >   _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);

> >   r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;

> >   _5 = r_9;

> >   return _5;

> >

> > this isn't yet nicely expanded since the BIT_INSERT_EXPR

> > expansion simply goes through store_bit_field and there's

> > no vector-mode vec_set.

> >

> > Similar as to the single-element insert SSA rewrite already

> > handles the transform is conditional on the involved

> > vector types having non-BLKmode.  This is somewhat bad

> > since the transform is supposed to enable SSA optimizations

> > by rewriting memory vectors into SSA form.  Since splitting

> > of larger generic vectors happens very much later only

> > this pessimizes their use.  But the BIT_INSERT_EXPR

> > expansion doesn't cope with BLKmode entities (source or

> > destination).

> >

> > Extending BIT_INSERT_EXPR this way seems natural given

> > the support of CONSTRUCTORs with smaller vectors.

> > BIT_FIELD_REF isn't particularly restricted so can be

> > used to extract sub-vectors as well.

> >

> > Code generation is as bad as before (RTL expansion eventually

> > spills) but SSA optimizations are enabled on less trivial

> > testcases.

> >

> > Boostrap / regtest running on x86_64-unknown-linux-gnu.

> >

> > Comments?

> >

> > Richard.

> >

> > 2019-05-14  Richard Biener  <rguenther@suse.de>

> >

> > 	PR tree-optimization/90424

> > 	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from

> > 	aligned subvectors.

> > 	(execute_update_addresses_taken): Likewise.

> > 	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

> >

> > 	* g++.target/i386/pr90424-1.C: New testcase.

> > 	* g++.target/i386/pr90424-2.C: Likewise.

> >

> > Index: gcc/tree-ssa.c

> > ===================================================================

> > --- gcc/tree-ssa.c	(revision 271155)

> > +++ gcc/tree-ssa.c	(working copy)

> > @@ -1521,14 +1521,28 @@ non_rewritable_lvalue_p (tree lhs)

> >        if (DECL_P (decl)

> >  	  && VECTOR_TYPE_P (TREE_TYPE (decl))

> >  	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode

> > -	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> > -			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)

> > +	  && multiple_of_p (sizetype,

> > +			    TYPE_SIZE_UNIT (TREE_TYPE (decl)),

> > +			    TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

> >  	  && known_ge (mem_ref_offset (lhs), 0)

> >  	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),

> >  		       mem_ref_offset (lhs))

> >  	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),

> >  			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))

> > -	return false;

> > +	{

> > +	  if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> > +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))),

> > +			       0))

> > +	    return false;

> > +	  /* For sub-vector inserts the insert vector mode has to be

> > +	     supported.  */

> > +	  tree vtype = build_vector_type

> > +	      (TREE_TYPE (TREE_TYPE (decl)),

> > +	       tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))

> > +	       / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))));

> 

> AFAICT nothing guarantees tree_fits_uhwi_p for the lhs, so this isn't

> poly-int clean.  Is there a guarantee that lhs is a multiple of the

> element size even for integers?  Or are we just relying on a vector

> of 0 elements being rejected?


That it is a multiple of the element size is tested indirectly by

> > +     && multiple_of_p (sizetype, 

> > +                       TYPE_SIZE_UNIT (TREE_TYPE (decl)),

> > +                       TYPE_SIZE_UNIT (TREE_TYPE (lhs)))


given 'decl' has vector type.  That also guarantees non-zero size?

But yes, I guess TYPE_SIZE_UNIT might be a poly-int.

> Maybe something like:

> 

> 	  tree elt_type = TREE_TYPE (TREE_TYPE (decl));

> 	  unsigned int elt_bits = tree_to_uhwi (TYPE_SIZE (elt_type));

> 	  poly_uint64 lhs_bits, nelts;

> 	  if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits)

> 	      && multiple_p (lhs_bits, elt_bits, &nelts))

>             {

> 	      tree vtype = build_vector_type (elt_type, nelts);


This should work.

Of course it even more so makes duplicating all the bits in
two places ugly :/

I guess I might take it as opportunity to refactor the pass
(as excuse to not work on that SLP thing...).

Thanks,
Richard.
Richard Sandiford May 14, 2019, 2:10 p.m. | #3
Richard Biener <rguenther@suse.de> writes:
> On Tue, 14 May 2019, Richard Sandiford wrote:

>

>> Richard Biener <rguenther@suse.de> writes:

>> > The following makes SSA rewrite (update-address-taken) recognize

>> > sets of aligned sub-vectors in aligned position

>> > (v2qi into v16qi, but esp. v8qi into v16qi).  It uses the

>> > BIT_INSERT_EXPR support for this, enabling that for vector

>> > typed values.  This makes us turn for example

>> >

>> > typedef unsigned char v16qi __attribute__((vector_size(16)));

>> > v16qi load (const void *p)

>> > {

>> >   v16qi r;

>> >   __builtin_memcpy (&r, p, 8);

>> >   return r;

>> > }

>> >

>> > into the following

>> >

>> > load (const void * p)

>> > {

>> >   v16qi r;

>> >   long unsigned int _3;

>> >   v16qi _5;

>> >   vector(8) unsigned char _7;

>> >

>> >   <bb 2> :

>> >   _3 = MEM[(char * {ref-all})p_2(D)];

>> >   _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);

>> >   r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;

>> >   _5 = r_9;

>> >   return _5;

>> >

>> > this isn't yet nicely expanded since the BIT_INSERT_EXPR

>> > expansion simply goes through store_bit_field and there's

>> > no vector-mode vec_set.

>> >

>> > Similar as to the single-element insert SSA rewrite already

>> > handles the transform is conditional on the involved

>> > vector types having non-BLKmode.  This is somewhat bad

>> > since the transform is supposed to enable SSA optimizations

>> > by rewriting memory vectors into SSA form.  Since splitting

>> > of larger generic vectors happens very much later only

>> > this pessimizes their use.  But the BIT_INSERT_EXPR

>> > expansion doesn't cope with BLKmode entities (source or

>> > destination).

>> >

>> > Extending BIT_INSERT_EXPR this way seems natural given

>> > the support of CONSTRUCTORs with smaller vectors.

>> > BIT_FIELD_REF isn't particularly restricted so can be

>> > used to extract sub-vectors as well.

>> >

>> > Code generation is as bad as before (RTL expansion eventually

>> > spills) but SSA optimizations are enabled on less trivial

>> > testcases.

>> >

>> > Boostrap / regtest running on x86_64-unknown-linux-gnu.

>> >

>> > Comments?

>> >

>> > Richard.

>> >

>> > 2019-05-14  Richard Biener  <rguenther@suse.de>

>> >

>> > 	PR tree-optimization/90424

>> > 	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from

>> > 	aligned subvectors.

>> > 	(execute_update_addresses_taken): Likewise.

>> > 	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

>> >

>> > 	* g++.target/i386/pr90424-1.C: New testcase.

>> > 	* g++.target/i386/pr90424-2.C: Likewise.

>> >

>> > Index: gcc/tree-ssa.c

>> > ===================================================================

>> > --- gcc/tree-ssa.c	(revision 271155)

>> > +++ gcc/tree-ssa.c	(working copy)

>> > @@ -1521,14 +1521,28 @@ non_rewritable_lvalue_p (tree lhs)

>> >        if (DECL_P (decl)

>> >  	  && VECTOR_TYPE_P (TREE_TYPE (decl))

>> >  	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode

>> > -	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

>> > -			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)

>> > +	  && multiple_of_p (sizetype,

>> > +			    TYPE_SIZE_UNIT (TREE_TYPE (decl)),

>> > +			    TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

>> >  	  && known_ge (mem_ref_offset (lhs), 0)

>> >  	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),

>> >  		       mem_ref_offset (lhs))

>> >  	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),

>> >  			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))

>> > -	return false;

>> > +	{

>> > +	  if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

>> > +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))),

>> > +			       0))

>> > +	    return false;

>> > +	  /* For sub-vector inserts the insert vector mode has to be

>> > +	     supported.  */

>> > +	  tree vtype = build_vector_type

>> > +	      (TREE_TYPE (TREE_TYPE (decl)),

>> > +	       tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))

>> > +	       / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))));

>> 

>> AFAICT nothing guarantees tree_fits_uhwi_p for the lhs, so this isn't

>> poly-int clean.  Is there a guarantee that lhs is a multiple of the

>> element size even for integers?  Or are we just relying on a vector

>> of 0 elements being rejected?

>

> That it is a multiple of the element size is tested indirectly by

>

>> > +     && multiple_of_p (sizetype, 

>> > +                       TYPE_SIZE_UNIT (TREE_TYPE (decl)),

>> > +                       TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

>

> given 'decl' has vector type.  That also guarantees non-zero size?


I was thinking of the case where the element size is still bigger
than lhs, so the division would end up being 0.  Maybe one of those
conditions indirectly prevents that though.

> But yes, I guess TYPE_SIZE_UNIT might be a poly-int.

>

>> Maybe something like:

>> 

>> 	  tree elt_type = TREE_TYPE (TREE_TYPE (decl));

>> 	  unsigned int elt_bits = tree_to_uhwi (TYPE_SIZE (elt_type));

>> 	  poly_uint64 lhs_bits, nelts;

>> 	  if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits)

>> 	      && multiple_p (lhs_bits, elt_bits, &nelts))

>>             {

>> 	      tree vtype = build_vector_type (elt_type, nelts);

>

> This should work.

>

> Of course it even more so makes duplicating all the bits in

> two places ugly :/

>

> I guess I might take it as opportunity to refactor the pass


Sounds good :-)

> (as excuse to not work on that SLP thing...).


except for that bit.

>

> Thanks,

> Richard.
Richard Biener May 15, 2019, 1:01 p.m. | #4
On Tue, 14 May 2019, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:

> > On Tue, 14 May 2019, Richard Sandiford wrote:

> >

> >> Richard Biener <rguenther@suse.de> writes:

> >> > The following makes SSA rewrite (update-address-taken) recognize

> >> > sets of aligned sub-vectors in aligned position

> >> > (v2qi into v16qi, but esp. v8qi into v16qi).  It uses the

> >> > BIT_INSERT_EXPR support for this, enabling that for vector

> >> > typed values.  This makes us turn for example

> >> >

> >> > typedef unsigned char v16qi __attribute__((vector_size(16)));

> >> > v16qi load (const void *p)

> >> > {

> >> >   v16qi r;

> >> >   __builtin_memcpy (&r, p, 8);

> >> >   return r;

> >> > }

> >> >

> >> > into the following

> >> >

> >> > load (const void * p)

> >> > {

> >> >   v16qi r;

> >> >   long unsigned int _3;

> >> >   v16qi _5;

> >> >   vector(8) unsigned char _7;

> >> >

> >> >   <bb 2> :

> >> >   _3 = MEM[(char * {ref-all})p_2(D)];

> >> >   _7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);

> >> >   r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;

> >> >   _5 = r_9;

> >> >   return _5;

> >> >

> >> > this isn't yet nicely expanded since the BIT_INSERT_EXPR

> >> > expansion simply goes through store_bit_field and there's

> >> > no vector-mode vec_set.

> >> >

> >> > Similar as to the single-element insert SSA rewrite already

> >> > handles the transform is conditional on the involved

> >> > vector types having non-BLKmode.  This is somewhat bad

> >> > since the transform is supposed to enable SSA optimizations

> >> > by rewriting memory vectors into SSA form.  Since splitting

> >> > of larger generic vectors happens very much later only

> >> > this pessimizes their use.  But the BIT_INSERT_EXPR

> >> > expansion doesn't cope with BLKmode entities (source or

> >> > destination).

> >> >

> >> > Extending BIT_INSERT_EXPR this way seems natural given

> >> > the support of CONSTRUCTORs with smaller vectors.

> >> > BIT_FIELD_REF isn't particularly restricted so can be

> >> > used to extract sub-vectors as well.

> >> >

> >> > Code generation is as bad as before (RTL expansion eventually

> >> > spills) but SSA optimizations are enabled on less trivial

> >> > testcases.

> >> >

> >> > Boostrap / regtest running on x86_64-unknown-linux-gnu.

> >> >

> >> > Comments?

> >> >

> >> > Richard.

> >> >

> >> > 2019-05-14  Richard Biener  <rguenther@suse.de>

> >> >

> >> > 	PR tree-optimization/90424

> >> > 	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from

> >> > 	aligned subvectors.

> >> > 	(execute_update_addresses_taken): Likewise.

> >> > 	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

> >> >

> >> > 	* g++.target/i386/pr90424-1.C: New testcase.

> >> > 	* g++.target/i386/pr90424-2.C: Likewise.

> >> >

> >> > Index: gcc/tree-ssa.c

> >> > ===================================================================

> >> > --- gcc/tree-ssa.c	(revision 271155)

> >> > +++ gcc/tree-ssa.c	(working copy)

> >> > @@ -1521,14 +1521,28 @@ non_rewritable_lvalue_p (tree lhs)

> >> >        if (DECL_P (decl)

> >> >  	  && VECTOR_TYPE_P (TREE_TYPE (decl))

> >> >  	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode

> >> > -	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> >> > -			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)

> >> > +	  && multiple_of_p (sizetype,

> >> > +			    TYPE_SIZE_UNIT (TREE_TYPE (decl)),

> >> > +			    TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

> >> >  	  && known_ge (mem_ref_offset (lhs), 0)

> >> >  	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),

> >> >  		       mem_ref_offset (lhs))

> >> >  	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),

> >> >  			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))

> >> > -	return false;

> >> > +	{

> >> > +	  if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),

> >> > +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))),

> >> > +			       0))

> >> > +	    return false;

> >> > +	  /* For sub-vector inserts the insert vector mode has to be

> >> > +	     supported.  */

> >> > +	  tree vtype = build_vector_type

> >> > +	      (TREE_TYPE (TREE_TYPE (decl)),

> >> > +	       tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))

> >> > +	       / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))));

> >> 

> >> AFAICT nothing guarantees tree_fits_uhwi_p for the lhs, so this isn't

> >> poly-int clean.  Is there a guarantee that lhs is a multiple of the

> >> element size even for integers?  Or are we just relying on a vector

> >> of 0 elements being rejected?

> >

> > That it is a multiple of the element size is tested indirectly by

> >

> >> > +     && multiple_of_p (sizetype, 

> >> > +                       TYPE_SIZE_UNIT (TREE_TYPE (decl)),

> >> > +                       TYPE_SIZE_UNIT (TREE_TYPE (lhs)))

> >

> > given 'decl' has vector type.  That also guarantees non-zero size?

> 

> I was thinking of the case where the element size is still bigger

> than lhs, so the division would end up being 0.  Maybe one of those

> conditions indirectly prevents that though.

> 

> > But yes, I guess TYPE_SIZE_UNIT might be a poly-int.

> >

> >> Maybe something like:

> >> 

> >> 	  tree elt_type = TREE_TYPE (TREE_TYPE (decl));

> >> 	  unsigned int elt_bits = tree_to_uhwi (TYPE_SIZE (elt_type));

> >> 	  poly_uint64 lhs_bits, nelts;

> >> 	  if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits)

> >> 	      && multiple_p (lhs_bits, elt_bits, &nelts))

> >>             {

> >> 	      tree vtype = build_vector_type (elt_type, nelts);

> >

> > This should work.

> >

> > Of course it even more so makes duplicating all the bits in

> > two places ugly :/

> >

> > I guess I might take it as opportunity to refactor the pass

> 

> Sounds good :-)

> 

> > (as excuse to not work on that SLP thing...).

> 

> except for that bit.


The following seems to work for me then ;)

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

To who is listening: the refactoring would queue up stmts we need
to rewrite as discovered from non_rewritable_lvalue_p or
non_rewritable_mem_ref_base together with a transform kind.
During transform we then simply walk those, not re-analyzing
everything.  Probably easyhack but given the ugliness of the code
it might be easy to introduce mistakes as well ;)

Richard.

2019-05-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/90424
	* tree-ssa.c (non_rewritable_lvalue_p): Handle inserts from
	aligned subvectors.
	(execute_update_addresses_taken): Likewise.
	* tree-cfg.c (verify_gimple_assign_ternary): Likewise.

	* g++.target/i386/pr90424-1.C: New testcase.
	* g++.target/i386/pr90424-2.C: Likewise.

Index: gcc/tree-ssa.c
===================================================================
--- gcc/tree-ssa.c	(revision 271203)
+++ gcc/tree-ssa.c	(working copy)
@@ -1521,14 +1521,29 @@ non_rewritable_lvalue_p (tree lhs)
       if (DECL_P (decl)
 	  && VECTOR_TYPE_P (TREE_TYPE (decl))
 	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode
-	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
-			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)
 	  && known_ge (mem_ref_offset (lhs), 0)
 	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),
 		       mem_ref_offset (lhs))
 	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),
 			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))
-	return false;
+	{
+	  poly_uint64 lhs_bits, nelts;
+	  if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)), &lhs_bits)
+	      && multiple_p (lhs_bits,
+			     tree_to_uhwi
+			       (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))),
+			     &nelts))
+	    {
+	      if (known_eq (nelts, 1))
+		return false;
+	      /* For sub-vector inserts the insert vector mode has to be
+		 supported.  */
+	      tree vtype = build_vector_type (TREE_TYPE (TREE_TYPE (decl)),
+					      nelts);
+	      if (TYPE_MODE (vtype) != BLKmode)
+		return false;
+	    }
+	}
     }
 
   /* A vector-insert using a BIT_FIELD_REF is rewritable using
@@ -1866,20 +1881,30 @@ execute_update_addresses_taken (void)
 		    && bitmap_bit_p (suitable_for_renaming, DECL_UID (sym))
 		    && VECTOR_TYPE_P (TREE_TYPE (sym))
 		    && TYPE_MODE (TREE_TYPE (sym)) != BLKmode
-		    && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
-					TYPE_SIZE_UNIT
-					  (TREE_TYPE (TREE_TYPE (sym))), 0)
-		    && tree_fits_uhwi_p (TREE_OPERAND (lhs, 1))
-		    && tree_int_cst_lt (TREE_OPERAND (lhs, 1),
-					TYPE_SIZE_UNIT (TREE_TYPE (sym)))
-		    && (tree_to_uhwi (TREE_OPERAND (lhs, 1))
-			% tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (lhs)))) == 0)
+		    && known_ge (mem_ref_offset (lhs), 0)
+		    && known_gt (wi::to_poly_offset
+				   (TYPE_SIZE_UNIT (TREE_TYPE (sym))),
+				 mem_ref_offset (lhs))
+		    && multiple_of_p (sizetype,
+				      TREE_OPERAND (lhs, 1),
+				      TYPE_SIZE_UNIT (TREE_TYPE (lhs))))
 		  {
 		    tree val = gimple_assign_rhs1 (stmt);
 		    if (! types_compatible_p (TREE_TYPE (val),
 					      TREE_TYPE (TREE_TYPE (sym))))
 		      {
-			tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (sym)));
+			poly_uint64 lhs_bits, nelts;
+			tree temtype = TREE_TYPE (TREE_TYPE (sym));
+			if (poly_int_tree_p (TYPE_SIZE (TREE_TYPE (lhs)),
+					     &lhs_bits)
+			    && multiple_p (lhs_bits,
+					   tree_to_uhwi
+					     (TYPE_SIZE (TREE_TYPE
+							   (TREE_TYPE (sym)))),
+					   &nelts)
+			    && maybe_ne (nelts, 1))
+			  temtype = build_vector_type (temtype, nelts);
+			tree tem = make_ssa_name (temtype);
 			gimple *pun
 			  = gimple_build_assign (tem,
 						 build1 (VIEW_CONVERT_EXPR,
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 271203)
+++ gcc/tree-cfg.c	(working copy)
@@ -4263,8 +4263,17 @@ verify_gimple_assign_ternary (gassign *s
 	}
       if (! ((INTEGRAL_TYPE_P (rhs1_type)
 	      && INTEGRAL_TYPE_P (rhs2_type))
+	     /* Vector element insert.  */
 	     || (VECTOR_TYPE_P (rhs1_type)
-		 && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type))))
+		 && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type))
+	     /* Aligned sub-vector insert.  */
+	     || (VECTOR_TYPE_P (rhs1_type)
+		 && VECTOR_TYPE_P (rhs2_type)
+		 && types_compatible_p (TREE_TYPE (rhs1_type),
+					TREE_TYPE (rhs2_type))
+		 && multiple_p (TYPE_VECTOR_SUBPARTS (rhs1_type),
+				TYPE_VECTOR_SUBPARTS (rhs2_type))
+		 && multiple_of_p (bitsizetype, rhs3, TYPE_SIZE (rhs2_type)))))
 	{
 	  error ("not allowed type combination in BIT_INSERT_EXPR");
 	  debug_generic_expr (rhs1_type);
Index: gcc/testsuite/g++.target/i386/pr90424-1.C
===================================================================
--- gcc/testsuite/g++.target/i386/pr90424-1.C	(nonexistent)
+++ gcc/testsuite/g++.target/i386/pr90424-1.C	(working copy)
@@ -0,0 +1,32 @@
+/* { dg-do compile { target c++11 } } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+template <class T>
+using V [[gnu::vector_size(16)]] = T;
+
+template <class T, unsigned M = sizeof(V<T>)>
+V<T> load(const void *p) {
+  using W = V<T>;
+  W r;
+  __builtin_memcpy(&r, p, M);
+  return r;
+}
+
+// movq or movsd
+template V<char> load<char, 8>(const void *);     // bad
+template V<short> load<short, 8>(const void *);   // bad
+template V<int> load<int, 8>(const void *);       // bad
+template V<long> load<long, 8>(const void *);     // good
+// the following is disabled because V2SF isn't a supported mode
+// template V<float> load<float, 8>(const void *);   // bad
+template V<double> load<double, 8>(const void *); // good (movsd?)
+
+// movd or movss
+template V<char> load<char, 4>(const void *);   // bad
+template V<short> load<short, 4>(const void *); // bad
+template V<int> load<int, 4>(const void *);     // good
+template V<float> load<float, 4>(const void *); // good
+
+/* We should end up with one load and one insert for each function.  */
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */
Index: gcc/testsuite/g++.target/i386/pr90424-2.C
===================================================================
--- gcc/testsuite/g++.target/i386/pr90424-2.C	(nonexistent)
+++ gcc/testsuite/g++.target/i386/pr90424-2.C	(working copy)
@@ -0,0 +1,31 @@
+/* { dg-do compile { target c++11 } } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+template <class T>
+using V [[gnu::vector_size(16)]] = T;
+
+template <class T, unsigned M = sizeof(V<T>)>
+V<T> load(const void *p) {
+  V<T> r = {};
+  __builtin_memcpy(&r, p, M);
+  return r;
+}
+
+// movq or movsd
+template V<char> load<char, 8>(const void *);     // bad
+template V<short> load<short, 8>(const void *);   // bad
+template V<int> load<int, 8>(const void *);       // bad
+template V<long> load<long, 8>(const void *);     // good
+// the following is disabled because V2SF isn't a supported mode
+// template V<float> load<float, 8>(const void *);   // bad
+template V<double> load<double, 8>(const void *); // good (movsd?)
+
+// movd or movss
+template V<char> load<char, 4>(const void *);   // bad
+template V<short> load<short, 4>(const void *); // bad
+template V<int> load<int, 4>(const void *);     // good
+template V<float> load<float, 4>(const void *); // good
+
+/* We should end up with one load and one insert for each function.  */
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */

Patch

Index: gcc/tree-ssa.c
===================================================================
--- gcc/tree-ssa.c	(revision 271155)
+++ gcc/tree-ssa.c	(working copy)
@@ -1521,14 +1521,28 @@  non_rewritable_lvalue_p (tree lhs)
       if (DECL_P (decl)
 	  && VECTOR_TYPE_P (TREE_TYPE (decl))
 	  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode
-	  && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
-			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)
+	  && multiple_of_p (sizetype,
+			    TYPE_SIZE_UNIT (TREE_TYPE (decl)),
+			    TYPE_SIZE_UNIT (TREE_TYPE (lhs)))
 	  && known_ge (mem_ref_offset (lhs), 0)
 	  && known_gt (wi::to_poly_offset (TYPE_SIZE_UNIT (TREE_TYPE (decl))),
 		       mem_ref_offset (lhs))
 	  && multiple_of_p (sizetype, TREE_OPERAND (lhs, 1),
 			    TYPE_SIZE_UNIT (TREE_TYPE (lhs))))
-	return false;
+	{
+	  if (operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
+			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))),
+			       0))
+	    return false;
+	  /* For sub-vector inserts the insert vector mode has to be
+	     supported.  */
+	  tree vtype = build_vector_type
+	      (TREE_TYPE (TREE_TYPE (decl)),
+	       tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))
+	       / tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (decl)))));
+	  if (TYPE_MODE (vtype) != BLKmode)
+	    return false;
+	}
     }
 
   /* A vector-insert using a BIT_FIELD_REF is rewritable using
@@ -1866,9 +1880,9 @@  execute_update_addresses_taken (void)
 		    && bitmap_bit_p (suitable_for_renaming, DECL_UID (sym))
 		    && VECTOR_TYPE_P (TREE_TYPE (sym))
 		    && TYPE_MODE (TREE_TYPE (sym)) != BLKmode
-		    && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
-					TYPE_SIZE_UNIT
-					  (TREE_TYPE (TREE_TYPE (sym))), 0)
+		    && multiple_of_p (sizetype,
+				      TYPE_SIZE_UNIT (TREE_TYPE (sym)),
+				      TYPE_SIZE_UNIT (TREE_TYPE (lhs)))
 		    && tree_fits_uhwi_p (TREE_OPERAND (lhs, 1))
 		    && tree_int_cst_lt (TREE_OPERAND (lhs, 1),
 					TYPE_SIZE_UNIT (TREE_TYPE (sym)))
@@ -1879,7 +1893,16 @@  execute_update_addresses_taken (void)
 		    if (! types_compatible_p (TREE_TYPE (val),
 					      TREE_TYPE (TREE_TYPE (sym))))
 		      {
-			tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (sym)));
+			tree temtype = TREE_TYPE (TREE_TYPE (sym));
+			if (!operand_equal_p (TYPE_SIZE
+					        (TREE_TYPE (TREE_TYPE (sym))),
+					      TYPE_SIZE (TREE_TYPE (lhs)), 0))
+			  temtype = build_vector_type
+			    (temtype,
+			     tree_to_uhwi (TYPE_SIZE (TREE_TYPE (lhs)))
+			     / tree_to_uhwi (TYPE_SIZE (TREE_TYPE
+							  (TREE_TYPE (sym)))));
+			tree tem = make_ssa_name (temtype);
 			gimple *pun
 			  = gimple_build_assign (tem,
 						 build1 (VIEW_CONVERT_EXPR,
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 271155)
+++ gcc/tree-cfg.c	(working copy)
@@ -4263,8 +4263,17 @@  verify_gimple_assign_ternary (gassign *s
 	}
       if (! ((INTEGRAL_TYPE_P (rhs1_type)
 	      && INTEGRAL_TYPE_P (rhs2_type))
+	     /* Vector element insert.  */
 	     || (VECTOR_TYPE_P (rhs1_type)
-		 && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type))))
+		 && types_compatible_p (TREE_TYPE (rhs1_type), rhs2_type))
+	     /* Aligned sub-vector insert.  */
+	     || (VECTOR_TYPE_P (rhs1_type)
+		 && VECTOR_TYPE_P (rhs2_type)
+		 && types_compatible_p (TREE_TYPE (rhs1_type),
+					TREE_TYPE (rhs2_type))
+		 && multiple_p (TYPE_VECTOR_SUBPARTS (rhs1_type),
+				TYPE_VECTOR_SUBPARTS (rhs2_type))
+		 && multiple_of_p (bitsizetype, rhs3, TYPE_SIZE (rhs2_type)))))
 	{
 	  error ("not allowed type combination in BIT_INSERT_EXPR");
 	  debug_generic_expr (rhs1_type);
Index: gcc/testsuite/g++.target/i386/pr90424-1.C
===================================================================
--- gcc/testsuite/g++.target/i386/pr90424-1.C	(nonexistent)
+++ gcc/testsuite/g++.target/i386/pr90424-1.C	(working copy)
@@ -0,0 +1,32 @@ 
+/* { dg-do compile { target c++11 } } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+template <class T>
+using V [[gnu::vector_size(16)]] = T;
+
+template <class T, unsigned M = sizeof(V<T>)>
+V<T> load(const void *p) {
+  using W = V<T>;
+  W r;
+  __builtin_memcpy(&r, p, M);
+  return r;
+}
+
+// movq or movsd
+template V<char> load<char, 8>(const void *);     // bad
+template V<short> load<short, 8>(const void *);   // bad
+template V<int> load<int, 8>(const void *);       // bad
+template V<long> load<long, 8>(const void *);     // good
+// the following is disabled because V2SF isn't a supported mode
+// template V<float> load<float, 8>(const void *);   // bad
+template V<double> load<double, 8>(const void *); // good (movsd?)
+
+// movd or movss
+template V<char> load<char, 4>(const void *);   // bad
+template V<short> load<short, 4>(const void *); // bad
+template V<int> load<int, 4>(const void *);     // good
+template V<float> load<float, 4>(const void *); // good
+
+/* We should end up with one load and one insert for each function.  */
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */
Index: gcc/testsuite/g++.target/i386/pr90424-2.C
===================================================================
--- gcc/testsuite/g++.target/i386/pr90424-2.C	(nonexistent)
+++ gcc/testsuite/g++.target/i386/pr90424-2.C	(working copy)
@@ -0,0 +1,31 @@ 
+/* { dg-do compile { target c++11 } } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+template <class T>
+using V [[gnu::vector_size(16)]] = T;
+
+template <class T, unsigned M = sizeof(V<T>)>
+V<T> load(const void *p) {
+  V<T> r = {};
+  __builtin_memcpy(&r, p, M);
+  return r;
+}
+
+// movq or movsd
+template V<char> load<char, 8>(const void *);     // bad
+template V<short> load<short, 8>(const void *);   // bad
+template V<int> load<int, 8>(const void *);       // bad
+template V<long> load<long, 8>(const void *);     // good
+// the following is disabled because V2SF isn't a supported mode
+// template V<float> load<float, 8>(const void *);   // bad
+template V<double> load<double, 8>(const void *); // good (movsd?)
+
+// movd or movss
+template V<char> load<char, 4>(const void *);   // bad
+template V<short> load<short, 4>(const void *); // bad
+template V<int> load<int, 4>(const void *);     // good
+template V<float> load<float, 4>(const void *); // good
+
+/* We should end up with one load and one insert for each function.  */
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 9 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM" 9 "optimized" } } */