RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support

Message ID 5BE7D7FC.8060805@riscy-ip.com
State New
Headers show
Series
  • RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support
Related show

Commit Message

Joern Wolfgang Rennecke Nov. 11, 2018, 7:19 a.m.
Our target (eSi-RISC) doesn't have DOT_PROD_EXPR or WIDEN_SUM_EXPR 
operations in
the standard vector modes; however, it has a vectorized WIDEN_MULT_PLUS_EXPR
implementation with a double-vector output, which works just as well, 
with a little
help from the compiler - as implemented in these patches.

Bootstrapped and regtested on i686-pc-linux-gnu.
2018-11-07  Joern Rennecke  <joern.rennecke@riscy-ip.com>

	* tree-cfg.c (verify_gimple_assign_ternary):
	Allow vector arguments to  WIDEN_MULT_{PLUS,MINUS}_EXPR.
	* tree-vect-data-refs.c (vect_get_smallest_scalar_type):
	Treat WIDEN_MULT_PLUS_EXPR like WIDEN_SUM_EXPR.
	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
	Get VECTYPE from STMT_VINFO_VECTYPE.
	(vectorizable_reduction): Fix result vector type for
	WIDEN_MULT_PLUS_EXPR.
	Use optab_handler or convert_optab_hander as needed.
	* tree-vect-patterns.c (vect_supportable_widen_optab_p): New function.
	(vect_recog_dot_prod_pattern): If DOT_PROD_EXPR can't be expanded
	directly, try to use WIDEN_MULT_PLUS_EXPR instead.
	(vect_recog_widen_sum_pattern): If no WIDEN_SUM pattern is available,
	try WIDEN_MULT_PLUS.
	* tree-vect-stmts.c (vect_get_vector_types_for_stmt):
	Allow vcector size input/output mismatch for reduction.

Comments

Richard Biener Nov. 12, 2018, 2:30 p.m. | #1
On Sun, Nov 11, 2018 at 8:21 AM Joern Wolfgang Rennecke
<joern.rennecke@riscy-ip.com> wrote:
>

> Our target (eSi-RISC) doesn't have DOT_PROD_EXPR or WIDEN_SUM_EXPR

> operations in

> the standard vector modes; however, it has a vectorized WIDEN_MULT_PLUS_EXPR

> implementation with a double-vector output, which works just as well,

> with a little

> help from the compiler - as implemented in these patches.


I guess I already asked this question when WIDEN_MULT_PLUS_EXPR was
introduced - but isn't that fully contained within a DOT_PROD_EXPR?

Some comments on the patch.

+  tree vecotype
+    = build_vector_type (otype, GET_MODE_NUNITS (TYPE_MODE (vecitype)));

TYPE_VECTOR_SUBPARTS (vecitype)

You want to pass in the half/full types and use get_vectype_for_scalar_type
which also makes sure the target supports the vector type.

I think you want to extend and re-use supportable_widening_operation
here anyways.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       (revision 266008)
+++ gcc/tree-vect-stmts.c       (working copy)
@@ -10638,7 +10638,11 @@ vect_get_vector_types_for_stmt (stmt_vec
                                   scalar_type);

   if (maybe_ne (GET_MODE_SIZE (TYPE_MODE (vectype)),
-               GET_MODE_SIZE (TYPE_MODE (nunits_vectype))))
+               GET_MODE_SIZE (TYPE_MODE (nunits_vectype)))
+      /* Reductions that use a widening reduction would show
+        a mismatch but that's already been checked to be OK.  */
+      && STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def)
+
     return opt_result::failure_at (stmt,
                                   "not vectorized: different sized vector "
                                   "types in statement, %T and %T\n",

that change doesn't look good.

> Bootstrapped and regtested on i686-pc-linux-gnu.
Joern Wolfgang Rennecke Nov. 13, 2018, 11:09 p.m. | #2
On 12/11/18 14:30, Richard Biener wrote:
> I guess I already asked this question when WIDEN_MULT_PLUS_EXPR was

> introduced - but isn't that fully contained within a DOT_PROD_EXPR?


I'm not sure what exactly you mean here.
A mailing list search to find that post was unsuccessful.  The earliest 
it found for
widen_sum_expr and richard and (biener or guenther) was from October 2012,
and WIDEN_SUM_EXPR already existed then as was visibile in the context 
of the
patch under discussion.

 From the perspective of expressing scalar operations, or expressing 
reduction operations on
(originally) scalars in the vectorizer, WIDEN_SUM_EXPR is redundant, as 
we could use DOT_PROD_EXPR with one constant input.
For describing vectorizing narrow vector operations, WIDEN_SUM_EXPR 
would be more suitable, as
we wouldn't need to bring in matrix operations, but the current 
non-widening semantics suffer
from being insufficiently precisely defined in what they are actually 
summing up.

 From the perspective of providing optabs, there are two aspects to 
consider: First, if a dot_prod
optab were to be to be used for expanding WIDEN_SUM_EXPR (by that or any 
other name), a
target that implements WIDEN_SUM but not DOT_PROD in general would have 
to have a dot_prod
expander that requires a vector of constant ones for one of its 
operands.  And then vectorizer, when
it wants to consider DOT_PROD_EXPR for two vector variables, would have 
to test the operand
predicates of the expander.  That sounds like a lot more trouble than 
the little reduction in the
set of optabs is worth.

>

> Some comments on the patch.

>

> +  tree vecotype

> +    = build_vector_type (otype, GET_MODE_NUNITS (TYPE_MODE (vecitype)));

>

> TYPE_VECTOR_SUBPARTS (vecitype)

>

> You want to pass in the half/full types and use get_vectype_for_scalar_type

> which also makes sure the target supports the vector type.


WIDEN_MULT_PLUS is special on our target in that it creates double-sized
vectors.  To get a full-sized vector in the reduction loop, you have to have
a double-sized vector result.  We already make the reduction special in that
the result can vary in size, it's scalar for an ordinary DOT_PROD_EXPR, 
while
a vector for other operations.
>

> I think you want to extend and re-use supportable_widening_operation

> here anyways.


I see.  Other targets generally use even/odd or lo/hi vectorized 
widening mult operations.
ia64/vect.md,  s390/vector.md,  spu/spu.md and tilegx/tilegx.md have 
mult_lo / mult_even patterns, but no dot_prod pattern, so if I go via an 
enhanced supportable_widening_operation, my patch should become relevant 
to these platforms.

To make it cover our target as well, I'll have to make it able, if the 
optabs indicate that, to specify a single WIDEN_MULT_PLUS_EXPR with a 
double-sized vector result instead of a split into two halves.


>

> Index: gcc/tree-vect-stmts.c

> ===================================================================

> --- gcc/tree-vect-stmts.c       (revision 266008)

> +++ gcc/tree-vect-stmts.c       (working copy)

> @@ -10638,7 +10638,11 @@ vect_get_vector_types_for_stmt (stmt_vec

>                                     scalar_type);

>

>     if (maybe_ne (GET_MODE_SIZE (TYPE_MODE (vectype)),

> -               GET_MODE_SIZE (TYPE_MODE (nunits_vectype))))

> +               GET_MODE_SIZE (TYPE_MODE (nunits_vectype)))

> +      /* Reductions that use a widening reduction would show

> +        a mismatch but that's already been checked to be OK.  */

> +      && STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def)

> +

>       return opt_result::failure_at (stmt,

>                                     "not vectorized: different sized vector "

>                                     "types in statement, %T and %T\n",

>

> that change doesn't look good.


Would it become acceptable if I made it specific to WIDEN_MULT_PLUS_EXPR ?
Richard Biener Nov. 14, 2018, 9:53 a.m. | #3
On Wed, Nov 14, 2018 at 12:09 AM Joern Wolfgang Rennecke
<joern.rennecke@riscy-ip.com> wrote:
>

>

> On 12/11/18 14:30, Richard Biener wrote:

> > I guess I already asked this question when WIDEN_MULT_PLUS_EXPR was

> > introduced - but isn't that fully contained within a DOT_PROD_EXPR?

>

> I'm not sure what exactly you mean here.

> A mailing list search to find that post was unsuccessful.  The earliest

> it found for

> widen_sum_expr and richard and (biener or guenther) was from October 2012,

> and WIDEN_SUM_EXPR already existed then as was visibile in the context

> of the

> patch under discussion.

>

>  From the perspective of expressing scalar operations, or expressing

> reduction operations on

> (originally) scalars in the vectorizer, WIDEN_SUM_EXPR is redundant, as

> we could use DOT_PROD_EXPR with one constant input.

> For describing vectorizing narrow vector operations, WIDEN_SUM_EXPR

> would be more suitable, as

> we wouldn't need to bring in matrix operations, but the current

> non-widening semantics suffer

> from being insufficiently precisely defined in what they are actually

> summing up.

>

>  From the perspective of providing optabs, there are two aspects to

> consider: First, if a dot_prod

> optab were to be to be used for expanding WIDEN_SUM_EXPR (by that or any

> other name), a

> target that implements WIDEN_SUM but not DOT_PROD in general would have

> to have a dot_prod

> expander that requires a vector of constant ones for one of its

> operands.  And then vectorizer, when

> it wants to consider DOT_PROD_EXPR for two vector variables, would have

> to test the operand

> predicates of the expander.  That sounds like a lot more trouble than

> the little reduction in the

> set of optabs is worth.

>

> >

> > Some comments on the patch.

> >

> > +  tree vecotype

> > +    = build_vector_type (otype, GET_MODE_NUNITS (TYPE_MODE (vecitype)));

> >

> > TYPE_VECTOR_SUBPARTS (vecitype)

> >

> > You want to pass in the half/full types and use get_vectype_for_scalar_type

> > which also makes sure the target supports the vector type.

>

> WIDEN_MULT_PLUS is special on our target in that it creates double-sized

> vectors.


Are there really double-size vectors or does the target simply produce
the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH
or _EVEN/ODD split operations.  Or, like - what I now remember - for the
DOT_PROD_EXPR optab, the target already reduces element pairs of the
result vector (unspecified which ones) so the result vector is of the same
size as the inputs.

That is, if your target produces two vectors you may instead want to hide
that fact by claiming you support DOT_PROD_EXPR and expanding
it to the widen-mult-plus plus reducing (adding) the two result vectors to
get a single one.

The vectorizer cannot really deal with multiple sizes, thus for example
a V4SI * V4SI + V4DI operation and that all those tree codes are exposed
as "scalar" is sth that continues to confuse me but is mainly done because
at pattern recognition time there's only the scalars.

>  To get a full-sized vector in the reduction loop, you have to have

> a double-sized vector result.  We already make the reduction special in that

> the result can vary in size, it's scalar for an ordinary DOT_PROD_EXPR,

> while

> a vector for other operations.

> >

> > I think you want to extend and re-use supportable_widening_operation

> > here anyways.

>

> I see.  Other targets generally use even/odd or lo/hi vectorized

> widening mult operations.

> ia64/vect.md,  s390/vector.md,  spu/spu.md and tilegx/tilegx.md have

> mult_lo / mult_even patterns, but no dot_prod pattern, so if I go via an

> enhanced supportable_widening_operation, my patch should become relevant

> to these platforms.

>

> To make it cover our target as well, I'll have to make it able, if the

> optabs indicate that, to specify a single WIDEN_MULT_PLUS_EXPR with a

> double-sized vector result instead of a split into two halves.


There doesn't seem to be any target providing vector variants for any of
the [us][us]madd/sub optabs so I believe those were added for scalar
operations only.

> grep [us\"][us]madd config/*/*.md

config/arc/arc.md:(define_expand "umaddsidi4"
config/arc/arc.md:(define_insn_and_split "umaddsidi4_split"
config/arm/arm.md:(define_expand "umaddsidi4"
config/avr/avr.md:;; "*usmaddqihi4"
config/avr/avr.md:;; "*sumaddqihi4"
config/avr/avr.md:;; "umaddqihi4.uconst"
config/avr/avr.md:(define_insn_and_split "*sumaddqihi4.uconst"
config/avr/avr.md:   ; *sumaddqihi4
config/csky/csky_insn_dsp.md:(define_insn "umaddsidi4"
config/mips/mips-fixed.md:(define_insn "ssmaddsqdq4"
config/nds32/nds32-dspext.md:(define_insn "smaddhidi"
config/nds32/nds32-dspext.md:(define_insn "smaddhidi2"
config/tilegx/tilegx.md:(define_insn "umaddsidi4"
config/tilepro/tilepro.md:(define_insn "umaddhisi4"

> grep [us\"][us]msub config/*/*.md

config/avr/avr.md:;; "*usmsubqihi4"
config/avr/avr.md:;; "*sumsubqihi4"
config/avr/avr.md:(define_insn_and_split "*sumsubqihi4.uconst"
config/avr/avr.md:   ; *sumsubqihi4
config/csky/csky_insn_dsp.md:(define_insn "umsubsidi4"
config/mips/mips-fixed.md:(define_insn "ssmsubsqdq4"

For vectorization I would advise to provide expansion patterns
for codes that are already supported, in your case DOT_PROD_EXPR.

Richard.

>

> >

> > Index: gcc/tree-vect-stmts.c

> > ===================================================================

> > --- gcc/tree-vect-stmts.c       (revision 266008)

> > +++ gcc/tree-vect-stmts.c       (working copy)

> > @@ -10638,7 +10638,11 @@ vect_get_vector_types_for_stmt (stmt_vec

> >                                     scalar_type);

> >

> >     if (maybe_ne (GET_MODE_SIZE (TYPE_MODE (vectype)),

> > -               GET_MODE_SIZE (TYPE_MODE (nunits_vectype))))

> > +               GET_MODE_SIZE (TYPE_MODE (nunits_vectype)))

> > +      /* Reductions that use a widening reduction would show

> > +        a mismatch but that's already been checked to be OK.  */

> > +      && STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def)

> > +

> >       return opt_result::failure_at (stmt,

> >                                     "not vectorized: different sized vector "

> >                                     "types in statement, %T and %T\n",

> >

> > that change doesn't look good.

>

> Would it become acceptable if I made it specific to WIDEN_MULT_PLUS_EXPR ?

>
Joern Wolfgang Rennecke Nov. 14, 2018, 3:21 p.m. | #4
On 14/11/18 09:53, Richard Biener wrote:
>> WIDEN_MULT_PLUS is special on our target in that it creates double-sized

>> vectors.

> Are there really double-size vectors or does the target simply produce

> the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH

> or _EVEN/ODD split operations.  Or, like - what I now remember - for the

> DOT_PROD_EXPR optab, the target already reduces element pairs of the

> result vector (unspecified which ones) so the result vector is of the same

> size as the inputs.

The output of widening multiply and widening multiply-add is stored
in two consecutive registers.  So, they can be used as separate
vectors, but you can't choose the register numbers indepenndently.
OTOH, you can treat them together as a double-sized vector, but
without any extra alignment requirements over a single-sized vector.
>

> That is, if your target produces two vectors you may instead want to hide

> that fact by claiming you support DOT_PROD_EXPR and expanding

> it to the widen-mult-plus plus reducing (adding) the two result vectors to

> get a single one.


Doing a part of the reduction in the loop is a bit pointless.

I have tried another approach, to advertize the WIDEN_MULT_PLUS
and WIDEN_MULT operations as LO/HI part operations of the double
vector size, and also add fake double-vector patterns for move, widening
and add (they get expanded or splitted to single-vector patterns).
That seems to work for the dot product, it's like the code is unrolled by
a factor of two.  There are a few drawbacks though:
- the tree optimizer creates separate WIDEN_MULT and PLUS expressions,
and it is left to the combiner to clean that up.  That combination and 
register allocation
might be a bit fragile.
- if the input isn't known to be aligned to the doubled vector size, a 
run-time
check is inserted to use an unvectorized loop if there is no excess 
alignment.
- auto-increment for the loads is lost.  I can probably fix this by 
keeping double-sized
loads around for longer or with some special-purpose pass, but both 
might have
some other drawbacks.  But there's actually a configuration option for 
an instruction
to load multiple vector registers with register-indirect or 
auto-increment, so there is
some merit to have a pattern for it.
- the code size is larger.
- vectorization will fail if any other code is mixed in for which no 
double-vector patterns are provided.
- this approach uses SUBREGs in ways that are not safe according to the 
documentation.
But then, other ports like i386 and aarch64-little endian do that too.  
I think it is now (since we have
SUBREG_BYTE) safe to have subregs of registers with hard reg sizes 
larger than UNITS_PER_WORD,
as long as you refer to entire hard registers.  Maybe we could change 
the documentation?
AFAICT, there are also only four places that need to be patched to make 
a lowpart access with a SUBREG of such a hard register safe. I'm trying 
this at the moment, it was justa few hours late for the
phase 1->3 deadline.

I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized 
pattern that
adds the products of the two input vectors into the double output 
vector, and leave
the rtl loop optimizer to get the constant pool load of the all-ones 
vector out of
the loop.  But again, there'll be issues with excess alignment 
requirements and code size.
>

> The vectorizer cannot really deal with multiple sizes, thus for example

> a V4SI * V4SI + V4DI operation and that all those tree codes are exposed

> as "scalar" is sth that continues to confuse me but is mainly done because

> at pattern recognition time there's only the scalars.

Well, the vectorizer makes an exception for reductions as it'll allow to 
maintain
either a vector or a scalar during the loop, so why not allow other 
sizes for that
value as well?  It's all hidden in the final reduction emitted by the 
epilogue.
> For vectorization I would advise to provide expansion patterns for 

> codes that are already supported, in your case DOT_PROD_EXPR.

With vector size doubling, it seems to work better with LO/HI multiply 
and PLUS (and let
the combiner take the strain).
without... for a straight expansion, there is little point.  The 
previous sum is in one
register, the multiply results are spread over two registers, and 
DOT_PROD_EXPR is supposed
to yield a scalar.  Even with a reduction instruction to sum up two 
registers, you need another
instruction to add up all three, so a minimum of three instructions. 
LO/HI mulltiply can
be fudged by doing a full multiply and picking half the result, and cse 
should reduce that
to one multiply.  Again, two adds needed, because the reduction variable 
is too narrow
to use widening multiply-add.
There maybe some merit to DOT_PROD_EXPR if I make it do something strange.
But there's no easy way to use a special purpose mode, since there's no 
matching reduction
pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is 
not readily
distinguishable from the one for a non-widening summation with the same 
output vector mode.
I could use a special kind of hard register that's really another view 
of a group of vector registers
and which are reserved for this purpose unless eliminated, and the 
elimination is blocked when
there is a statement that uses these registers because the expander for 
the DOT_PROD_EXPR /
WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if 
they special 'hard
reg' can't be obtained another, more expensive pattern (suitably 
indicated in the constraints) is
used... but that's a lot of hair.
It's probably easier to write a special-purpose ssa pass to patch up the 
type of the reduction variable,
and insert that pass to run after the vectorizer.  widen the variable 
when entering the loop,
reduce it when exiting. if the loop is not understood, a more expensive 
pattern with standard
reduction variable width is used.
In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they 
are somewhat special
and thus stick out (or in other words, you can take a bit of time to 
verify you got something interesting
when you find them).
Richard Biener Nov. 15, 2018, 12:17 p.m. | #5
On Wed, Nov 14, 2018 at 4:21 PM Joern Wolfgang Rennecke
<joern.rennecke@riscy-ip.com> wrote:
>

>

> On 14/11/18 09:53, Richard Biener wrote:

> >> WIDEN_MULT_PLUS is special on our target in that it creates double-sized

> >> vectors.

> > Are there really double-size vectors or does the target simply produce

> > the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH

> > or _EVEN/ODD split operations.  Or, like - what I now remember - for the

> > DOT_PROD_EXPR optab, the target already reduces element pairs of the

> > result vector (unspecified which ones) so the result vector is of the same

> > size as the inputs.

> The output of widening multiply and widening multiply-add is stored

> in two consecutive registers.  So, they can be used as separate

> vectors, but you can't choose the register numbers indepenndently.

> OTOH, you can treat them together as a double-sized vector, but

> without any extra alignment requirements over a single-sized vector.

> >

> > That is, if your target produces two vectors you may instead want to hide

> > that fact by claiming you support DOT_PROD_EXPR and expanding

> > it to the widen-mult-plus plus reducing (adding) the two result vectors to

> > get a single one.

>

> Doing a part of the reduction in the loop is a bit pointless.


The advantage is you can work with lower vectorization factor and thus
less unrolling (smaller code).

> I have tried another approach, to advertize the WIDEN_MULT_PLUS

> and WIDEN_MULT operations as LO/HI part operations of the double

> vector size, and also add fake double-vector patterns for move, widening

> and add (they get expanded or splitted to single-vector patterns).

> That seems to work for the dot product, it's like the code is unrolled by

> a factor of two.


Yeah, that would work as well.  As said, the vectorizer cannot really
handle the doubled vector size (means: you sooner or later will run into
issues).

>  There are a few drawbacks though:

> - the tree optimizer creates separate WIDEN_MULT and PLUS expressions,

> and it is left to the combiner to clean that up.  That combination and

> register allocation

> might be a bit fragile.

> - if the input isn't known to be aligned to the doubled vector size, a

> run-time

> check is inserted to use an unvectorized loop if there is no excess

> alignment.

> - auto-increment for the loads is lost.  I can probably fix this by

> keeping double-sized

> loads around for longer or with some special-purpose pass, but both

> might have

> some other drawbacks.  But there's actually a configuration option for

> an instruction

> to load multiple vector registers with register-indirect or

> auto-increment, so there is

> some merit to have a pattern for it.

> - the code size is larger.

> - vectorization will fail if any other code is mixed in for which no

> double-vector patterns are provided.

> - this approach uses SUBREGs in ways that are not safe according to the

> documentation.

> But then, other ports like i386 and aarch64-little endian do that too.


I think they do it if they really have such instructions in the ISA
(they also have those that do the in-loop reduction to half of the result
vector size -- DOT_PROD_EXPR).

> I think it is now (since we have

> SUBREG_BYTE) safe to have subregs of registers with hard reg sizes

> larger than UNITS_PER_WORD,

> as long as you refer to entire hard registers.  Maybe we could change

> the documentation?

> AFAICT, there are also only four places that need to be patched to make

> a lowpart access with a SUBREG of such a hard register safe. I'm trying

> this at the moment, it was justa few hours late for the

> phase 1->3 deadline.

>

> I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized

> pattern that

> adds the products of the two input vectors into the double output

> vector, and leave

> the rtl loop optimizer to get the constant pool load of the all-ones

> vector out of

> the loop.  But again, there'll be issues with excess alignment

> requirements and code size.


I think going the DOT_PROD_EXPR way is a lot easier.  You simply
expand the additional (in-loop) sum.  The only drawback I see is that
this might be slower code.

So yes the a _LO/_HI way maps better to hardware but you rely on
CSE to remove the redundant instruction if you implement _LO/_HI
as doing the full operation and just taking one of the result vectors.

> >

> > The vectorizer cannot really deal with multiple sizes, thus for example

> > a V4SI * V4SI + V4DI operation and that all those tree codes are exposed

> > as "scalar" is sth that continues to confuse me but is mainly done because

> > at pattern recognition time there's only the scalars.

> Well, the vectorizer makes an exception for reductions as it'll allow to

> maintain

> either a vector or a scalar during the loop, so why not allow other

> sizes for that

> value as well?


It's not implemented ;)

>  It's all hidden in the final reduction emitted by the

> epilogue.

> > For vectorization I would advise to provide expansion patterns for

> > codes that are already supported, in your case DOT_PROD_EXPR.

> With vector size doubling, it seems to work better with LO/HI multiply

> and PLUS (and let

> the combiner take the strain).

> without... for a straight expansion, there is little point.  The

> previous sum is in one

> register, the multiply results are spread over two registers, and

> DOT_PROD_EXPR is supposed

> to yield a scalar.  Even with a reduction instruction to sum up two

> registers, you need another

> instruction to add up all three, so a minimum of three instructions.


No, DOT_PROD_EXPR yields a vector of the same size as the inputs.
That means it has to reduce the N element result vector to a M element
one to match that constraint.  For example on x86 pmaddw is an
instruction that does this.

That is, the overhead for you is doing a single vector add to combine
the two vector results to one.

> LO/HI mulltiply can

> be fudged by doing a full multiply and picking half the result, and cse

> should reduce that

> to one multiply.  Again, two adds needed, because the reduction variable

> is too narrow

> to use widening multiply-add.

> There maybe some merit to DOT_PROD_EXPR if I make it do something strange.

> But there's no easy way to use a special purpose mode, since there's no

> matching reduction

> pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is

> not readily

> distinguishable from the one for a non-widening summation with the same

> output vector mode.

> I could use a special kind of hard register that's really another view

> of a group of vector registers

> and which are reserved for this purpose unless eliminated, and the

> elimination is blocked when

> there is a statement that uses these registers because the expander for

> the DOT_PROD_EXPR /

> WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if

> they special 'hard

> reg' can't be obtained another, more expensive pattern (suitably

> indicated in the constraints) is

> used... but that's a lot of hair.

> It's probably easier to write a special-purpose ssa pass to patch up the

> type of the reduction variable,

> and insert that pass to run after the vectorizer.  widen the variable

> when entering the loop,

> reduce it when exiting. if the loop is not understood, a more expensive

> pattern with standard

> reduction variable width is used.

> In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they

> are somewhat special

> and thus stick out (or in other words, you can take a bit of time to

> verify you got something interesting

> when you find them).

>

Patch

Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 266008)
+++ gcc/tree-cfg.c	(working copy)
@@ -4162,6 +4162,22 @@  verify_gimple_assign_ternary (gassign *s
     {
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
+      if (VECTOR_TYPE_P (lhs_type)
+	  && VECTOR_TYPE_P (rhs1_type)
+	  && VECTOR_TYPE_P (rhs2_type)
+	  && VECTOR_TYPE_P (rhs3_type)
+	  && !maybe_ne (TYPE_VECTOR_SUBPARTS (lhs_type),
+			TYPE_VECTOR_SUBPARTS (rhs1_type))
+	  && !maybe_ne (TYPE_VECTOR_SUBPARTS (lhs_type),
+			TYPE_VECTOR_SUBPARTS (rhs2_type))
+	  && !maybe_ne (TYPE_VECTOR_SUBPARTS (lhs_type),
+			TYPE_VECTOR_SUBPARTS (rhs3_type)))
+	{
+	  lhs_type = TREE_TYPE (lhs_type);
+	  rhs1_type = TREE_TYPE (rhs1_type);
+	  rhs2_type = TREE_TYPE (rhs2_type);
+	  rhs3_type = TREE_TYPE (rhs3_type);
+	}
       if ((!INTEGRAL_TYPE_P (rhs1_type)
 	   && !FIXED_POINT_TYPE_P (rhs1_type))
 	  || !useless_type_conversion_p (rhs1_type, rhs2_type)
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	(revision 266008)
+++ gcc/tree-vect-data-refs.c	(working copy)
@@ -135,6 +135,7 @@  vect_get_smallest_scalar_type (stmt_vec_
       && (gimple_assign_cast_p (assign)
 	  || gimple_assign_rhs_code (assign) == DOT_PROD_EXPR
 	  || gimple_assign_rhs_code (assign) == WIDEN_SUM_EXPR
+	  || gimple_assign_rhs_code (assign) == WIDEN_MULT_PLUS_EXPR
 	  || gimple_assign_rhs_code (assign) == WIDEN_MULT_EXPR
 	  || gimple_assign_rhs_code (assign) == WIDEN_LSHIFT_EXPR
 	  || gimple_assign_rhs_code (assign) == FLOAT_EXPR))
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	(revision 266008)
+++ gcc/tree-vect-loop.c	(working copy)
@@ -3978,7 +3978,7 @@  get_initial_def_for_reduction (stmt_vec_
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   tree scalar_type = TREE_TYPE (init_val);
-  tree vectype = get_vectype_for_scalar_type (scalar_type);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_vinfo);
   enum tree_code code = gimple_assign_rhs_code (stmt_vinfo->stmt);
   tree def_for_init;
   tree init_def;
@@ -4001,6 +4001,7 @@  get_initial_def_for_reduction (stmt_vec_
     {
     case WIDEN_SUM_EXPR:
     case DOT_PROD_EXPR:
+    case WIDEN_MULT_PLUS_EXPR:
     case SAD_EXPR:
     case PLUS_EXPR:
     case MINUS_EXPR:
@@ -6043,6 +6044,21 @@  vectorizable_reduction (stmt_vec_info st
 	    slp_node_instance->reduc_phis = slp_node;
 
 	  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+
+	  stmt_vec_info reduc = STMT_VINFO_REDUC_DEF (stmt_info);
+	  if (reduc
+	      && (reduc = STMT_VINFO_RELATED_STMT (reduc))
+	      && gimple_assign_rhs_code (reduc->stmt) == WIDEN_MULT_PLUS_EXPR)
+	    {
+	      poly_uint64 nelem
+		= (TYPE_VECTOR_SUBPARTS
+		    (get_vectype_for_scalar_type
+		      (TREE_TYPE (gimple_assign_rhs1 (reduc->stmt)))));
+	      STMT_VINFO_VECTYPE (stmt_info)
+		= (build_vector_type
+		    (TREE_TYPE (gimple_assign_lhs (reduc->stmt)), nelem));
+	    }
+
 	  return true;
 	}
 
@@ -6529,7 +6545,13 @@  vectorizable_reduction (stmt_vec_info st
           return false;
         }
 
-      if (optab_handler (optab, vec_mode) == CODE_FOR_nothing)
+      enum insn_code icode;
+      if (convert_optab_p (optab))
+	icode
+	  = convert_optab_handler (optab, TYPE_MODE (vectype_out), vec_mode);
+      else
+	icode = optab_handler (optab, vec_mode);
+      if (icode == CODE_FOR_nothing)
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
Index: gcc/tree-vect-patterns.c
===================================================================
--- gcc/tree-vect-patterns.c	(revision 266008)
+++ gcc/tree-vect-patterns.c	(working copy)
@@ -212,6 +212,49 @@  vect_supportable_direct_optab_p (tree ot
   return true;
 }
 
+/* Likewise for widening operations, like WIDEN_MULT_PLUS_EXPR:
+   Return true if the target supports a vector version of CODE,
+   where CODE is known to map to a 'conversion' optab.  ITYPE specifies
+   the type of (some of) the (narrower) scalar inputs and OTYPE specifies
+   the type of the scalar result.
+
+   The number of elements of the input vector are the same as the number
+   of elements of the output vector, and Operand 0 of the target pattern
+   must match the latter.
+
+   When returning true, set *VECOTYPE_OUT to the vector version of OTYPE.
+   Also set *VECITYPE_OUT to the vector version of ITYPE if VECITYPE_OUT
+   is nonnull.  */
+static bool
+vect_supportable_widen_optab_p (tree otype, tree_code code,
+				 tree itype, tree *vecotype_out,
+				 tree *vecitype_out = NULL)
+{
+  tree vecitype = get_vectype_for_scalar_type (itype);
+  if (!vecitype)
+    return false;
+
+  tree vecotype
+    = build_vector_type (otype, GET_MODE_NUNITS (TYPE_MODE (vecitype)));
+  if (!vecotype)
+    return false;
+
+  optab optab = optab_for_tree_code (code, vecitype, optab_default);
+  if (!optab)
+    return false;
+
+  insn_code icode
+    = convert_optab_handler (optab, TYPE_MODE (vecotype), TYPE_MODE (vecitype));
+  if (icode == CODE_FOR_nothing
+      || insn_data[icode].operand[0].mode != TYPE_MODE (vecotype))
+    return false;
+
+  *vecotype_out = vecotype;
+  if (vecitype_out)
+    *vecitype_out = vecitype;
+  return true;
+}
+
 /* Round bit precision PRECISION up to a full element.  */
 
 static unsigned int
@@ -953,9 +996,17 @@  vect_recog_dot_prod_pattern (stmt_vec_in
   vect_pattern_detected ("vect_recog_dot_prod_pattern", last_stmt);
 
   tree half_vectype;
+  tree_code code = DOT_PROD_EXPR;
   if (!vect_supportable_direct_optab_p (type, DOT_PROD_EXPR, half_type,
 					type_out, &half_vectype))
-    return NULL;
+    {
+      /* We won't be able to expand DOT_PROD_EXPR with the current vector size,
+	 try WIDEN_MULT_EXPR instead.  */
+      code = WIDEN_MULT_PLUS_EXPR;
+      if (!vect_supportable_widen_optab_p (type, code, half_type,
+					   type_out, &half_vectype))
+        return NULL;
+    }
 
   /* Get the inputs in the appropriate types.  */
   tree mult_oprnd[2];
@@ -963,7 +1014,7 @@  vect_recog_dot_prod_pattern (stmt_vec_in
 		       unprom0, half_vectype);
 
   var = vect_recog_temp_ssa_var (type, NULL);
-  pattern_stmt = gimple_build_assign (var, DOT_PROD_EXPR,
+  pattern_stmt = gimple_build_assign (var, code,
 				      mult_oprnd[0], mult_oprnd[1], oprnd1);
 
   return pattern_stmt;
@@ -1440,12 +1491,27 @@  vect_recog_widen_sum_pattern (stmt_vec_i
 
   vect_pattern_detected ("vect_recog_widen_sum_pattern", last_stmt);
 
-  if (!vect_supportable_direct_optab_p (type, WIDEN_SUM_EXPR, unprom0.type,
-					type_out))
-    return NULL;
+  tree half_vectype;
 
-  var = vect_recog_temp_ssa_var (type, NULL);
-  pattern_stmt = gimple_build_assign (var, WIDEN_SUM_EXPR, unprom0.op, oprnd1);
+  if (vect_supportable_direct_optab_p
+       (type, WIDEN_SUM_EXPR, unprom0.type, type_out))
+    {
+      var = vect_recog_temp_ssa_var (type, NULL);
+      pattern_stmt
+	= gimple_build_assign (var, WIDEN_SUM_EXPR, unprom0.op, oprnd1);
+    }
+  /* If we won't be able to expand WIDEN_SUM_EXPR with the current
+     vector size, try WIDEN_MULT_PLUS_EXPR instead.  */
+  else if ((vect_supportable_widen_optab_p
+	     (type, WIDEN_MULT_PLUS_EXPR, unprom0.type, type_out, &half_vectype)))
+    {
+      var = vect_recog_temp_ssa_var (type, NULL);
+      tree one = build_int_cst (unprom0.type, 1);
+      pattern_stmt = gimple_build_assign (var, WIDEN_MULT_PLUS_EXPR,
+					  unprom0.op, one, oprnd1);
+    }
+  else
+    return NULL;
 
   return pattern_stmt;
 }
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 266008)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -10638,7 +10638,11 @@  vect_get_vector_types_for_stmt (stmt_vec
 				   scalar_type);
 
   if (maybe_ne (GET_MODE_SIZE (TYPE_MODE (vectype)),
-		GET_MODE_SIZE (TYPE_MODE (nunits_vectype))))
+		GET_MODE_SIZE (TYPE_MODE (nunits_vectype)))
+      /* Reductions that use a widening reduction would show
+	 a mismatch but that's already been checked to be OK.  */
+      && STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def)
+
     return opt_result::failure_at (stmt,
 				   "not vectorized: different sized vector "
 				   "types in statement, %T and %T\n",