[PR,81616] Deferring FMA transformations in tight loops

Message ID ri6mv2knkeg.fsf@suse.cz
State New
Headers show
Series
  • [PR,81616] Deferring FMA transformations in tight loops
Related show

Commit Message

Martin Jambor Dec. 15, 2017, 2:19 p.m.
Hello,

the patch below prevents creation if fused-multiply-and-add instructions
in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
result fixes regressions of native znver1 tuning when compared to
generic tuning in:

  - the matrix.c testcase of PR 81616 (straightforward matrix
    multiplication) at -O2 and -O3 which is currently 60% (!),

  - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

  - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
    about 20% in both cases.

The basic idea is to detect loops in the following form:

    <bb 6>
    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
    ...
    _65 = _14 * _16;
    accumulator_66 = _65 + accumulator_111;

and prevents from creating FMA for it.  Because at least in the parest
and calculix cases it has to, it also deals with more than one chain of
FMA candidates that feed the next one's addend:


    <bb 6>
    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
    ...
    _65 = _14 * _16;
    accumulator_55 = _65 + accumulator_111;
    _65 = _24 * _36;
    accumulator_66 = _65 + accumulator_55;

Unfortunately, to really get rid of the calculix regression, the
algorithm cannot just look at one BB at a time but also has to work for
cases like the following:

     1  void mult(void)
     2  {
     3      int i, j, k, l;
     4  
     5     for(i=0; i<SIZE; ++i)
     6     {
     7        for(j=0; j<SIZE; ++j)
     8        {
     9           for(l=0; l<SIZE; l+=10)
    10           {
    11               c[i][j] += a[i][l] * b[k][l];
    12               for(k=1; k<10; ++k)
    13               {
    14                   c[i][j] += a[i][l+k] * b[k][l+k];
    15               }
    16  
    17           }
    18        }
    19     }
    20  }

where the FMA on line 14 feeds into the one on line 11 in an
encompassing loop.  Therefore I have changed the structure of the pass
to work in reverse dominance order and it keeps a hash set of results of
rejected FMAs candidates which it checks when looking at PHI nodes of
the current BB.  Without this reorganization, calculix was still 8%
slower with native tuning than with generic one.

When the deferring mechanism realizes that in the current BB, the FMA
candidates do not all form a one chain tight-loop like in the examples
above, it goes back to all the previously deferred candidates (in the
current BB only) and performs the transformation.

The main reason is to keep the patch conservative (and also simple), but
it also means that the following function is not affected and is still
20% slower when compiled with native march and tuning compared to the
generic one:

     1  void mult(struct s *p1, struct s *p2)
     2  {
     3     int i, j, k;
     4  
     5     for(i=0; i<SIZE; ++i)
     6     {
     7        for(j=0; j<SIZE; ++j)
     8        {
     9           for(k=0; k<SIZE; ++k)
    10           {
    11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
    12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
    13           }
    14        }
    15     }
    16  }

I suppose that the best optimization for the above would be to split the
loops, but one could probably construct at least an artificial testcase
where the FMAs would keep enough locality that it is not the case.  The
mechanism can be easily extended to keep track of not just one chain but
a few, preferably as a followup, if people think it makes sense.

An interesting observation is that the matrix multiplication does not
suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
Apparently the 256 vector processing can hide the latency penalty when
internally it is split into two halves.  The same goes for 512 bit
vectors.  That is why the patch leaves those be - well, there is a param
for the threshold which is set to zero for everybody but znver1.  If
maintainers of any other architecture suspect that their FMAs might
suffer similar latency problem, they can easily try tweaking that
parameter and see what happens with the matrix multiplication example.

I have bootstrapped and tested the patch on x86_64-linux (as it is and
also with the param set to a 256 by default to make it trigger).  I have
also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
FPrate and the only changes are the big improvements of calculix and
parest.

After I address any comments and/or suggestions, would it be OK for
trunk?

Thanks,

Martin


2017-12-13  Martin Jambor  <mjambor@suse.cz>

	PR target/81616
	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
	* tree-ssa-math-opts.c: Include domwalk.h.
	(convert_mult_to_fma_1): New function.
	(fma_transformation_info): New type.
	(fma_deferring_state): Likewise.
	(cancel_fma_deferring): New function.
	(result_of_phi): Likewise.
	(last_fma_candidate_feeds_initial_phi): Likewise.
	(convert_mult_to_fma): Added deferring logic, split actual
	transformation to convert_mult_to_fma_1.
	(math_opts_dom_walker): New type.
	(math_opts_dom_walker::after_dom_children): New method, body moved
	here from pass_optimize_widening_mul::execute, added deferring logic
	bits.
	(pass_optimize_widening_mul::execute): Moved most of code to
	math_opts_dom_walker::after_dom_children.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
	* config/i386/i386.c (ix86_option_override_internal): Added
	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
---
 gcc/config/i386/i386.c       |   5 +
 gcc/config/i386/x86-tune.def |   4 +
 gcc/params.def               |   5 +
 gcc/tree-ssa-math-opts.c     | 521 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 407 insertions(+), 128 deletions(-)

-- 
2.15.1

Comments

Richard Biener Dec. 18, 2017, 11:20 a.m. | #1
On Fri, Dec 15, 2017 at 3:19 PM, Martin Jambor <mjambor@suse.cz> wrote:
>

> Hello,

>

> the patch below prevents creation if fused-multiply-and-add instructions

> in the widening_mul gimple pass on the Zen-based AMD CPUs and as a

> result fixes regressions of native znver1 tuning when compared to

> generic tuning in:

>

>   - the matrix.c testcase of PR 81616 (straightforward matrix

>     multiplication) at -O2 and -O3 which is currently 60% (!),

>

>   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

>

>   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also

>     about 20% in both cases.

>

> The basic idea is to detect loops in the following form:

>

>     <bb 6>

>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>     ...

>     _65 = _14 * _16;

>     accumulator_66 = _65 + accumulator_111;

>

> and prevents from creating FMA for it.  Because at least in the parest

> and calculix cases it has to, it also deals with more than one chain of

> FMA candidates that feed the next one's addend:

>

>

>     <bb 6>

>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>     ...

>     _65 = _14 * _16;

>     accumulator_55 = _65 + accumulator_111;

>     _65 = _24 * _36;

>     accumulator_66 = _65 + accumulator_55;

>

> Unfortunately, to really get rid of the calculix regression, the

> algorithm cannot just look at one BB at a time but also has to work for

> cases like the following:

>

>      1  void mult(void)

>      2  {

>      3      int i, j, k, l;

>      4

>      5     for(i=0; i<SIZE; ++i)

>      6     {

>      7        for(j=0; j<SIZE; ++j)

>      8        {

>      9           for(l=0; l<SIZE; l+=10)

>     10           {

>     11               c[i][j] += a[i][l] * b[k][l];

>     12               for(k=1; k<10; ++k)

>     13               {

>     14                   c[i][j] += a[i][l+k] * b[k][l+k];

>     15               }

>     16

>     17           }

>     18        }

>     19     }

>     20  }

>

> where the FMA on line 14 feeds into the one on line 11 in an

> encompassing loop.  Therefore I have changed the structure of the pass

> to work in reverse dominance order and it keeps a hash set of results of

> rejected FMAs candidates which it checks when looking at PHI nodes of

> the current BB.  Without this reorganization, calculix was still 8%

> slower with native tuning than with generic one.

>

> When the deferring mechanism realizes that in the current BB, the FMA

> candidates do not all form a one chain tight-loop like in the examples

> above, it goes back to all the previously deferred candidates (in the

> current BB only) and performs the transformation.

>

> The main reason is to keep the patch conservative (and also simple), but

> it also means that the following function is not affected and is still

> 20% slower when compiled with native march and tuning compared to the

> generic one:

>

>      1  void mult(struct s *p1, struct s *p2)

>      2  {

>      3     int i, j, k;

>      4

>      5     for(i=0; i<SIZE; ++i)

>      6     {

>      7        for(j=0; j<SIZE; ++j)

>      8        {

>      9           for(k=0; k<SIZE; ++k)

>     10           {

>     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];

>     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];

>     13           }

>     14        }

>     15     }

>     16  }


So here we apply store-motion to p[12]->c[i][j] so we see a
reduction in SSA?  That is, you are really looking for
SSA SCCs that involve only FMA candidates?  There's a SCC
finding code in tree-ssa-sccvn.c, DFS (), but a straight-forward
use->def walk over the adds should find a SCC just involving the
adds?  That is, rather than doing all the defering stuff I wonder
if it is easier to simply gather the SCC if there is any and add
"disqualifications", like to each other member?  So when seeing

  x += a1 * b1;
  x += a2 * b2;
  x += a3 * b3;
...

we'd still generate a FMA for each 2nd statement?  Basically
do some scheduling of as many mults as needed to conver
the latency of the first FMA.

To go back to the Zen specific issue - so FMA has a higher overall
latency than mul + add (?) and (or?) if there's more than one mul being
accumulated then the 2nd mul can execute in parallel with the first
add (in case dependences allow).  I wonder if there's an easy way to
query the DFA for the latency of FMA vs. ADD and/or if the --param
(or target hook?) should rather tell us sth about this latency difference?

Zen IIRC can execute two FMA128 in parallel (thus one FMA256 per cycle).
That makes it an odd observation that avx256 can "fix" the issue.  WRT
the above suggestion the situation needs to be clarified.

> I suppose that the best optimization for the above would be to split the

> loops, but one could probably construct at least an artificial testcase

> where the FMAs would keep enough locality that it is not the case.  The

> mechanism can be easily extended to keep track of not just one chain but

> a few, preferably as a followup, if people think it makes sense.

>

> An interesting observation is that the matrix multiplication does not

> suffer the penalty when compiled with -O3 -mprefer-vector-width=256.

> Apparently the 256 vector processing can hide the latency penalty when

> internally it is split into two halves.  The same goes for 512 bit

> vectors.  That is why the patch leaves those be - well, there is a param

> for the threshold which is set to zero for everybody but znver1.  If

> maintainers of any other architecture suspect that their FMAs might

> suffer similar latency problem, they can easily try tweaking that

> parameter and see what happens with the matrix multiplication example.

>

> I have bootstrapped and tested the patch on x86_64-linux (as it is and

> also with the param set to a 256 by default to make it trigger).  I have

> also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017

> FPrate and the only changes are the big improvements of calculix and

> parest.

>

> After I address any comments and/or suggestions, would it be OK for

> trunk?


What does the patch do to bwaves whose innermost loop is a sequence
of FMAs (slightly complicated by reassoc widening the sum)?

So for bwaves we currently have loads feeding the multiplies which might
cause enough latency to make the FMA issue not matter and/or there
are intermediate non-FMA adds that might hide the issue.  Forcing
a full FMA chain is slowing Zen down significantly though (see my report).

Another issue is are FMAs not discovered anyway by combine later?
Possibly not on x86 since all patterns use FMA rather than (plus (mult ...)).

One minor issue below (otherwise I was just looking at the high-level parts).

Richard.

> Thanks,

>

> Martin

>

>

> 2017-12-13  Martin Jambor  <mjambor@suse.cz>

>

>         PR target/81616

>         * params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.

>         * tree-ssa-math-opts.c: Include domwalk.h.

>         (convert_mult_to_fma_1): New function.

>         (fma_transformation_info): New type.

>         (fma_deferring_state): Likewise.

>         (cancel_fma_deferring): New function.

>         (result_of_phi): Likewise.

>         (last_fma_candidate_feeds_initial_phi): Likewise.

>         (convert_mult_to_fma): Added deferring logic, split actual

>         transformation to convert_mult_to_fma_1.

>         (math_opts_dom_walker): New type.

>         (math_opts_dom_walker::after_dom_children): New method, body moved

>         here from pass_optimize_widening_mul::execute, added deferring logic

>         bits.

>         (pass_optimize_widening_mul::execute): Moved most of code to

>         math_opts_dom_walker::after_dom_children.

>         * config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.

>         * config/i386/i386.c (ix86_option_override_internal): Added

>         maybe_setting of PARAM_AVOID_FMA_MAX_BITS.

> ---

>  gcc/config/i386/i386.c       |   5 +

>  gcc/config/i386/x86-tune.def |   4 +

>  gcc/params.def               |   5 +

>  gcc/tree-ssa-math-opts.c     | 521 ++++++++++++++++++++++++++++++++-----------

>  4 files changed, 407 insertions(+), 128 deletions(-)

>

> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c

> index e323102cef5..224544fe04f 100644

> --- a/gcc/config/i386/i386.c

> +++ b/gcc/config/i386/i386.c

> @@ -4888,6 +4888,11 @@ ix86_option_override_internal (bool main_args_p,

>         (cf_protection_level) (opts->x_flag_cf_protection | CF_SET);

>      }

>

> +  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])

> +    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,

> +                          opts->x_param_values,

> +                          opts_set->x_param_values);

> +

>    return true;

>  }

>

> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def

> index 25f28e3cfc1..1b6f5f8816b 100644

> --- a/gcc/config/i386/x86-tune.def

> +++ b/gcc/config/i386/x86-tune.def

> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",

>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",

>            m_SILVERMONT | m_INTEL)

>

> +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or

> +   smaller FMA chain.  */

> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)

> +

>  /*****************************************************************************/

>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */

>  /*****************************************************************************/

> diff --git a/gcc/params.def b/gcc/params.def

> index 923ebc8e66c..dd6193b44c2 100644

> --- a/gcc/params.def

> +++ b/gcc/params.def

> @@ -1317,6 +1317,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,

>          "Maximum unroll factor for the unroll-and-jam transformation.",

>          4, 0, 0)

>

> +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,

> +        "avoid-fma-max-bits",

> +        "Maximum number of bits for which we avoid creating FMAs.",

> +        0, 0, 512)

> +

>  /*

>

>  Local variables:

> diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c

> index 8db12f5e1cd..90a9f5359d6 100644

> --- a/gcc/tree-ssa-math-opts.c

> +++ b/gcc/tree-ssa-math-opts.c

> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see

>  #include "optabs-libfuncs.h"

>  #include "tree-eh.h"

>  #include "targhooks.h"

> +#include "domwalk.h"

>

>  /* This structure represents one basic block that either computes a

>     division, or is a common dominator for basic block that compute a

> @@ -2637,17 +2638,218 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,

>    return true;

>  }

>

> +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and

> +   OP2 and which we know is used in statements that can be, together with the

> +   multiplication, converted to FMAs, perform the transformation.  */

> +

> +static void

> +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)

> +{

> +  tree type = TREE_TYPE (mul_result);

> +  gimple *use_stmt;

> +  imm_use_iterator imm_iter;

> +  gassign *fma_stmt;

> +

> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)

> +    {

> +      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);

> +      enum tree_code use_code;

> +      tree addop, mulop1 = op1, result = mul_result;

> +      bool negate_p = false;

> +

> +      if (is_gimple_debug (use_stmt))

> +       continue;

> +

> +      use_code = gimple_assign_rhs_code (use_stmt);

> +      if (use_code == NEGATE_EXPR)

> +       {

> +         result = gimple_assign_lhs (use_stmt);

> +         use_operand_p use_p;

> +         gimple *neguse_stmt;

> +         single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);

> +         gsi_remove (&gsi, true);

> +         release_defs (use_stmt);

> +

> +         use_stmt = neguse_stmt;

> +         gsi = gsi_for_stmt (use_stmt);

> +         use_code = gimple_assign_rhs_code (use_stmt);

> +         negate_p = true;

> +       }

> +

> +      if (gimple_assign_rhs1 (use_stmt) == result)

> +       {

> +         addop = gimple_assign_rhs2 (use_stmt);

> +         /* a * b - c -> a * b + (-c)  */

> +         if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> +           addop = force_gimple_operand_gsi (&gsi,

> +                                             build1 (NEGATE_EXPR,

> +                                                     type, addop),

> +                                             true, NULL_TREE, true,

> +                                             GSI_SAME_STMT);

> +       }

> +      else

> +       {

> +         addop = gimple_assign_rhs1 (use_stmt);

> +         /* a - b * c -> (-b) * c + a */

> +         if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> +           negate_p = !negate_p;

> +       }

> +

> +      if (negate_p)

> +       mulop1 = force_gimple_operand_gsi (&gsi,

> +                                          build1 (NEGATE_EXPR,

> +                                                  type, mulop1),

> +                                          true, NULL_TREE, true,

> +                                          GSI_SAME_STMT);

> +

> +      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),

> +                                     FMA_EXPR, mulop1, op2, addop);

> +

> +      if (dump_file && (dump_flags & TDF_DETAILS))

> +       {

> +         fprintf (dump_file, "Generated FMA ");

> +         print_gimple_stmt (dump_file, fma_stmt, 0, 0);

> +         fprintf (dump_file, "\n");

> +       }

> +

> +      gsi_replace (&gsi, fma_stmt, true);

> +      widen_mul_stats.fmas_inserted++;

> +    }

> +}

> +

> +/* Data necessary to perform the actual transformation from a multiplication

> +   and an addition to an FMA after decision is taken it should be done and to

> +   then delete the multiplication statement from the function IL.  */

> +

> +struct fma_transformation_info

> +{

> +  gimple *mul_stmt;

> +  tree mul_result;

> +  tree op1;

> +  tree op2;

> +};

> +

> +/* Structure containing the current state of FMA deferring, i.e. whether we are

> +   deferring, whether to continue deferring, and all data necessary to come

> +   back and perform all deferred transformations.  */

> +

> +class fma_deferring_state

> +{

> +public:

> +  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually

> +     do any deferring.  */

> +

> +  fma_deferring_state (bool perform_deferring)

> +    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),

> +      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}

> +

> +  /* List of FMA candidates for which we the transformation has been determined

> +     possible but we at this point in BB analysis we do not consider them

> +     beneficial.  */

> +  auto_vec<fma_transformation_info, 8> m_candidates;

> +

> +  /* Set of results of multiplication that are part of an already deferred FMA

> +     candidates.  */

> +  hash_set<tree> m_mul_result_set;

> +

> +  /* The PHI that supposedly feeds back result of a FMA to another over loop

> +     boundary.  */

> +  gphi *m_initial_phi;

> +

> +  /* Result of the last produced FMA candidate or NULL if there has not been

> +     one.  */

> +  tree m_last_result;

> +

> +  /* If true, deferring might still be profitable.  If false, transform all

> +     candidates and no longer defer.  */

> +  bool m_deferring_p;

> +};

> +

> +/* Transform all deferred FMA candidates and mark STATE as no longer

> +   deferring.  */

> +

> +static void

> +cancel_fma_deferring (fma_deferring_state *state)

> +{

> +  if (!state->m_deferring_p)

> +    return;

> +

> +  for (unsigned i = 0; i < state->m_candidates.length (); i++)

> +    {

> +      if (dump_file && (dump_flags & TDF_DETAILS))

> +       fprintf (dump_file, "Generating deferred FMA\n");

> +

> +      const fma_transformation_info &fti = state->m_candidates[i];

> +      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);

> +

> +      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);

> +      gsi_remove (&gsi, true);

> +      release_defs (fti.mul_stmt);

> +    }

> +  state->m_deferring_p = false;

> +}

> +

> +/* If OP is an SSA name defined by a PHI node, return the PHI statement.

> +   Otherwise return NULL.  */

> +

> +static gphi *

> +result_of_phi (tree op)

> +{

> +  if (TREE_CODE (op) != SSA_NAME)

> +    return NULL;


  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));

?  Maybe inline into the single(?) user.

> +  gimple *opdef = SSA_NAME_DEF_STMT (op);

> +  if (gimple_code (opdef) != GIMPLE_PHI)

> +    return NULL;

> +

> +  return as_a <gphi *> (opdef);

> +}

> +

> +/* After processing statements of a BB and recording STATE, return true if the

> +   initial phi is fed by the last FMA candidate result ore one such result from

> +   previously processed BBs marked in LAST_RESULT_SET.  */

> +

> +static bool

> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,

> +                                     hash_set<tree> *last_result_set)

> +{

> +  ssa_op_iter iter;

> +  use_operand_p use;

> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)

> +    {

> +      tree t = USE_FROM_PTR (use);

> +      if (t == state->m_last_result

> +         || last_result_set->contains (t))

> +       return true;

> +    }


maybe query the other way around,

   if (single_imm_use (state->m_last_result, ....)
      && def_stmt is PHI?

> +

> +  return false;

> +}

> +

>  /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2

>     with uses in additions and subtractions to form fused multiply-add

> -   operations.  Returns true if successful and MUL_STMT should be removed.  */

> +   operations.  Returns true if successful and MUL_STMT should be removed.

> +

> +   If STATE indicates that we are deferring FMA transformation, that means

> +   that we do not produce FMAs for basic blocks which look like:

> +

> +    <bb 6>

> +    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

> +    _65 = _14 * _16;

> +    accumulator_66 = _65 + dst_row_111;

> +

> +  or its unrolled version, i.e. with several FMA candidates that feed result

> +  of one into the addend of another.  Instead, we add them to a list in STATE

> +  and if we later discover an FMA candidate that is not part of such a chain,

> +  we go back and perform all deferred past candidates.  */

>

>  static bool

> -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

> +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,

> +                    fma_deferring_state *state)

>  {

>    tree mul_result = gimple_get_lhs (mul_stmt);

>    tree type = TREE_TYPE (mul_result);

>    gimple *use_stmt, *neguse_stmt;

> -  gassign *fma_stmt;

>    use_operand_p use_p;

>    imm_use_iterator imm_iter;

>

> @@ -2671,6 +2873,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

>    if (has_zero_uses (mul_result))

>      return false;

>

> +  bool check_defer

> +    = (state->m_deferring_p

> +       && (tree_to_shwi (TYPE_SIZE (type))

> +          <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));

> +  bool defer = check_defer;

>    /* Make sure that the multiplication statement becomes dead after

>       the transformation, thus that all uses are transformed to FMAs.

>       This means we assume that an FMA operation has the same cost

> @@ -2768,77 +2975,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

>             }

>         }

>

> +      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);

> +      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);

>        /* We can't handle a * b + a * b.  */

> -      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))

> +      if (use_rhs1 == use_rhs2)

> +       return false;

> +      /* If deferring, make sure we are not looking at an instruction that

> +        wouldn't have existed if we were not.  */

> +      if (state->m_deferring_p

> +         && (state->m_mul_result_set.contains (use_rhs1)

> +             || state->m_mul_result_set.contains (use_rhs2)))

>         return false;

>

> -      /* While it is possible to validate whether or not the exact form

> -        that we've recognized is available in the backend, the assumption

> -        is that the transformation is never a loss.  For instance, suppose

> -        the target only has the plain FMA pattern available.  Consider

> -        a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which

> -        is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we

> -        still have 3 operations, but in the FMA form the two NEGs are

> -        independent and could be run in parallel.  */

> -    }

> -

> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)

> -    {

> -      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);

> -      enum tree_code use_code;

> -      tree addop, mulop1 = op1, result = mul_result;

> -      bool negate_p = false;

> -

> -      if (is_gimple_debug (use_stmt))

> -       continue;

> -

> -      use_code = gimple_assign_rhs_code (use_stmt);

> -      if (use_code == NEGATE_EXPR)

> +      if (check_defer)

>         {

> -         result = gimple_assign_lhs (use_stmt);

> -         single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);

> -         gsi_remove (&gsi, true);

> -         release_defs (use_stmt);

> +         tree use_lhs = gimple_assign_lhs (use_stmt);

> +         if (state->m_last_result)

> +           {

> +             if (use_rhs2 == state->m_last_result

> +                 || use_rhs1 == state->m_last_result)

> +               defer = true;

> +             else

> +               defer = false;

> +           }

> +         else

> +           {

> +             gcc_checking_assert (!state->m_initial_phi);

> +             gphi *phi;

> +             if (use_rhs1 == result)

> +               phi = result_of_phi (use_rhs2);

> +             else

> +               {

> +                 gcc_assert (use_rhs2 == result);

> +                 phi = result_of_phi (use_rhs1);

> +               }

>

> -         use_stmt = neguse_stmt;

> -         gsi = gsi_for_stmt (use_stmt);

> -         use_code = gimple_assign_rhs_code (use_stmt);

> -         negate_p = true;

> -       }

> +             if (phi)

> +               {

> +                 state->m_initial_phi = phi;

> +                 defer = true;

> +               }

> +             else

> +               defer = false;

> +           }

>

> -      if (gimple_assign_rhs1 (use_stmt) == result)

> -       {

> -         addop = gimple_assign_rhs2 (use_stmt);

> -         /* a * b - c -> a * b + (-c)  */

> -         if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> -           addop = force_gimple_operand_gsi (&gsi,

> -                                             build1 (NEGATE_EXPR,

> -                                                     type, addop),

> -                                             true, NULL_TREE, true,

> -                                             GSI_SAME_STMT);

> +         state->m_last_result = use_lhs;

> +         check_defer = false;

>         }

>        else

> +       defer = false;

> +

> +      /* While it is possible to validate whether or not the exact form that

> +        we've recognized is available in the backend, the assumption is that

> +        if the deferring logic above did not trigger, the transformation is

> +        never a loss.  For instance, suppose the target only has the plain FMA

> +        pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged

> +        MUL+SUB for FMA+NEG, which is still two operations.  Consider

> +         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA

> +        form the two NEGs are independent and could be run in parallel.  */

> +    }

> +

> +  if (defer)

> +    {

> +      fma_transformation_info fti;

> +      fti.mul_stmt = mul_stmt;

> +      fti.mul_result = mul_result;

> +      fti.op1 = op1;

> +      fti.op2 = op2;

> +      state->m_candidates.safe_push (fti);

> +      state->m_mul_result_set.add (mul_result);

> +

> +      if (dump_file && (dump_flags & TDF_DETAILS))

>         {

> -         addop = gimple_assign_rhs1 (use_stmt);

> -         /* a - b * c -> (-b) * c + a */

> -         if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> -           negate_p = !negate_p;

> +         fprintf (dump_file, "Deferred generating FMA for multiplication ");

> +         print_gimple_stmt (dump_file, mul_stmt, 0, 0);

> +         fprintf (dump_file, "\n");

>         }

>

> -      if (negate_p)

> -       mulop1 = force_gimple_operand_gsi (&gsi,

> -                                          build1 (NEGATE_EXPR,

> -                                                  type, mulop1),

> -                                          true, NULL_TREE, true,

> -                                          GSI_SAME_STMT);

> -

> -      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),

> -                                     FMA_EXPR, mulop1, op2, addop);

> -      gsi_replace (&gsi, fma_stmt, true);

> -      widen_mul_stats.fmas_inserted++;

> +      return false;

> +    }

> +  else

> +    {

> +      if (state->m_deferring_p)

> +       cancel_fma_deferring (state);

> +      convert_mult_to_fma_1 (mul_result, op1, op2);

> +      return true;

>      }

> -

> -  return true;

>  }

>

>

> @@ -3268,92 +3490,135 @@ public:

>

>  }; // class pass_optimize_widening_mul

>

> -unsigned int

> -pass_optimize_widening_mul::execute (function *fun)

> +/* Walker class to perform the transformation in reverse dominance order. */

> +

> +class math_opts_dom_walker : public dom_walker

>  {

> -  basic_block bb;

> -  bool cfg_changed = false;

> +public:

> +  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set

> +     if walking modidifes the CFG.  */

>

> -  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));

> -  calculate_dominance_info (CDI_DOMINATORS);

> -  renumber_gimple_stmt_uids ();

> +  math_opts_dom_walker (bool *cfg_changed_p)

> +    : dom_walker (CDI_DOMINATORS), m_last_result_set (),

> +      m_cfg_changed_p (cfg_changed_p) {}

>

> -  FOR_EACH_BB_FN (bb, fun)

> +  /* The actual actions performed in the walk.  */

> +

> +  virtual void after_dom_children (basic_block);

> +

> +  /* Set of results of chains of multiply and add statement combinations that

> +     were not transformed into FMAs because of active deferring.  */

> +  hash_set<tree> m_last_result_set;

> +

> +  /* Pointer to a flag of the user that needs to be set if CFG has been

> +     modified.  */

> +  bool *m_cfg_changed_p;

> +};

> +

> +void

> +math_opts_dom_walker::after_dom_children (basic_block bb)

> +{

> +  gimple_stmt_iterator gsi;

> +

> +  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);

> +

> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)

>      {

> -      gimple_stmt_iterator gsi;

> +      gimple *stmt = gsi_stmt (gsi);

> +      enum tree_code code;

>

> -      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)

> -        {

> -         gimple *stmt = gsi_stmt (gsi);

> -         enum tree_code code;

> +      if (is_gimple_assign (stmt))

> +       {

> +         code = gimple_assign_rhs_code (stmt);

> +         switch (code)

> +           {

> +           case MULT_EXPR:

> +             if (!convert_mult_to_widen (stmt, &gsi)

> +                 && !convert_expand_mult_copysign (stmt, &gsi)

> +                 && convert_mult_to_fma (stmt,

> +                                         gimple_assign_rhs1 (stmt),

> +                                         gimple_assign_rhs2 (stmt),

> +                                         &fma_state))

> +               {

> +                 gsi_remove (&gsi, true);

> +                 release_defs (stmt);

> +                 continue;

> +               }

> +             break;

> +

> +           case PLUS_EXPR:

> +           case MINUS_EXPR:

> +             if (!convert_plusminus_to_widen (&gsi, stmt, code))

> +               match_uaddsub_overflow (&gsi, stmt, code);

> +             break;

>

> -         if (is_gimple_assign (stmt))

> +           case TRUNC_MOD_EXPR:

> +             convert_to_divmod (as_a<gassign *> (stmt));

> +             break;

> +

> +           default:;

> +           }

> +       }

> +      else if (is_gimple_call (stmt))

> +       {

> +         tree fndecl = gimple_call_fndecl (stmt);

> +         if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))

>             {

> -             code = gimple_assign_rhs_code (stmt);

> -             switch (code)

> +             switch (DECL_FUNCTION_CODE (fndecl))

>                 {

> -               case MULT_EXPR:

> -                 if (!convert_mult_to_widen (stmt, &gsi)

> -                     && !convert_expand_mult_copysign (stmt, &gsi)

> +               case BUILT_IN_POWF:

> +               case BUILT_IN_POW:

> +               case BUILT_IN_POWL:

> +                 if (gimple_call_lhs (stmt)

> +                     && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST

> +                     && real_equal

> +                     (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),

> +                      &dconst2)

>                       && convert_mult_to_fma (stmt,

> -                                             gimple_assign_rhs1 (stmt),

> -                                             gimple_assign_rhs2 (stmt)))

> +                                             gimple_call_arg (stmt, 0),

> +                                             gimple_call_arg (stmt, 0),

> +                                             &fma_state))

>                     {

> -                     gsi_remove (&gsi, true);

> +                     unlink_stmt_vdef (stmt);

> +                     if (gsi_remove (&gsi, true)

> +                         && gimple_purge_dead_eh_edges (bb))

> +                       *m_cfg_changed_p = true;

>                       release_defs (stmt);

>                       continue;

>                     }

>                   break;

>

> -               case PLUS_EXPR:

> -               case MINUS_EXPR:

> -                 if (!convert_plusminus_to_widen (&gsi, stmt, code))

> -                   match_uaddsub_overflow (&gsi, stmt, code);

> -                 break;

> -

> -               case TRUNC_MOD_EXPR:

> -                 convert_to_divmod (as_a<gassign *> (stmt));

> -                 break;

> -

>                 default:;

>                 }

>             }

> -         else if (is_gimple_call (stmt)

> -                  && gimple_call_lhs (stmt))

> -           {

> -             tree fndecl = gimple_call_fndecl (stmt);

> -             if (fndecl

> -                 && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))

> -               {

> -                 switch (DECL_FUNCTION_CODE (fndecl))

> -                   {

> -                     case BUILT_IN_POWF:

> -                     case BUILT_IN_POW:

> -                     case BUILT_IN_POWL:

> -                       if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST

> -                           && real_equal

> -                                (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),

> -                                 &dconst2)

> -                           && convert_mult_to_fma (stmt,

> -                                                   gimple_call_arg (stmt, 0),

> -                                                   gimple_call_arg (stmt, 0)))

> -                         {

> -                           unlink_stmt_vdef (stmt);

> -                           if (gsi_remove (&gsi, true)

> -                               && gimple_purge_dead_eh_edges (bb))

> -                             cfg_changed = true;

> -                           release_defs (stmt);

> -                           continue;

> -                         }

> -                         break;

> -

> -                     default:;

> -                   }

> -               }

> -           }

> -         gsi_next (&gsi);

> +         else

> +           cancel_fma_deferring (&fma_state);

>         }

> +      gsi_next (&gsi);

>      }

> +  if (fma_state.m_deferring_p

> +      && fma_state.m_initial_phi)

> +    {

> +      gcc_checking_assert (fma_state.m_last_result);

> +      if (!last_fma_candidate_feeds_initial_phi (&fma_state,

> +                                                &m_last_result_set))

> +       cancel_fma_deferring (&fma_state);

> +      else

> +       m_last_result_set.add (fma_state.m_last_result);

> +    }

> +}

> +

> +

> +unsigned int

> +pass_optimize_widening_mul::execute (function *fun)

> +{

> +  bool cfg_changed = false;

> +

> +  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));

> +  calculate_dominance_info (CDI_DOMINATORS);

> +  renumber_gimple_stmt_uids ();

> +

> +  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));

>

>    statistics_counter_event (fun, "widening multiplications inserted",

>                             widen_mul_stats.widen_mults_inserted);

> --

> 2.15.1

>
Martin Jambor Dec. 22, 2017, 12:01 a.m. | #2
Hi,

On Mon, Dec 18 2017, Richard Biener wrote:
> On Fri, Dec 15, 2017 at 3:19 PM, Martin Jambor <mjambor@suse.cz> wrote:

>>

>> Hello,

>>

>> the patch below prevents creation if fused-multiply-and-add instructions

>> in the widening_mul gimple pass on the Zen-based AMD CPUs and as a

>> result fixes regressions of native znver1 tuning when compared to

>> generic tuning in:

>>

>>   - the matrix.c testcase of PR 81616 (straightforward matrix

>>     multiplication) at -O2 and -O3 which is currently 60% (!),

>>

>>   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

>>

>>   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also

>>     about 20% in both cases.

>>

>> The basic idea is to detect loops in the following form:

>>

>>     <bb 6>

>>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>>     ...

>>     _65 = _14 * _16;

>>     accumulator_66 = _65 + accumulator_111;

>>

>> and prevents from creating FMA for it.  Because at least in the parest

>> and calculix cases it has to, it also deals with more than one chain of

>> FMA candidates that feed the next one's addend:

>>

>>

>>     <bb 6>

>>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>>     ...

>>     _65 = _14 * _16;

>>     accumulator_55 = _65 + accumulator_111;

>>     _65 = _24 * _36;

>>     accumulator_66 = _65 + accumulator_55;

>>

>> Unfortunately, to really get rid of the calculix regression, the

>> algorithm cannot just look at one BB at a time but also has to work for

>> cases like the following:

>>

>>      1  void mult(void)

>>      2  {

>>      3      int i, j, k, l;

>>      4

>>      5     for(i=0; i<SIZE; ++i)

>>      6     {

>>      7        for(j=0; j<SIZE; ++j)

>>      8        {

>>      9           for(l=0; l<SIZE; l+=10)

>>     10           {

>>     11               c[i][j] += a[i][l] * b[k][l];

>>     12               for(k=1; k<10; ++k)

>>     13               {

>>     14                   c[i][j] += a[i][l+k] * b[k][l+k];

>>     15               }

>>     16

>>     17           }

>>     18        }

>>     19     }

>>     20  }

>>

>> where the FMA on line 14 feeds into the one on line 11 in an

>> encompassing loop.  Therefore I have changed the structure of the pass

>> to work in reverse dominance order and it keeps a hash set of results of

>> rejected FMAs candidates which it checks when looking at PHI nodes of

>> the current BB.  Without this reorganization, calculix was still 8%

>> slower with native tuning than with generic one.

>>

>> When the deferring mechanism realizes that in the current BB, the FMA

>> candidates do not all form a one chain tight-loop like in the examples

>> above, it goes back to all the previously deferred candidates (in the

>> current BB only) and performs the transformation.

>>

>> The main reason is to keep the patch conservative (and also simple), but

>> it also means that the following function is not affected and is still

>> 20% slower when compiled with native march and tuning compared to the

>> generic one:

>>

>>      1  void mult(struct s *p1, struct s *p2)

>>      2  {

>>      3     int i, j, k;

>>      4

>>      5     for(i=0; i<SIZE; ++i)

>>      6     {

>>      7        for(j=0; j<SIZE; ++j)

>>      8        {

>>      9           for(k=0; k<SIZE; ++k)

>>     10           {

>>     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];

>>     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];

>>     13           }

>>     14        }

>>     15     }

>>     16  }

>

> So here we apply store-motion to p[12]->c[i][j] so we see a

> reduction in SSA?  That is, you are really looking for

> SSA SCCs that involve only FMA candidates?  There's a SCC

> finding code in tree-ssa-sccvn.c, DFS (), but a straight-forward

> use->def walk over the adds should find a SCC just involving the

> adds?  That is, rather than doing all the defering stuff I wonder

> if it is easier to simply gather the SCC if there is any and add

> "disqualifications", like to each other member?  So when seeing

>

>   x += a1 * b1;

>   x += a2 * b2;

>   x += a3 * b3;

> ...


That is certainly an alternative but I am not sure it would result in
simpler code, at least with the identical intended behavior, because I
would have to then walk backwards from the additions to multiplications,
skipping a potential negation, to check that they form a viable FMA.
And remembering disqualifications is likely to be similarly complex like
remembering deferred transformations.

> we'd still generate a FMA for each 2nd statement?  Basically


When I manually unroll the matrix multiplication twice, not creating
every second FMA improves run-time only by 20%, whereas removing all by
60%.  When I tried to unroll the loop more, specifically, eight times,
the regression disappeared even with unpatched compiler, I assume the
loads hid the latency issue.

> do some scheduling of as many mults as needed to conver

> the latency of the first FMA.

>

> To go back to the Zen specific issue - so FMA has a higher overall

> latency than mul + add (?) and (or?)


No, if I read the tables I'm using for this right, the latency is 5c,
whereas add + mul have 7.

> if there's more than one mul being

> accumulated then the 2nd mul can execute in parallel with the first

> add (in case dependences allow).


That is how I understand the issue.

> I wonder if there's an easy way to

> query the DFA for the latency of FMA vs. ADD and/or if the --param

> (or target hook?) should rather tell us sth about this latency difference?


Well, I guess that ideally this would really be an instruction-selection
decision based on the fact whether the addend is likely to be ready or
not.  But I am not sure how to model this at gimple level and I did not
think I had enough time to investigate this for gcc 8, so I basically
re-used (most of) the observations that lead to Venkat's

  https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html

but in order to prevent FMA generation, not to split it.

>

> Zen IIRC can execute two FMA128 in parallel (thus one FMA256 per cycle).

> That makes it an odd observation that avx256 can "fix" the issue.  WRT

> the above suggestion the situation needs to be clarified.


I don't have a good explanation for this, I will try to ask AMD.

>

>> I suppose that the best optimization for the above would be to split the

>> loops, but one could probably construct at least an artificial testcase

>> where the FMAs would keep enough locality that it is not the case.  The

>> mechanism can be easily extended to keep track of not just one chain but

>> a few, preferably as a followup, if people think it makes sense.

>>

>> An interesting observation is that the matrix multiplication does not

>> suffer the penalty when compiled with -O3 -mprefer-vector-width=256.

>> Apparently the 256 vector processing can hide the latency penalty when

>> internally it is split into two halves.  The same goes for 512 bit

>> vectors.  That is why the patch leaves those be - well, there is a param

>> for the threshold which is set to zero for everybody but znver1.  If

>> maintainers of any other architecture suspect that their FMAs might

>> suffer similar latency problem, they can easily try tweaking that

>> parameter and see what happens with the matrix multiplication example.

>>

>> I have bootstrapped and tested the patch on x86_64-linux (as it is and

>> also with the param set to a 256 by default to make it trigger).  I have

>> also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017

>> FPrate and the only changes are the big improvements of calculix and

>> parest.

>>

>> After I address any comments and/or suggestions, would it be OK for

>> trunk?

>

> What does the patch do to bwaves whose innermost loop is a sequence

> of FMAs (slightly complicated by reassoc widening the sum)?


It splits the chain of FMAs at -O2 and -O3 (not at -Ofast where
reassociation does that).  It actually speeds up the testcase by over 5%
at -O2 and over 10% at -O3.

>

> So for bwaves we currently have loads feeding the multiplies which might

> cause enough latency to make the FMA issue not matter and/or there

> are intermediate non-FMA adds that might hide the issue.  Forcing

> a full FMA chain is slowing Zen down significantly though (see my report).


Apparently so.

>

> Another issue is are FMAs not discovered anyway by combine later?

> Possibly not on x86 since all patterns use FMA rather than (plus (mult ...)).


I don't think so but I may be easily wrong when it comes to such late passes.

>

> One minor issue below (otherwise I was just looking at the high-level parts).

>


...

>> +static gphi *

>> +result_of_phi (tree op)

>> +{

>> +  if (TREE_CODE (op) != SSA_NAME)

>> +    return NULL;

>

>   return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));

>


Fixed on my private branch.

> ?  Maybe inline into the single(?) user.


There are two.  I can do that, but the (TREE_CODE (op) != SSA_NAME) part
would make the code IMHO uglier.

...

>> +/* After processing statements of a BB and recording STATE, return true if the

>> +   initial phi is fed by the last FMA candidate result ore one such result from

>> +   previously processed BBs marked in LAST_RESULT_SET.  */

>> +

>> +static bool

>> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,

>> +                                     hash_set<tree> *last_result_set)

>> +{

>> +  ssa_op_iter iter;

>> +  use_operand_p use;

>> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)

>> +    {

>> +      tree t = USE_FROM_PTR (use);

>> +      if (t == state->m_last_result

>> +         || last_result_set->contains (t))

>> +       return true;

>> +    }

>

> maybe query the other way around,

>

>    if (single_imm_use (state->m_last_result, ....)

>       && def_stmt is PHI?

>


I am not sure I understand.  There are usually two uses of
state->m_last_result, one in the PHI node and at least one final use of
the computed sum after the loop.  Also, I need to check the arguments
whether they are in the last_result_set.

Thanks for the comments,

Martin
Martin Jambor Jan. 10, 2018, 6:46 p.m. | #3
Hello,

I would really like to ping the FMA transformation prevention patch that
I sent here in December, which, after incorporating a suggestion from
Richi, re-base and re-testing, I re-post below.  I really think that it
should make into gcc 8 in some form, because the performance wins are
really big.

I am still opened to all sorts of comments, of course, especially to
suggestions about the form of the param controlling this behavior (or
how to communicate that we want to do this on Zen in general).  It might
even be a binary value since not forming FMAs does not seem to harm
bigger vectors either (as far as we know, I should add).

For the record, I have not yet received any information from AMD why
FMAs on 256bit vectors do not have this problem, I assume all people
that could give an authoritative answer are now looking into covert
channel mitigation stuff.  But very probably it is just that the
internal split FMAs can be scheduled so that while one is still waiting
for its addend, another can already execute.

Thanks,

Martin


On Fri, Dec 15 2017, Martin Jambor wrote:
> Hello,

>

> the patch below prevents creation if fused-multiply-and-add instructions

> in the widening_mul gimple pass on the Zen-based AMD CPUs and as a

> result fixes regressions of native znver1 tuning when compared to

> generic tuning in:

>

>   - the matrix.c testcase of PR 81616 (straightforward matrix

>     multiplication) at -O2 and -O3 which is currently 60% (!),

>

>   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

>

>   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also

>     about 20% in both cases.

>

> The basic idea is to detect loops in the following form:

>

>     <bb 6>

>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>     ...

>     _65 = _14 * _16;

>     accumulator_66 = _65 + accumulator_111;

>

> and prevents from creating FMA for it.  Because at least in the parest

> and calculix cases it has to, it also deals with more than one chain of

> FMA candidates that feed the next one's addend:

>

>

>     <bb 6>

>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

>     ...

>     _65 = _14 * _16;

>     accumulator_55 = _65 + accumulator_111;

>     _65 = _24 * _36;

>     accumulator_66 = _65 + accumulator_55;

>

> Unfortunately, to really get rid of the calculix regression, the

> algorithm cannot just look at one BB at a time but also has to work for

> cases like the following:

>

>      1  void mult(void)

>      2  {

>      3      int i, j, k, l;

>      4  

>      5     for(i=0; i<SIZE; ++i)

>      6     {

>      7        for(j=0; j<SIZE; ++j)

>      8        {

>      9           for(l=0; l<SIZE; l+=10)

>     10           {

>     11               c[i][j] += a[i][l] * b[k][l];

>     12               for(k=1; k<10; ++k)

>     13               {

>     14                   c[i][j] += a[i][l+k] * b[k][l+k];

>     15               }

>     16  

>     17           }

>     18        }

>     19     }

>     20  }

>

> where the FMA on line 14 feeds into the one on line 11 in an

> encompassing loop.  Therefore I have changed the structure of the pass

> to work in reverse dominance order and it keeps a hash set of results of

> rejected FMAs candidates which it checks when looking at PHI nodes of

> the current BB.  Without this reorganization, calculix was still 8%

> slower with native tuning than with generic one.

>

> When the deferring mechanism realizes that in the current BB, the FMA

> candidates do not all form a one chain tight-loop like in the examples

> above, it goes back to all the previously deferred candidates (in the

> current BB only) and performs the transformation.

>

> The main reason is to keep the patch conservative (and also simple), but

> it also means that the following function is not affected and is still

> 20% slower when compiled with native march and tuning compared to the

> generic one:

>

>      1  void mult(struct s *p1, struct s *p2)

>      2  {

>      3     int i, j, k;

>      4  

>      5     for(i=0; i<SIZE; ++i)

>      6     {

>      7        for(j=0; j<SIZE; ++j)

>      8        {

>      9           for(k=0; k<SIZE; ++k)

>     10           {

>     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];

>     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];

>     13           }

>     14        }

>     15     }

>     16  }

>

> I suppose that the best optimization for the above would be to split the

> loops, but one could probably construct at least an artificial testcase

> where the FMAs would keep enough locality that it is not the case.  The

> mechanism can be easily extended to keep track of not just one chain but

> a few, preferably as a followup, if people think it makes sense.

>

> An interesting observation is that the matrix multiplication does not

> suffer the penalty when compiled with -O3 -mprefer-vector-width=256.

> Apparently the 256 vector processing can hide the latency penalty when

> internally it is split into two halves.  The same goes for 512 bit

> vectors.  That is why the patch leaves those be - well, there is a param

> for the threshold which is set to zero for everybody but znver1.  If

> maintainers of any other architecture suspect that their FMAs might

> suffer similar latency problem, they can easily try tweaking that

> parameter and see what happens with the matrix multiplication example.

>

> I have bootstrapped and tested the patch on x86_64-linux (as it is and

> also with the param set to a 256 by default to make it trigger).  I have

> also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017

> FPrate and the only changes are the big improvements of calculix and

> parest.

>


2018-01-09  Martin Jambor  <mjambor@suse.cz>

	PR target/81616
	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
	* tree-ssa-math-opts.c: Include domwalk.h.
	(convert_mult_to_fma_1): New function.
	(fma_transformation_info): New type.
	(fma_deferring_state): Likewise.
	(cancel_fma_deferring): New function.
	(result_of_phi): Likewise.
	(last_fma_candidate_feeds_initial_phi): Likewise.
	(convert_mult_to_fma): Added deferring logic, split actual
	transformation to convert_mult_to_fma_1.
	(math_opts_dom_walker): New type.
	(math_opts_dom_walker::after_dom_children): New method, body moved
	here from pass_optimize_widening_mul::execute, added deferring logic
	bits.
	(pass_optimize_widening_mul::execute): Moved most of code to
	math_opts_dom_walker::after_dom_children.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
	* config/i386/i386.c (ix86_option_override_internal): Added
	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
---
 gcc/config/i386/i386.c       |   5 +
 gcc/config/i386/x86-tune.def |   4 +
 gcc/params.def               |   5 +
 gcc/tree-ssa-math-opts.c     | 517 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 403 insertions(+), 128 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 8696f931806..8200a136dd1 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p,
 	(cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
     }
 
+  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])
+    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,
+			   opts->x_param_values,
+			   opts_set->x_param_values);
+
   return true;
 }
 
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 9401d51cdc1..0effce759f1 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
           m_SILVERMONT | m_INTEL)
 
+/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
+   smaller FMA chain.  */
+DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/
diff --git a/gcc/params.def b/gcc/params.def
index d9f8d8208a1..a0cd06db339 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,
 	 "Maximum unroll factor for the unroll-and-jam transformation.",
 	 4, 0, 0)
 
+DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,
+	 "avoid-fma-max-bits",
+	 "Maximum number of bits for which we avoid creating FMAs.",
+	 0, 0, 512)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index ea880c7b1d8..16d9399af0b 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-libfuncs.h"
 #include "tree-eh.h"
 #include "targhooks.h"
+#include "domwalk.h"
 
 /* This structure represents one basic block that either computes a
    division, or is a common dominator for basic block that compute a
@@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
   return true;
 }
 
+/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
+   OP2 and which we know is used in statements that can be, together with the
+   multiplication, converted to FMAs, perform the transformation.  */
+
+static void
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+{
+  tree type = TREE_TYPE (mul_result);
+  gimple *use_stmt;
+  imm_use_iterator imm_iter;
+  gassign *fma_stmt;
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
+      enum tree_code use_code;
+      tree addop, mulop1 = op1, result = mul_result;
+      bool negate_p = false;
+
+      if (is_gimple_debug (use_stmt))
+	continue;
+
+      use_code = gimple_assign_rhs_code (use_stmt);
+      if (use_code == NEGATE_EXPR)
+	{
+	  result = gimple_assign_lhs (use_stmt);
+	  use_operand_p use_p;
+	  gimple *neguse_stmt;
+	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
+	  gsi_remove (&gsi, true);
+	  release_defs (use_stmt);
+
+	  use_stmt = neguse_stmt;
+	  gsi = gsi_for_stmt (use_stmt);
+	  use_code = gimple_assign_rhs_code (use_stmt);
+	  negate_p = true;
+	}
+
+      if (gimple_assign_rhs1 (use_stmt) == result)
+	{
+	  addop = gimple_assign_rhs2 (use_stmt);
+	  /* a * b - c -> a * b + (-c)  */
+	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
+	    addop = force_gimple_operand_gsi (&gsi,
+					      build1 (NEGATE_EXPR,
+						      type, addop),
+					      true, NULL_TREE, true,
+					      GSI_SAME_STMT);
+	}
+      else
+	{
+	  addop = gimple_assign_rhs1 (use_stmt);
+	  /* a - b * c -> (-b) * c + a */
+	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
+	    negate_p = !negate_p;
+	}
+
+      if (negate_p)
+	mulop1 = force_gimple_operand_gsi (&gsi,
+					   build1 (NEGATE_EXPR,
+						   type, mulop1),
+					   true, NULL_TREE, true,
+					   GSI_SAME_STMT);
+
+      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
+				      FMA_EXPR, mulop1, op2, addop);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Generated FMA ");
+	  print_gimple_stmt (dump_file, fma_stmt, 0, 0);
+	  fprintf (dump_file, "\n");
+	}
+
+      gsi_replace (&gsi, fma_stmt, true);
+      widen_mul_stats.fmas_inserted++;
+    }
+}
+
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+/* Transform all deferred FMA candidates and mark STATE as no longer
+   deferring.  */
+
+static void
+cancel_fma_deferring (fma_deferring_state *state)
+{
+  if (!state->m_deferring_p)
+    return;
+
+  for (unsigned i = 0; i < state->m_candidates.length (); i++)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Generating deferred FMA\n");
+
+      const fma_transformation_info &fti = state->m_candidates[i];
+      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
+
+      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
+      gsi_remove (&gsi, true);
+      release_defs (fti.mul_stmt);
+    }
+  state->m_deferring_p = false;
+}
+
+/* If OP is an SSA name defined by a PHI node, return the PHI statement.
+   Otherwise return NULL.  */
+
+static gphi *
+result_of_phi (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL;
+
+  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
+}
+
+/* After processing statements of a BB and recording STATE, return true if the
+   initial phi is fed by the last FMA candidate result ore one such result from
+   previously processed BBs marked in LAST_RESULT_SET.  */
+
+static bool
+last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
+				      hash_set<tree> *last_result_set)
+{
+  ssa_op_iter iter;
+  use_operand_p use;
+  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
+    {
+      tree t = USE_FROM_PTR (use);
+      if (t == state->m_last_result
+	  || last_result_set->contains (t))
+	return true;
+    }
+
+  return false;
+}
+
 /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2
    with uses in additions and subtractions to form fused multiply-add
-   operations.  Returns true if successful and MUL_STMT should be removed.  */
+   operations.  Returns true if successful and MUL_STMT should be removed.
+
+   If STATE indicates that we are deferring FMA transformation, that means
+   that we do not produce FMAs for basic blocks which look like:
+
+    <bb 6>
+    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
+    _65 = _14 * _16;
+    accumulator_66 = _65 + accumulator_111;
+
+  or its unrolled version, i.e. with several FMA candidates that feed result
+  of one into the addend of another.  Instead, we add them to a list in STATE
+  and if we later discover an FMA candidate that is not part of such a chain,
+  we go back and perform all deferred past candidates.  */
 
 static bool
-convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state)
 {
   tree mul_result = gimple_get_lhs (mul_stmt);
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt, *neguse_stmt;
-  gassign *fma_stmt;
   use_operand_p use_p;
   imm_use_iterator imm_iter;
 
@@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
   if (has_zero_uses (mul_result))
     return false;
 
+  bool check_defer
+    = (state->m_deferring_p
+       && (tree_to_shwi (TYPE_SIZE (type))
+	   <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));
+  bool defer = check_defer;
   /* Make sure that the multiplication statement becomes dead after
      the transformation, thus that all uses are transformed to FMAs.
      This means we assume that an FMA operation has the same cost
@@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
 	    }
 	}
 
+      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);
+      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);
       /* We can't handle a * b + a * b.  */
-      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))
+      if (use_rhs1 == use_rhs2)
+	return false;
+      /* If deferring, make sure we are not looking at an instruction that
+	 wouldn't have existed if we were not.  */
+      if (state->m_deferring_p
+	  && (state->m_mul_result_set.contains (use_rhs1)
+	      || state->m_mul_result_set.contains (use_rhs2)))
 	return false;
 
-      /* While it is possible to validate whether or not the exact form
-	 that we've recognized is available in the backend, the assumption
-	 is that the transformation is never a loss.  For instance, suppose
-	 the target only has the plain FMA pattern available.  Consider
-	 a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which
-	 is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we
-	 still have 3 operations, but in the FMA form the two NEGs are
-	 independent and could be run in parallel.  */
-    }
-
-  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
-    {
-      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
-      enum tree_code use_code;
-      tree addop, mulop1 = op1, result = mul_result;
-      bool negate_p = false;
-
-      if (is_gimple_debug (use_stmt))
-	continue;
-
-      use_code = gimple_assign_rhs_code (use_stmt);
-      if (use_code == NEGATE_EXPR)
+      if (check_defer)
 	{
-	  result = gimple_assign_lhs (use_stmt);
-	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  tree use_lhs = gimple_assign_lhs (use_stmt);
+	  if (state->m_last_result)
+	    {
+	      if (use_rhs2 == state->m_last_result
+		  || use_rhs1 == state->m_last_result)
+		defer = true;
+	      else
+		defer = false;
+	    }
+	  else
+	    {
+	      gcc_checking_assert (!state->m_initial_phi);
+	      gphi *phi;
+	      if (use_rhs1 == result)
+		phi = result_of_phi (use_rhs2);
+	      else
+		{
+		  gcc_assert (use_rhs2 == result);
+		  phi = result_of_phi (use_rhs1);
+		}
 
-	  use_stmt = neguse_stmt;
-	  gsi = gsi_for_stmt (use_stmt);
-	  use_code = gimple_assign_rhs_code (use_stmt);
-	  negate_p = true;
-	}
+	      if (phi)
+		{
+		  state->m_initial_phi = phi;
+		  defer = true;
+		}
+	      else
+		defer = false;
+	    }
 
-      if (gimple_assign_rhs1 (use_stmt) == result)
-	{
-	  addop = gimple_assign_rhs2 (use_stmt);
-	  /* a * b - c -> a * b + (-c)  */
-	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
-	    addop = force_gimple_operand_gsi (&gsi,
-					      build1 (NEGATE_EXPR,
-						      type, addop),
-					      true, NULL_TREE, true,
-					      GSI_SAME_STMT);
+	  state->m_last_result = use_lhs;
+	  check_defer = false;
 	}
       else
+	defer = false;
+
+      /* While it is possible to validate whether or not the exact form that
+	 we've recognized is available in the backend, the assumption is that
+	 if the deferring logic above did not trigger, the transformation is
+	 never a loss.  For instance, suppose the target only has the plain FMA
+	 pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged
+	 MUL+SUB for FMA+NEG, which is still two operations.  Consider
+         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA
+	 form the two NEGs are independent and could be run in parallel.  */
+    }
+
+  if (defer)
+    {
+      fma_transformation_info fti;
+      fti.mul_stmt = mul_stmt;
+      fti.mul_result = mul_result;
+      fti.op1 = op1;
+      fti.op2 = op2;
+      state->m_candidates.safe_push (fti);
+      state->m_mul_result_set.add (mul_result);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
 	{
-	  addop = gimple_assign_rhs1 (use_stmt);
-	  /* a - b * c -> (-b) * c + a */
-	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
-	    negate_p = !negate_p;
+	  fprintf (dump_file, "Deferred generating FMA for multiplication ");
+	  print_gimple_stmt (dump_file, mul_stmt, 0, 0);
+	  fprintf (dump_file, "\n");
 	}
 
-      if (negate_p)
-	mulop1 = force_gimple_operand_gsi (&gsi,
-					   build1 (NEGATE_EXPR,
-						   type, mulop1),
-					   true, NULL_TREE, true,
-					   GSI_SAME_STMT);
-
-      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
-				      FMA_EXPR, mulop1, op2, addop);
-      gsi_replace (&gsi, fma_stmt, true);
-      widen_mul_stats.fmas_inserted++;
+      return false;
+    }
+  else
+    {
+      if (state->m_deferring_p)
+	cancel_fma_deferring (state);
+      convert_mult_to_fma_1 (mul_result, op1, op2);
+      return true;
     }
-
-  return true;
 }
 
 
@@ -3270,92 +3488,135 @@ public:
 
 }; // class pass_optimize_widening_mul
 
-unsigned int
-pass_optimize_widening_mul::execute (function *fun)
+/* Walker class to perform the transformation in reverse dominance order. */
+
+class math_opts_dom_walker : public dom_walker
 {
-  basic_block bb;
-  bool cfg_changed = false;
+public:
+  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
+     if walking modidifes the CFG.  */
 
-  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
-  calculate_dominance_info (CDI_DOMINATORS);
-  renumber_gimple_stmt_uids ();
+  math_opts_dom_walker (bool *cfg_changed_p)
+    : dom_walker (CDI_DOMINATORS), m_last_result_set (),
+      m_cfg_changed_p (cfg_changed_p) {}
 
-  FOR_EACH_BB_FN (bb, fun)
+  /* The actual actions performed in the walk.  */
+
+  virtual void after_dom_children (basic_block);
+
+  /* Set of results of chains of multiply and add statement combinations that
+     were not transformed into FMAs because of active deferring.  */
+  hash_set<tree> m_last_result_set;
+
+  /* Pointer to a flag of the user that needs to be set if CFG has been
+     modified.  */
+  bool *m_cfg_changed_p;
+};
+
+void
+math_opts_dom_walker::after_dom_children (basic_block bb)
+{
+  gimple_stmt_iterator gsi;
+
+  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);
+
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
     {
-      gimple_stmt_iterator gsi;
+      gimple *stmt = gsi_stmt (gsi);
+      enum tree_code code;
 
-      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
-        {
-	  gimple *stmt = gsi_stmt (gsi);
-	  enum tree_code code;
+      if (is_gimple_assign (stmt))
+	{
+	  code = gimple_assign_rhs_code (stmt);
+	  switch (code)
+	    {
+	    case MULT_EXPR:
+	      if (!convert_mult_to_widen (stmt, &gsi)
+		  && !convert_expand_mult_copysign (stmt, &gsi)
+		  && convert_mult_to_fma (stmt,
+					  gimple_assign_rhs1 (stmt),
+					  gimple_assign_rhs2 (stmt),
+					  &fma_state))
+		{
+		  gsi_remove (&gsi, true);
+		  release_defs (stmt);
+		  continue;
+		}
+	      break;
+
+	    case PLUS_EXPR:
+	    case MINUS_EXPR:
+	      if (!convert_plusminus_to_widen (&gsi, stmt, code))
+		match_uaddsub_overflow (&gsi, stmt, code);
+	      break;
 
-	  if (is_gimple_assign (stmt))
+	    case TRUNC_MOD_EXPR:
+	      convert_to_divmod (as_a<gassign *> (stmt));
+	      break;
+
+	    default:;
+	    }
+	}
+      else if (is_gimple_call (stmt))
+	{
+	  tree fndecl = gimple_call_fndecl (stmt);
+	  if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
 	    {
-	      code = gimple_assign_rhs_code (stmt);
-	      switch (code)
+	      switch (DECL_FUNCTION_CODE (fndecl))
 		{
-		case MULT_EXPR:
-		  if (!convert_mult_to_widen (stmt, &gsi)
-		      && !convert_expand_mult_copysign (stmt, &gsi)
+		case BUILT_IN_POWF:
+		case BUILT_IN_POW:
+		case BUILT_IN_POWL:
+		  if (gimple_call_lhs (stmt)
+		      && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
+		      && real_equal
+		      (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
+		       &dconst2)
 		      && convert_mult_to_fma (stmt,
-					      gimple_assign_rhs1 (stmt),
-					      gimple_assign_rhs2 (stmt)))
+					      gimple_call_arg (stmt, 0),
+					      gimple_call_arg (stmt, 0),
+					      &fma_state))
 		    {
-		      gsi_remove (&gsi, true);
+		      unlink_stmt_vdef (stmt);
+		      if (gsi_remove (&gsi, true)
+			  && gimple_purge_dead_eh_edges (bb))
+			*m_cfg_changed_p = true;
 		      release_defs (stmt);
 		      continue;
 		    }
 		  break;
 
-		case PLUS_EXPR:
-		case MINUS_EXPR:
-		  if (!convert_plusminus_to_widen (&gsi, stmt, code))
-		    match_uaddsub_overflow (&gsi, stmt, code);
-		  break;
-
-		case TRUNC_MOD_EXPR:
-		  convert_to_divmod (as_a<gassign *> (stmt));
-		  break;
-
 		default:;
 		}
 	    }
-	  else if (is_gimple_call (stmt)
-		   && gimple_call_lhs (stmt))
-	    {
-	      tree fndecl = gimple_call_fndecl (stmt);
-	      if (fndecl
-		  && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
-		{
-		  switch (DECL_FUNCTION_CODE (fndecl))
-		    {
-		      case BUILT_IN_POWF:
-		      case BUILT_IN_POW:
-		      case BUILT_IN_POWL:
-			if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
-			    && real_equal
-			         (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
-				  &dconst2)
-			    && convert_mult_to_fma (stmt,
-						    gimple_call_arg (stmt, 0),
-						    gimple_call_arg (stmt, 0)))
-			  {
-			    unlink_stmt_vdef (stmt);
-			    if (gsi_remove (&gsi, true)
-				&& gimple_purge_dead_eh_edges (bb))
-			      cfg_changed = true;
-			    release_defs (stmt);
-			    continue;
-			  }
-			  break;
-
-		      default:;
-		    }
-		}
-	    }
-	  gsi_next (&gsi);
+	  else
+	    cancel_fma_deferring (&fma_state);
 	}
+      gsi_next (&gsi);
     }
+  if (fma_state.m_deferring_p
+      && fma_state.m_initial_phi)
+    {
+      gcc_checking_assert (fma_state.m_last_result);
+      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
+						 &m_last_result_set))
+	cancel_fma_deferring (&fma_state);
+      else
+	m_last_result_set.add (fma_state.m_last_result);
+    }
+}
+
+
+unsigned int
+pass_optimize_widening_mul::execute (function *fun)
+{
+  bool cfg_changed = false;
+
+  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
+  calculate_dominance_info (CDI_DOMINATORS);
+  renumber_gimple_stmt_uids ();
+
+  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
   statistics_counter_event (fun, "widening multiplications inserted",
 			    widen_mul_stats.widen_mults_inserted);
-- 
2.15.1
Jeff Law Jan. 10, 2018, 7:08 p.m. | #4
On 01/10/2018 11:46 AM, Martin Jambor wrote:
> Hello,

> 

> I would really like to ping the FMA transformation prevention patch that

> I sent here in December, which, after incorporating a suggestion from

> Richi, re-base and re-testing, I re-post below.  I really think that it

> should make into gcc 8 in some form, because the performance wins are

> really big.

> 

> I am still opened to all sorts of comments, of course, especially to

> suggestions about the form of the param controlling this behavior (or

> how to communicate that we want to do this on Zen in general).  It might

> even be a binary value since not forming FMAs does not seem to harm

> bigger vectors either (as far as we know, I should add).

> 

> For the record, I have not yet received any information from AMD why

> FMAs on 256bit vectors do not have this problem, I assume all people

> that could give an authoritative answer are now looking into covert

> channel mitigation stuff.  But very probably it is just that the

> internal split FMAs can be scheduled so that while one is still waiting

> for its addend, another can already execute.

Both are likely true...

Hoping Richi will look at the patch, but it's certainly in my queue of
things to look at if he doesn't get to  it.

jeff
Richard Biener Jan. 12, 2018, 12:14 p.m. | #5
On Wed, 10 Jan 2018, Martin Jambor wrote:

> Hello,

> 

> I would really like to ping the FMA transformation prevention patch that

> I sent here in December, which, after incorporating a suggestion from

> Richi, re-base and re-testing, I re-post below.  I really think that it

> should make into gcc 8 in some form, because the performance wins are

> really big.

> 

> I am still opened to all sorts of comments, of course, especially to

> suggestions about the form of the param controlling this behavior (or

> how to communicate that we want to do this on Zen in general).  It might

> even be a binary value since not forming FMAs does not seem to harm

> bigger vectors either (as far as we know, I should add).

> 

> For the record, I have not yet received any information from AMD why

> FMAs on 256bit vectors do not have this problem, I assume all people

> that could give an authoritative answer are now looking into covert

> channel mitigation stuff.  But very probably it is just that the

> internal split FMAs can be scheduled so that while one is still waiting

> for its addend, another can already execute.


OK, and sorry for the delay.

Thanks,
Richard.

> Thanks,

> 

> Martin

> 

> 

> On Fri, Dec 15 2017, Martin Jambor wrote:

> > Hello,

> >

> > the patch below prevents creation if fused-multiply-and-add instructions

> > in the widening_mul gimple pass on the Zen-based AMD CPUs and as a

> > result fixes regressions of native znver1 tuning when compared to

> > generic tuning in:

> >

> >   - the matrix.c testcase of PR 81616 (straightforward matrix

> >     multiplication) at -O2 and -O3 which is currently 60% (!),

> >

> >   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

> >

> >   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also

> >     about 20% in both cases.

> >

> > The basic idea is to detect loops in the following form:

> >

> >     <bb 6>

> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

> >     ...

> >     _65 = _14 * _16;

> >     accumulator_66 = _65 + accumulator_111;

> >

> > and prevents from creating FMA for it.  Because at least in the parest

> > and calculix cases it has to, it also deals with more than one chain of

> > FMA candidates that feed the next one's addend:

> >

> >

> >     <bb 6>

> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

> >     ...

> >     _65 = _14 * _16;

> >     accumulator_55 = _65 + accumulator_111;

> >     _65 = _24 * _36;

> >     accumulator_66 = _65 + accumulator_55;

> >

> > Unfortunately, to really get rid of the calculix regression, the

> > algorithm cannot just look at one BB at a time but also has to work for

> > cases like the following:

> >

> >      1  void mult(void)

> >      2  {

> >      3      int i, j, k, l;

> >      4  

> >      5     for(i=0; i<SIZE; ++i)

> >      6     {

> >      7        for(j=0; j<SIZE; ++j)

> >      8        {

> >      9           for(l=0; l<SIZE; l+=10)

> >     10           {

> >     11               c[i][j] += a[i][l] * b[k][l];

> >     12               for(k=1; k<10; ++k)

> >     13               {

> >     14                   c[i][j] += a[i][l+k] * b[k][l+k];

> >     15               }

> >     16  

> >     17           }

> >     18        }

> >     19     }

> >     20  }

> >

> > where the FMA on line 14 feeds into the one on line 11 in an

> > encompassing loop.  Therefore I have changed the structure of the pass

> > to work in reverse dominance order and it keeps a hash set of results of

> > rejected FMAs candidates which it checks when looking at PHI nodes of

> > the current BB.  Without this reorganization, calculix was still 8%

> > slower with native tuning than with generic one.

> >

> > When the deferring mechanism realizes that in the current BB, the FMA

> > candidates do not all form a one chain tight-loop like in the examples

> > above, it goes back to all the previously deferred candidates (in the

> > current BB only) and performs the transformation.

> >

> > The main reason is to keep the patch conservative (and also simple), but

> > it also means that the following function is not affected and is still

> > 20% slower when compiled with native march and tuning compared to the

> > generic one:

> >

> >      1  void mult(struct s *p1, struct s *p2)

> >      2  {

> >      3     int i, j, k;

> >      4  

> >      5     for(i=0; i<SIZE; ++i)

> >      6     {

> >      7        for(j=0; j<SIZE; ++j)

> >      8        {

> >      9           for(k=0; k<SIZE; ++k)

> >     10           {

> >     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];

> >     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];

> >     13           }

> >     14        }

> >     15     }

> >     16  }

> >

> > I suppose that the best optimization for the above would be to split the

> > loops, but one could probably construct at least an artificial testcase

> > where the FMAs would keep enough locality that it is not the case.  The

> > mechanism can be easily extended to keep track of not just one chain but

> > a few, preferably as a followup, if people think it makes sense.

> >

> > An interesting observation is that the matrix multiplication does not

> > suffer the penalty when compiled with -O3 -mprefer-vector-width=256.

> > Apparently the 256 vector processing can hide the latency penalty when

> > internally it is split into two halves.  The same goes for 512 bit

> > vectors.  That is why the patch leaves those be - well, there is a param

> > for the threshold which is set to zero for everybody but znver1.  If

> > maintainers of any other architecture suspect that their FMAs might

> > suffer similar latency problem, they can easily try tweaking that

> > parameter and see what happens with the matrix multiplication example.

> >

> > I have bootstrapped and tested the patch on x86_64-linux (as it is and

> > also with the param set to a 256 by default to make it trigger).  I have

> > also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017

> > FPrate and the only changes are the big improvements of calculix and

> > parest.

> >

> 

> 2018-01-09  Martin Jambor  <mjambor@suse.cz>

> 

> 	PR target/81616

> 	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.

> 	* tree-ssa-math-opts.c: Include domwalk.h.

> 	(convert_mult_to_fma_1): New function.

> 	(fma_transformation_info): New type.

> 	(fma_deferring_state): Likewise.

> 	(cancel_fma_deferring): New function.

> 	(result_of_phi): Likewise.

> 	(last_fma_candidate_feeds_initial_phi): Likewise.

> 	(convert_mult_to_fma): Added deferring logic, split actual

> 	transformation to convert_mult_to_fma_1.

> 	(math_opts_dom_walker): New type.

> 	(math_opts_dom_walker::after_dom_children): New method, body moved

> 	here from pass_optimize_widening_mul::execute, added deferring logic

> 	bits.

> 	(pass_optimize_widening_mul::execute): Moved most of code to

> 	math_opts_dom_walker::after_dom_children.

> 	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.

> 	* config/i386/i386.c (ix86_option_override_internal): Added

> 	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.

> ---

>  gcc/config/i386/i386.c       |   5 +

>  gcc/config/i386/x86-tune.def |   4 +

>  gcc/params.def               |   5 +

>  gcc/tree-ssa-math-opts.c     | 517 ++++++++++++++++++++++++++++++++-----------

>  4 files changed, 403 insertions(+), 128 deletions(-)

> 

> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c

> index 8696f931806..8200a136dd1 100644

> --- a/gcc/config/i386/i386.c

> +++ b/gcc/config/i386/i386.c

> @@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p,

>  	(cf_protection_level) (opts->x_flag_cf_protection | CF_SET);

>      }

>  

> +  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])

> +    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,

> +			   opts->x_param_values,

> +			   opts_set->x_param_values);

> +

>    return true;

>  }

>  

> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def

> index 9401d51cdc1..0effce759f1 100644

> --- a/gcc/config/i386/x86-tune.def

> +++ b/gcc/config/i386/x86-tune.def

> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",

>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",

>            m_SILVERMONT | m_INTEL)

>  

> +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or

> +   smaller FMA chain.  */

> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)

> +

>  /*****************************************************************************/

>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */

>  /*****************************************************************************/

> diff --git a/gcc/params.def b/gcc/params.def

> index d9f8d8208a1..a0cd06db339 100644

> --- a/gcc/params.def

> +++ b/gcc/params.def

> @@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,

>  	 "Maximum unroll factor for the unroll-and-jam transformation.",

>  	 4, 0, 0)

>  

> +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,

> +	 "avoid-fma-max-bits",

> +	 "Maximum number of bits for which we avoid creating FMAs.",

> +	 0, 0, 512)

> +

>  /*

>  

>  Local variables:

> diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c

> index ea880c7b1d8..16d9399af0b 100644

> --- a/gcc/tree-ssa-math-opts.c

> +++ b/gcc/tree-ssa-math-opts.c

> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see

>  #include "optabs-libfuncs.h"

>  #include "tree-eh.h"

>  #include "targhooks.h"

> +#include "domwalk.h"

>  

>  /* This structure represents one basic block that either computes a

>     division, or is a common dominator for basic block that compute a

> @@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,

>    return true;

>  }

>  

> +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and

> +   OP2 and which we know is used in statements that can be, together with the

> +   multiplication, converted to FMAs, perform the transformation.  */

> +

> +static void

> +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)

> +{

> +  tree type = TREE_TYPE (mul_result);

> +  gimple *use_stmt;

> +  imm_use_iterator imm_iter;

> +  gassign *fma_stmt;

> +

> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)

> +    {

> +      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);

> +      enum tree_code use_code;

> +      tree addop, mulop1 = op1, result = mul_result;

> +      bool negate_p = false;

> +

> +      if (is_gimple_debug (use_stmt))

> +	continue;

> +

> +      use_code = gimple_assign_rhs_code (use_stmt);

> +      if (use_code == NEGATE_EXPR)

> +	{

> +	  result = gimple_assign_lhs (use_stmt);

> +	  use_operand_p use_p;

> +	  gimple *neguse_stmt;

> +	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);

> +	  gsi_remove (&gsi, true);

> +	  release_defs (use_stmt);

> +

> +	  use_stmt = neguse_stmt;

> +	  gsi = gsi_for_stmt (use_stmt);

> +	  use_code = gimple_assign_rhs_code (use_stmt);

> +	  negate_p = true;

> +	}

> +

> +      if (gimple_assign_rhs1 (use_stmt) == result)

> +	{

> +	  addop = gimple_assign_rhs2 (use_stmt);

> +	  /* a * b - c -> a * b + (-c)  */

> +	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> +	    addop = force_gimple_operand_gsi (&gsi,

> +					      build1 (NEGATE_EXPR,

> +						      type, addop),

> +					      true, NULL_TREE, true,

> +					      GSI_SAME_STMT);

> +	}

> +      else

> +	{

> +	  addop = gimple_assign_rhs1 (use_stmt);

> +	  /* a - b * c -> (-b) * c + a */

> +	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> +	    negate_p = !negate_p;

> +	}

> +

> +      if (negate_p)

> +	mulop1 = force_gimple_operand_gsi (&gsi,

> +					   build1 (NEGATE_EXPR,

> +						   type, mulop1),

> +					   true, NULL_TREE, true,

> +					   GSI_SAME_STMT);

> +

> +      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),

> +				      FMA_EXPR, mulop1, op2, addop);

> +

> +      if (dump_file && (dump_flags & TDF_DETAILS))

> +	{

> +	  fprintf (dump_file, "Generated FMA ");

> +	  print_gimple_stmt (dump_file, fma_stmt, 0, 0);

> +	  fprintf (dump_file, "\n");

> +	}

> +

> +      gsi_replace (&gsi, fma_stmt, true);

> +      widen_mul_stats.fmas_inserted++;

> +    }

> +}

> +

> +/* Data necessary to perform the actual transformation from a multiplication

> +   and an addition to an FMA after decision is taken it should be done and to

> +   then delete the multiplication statement from the function IL.  */

> +

> +struct fma_transformation_info

> +{

> +  gimple *mul_stmt;

> +  tree mul_result;

> +  tree op1;

> +  tree op2;

> +};

> +

> +/* Structure containing the current state of FMA deferring, i.e. whether we are

> +   deferring, whether to continue deferring, and all data necessary to come

> +   back and perform all deferred transformations.  */

> +

> +class fma_deferring_state

> +{

> +public:

> +  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually

> +     do any deferring.  */

> +

> +  fma_deferring_state (bool perform_deferring)

> +    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),

> +      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}

> +

> +  /* List of FMA candidates for which we the transformation has been determined

> +     possible but we at this point in BB analysis we do not consider them

> +     beneficial.  */

> +  auto_vec<fma_transformation_info, 8> m_candidates;

> +

> +  /* Set of results of multiplication that are part of an already deferred FMA

> +     candidates.  */

> +  hash_set<tree> m_mul_result_set;

> +

> +  /* The PHI that supposedly feeds back result of a FMA to another over loop

> +     boundary.  */

> +  gphi *m_initial_phi;

> +

> +  /* Result of the last produced FMA candidate or NULL if there has not been

> +     one.  */

> +  tree m_last_result;

> +

> +  /* If true, deferring might still be profitable.  If false, transform all

> +     candidates and no longer defer.  */

> +  bool m_deferring_p;

> +};

> +

> +/* Transform all deferred FMA candidates and mark STATE as no longer

> +   deferring.  */

> +

> +static void

> +cancel_fma_deferring (fma_deferring_state *state)

> +{

> +  if (!state->m_deferring_p)

> +    return;

> +

> +  for (unsigned i = 0; i < state->m_candidates.length (); i++)

> +    {

> +      if (dump_file && (dump_flags & TDF_DETAILS))

> +	fprintf (dump_file, "Generating deferred FMA\n");

> +

> +      const fma_transformation_info &fti = state->m_candidates[i];

> +      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);

> +

> +      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);

> +      gsi_remove (&gsi, true);

> +      release_defs (fti.mul_stmt);

> +    }

> +  state->m_deferring_p = false;

> +}

> +

> +/* If OP is an SSA name defined by a PHI node, return the PHI statement.

> +   Otherwise return NULL.  */

> +

> +static gphi *

> +result_of_phi (tree op)

> +{

> +  if (TREE_CODE (op) != SSA_NAME)

> +    return NULL;

> +

> +  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));

> +}

> +

> +/* After processing statements of a BB and recording STATE, return true if the

> +   initial phi is fed by the last FMA candidate result ore one such result from

> +   previously processed BBs marked in LAST_RESULT_SET.  */

> +

> +static bool

> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,

> +				      hash_set<tree> *last_result_set)

> +{

> +  ssa_op_iter iter;

> +  use_operand_p use;

> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)

> +    {

> +      tree t = USE_FROM_PTR (use);

> +      if (t == state->m_last_result

> +	  || last_result_set->contains (t))

> +	return true;

> +    }

> +

> +  return false;

> +}

> +

>  /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2

>     with uses in additions and subtractions to form fused multiply-add

> -   operations.  Returns true if successful and MUL_STMT should be removed.  */

> +   operations.  Returns true if successful and MUL_STMT should be removed.

> +

> +   If STATE indicates that we are deferring FMA transformation, that means

> +   that we do not produce FMAs for basic blocks which look like:

> +

> +    <bb 6>

> +    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>

> +    _65 = _14 * _16;

> +    accumulator_66 = _65 + accumulator_111;

> +

> +  or its unrolled version, i.e. with several FMA candidates that feed result

> +  of one into the addend of another.  Instead, we add them to a list in STATE

> +  and if we later discover an FMA candidate that is not part of such a chain,

> +  we go back and perform all deferred past candidates.  */

>  

>  static bool

> -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

> +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,

> +		     fma_deferring_state *state)

>  {

>    tree mul_result = gimple_get_lhs (mul_stmt);

>    tree type = TREE_TYPE (mul_result);

>    gimple *use_stmt, *neguse_stmt;

> -  gassign *fma_stmt;

>    use_operand_p use_p;

>    imm_use_iterator imm_iter;

>  

> @@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

>    if (has_zero_uses (mul_result))

>      return false;

>  

> +  bool check_defer

> +    = (state->m_deferring_p

> +       && (tree_to_shwi (TYPE_SIZE (type))

> +	   <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));

> +  bool defer = check_defer;

>    /* Make sure that the multiplication statement becomes dead after

>       the transformation, thus that all uses are transformed to FMAs.

>       This means we assume that an FMA operation has the same cost

> @@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)

>  	    }

>  	}

>  

> +      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);

> +      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);

>        /* We can't handle a * b + a * b.  */

> -      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))

> +      if (use_rhs1 == use_rhs2)

> +	return false;

> +      /* If deferring, make sure we are not looking at an instruction that

> +	 wouldn't have existed if we were not.  */

> +      if (state->m_deferring_p

> +	  && (state->m_mul_result_set.contains (use_rhs1)

> +	      || state->m_mul_result_set.contains (use_rhs2)))

>  	return false;

>  

> -      /* While it is possible to validate whether or not the exact form

> -	 that we've recognized is available in the backend, the assumption

> -	 is that the transformation is never a loss.  For instance, suppose

> -	 the target only has the plain FMA pattern available.  Consider

> -	 a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which

> -	 is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we

> -	 still have 3 operations, but in the FMA form the two NEGs are

> -	 independent and could be run in parallel.  */

> -    }

> -

> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)

> -    {

> -      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);

> -      enum tree_code use_code;

> -      tree addop, mulop1 = op1, result = mul_result;

> -      bool negate_p = false;

> -

> -      if (is_gimple_debug (use_stmt))

> -	continue;

> -

> -      use_code = gimple_assign_rhs_code (use_stmt);

> -      if (use_code == NEGATE_EXPR)

> +      if (check_defer)

>  	{

> -	  result = gimple_assign_lhs (use_stmt);

> -	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);

> -	  gsi_remove (&gsi, true);

> -	  release_defs (use_stmt);

> +	  tree use_lhs = gimple_assign_lhs (use_stmt);

> +	  if (state->m_last_result)

> +	    {

> +	      if (use_rhs2 == state->m_last_result

> +		  || use_rhs1 == state->m_last_result)

> +		defer = true;

> +	      else

> +		defer = false;

> +	    }

> +	  else

> +	    {

> +	      gcc_checking_assert (!state->m_initial_phi);

> +	      gphi *phi;

> +	      if (use_rhs1 == result)

> +		phi = result_of_phi (use_rhs2);

> +	      else

> +		{

> +		  gcc_assert (use_rhs2 == result);

> +		  phi = result_of_phi (use_rhs1);

> +		}

>  

> -	  use_stmt = neguse_stmt;

> -	  gsi = gsi_for_stmt (use_stmt);

> -	  use_code = gimple_assign_rhs_code (use_stmt);

> -	  negate_p = true;

> -	}

> +	      if (phi)

> +		{

> +		  state->m_initial_phi = phi;

> +		  defer = true;

> +		}

> +	      else

> +		defer = false;

> +	    }

>  

> -      if (gimple_assign_rhs1 (use_stmt) == result)

> -	{

> -	  addop = gimple_assign_rhs2 (use_stmt);

> -	  /* a * b - c -> a * b + (-c)  */

> -	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> -	    addop = force_gimple_operand_gsi (&gsi,

> -					      build1 (NEGATE_EXPR,

> -						      type, addop),

> -					      true, NULL_TREE, true,

> -					      GSI_SAME_STMT);

> +	  state->m_last_result = use_lhs;

> +	  check_defer = false;

>  	}

>        else

> +	defer = false;

> +

> +      /* While it is possible to validate whether or not the exact form that

> +	 we've recognized is available in the backend, the assumption is that

> +	 if the deferring logic above did not trigger, the transformation is

> +	 never a loss.  For instance, suppose the target only has the plain FMA

> +	 pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged

> +	 MUL+SUB for FMA+NEG, which is still two operations.  Consider

> +         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA

> +	 form the two NEGs are independent and could be run in parallel.  */

> +    }

> +

> +  if (defer)

> +    {

> +      fma_transformation_info fti;

> +      fti.mul_stmt = mul_stmt;

> +      fti.mul_result = mul_result;

> +      fti.op1 = op1;

> +      fti.op2 = op2;

> +      state->m_candidates.safe_push (fti);

> +      state->m_mul_result_set.add (mul_result);

> +

> +      if (dump_file && (dump_flags & TDF_DETAILS))

>  	{

> -	  addop = gimple_assign_rhs1 (use_stmt);

> -	  /* a - b * c -> (-b) * c + a */

> -	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)

> -	    negate_p = !negate_p;

> +	  fprintf (dump_file, "Deferred generating FMA for multiplication ");

> +	  print_gimple_stmt (dump_file, mul_stmt, 0, 0);

> +	  fprintf (dump_file, "\n");

>  	}

>  

> -      if (negate_p)

> -	mulop1 = force_gimple_operand_gsi (&gsi,

> -					   build1 (NEGATE_EXPR,

> -						   type, mulop1),

> -					   true, NULL_TREE, true,

> -					   GSI_SAME_STMT);

> -

> -      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),

> -				      FMA_EXPR, mulop1, op2, addop);

> -      gsi_replace (&gsi, fma_stmt, true);

> -      widen_mul_stats.fmas_inserted++;

> +      return false;

> +    }

> +  else

> +    {

> +      if (state->m_deferring_p)

> +	cancel_fma_deferring (state);

> +      convert_mult_to_fma_1 (mul_result, op1, op2);

> +      return true;

>      }

> -

> -  return true;

>  }

>  

>  

> @@ -3270,92 +3488,135 @@ public:

>  

>  }; // class pass_optimize_widening_mul

>  

> -unsigned int

> -pass_optimize_widening_mul::execute (function *fun)

> +/* Walker class to perform the transformation in reverse dominance order. */

> +

> +class math_opts_dom_walker : public dom_walker

>  {

> -  basic_block bb;

> -  bool cfg_changed = false;

> +public:

> +  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set

> +     if walking modidifes the CFG.  */

>  

> -  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));

> -  calculate_dominance_info (CDI_DOMINATORS);

> -  renumber_gimple_stmt_uids ();

> +  math_opts_dom_walker (bool *cfg_changed_p)

> +    : dom_walker (CDI_DOMINATORS), m_last_result_set (),

> +      m_cfg_changed_p (cfg_changed_p) {}

>  

> -  FOR_EACH_BB_FN (bb, fun)

> +  /* The actual actions performed in the walk.  */

> +

> +  virtual void after_dom_children (basic_block);

> +

> +  /* Set of results of chains of multiply and add statement combinations that

> +     were not transformed into FMAs because of active deferring.  */

> +  hash_set<tree> m_last_result_set;

> +

> +  /* Pointer to a flag of the user that needs to be set if CFG has been

> +     modified.  */

> +  bool *m_cfg_changed_p;

> +};

> +

> +void

> +math_opts_dom_walker::after_dom_children (basic_block bb)

> +{

> +  gimple_stmt_iterator gsi;

> +

> +  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);

> +

> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)

>      {

> -      gimple_stmt_iterator gsi;

> +      gimple *stmt = gsi_stmt (gsi);

> +      enum tree_code code;

>  

> -      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)

> -        {

> -	  gimple *stmt = gsi_stmt (gsi);

> -	  enum tree_code code;

> +      if (is_gimple_assign (stmt))

> +	{

> +	  code = gimple_assign_rhs_code (stmt);

> +	  switch (code)

> +	    {

> +	    case MULT_EXPR:

> +	      if (!convert_mult_to_widen (stmt, &gsi)

> +		  && !convert_expand_mult_copysign (stmt, &gsi)

> +		  && convert_mult_to_fma (stmt,

> +					  gimple_assign_rhs1 (stmt),

> +					  gimple_assign_rhs2 (stmt),

> +					  &fma_state))

> +		{

> +		  gsi_remove (&gsi, true);

> +		  release_defs (stmt);

> +		  continue;

> +		}

> +	      break;

> +

> +	    case PLUS_EXPR:

> +	    case MINUS_EXPR:

> +	      if (!convert_plusminus_to_widen (&gsi, stmt, code))

> +		match_uaddsub_overflow (&gsi, stmt, code);

> +	      break;

>  

> -	  if (is_gimple_assign (stmt))

> +	    case TRUNC_MOD_EXPR:

> +	      convert_to_divmod (as_a<gassign *> (stmt));

> +	      break;

> +

> +	    default:;

> +	    }

> +	}

> +      else if (is_gimple_call (stmt))

> +	{

> +	  tree fndecl = gimple_call_fndecl (stmt);

> +	  if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))

>  	    {

> -	      code = gimple_assign_rhs_code (stmt);

> -	      switch (code)

> +	      switch (DECL_FUNCTION_CODE (fndecl))

>  		{

> -		case MULT_EXPR:

> -		  if (!convert_mult_to_widen (stmt, &gsi)

> -		      && !convert_expand_mult_copysign (stmt, &gsi)

> +		case BUILT_IN_POWF:

> +		case BUILT_IN_POW:

> +		case BUILT_IN_POWL:

> +		  if (gimple_call_lhs (stmt)

> +		      && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST

> +		      && real_equal

> +		      (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),

> +		       &dconst2)

>  		      && convert_mult_to_fma (stmt,

> -					      gimple_assign_rhs1 (stmt),

> -					      gimple_assign_rhs2 (stmt)))

> +					      gimple_call_arg (stmt, 0),

> +					      gimple_call_arg (stmt, 0),

> +					      &fma_state))

>  		    {

> -		      gsi_remove (&gsi, true);

> +		      unlink_stmt_vdef (stmt);

> +		      if (gsi_remove (&gsi, true)

> +			  && gimple_purge_dead_eh_edges (bb))

> +			*m_cfg_changed_p = true;

>  		      release_defs (stmt);

>  		      continue;

>  		    }

>  		  break;

>  

> -		case PLUS_EXPR:

> -		case MINUS_EXPR:

> -		  if (!convert_plusminus_to_widen (&gsi, stmt, code))

> -		    match_uaddsub_overflow (&gsi, stmt, code);

> -		  break;

> -

> -		case TRUNC_MOD_EXPR:

> -		  convert_to_divmod (as_a<gassign *> (stmt));

> -		  break;

> -

>  		default:;

>  		}

>  	    }

> -	  else if (is_gimple_call (stmt)

> -		   && gimple_call_lhs (stmt))

> -	    {

> -	      tree fndecl = gimple_call_fndecl (stmt);

> -	      if (fndecl

> -		  && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))

> -		{

> -		  switch (DECL_FUNCTION_CODE (fndecl))

> -		    {

> -		      case BUILT_IN_POWF:

> -		      case BUILT_IN_POW:

> -		      case BUILT_IN_POWL:

> -			if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST

> -			    && real_equal

> -			         (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),

> -				  &dconst2)

> -			    && convert_mult_to_fma (stmt,

> -						    gimple_call_arg (stmt, 0),

> -						    gimple_call_arg (stmt, 0)))

> -			  {

> -			    unlink_stmt_vdef (stmt);

> -			    if (gsi_remove (&gsi, true)

> -				&& gimple_purge_dead_eh_edges (bb))

> -			      cfg_changed = true;

> -			    release_defs (stmt);

> -			    continue;

> -			  }

> -			  break;

> -

> -		      default:;

> -		    }

> -		}

> -	    }

> -	  gsi_next (&gsi);

> +	  else

> +	    cancel_fma_deferring (&fma_state);

>  	}

> +      gsi_next (&gsi);

>      }

> +  if (fma_state.m_deferring_p

> +      && fma_state.m_initial_phi)

> +    {

> +      gcc_checking_assert (fma_state.m_last_result);

> +      if (!last_fma_candidate_feeds_initial_phi (&fma_state,

> +						 &m_last_result_set))

> +	cancel_fma_deferring (&fma_state);

> +      else

> +	m_last_result_set.add (fma_state.m_last_result);

> +    }

> +}

> +

> +

> +unsigned int

> +pass_optimize_widening_mul::execute (function *fun)

> +{

> +  bool cfg_changed = false;

> +

> +  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));

> +  calculate_dominance_info (CDI_DOMINATORS);

> +  renumber_gimple_stmt_uids ();

> +

> +  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));

>  

>    statistics_counter_event (fun, "widening multiplications inserted",

>  			    widen_mul_stats.widen_mults_inserted);

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e323102cef5..224544fe04f 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -4888,6 +4888,11 @@  ix86_option_override_internal (bool main_args_p,
 	(cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
     }
 
+  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])
+    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,
+			   opts->x_param_values,
+			   opts_set->x_param_values);
+
   return true;
 }
 
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 25f28e3cfc1..1b6f5f8816b 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -399,6 +399,10 @@  DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
           m_SILVERMONT | m_INTEL)
 
+/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
+   smaller FMA chain.  */
+DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/
diff --git a/gcc/params.def b/gcc/params.def
index 923ebc8e66c..dd6193b44c2 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1317,6 +1317,11 @@  DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,
 	 "Maximum unroll factor for the unroll-and-jam transformation.",
 	 4, 0, 0)
 
+DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,
+	 "avoid-fma-max-bits",
+	 "Maximum number of bits for which we avoid creating FMAs.",
+	 0, 0, 512)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 8db12f5e1cd..90a9f5359d6 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -115,6 +115,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "optabs-libfuncs.h"
 #include "tree-eh.h"
 #include "targhooks.h"
+#include "domwalk.h"
 
 /* This structure represents one basic block that either computes a
    division, or is a common dominator for basic block that compute a
@@ -2637,17 +2638,218 @@  convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
   return true;
 }
 
+/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
+   OP2 and which we know is used in statements that can be, together with the
+   multiplication, converted to FMAs, perform the transformation.  */
+
+static void
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+{
+  tree type = TREE_TYPE (mul_result);
+  gimple *use_stmt;
+  imm_use_iterator imm_iter;
+  gassign *fma_stmt;
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
+      enum tree_code use_code;
+      tree addop, mulop1 = op1, result = mul_result;
+      bool negate_p = false;
+
+      if (is_gimple_debug (use_stmt))
+	continue;
+
+      use_code = gimple_assign_rhs_code (use_stmt);
+      if (use_code == NEGATE_EXPR)
+	{
+	  result = gimple_assign_lhs (use_stmt);
+	  use_operand_p use_p;
+	  gimple *neguse_stmt;
+	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
+	  gsi_remove (&gsi, true);
+	  release_defs (use_stmt);
+
+	  use_stmt = neguse_stmt;
+	  gsi = gsi_for_stmt (use_stmt);
+	  use_code = gimple_assign_rhs_code (use_stmt);
+	  negate_p = true;
+	}
+
+      if (gimple_assign_rhs1 (use_stmt) == result)
+	{
+	  addop = gimple_assign_rhs2 (use_stmt);
+	  /* a * b - c -> a * b + (-c)  */
+	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
+	    addop = force_gimple_operand_gsi (&gsi,
+					      build1 (NEGATE_EXPR,
+						      type, addop),
+					      true, NULL_TREE, true,
+					      GSI_SAME_STMT);
+	}
+      else
+	{
+	  addop = gimple_assign_rhs1 (use_stmt);
+	  /* a - b * c -> (-b) * c + a */
+	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
+	    negate_p = !negate_p;
+	}
+
+      if (negate_p)
+	mulop1 = force_gimple_operand_gsi (&gsi,
+					   build1 (NEGATE_EXPR,
+						   type, mulop1),
+					   true, NULL_TREE, true,
+					   GSI_SAME_STMT);
+
+      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
+				      FMA_EXPR, mulop1, op2, addop);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Generated FMA ");
+	  print_gimple_stmt (dump_file, fma_stmt, 0, 0);
+	  fprintf (dump_file, "\n");
+	}
+
+      gsi_replace (&gsi, fma_stmt, true);
+      widen_mul_stats.fmas_inserted++;
+    }
+}
+
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+/* Transform all deferred FMA candidates and mark STATE as no longer
+   deferring.  */
+
+static void
+cancel_fma_deferring (fma_deferring_state *state)
+{
+  if (!state->m_deferring_p)
+    return;
+
+  for (unsigned i = 0; i < state->m_candidates.length (); i++)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Generating deferred FMA\n");
+
+      const fma_transformation_info &fti = state->m_candidates[i];
+      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
+
+      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
+      gsi_remove (&gsi, true);
+      release_defs (fti.mul_stmt);
+    }
+  state->m_deferring_p = false;
+}
+
+/* If OP is an SSA name defined by a PHI node, return the PHI statement.
+   Otherwise return NULL.  */
+
+static gphi *
+result_of_phi (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL;
+
+  gimple *opdef = SSA_NAME_DEF_STMT (op);
+  if (gimple_code (opdef) != GIMPLE_PHI)
+    return NULL;
+
+  return as_a <gphi *> (opdef);
+}
+
+/* After processing statements of a BB and recording STATE, return true if the
+   initial phi is fed by the last FMA candidate result ore one such result from
+   previously processed BBs marked in LAST_RESULT_SET.  */
+
+static bool
+last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
+				      hash_set<tree> *last_result_set)
+{
+  ssa_op_iter iter;
+  use_operand_p use;
+  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
+    {
+      tree t = USE_FROM_PTR (use);
+      if (t == state->m_last_result
+	  || last_result_set->contains (t))
+	return true;
+    }
+
+  return false;
+}
+
 /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2
    with uses in additions and subtractions to form fused multiply-add
-   operations.  Returns true if successful and MUL_STMT should be removed.  */
+   operations.  Returns true if successful and MUL_STMT should be removed.
+
+   If STATE indicates that we are deferring FMA transformation, that means
+   that we do not produce FMAs for basic blocks which look like:
+
+    <bb 6>
+    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
+    _65 = _14 * _16;
+    accumulator_66 = _65 + dst_row_111;
+
+  or its unrolled version, i.e. with several FMA candidates that feed result
+  of one into the addend of another.  Instead, we add them to a list in STATE
+  and if we later discover an FMA candidate that is not part of such a chain,
+  we go back and perform all deferred past candidates.  */
 
 static bool
-convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state)
 {
   tree mul_result = gimple_get_lhs (mul_stmt);
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt, *neguse_stmt;
-  gassign *fma_stmt;
   use_operand_p use_p;
   imm_use_iterator imm_iter;
 
@@ -2671,6 +2873,11 @@  convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
   if (has_zero_uses (mul_result))
     return false;
 
+  bool check_defer
+    = (state->m_deferring_p
+       && (tree_to_shwi (TYPE_SIZE (type))
+	   <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));
+  bool defer = check_defer;
   /* Make sure that the multiplication statement becomes dead after
      the transformation, thus that all uses are transformed to FMAs.
      This means we assume that an FMA operation has the same cost
@@ -2768,77 +2975,92 @@  convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
 	    }
 	}
 
+      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);
+      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);
       /* We can't handle a * b + a * b.  */
-      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))
+      if (use_rhs1 == use_rhs2)
+	return false;
+      /* If deferring, make sure we are not looking at an instruction that
+	 wouldn't have existed if we were not.  */
+      if (state->m_deferring_p
+	  && (state->m_mul_result_set.contains (use_rhs1)
+	      || state->m_mul_result_set.contains (use_rhs2)))
 	return false;
 
-      /* While it is possible to validate whether or not the exact form
-	 that we've recognized is available in the backend, the assumption
-	 is that the transformation is never a loss.  For instance, suppose
-	 the target only has the plain FMA pattern available.  Consider
-	 a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which
-	 is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we
-	 still have 3 operations, but in the FMA form the two NEGs are
-	 independent and could be run in parallel.  */
-    }
-
-  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
-    {
-      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
-      enum tree_code use_code;
-      tree addop, mulop1 = op1, result = mul_result;
-      bool negate_p = false;
-
-      if (is_gimple_debug (use_stmt))
-	continue;
-
-      use_code = gimple_assign_rhs_code (use_stmt);
-      if (use_code == NEGATE_EXPR)
+      if (check_defer)
 	{
-	  result = gimple_assign_lhs (use_stmt);
-	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  tree use_lhs = gimple_assign_lhs (use_stmt);
+	  if (state->m_last_result)
+	    {
+	      if (use_rhs2 == state->m_last_result
+		  || use_rhs1 == state->m_last_result)
+		defer = true;
+	      else
+		defer = false;
+	    }
+	  else
+	    {
+	      gcc_checking_assert (!state->m_initial_phi);
+	      gphi *phi;
+	      if (use_rhs1 == result)
+		phi = result_of_phi (use_rhs2);
+	      else
+		{
+		  gcc_assert (use_rhs2 == result);
+		  phi = result_of_phi (use_rhs1);
+		}
 
-	  use_stmt = neguse_stmt;
-	  gsi = gsi_for_stmt (use_stmt);
-	  use_code = gimple_assign_rhs_code (use_stmt);
-	  negate_p = true;
-	}
+	      if (phi)
+		{
+		  state->m_initial_phi = phi;
+		  defer = true;
+		}
+	      else
+		defer = false;
+	    }
 
-      if (gimple_assign_rhs1 (use_stmt) == result)
-	{
-	  addop = gimple_assign_rhs2 (use_stmt);
-	  /* a * b - c -> a * b + (-c)  */
-	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
-	    addop = force_gimple_operand_gsi (&gsi,
-					      build1 (NEGATE_EXPR,
-						      type, addop),
-					      true, NULL_TREE, true,
-					      GSI_SAME_STMT);
+	  state->m_last_result = use_lhs;
+	  check_defer = false;
 	}
       else
+	defer = false;
+
+      /* While it is possible to validate whether or not the exact form that
+	 we've recognized is available in the backend, the assumption is that
+	 if the deferring logic above did not trigger, the transformation is
+	 never a loss.  For instance, suppose the target only has the plain FMA
+	 pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged
+	 MUL+SUB for FMA+NEG, which is still two operations.  Consider
+         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA
+	 form the two NEGs are independent and could be run in parallel.  */
+    }
+
+  if (defer)
+    {
+      fma_transformation_info fti;
+      fti.mul_stmt = mul_stmt;
+      fti.mul_result = mul_result;
+      fti.op1 = op1;
+      fti.op2 = op2;
+      state->m_candidates.safe_push (fti);
+      state->m_mul_result_set.add (mul_result);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
 	{
-	  addop = gimple_assign_rhs1 (use_stmt);
-	  /* a - b * c -> (-b) * c + a */
-	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
-	    negate_p = !negate_p;
+	  fprintf (dump_file, "Deferred generating FMA for multiplication ");
+	  print_gimple_stmt (dump_file, mul_stmt, 0, 0);
+	  fprintf (dump_file, "\n");
 	}
 
-      if (negate_p)
-	mulop1 = force_gimple_operand_gsi (&gsi,
-					   build1 (NEGATE_EXPR,
-						   type, mulop1),
-					   true, NULL_TREE, true,
-					   GSI_SAME_STMT);
-
-      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
-				      FMA_EXPR, mulop1, op2, addop);
-      gsi_replace (&gsi, fma_stmt, true);
-      widen_mul_stats.fmas_inserted++;
+      return false;
+    }
+  else
+    {
+      if (state->m_deferring_p)
+	cancel_fma_deferring (state);
+      convert_mult_to_fma_1 (mul_result, op1, op2);
+      return true;
     }
-
-  return true;
 }
 
 
@@ -3268,92 +3490,135 @@  public:
 
 }; // class pass_optimize_widening_mul
 
-unsigned int
-pass_optimize_widening_mul::execute (function *fun)
+/* Walker class to perform the transformation in reverse dominance order. */
+
+class math_opts_dom_walker : public dom_walker
 {
-  basic_block bb;
-  bool cfg_changed = false;
+public:
+  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
+     if walking modidifes the CFG.  */
 
-  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
-  calculate_dominance_info (CDI_DOMINATORS);
-  renumber_gimple_stmt_uids ();
+  math_opts_dom_walker (bool *cfg_changed_p)
+    : dom_walker (CDI_DOMINATORS), m_last_result_set (),
+      m_cfg_changed_p (cfg_changed_p) {}
 
-  FOR_EACH_BB_FN (bb, fun)
+  /* The actual actions performed in the walk.  */
+
+  virtual void after_dom_children (basic_block);
+
+  /* Set of results of chains of multiply and add statement combinations that
+     were not transformed into FMAs because of active deferring.  */
+  hash_set<tree> m_last_result_set;
+
+  /* Pointer to a flag of the user that needs to be set if CFG has been
+     modified.  */
+  bool *m_cfg_changed_p;
+};
+
+void
+math_opts_dom_walker::after_dom_children (basic_block bb)
+{
+  gimple_stmt_iterator gsi;
+
+  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);
+
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
     {
-      gimple_stmt_iterator gsi;
+      gimple *stmt = gsi_stmt (gsi);
+      enum tree_code code;
 
-      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
-        {
-	  gimple *stmt = gsi_stmt (gsi);
-	  enum tree_code code;
+      if (is_gimple_assign (stmt))
+	{
+	  code = gimple_assign_rhs_code (stmt);
+	  switch (code)
+	    {
+	    case MULT_EXPR:
+	      if (!convert_mult_to_widen (stmt, &gsi)
+		  && !convert_expand_mult_copysign (stmt, &gsi)
+		  && convert_mult_to_fma (stmt,
+					  gimple_assign_rhs1 (stmt),
+					  gimple_assign_rhs2 (stmt),
+					  &fma_state))
+		{
+		  gsi_remove (&gsi, true);
+		  release_defs (stmt);
+		  continue;
+		}
+	      break;
+
+	    case PLUS_EXPR:
+	    case MINUS_EXPR:
+	      if (!convert_plusminus_to_widen (&gsi, stmt, code))
+		match_uaddsub_overflow (&gsi, stmt, code);
+	      break;
 
-	  if (is_gimple_assign (stmt))
+	    case TRUNC_MOD_EXPR:
+	      convert_to_divmod (as_a<gassign *> (stmt));
+	      break;
+
+	    default:;
+	    }
+	}
+      else if (is_gimple_call (stmt))
+	{
+	  tree fndecl = gimple_call_fndecl (stmt);
+	  if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
 	    {
-	      code = gimple_assign_rhs_code (stmt);
-	      switch (code)
+	      switch (DECL_FUNCTION_CODE (fndecl))
 		{
-		case MULT_EXPR:
-		  if (!convert_mult_to_widen (stmt, &gsi)
-		      && !convert_expand_mult_copysign (stmt, &gsi)
+		case BUILT_IN_POWF:
+		case BUILT_IN_POW:
+		case BUILT_IN_POWL:
+		  if (gimple_call_lhs (stmt)
+		      && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
+		      && real_equal
+		      (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
+		       &dconst2)
 		      && convert_mult_to_fma (stmt,
-					      gimple_assign_rhs1 (stmt),
-					      gimple_assign_rhs2 (stmt)))
+					      gimple_call_arg (stmt, 0),
+					      gimple_call_arg (stmt, 0),
+					      &fma_state))
 		    {
-		      gsi_remove (&gsi, true);
+		      unlink_stmt_vdef (stmt);
+		      if (gsi_remove (&gsi, true)
+			  && gimple_purge_dead_eh_edges (bb))
+			*m_cfg_changed_p = true;
 		      release_defs (stmt);
 		      continue;
 		    }
 		  break;
 
-		case PLUS_EXPR:
-		case MINUS_EXPR:
-		  if (!convert_plusminus_to_widen (&gsi, stmt, code))
-		    match_uaddsub_overflow (&gsi, stmt, code);
-		  break;
-
-		case TRUNC_MOD_EXPR:
-		  convert_to_divmod (as_a<gassign *> (stmt));
-		  break;
-
 		default:;
 		}
 	    }
-	  else if (is_gimple_call (stmt)
-		   && gimple_call_lhs (stmt))
-	    {
-	      tree fndecl = gimple_call_fndecl (stmt);
-	      if (fndecl
-		  && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
-		{
-		  switch (DECL_FUNCTION_CODE (fndecl))
-		    {
-		      case BUILT_IN_POWF:
-		      case BUILT_IN_POW:
-		      case BUILT_IN_POWL:
-			if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
-			    && real_equal
-			         (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
-				  &dconst2)
-			    && convert_mult_to_fma (stmt,
-						    gimple_call_arg (stmt, 0),
-						    gimple_call_arg (stmt, 0)))
-			  {
-			    unlink_stmt_vdef (stmt);
-			    if (gsi_remove (&gsi, true)
-				&& gimple_purge_dead_eh_edges (bb))
-			      cfg_changed = true;
-			    release_defs (stmt);
-			    continue;
-			  }
-			  break;
-
-		      default:;
-		    }
-		}
-	    }
-	  gsi_next (&gsi);
+	  else
+	    cancel_fma_deferring (&fma_state);
 	}
+      gsi_next (&gsi);
     }
+  if (fma_state.m_deferring_p
+      && fma_state.m_initial_phi)
+    {
+      gcc_checking_assert (fma_state.m_last_result);
+      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
+						 &m_last_result_set))
+	cancel_fma_deferring (&fma_state);
+      else
+	m_last_result_set.add (fma_state.m_last_result);
+    }
+}
+
+
+unsigned int
+pass_optimize_widening_mul::execute (function *fun)
+{
+  bool cfg_changed = false;
+
+  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
+  calculate_dominance_info (CDI_DOMINATORS);
+  renumber_gimple_stmt_uids ();
+
+  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
   statistics_counter_event (fun, "widening multiplications inserted",
 			    widen_mul_stats.widen_mults_inserted);