More conservative interchanging small loops with const initialized simple reduction

Message ID DB5PR0801MB2742D02F8DAEE967A46E7669E7300@DB5PR0801MB2742.eurprd08.prod.outlook.com
State New
Headers show
Series
  • More conservative interchanging small loops with const initialized simple reduction
Related show

Commit Message

Bin Cheng Dec. 8, 2017, 11:46 a.m.
Hi,
This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.
The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small
loop.
Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

Thanks,
bin
2017-12-08  Bin Cheng  <bin.cheng@arm.com>

	* gimple-loop-interchange.cc (struct loop_cand): New field.
	(loop_cand::loop_cand): Init new field in constructor.
	(loop_cand::classify_simple_reduction): Record simple reduction
	initialized with constant value.
	(should_interchange_loops): New parameter.  Skip interchange if loop
	has few data references and constant intitialized simple reduction.
	(tree_loop_interchange::interchange): Update call to above function.
	(should_interchange_loop_nest): Ditto.

Comments

Richard Biener Dec. 8, 2017, 12:17 p.m. | #1
On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,

> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

> loop.

> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?


Shouldn't we do this even for non-constant initialzied simple
reduction?  Because for any simple
reduction we add two DRs that are not innermost, for constant
initialized we add an additional
cond-expr.  So ...

+  /* Conservatively skip interchange in cases only have few data references
+     and constant initialized simple reduction since it introduces new data
+     reference as well as ?: operation.  */
+  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
+    return false;
+

can you, instead of carrying num_const_init_simple_reduc simply loop
over m_reductions
and classify them in this function accordingly?  I think we want to
cost non-constant-init
reductions as well.  The :? can eventually count for another DR for
cost purposes.

It looks like we do count the existing DRs for the reduction?  Is that
why you arrive
at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
But we don't really know whether the DR was invariant in the outer
loop (well, I suppose
we could remember the DR in m_reductions).

Note that the good thing is that the ?: has an invariant condition and
thus vectorization
can hoist the mask generation out of the vectorized loop which means
it boils down to
cheap operations.  My gut feeling is that just looking at the number
of memory references
isn't a good indicator of profitability as the regular stmt workload
has a big impact on
profitability of vectorization.

So no ack nor nack...

Richard.

> Thanks,

> bin

> 2017-12-08  Bin Cheng  <bin.cheng@arm.com>

>

>         * gimple-loop-interchange.cc (struct loop_cand): New field.

>         (loop_cand::loop_cand): Init new field in constructor.

>         (loop_cand::classify_simple_reduction): Record simple reduction

>         initialized with constant value.

>         (should_interchange_loops): New parameter.  Skip interchange if loop

>         has few data references and constant intitialized simple reduction.

>         (tree_loop_interchange::interchange): Update call to above function.

>         (should_interchange_loop_nest): Ditto.
Bin.Cheng Dec. 8, 2017, 12:43 p.m. | #2
On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>> Hi,

>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>> loop.

>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>

> Shouldn't we do this even for non-constant initialzied simple

> reduction?  Because for any simple

> reduction we add two DRs that are not innermost, for constant

> initialized we add an additional

> cond-expr.  So ...

>

> +  /* Conservatively skip interchange in cases only have few data references

> +     and constant initialized simple reduction since it introduces new data

> +     reference as well as ?: operation.  */

> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

> +    return false;

> +

>

> can you, instead of carrying num_const_init_simple_reduc simply loop

> over m_reductions

> and classify them in this function accordingly?  I think we want to

> cost non-constant-init

> reductions as well.  The :? can eventually count for another DR for

> cost purposes.

Number of non-constant-init reductions can still be carried in struct
loop_cand?  I am not very sure what's the advantage of an additional
loop over m_reductions getting the same information.
Perhaps the increase of stmts should be counted like:
  num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
Question is which number should this be compared against.  (we may
need to shift num_new_inv_drs to the other side for wrapping issue).

>

> It looks like we do count the existing DRs for the reduction?  Is that

> why you arrive

> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

Yes.
> But we don't really know whether the DR was invariant in the outer

> loop (well, I suppose

Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
invariant reference with regarding to inner loop, rather than the
outer loop.  The same to num_new_inv_drs,
which means a reference become invariant after loop interchange with
regarding to (the new) inner loop.  This invariant information is
always known from data reference, right?
As for DRs for reduction, we know it's invariant because we set its
inner loop stride to zero.

> we could remember the DR in m_reductions).

>

> Note that the good thing is that the ?: has an invariant condition and

> thus vectorization

> can hoist the mask generation out of the vectorized loop which means

> it boils down to

> cheap operations.  My gut feeling is that just looking at the number

> of memory references

> isn't a good indicator of profitability as the regular stmt workload

> has a big impact on

> profitability of vectorization.

It's not specific to vectorization.  The generated new code also costs
too much in small loops without vectorization.  But yes, # of mem_refs
may be too inaccurate, maybe we should check against num_stmts.

Thanks,
bin
>

> So no ack nor nack...

>

> Richard.

>

>> Thanks,

>> bin

>> 2017-12-08  Bin Cheng  <bin.cheng@arm.com>

>>

>>         * gimple-loop-interchange.cc (struct loop_cand): New field.

>>         (loop_cand::loop_cand): Init new field in constructor.

>>         (loop_cand::classify_simple_reduction): Record simple reduction

>>         initialized with constant value.

>>         (should_interchange_loops): New parameter.  Skip interchange if loop

>>         has few data references and constant intitialized simple reduction.

>>         (tree_loop_interchange::interchange): Update call to above function.

>>         (should_interchange_loop_nest): Ditto.
Richard Biener Dec. 8, 2017, 2:40 p.m. | #3
On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener

> <richard.guenther@gmail.com> wrote:

>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>>> Hi,

>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>>> loop.

>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>>

>> Shouldn't we do this even for non-constant initialzied simple

>> reduction?  Because for any simple

>> reduction we add two DRs that are not innermost, for constant

>> initialized we add an additional

>> cond-expr.  So ...

>>

>> +  /* Conservatively skip interchange in cases only have few data references

>> +     and constant initialized simple reduction since it introduces new data

>> +     reference as well as ?: operation.  */

>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>> +    return false;

>> +

>>

>> can you, instead of carrying num_const_init_simple_reduc simply loop

>> over m_reductions

>> and classify them in this function accordingly?  I think we want to

>> cost non-constant-init

>> reductions as well.  The :? can eventually count for another DR for

>> cost purposes.

> Number of non-constant-init reductions can still be carried in struct

> loop_cand?  I am not very sure what's the advantage of an additional

> loop over m_reductions getting the same information.

> Perhaps the increase of stmts should be counted like:

>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs

> Question is which number should this be compared against.  (we may

> need to shift num_new_inv_drs to the other side for wrapping issue).

>

>>

>> It looks like we do count the existing DRs for the reduction?  Is that

>> why you arrive

>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

> Yes.

>> But we don't really know whether the DR was invariant in the outer

>> loop (well, I suppose

> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of

> invariant reference with regarding to inner loop, rather than the

> outer loop.  The same to num_new_inv_drs,

> which means a reference become invariant after loop interchange with

> regarding to (the new) inner loop.  This invariant information is

> always known from data reference, right?

> As for DRs for reduction, we know it's invariant because we set its

> inner loop stride to zero.

>

>> we could remember the DR in m_reductions).

>>

>> Note that the good thing is that the ?: has an invariant condition and

>> thus vectorization

>> can hoist the mask generation out of the vectorized loop which means

>> it boils down to

>> cheap operations.  My gut feeling is that just looking at the number

>> of memory references

>> isn't a good indicator of profitability as the regular stmt workload

>> has a big impact on

>> profitability of vectorization.

> It's not specific to vectorization.  The generated new code also costs

> too much in small loops without vectorization.  But yes, # of mem_refs

> may be too inaccurate, maybe we should check against num_stmts.


Not specific to vectorization but the interchange may pay off only when
vectorizing a loop.  Would the loop in loop-interchange-5.c be still
interchanged?  If we remove the multiplication and just keep
c[i][j] = c[i][j] + b[k][j];
?  That is, why is the constant init so special?  Even for non-constant init
we're changing two outer loop DRs to two non-consecutive inner loop DRs.

Richard.

> Thanks,

> bin

>>

>> So no ack nor nack...

>>

>> Richard.

>>

>>> Thanks,

>>> bin

>>> 2017-12-08  Bin Cheng  <bin.cheng@arm.com>

>>>

>>>         * gimple-loop-interchange.cc (struct loop_cand): New field.

>>>         (loop_cand::loop_cand): Init new field in constructor.

>>>         (loop_cand::classify_simple_reduction): Record simple reduction

>>>         initialized with constant value.

>>>         (should_interchange_loops): New parameter.  Skip interchange if loop

>>>         has few data references and constant intitialized simple reduction.

>>>         (tree_loop_interchange::interchange): Update call to above function.

>>>         (should_interchange_loop_nest): Ditto.
Bin.Cheng Dec. 8, 2017, 3:18 p.m. | #4
On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:

>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener

>> <richard.guenther@gmail.com> wrote:

>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>>>> Hi,

>>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>>>> loop.

>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>>>

>>> Shouldn't we do this even for non-constant initialzied simple

>>> reduction?  Because for any simple

>>> reduction we add two DRs that are not innermost, for constant

>>> initialized we add an additional

>>> cond-expr.  So ...

>>>

>>> +  /* Conservatively skip interchange in cases only have few data references

>>> +     and constant initialized simple reduction since it introduces new data

>>> +     reference as well as ?: operation.  */

>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>>> +    return false;

>>> +

>>>

>>> can you, instead of carrying num_const_init_simple_reduc simply loop

>>> over m_reductions

>>> and classify them in this function accordingly?  I think we want to

>>> cost non-constant-init

>>> reductions as well.  The :? can eventually count for another DR for

>>> cost purposes.

>> Number of non-constant-init reductions can still be carried in struct

>> loop_cand?  I am not very sure what's the advantage of an additional

>> loop over m_reductions getting the same information.

>> Perhaps the increase of stmts should be counted like:

>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs

>> Question is which number should this be compared against.  (we may

>> need to shift num_new_inv_drs to the other side for wrapping issue).

>>

>>>

>>> It looks like we do count the existing DRs for the reduction?  Is that

>>> why you arrive

>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

>> Yes.

>>> But we don't really know whether the DR was invariant in the outer

>>> loop (well, I suppose

>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of

>> invariant reference with regarding to inner loop, rather than the

>> outer loop.  The same to num_new_inv_drs,

>> which means a reference become invariant after loop interchange with

>> regarding to (the new) inner loop.  This invariant information is

>> always known from data reference, right?

>> As for DRs for reduction, we know it's invariant because we set its

>> inner loop stride to zero.

>>

>>> we could remember the DR in m_reductions).

>>>

>>> Note that the good thing is that the ?: has an invariant condition and

>>> thus vectorization

>>> can hoist the mask generation out of the vectorized loop which means

>>> it boils down to

>>> cheap operations.  My gut feeling is that just looking at the number

>>> of memory references

>>> isn't a good indicator of profitability as the regular stmt workload

>>> has a big impact on

>>> profitability of vectorization.

>> It's not specific to vectorization.  The generated new code also costs

>> too much in small loops without vectorization.  But yes, # of mem_refs

>> may be too inaccurate, maybe we should check against num_stmts.

>

> Not specific to vectorization but the interchange may pay off only when

> vectorizing a loop.  Would the loop in loop-interchange-5.c be still

> interchanged?  If we remove the multiplication and just keep

> c[i][j] = c[i][j] + b[k][j];

Both loop-interchange-5.c and the modified version are interchange,
because we check
it against number of all data references (including num_old_inv_drs):
 if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

> ?  That is, why is the constant init so special?  Even for non-constant init

> we're changing two outer loop DRs to two non-consecutive inner loop DRs.

No, the two outer loop DRs becomes consecutive with respect to inner loop.
So for a typical matrix mul case, the interchange moves two outer loop
DRs into inner loops, moves an inner loop DR out to outer loop.
Overall it introduces an additional instruction.  In terms of cache
behavior, it's even cheaper?  Because the two new DRs access the same
memory object.
As for pr62178.c, assembly regressed from:
.L3:
        ldr     w3, [x0], 124
        ldr     w4, [x2, 4]!
        cmp     x0, x5
        madd    w1, w4, w3, w1
        bne     .L3

To:
.L3:
        ldr     w2, [x3, x0, lsl 2]
        cmp     w5, 1
        ldr     w1, [x4, x0, lsl 2]
        csel    w2, w2, wzr, ne
        madd    w1, w6, w1, w2
        str     w1, [x3, x0, lsl 2]
        add     x0, x0, 1
        cmp     x0, 31
        bne     .L3

Without vectorization.  Two additional instruction for cond_expr, one
additional memory reference for interchange, and one additional
instruction because of ivopt/addressing mode.

Thanks,
bin
>

> Richard.

>

>> Thanks,

>> bin

>>>

>>> So no ack nor nack...

>>>

>>> Richard.

>>>

>>>> Thanks,

>>>> bin

>>>> 2017-12-08  Bin Cheng  <bin.cheng@arm.com>

>>>>

>>>>         * gimple-loop-interchange.cc (struct loop_cand): New field.

>>>>         (loop_cand::loop_cand): Init new field in constructor.

>>>>         (loop_cand::classify_simple_reduction): Record simple reduction

>>>>         initialized with constant value.

>>>>         (should_interchange_loops): New parameter.  Skip interchange if loop

>>>>         has few data references and constant intitialized simple reduction.

>>>>         (tree_loop_interchange::interchange): Update call to above function.

>>>>         (should_interchange_loop_nest): Ditto.
Bin.Cheng Dec. 8, 2017, 4:24 p.m. | #5
On Fri, Dec 8, 2017 at 3:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener

> <richard.guenther@gmail.com> wrote:

>> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:

>>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener

>>> <richard.guenther@gmail.com> wrote:

>>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>>>>> Hi,

>>>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>>>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>>>>> loop.

>>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>>>>

>>>> Shouldn't we do this even for non-constant initialzied simple

>>>> reduction?  Because for any simple

>>>> reduction we add two DRs that are not innermost, for constant

>>>> initialized we add an additional

>>>> cond-expr.  So ...

>>>>

>>>> +  /* Conservatively skip interchange in cases only have few data references

>>>> +     and constant initialized simple reduction since it introduces new data

>>>> +     reference as well as ?: operation.  */

>>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>>>> +    return false;

>>>> +

>>>>

>>>> can you, instead of carrying num_const_init_simple_reduc simply loop

>>>> over m_reductions

>>>> and classify them in this function accordingly?  I think we want to

>>>> cost non-constant-init

>>>> reductions as well.  The :? can eventually count for another DR for

>>>> cost purposes.

>>> Number of non-constant-init reductions can still be carried in struct

>>> loop_cand?  I am not very sure what's the advantage of an additional

>>> loop over m_reductions getting the same information.

>>> Perhaps the increase of stmts should be counted like:

>>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs

>>> Question is which number should this be compared against.  (we may

>>> need to shift num_new_inv_drs to the other side for wrapping issue).

>>>

>>>>

>>>> It looks like we do count the existing DRs for the reduction?  Is that

>>>> why you arrive

>>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

>>> Yes.

>>>> But we don't really know whether the DR was invariant in the outer

>>>> loop (well, I suppose

>>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of

>>> invariant reference with regarding to inner loop, rather than the

>>> outer loop.  The same to num_new_inv_drs,

>>> which means a reference become invariant after loop interchange with

>>> regarding to (the new) inner loop.  This invariant information is

>>> always known from data reference, right?

>>> As for DRs for reduction, we know it's invariant because we set its

>>> inner loop stride to zero.

>>>

>>>> we could remember the DR in m_reductions).

>>>>

>>>> Note that the good thing is that the ?: has an invariant condition and

>>>> thus vectorization

>>>> can hoist the mask generation out of the vectorized loop which means

>>>> it boils down to

>>>> cheap operations.  My gut feeling is that just looking at the number

>>>> of memory references

>>>> isn't a good indicator of profitability as the regular stmt workload

>>>> has a big impact on

>>>> profitability of vectorization.

>>> It's not specific to vectorization.  The generated new code also costs

>>> too much in small loops without vectorization.  But yes, # of mem_refs

>>> may be too inaccurate, maybe we should check against num_stmts.

>>

>> Not specific to vectorization but the interchange may pay off only when

>> vectorizing a loop.  Would the loop in loop-interchange-5.c be still

>> interchanged?  If we remove the multiplication and just keep

>> c[i][j] = c[i][j] + b[k][j];

> Both loop-interchange-5.c and the modified version are interchange,

> because we check

> it against number of all data references (including num_old_inv_drs):

>  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>

>> ?  That is, why is the constant init so special?  Even for non-constant init

>> we're changing two outer loop DRs to two non-consecutive inner loop DRs.

> No, the two outer loop DRs becomes consecutive with respect to inner loop.

> So for a typical matrix mul case, the interchange moves two outer loop

> DRs into inner loops, moves an inner loop DR out to outer loop.

> Overall it introduces an additional instruction.  In terms of cache

> behavior, it's even cheaper?  Because the two new DRs access the same

> memory object.

> As for pr62178.c, assembly regressed from:

> .L3:

>         ldr     w3, [x0], 124

>         ldr     w4, [x2, 4]!

>         cmp     x0, x5

>         madd    w1, w4, w3, w1

>         bne     .L3

>

> To:

> .L3:

>         ldr     w2, [x3, x0, lsl 2]

>         cmp     w5, 1

>         ldr     w1, [x4, x0, lsl 2]

>         csel    w2, w2, wzr, ne

>         madd    w1, w6, w1, w2

>         str     w1, [x3, x0, lsl 2]

>         add     x0, x0, 1

>         cmp     x0, 31

>         bne     .L3

>

> Without vectorization.  Two additional instruction for cond_expr, one

> additional memory reference for interchange, and one additional

> instruction because of ivopt/addressing mode.

For the record, the performance of pr62178.c depends on size of
matrix, with 31*31 matrix, the above interchanged version is slower,
but it's faster for 250*250 matrix.
So L1-cache-size may need to be considered somehow, but still a
problem for unknown niters cases.

Thanks,
bin
>

> Thanks,

> bin

>>

>> Richard.

>>

>>> Thanks,

>>> bin

>>>>

>>>> So no ack nor nack...

>>>>

>>>> Richard.

>>>>

>>>>> Thanks,

>>>>> bin

>>>>> 2017-12-08  Bin Cheng  <bin.cheng@arm.com>

>>>>>

>>>>>         * gimple-loop-interchange.cc (struct loop_cand): New field.

>>>>>         (loop_cand::loop_cand): Init new field in constructor.

>>>>>         (loop_cand::classify_simple_reduction): Record simple reduction

>>>>>         initialized with constant value.

>>>>>         (should_interchange_loops): New parameter.  Skip interchange if loop

>>>>>         has few data references and constant intitialized simple reduction.

>>>>>         (tree_loop_interchange::interchange): Update call to above function.

>>>>>         (should_interchange_loop_nest): Ditto.
Bin.Cheng Dec. 12, 2017, 12:42 p.m. | #6
On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:

>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener

>> <richard.guenther@gmail.com> wrote:

>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>>>> Hi,

>>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>>>> loop.

>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>>>

>>> Shouldn't we do this even for non-constant initialzied simple

>>> reduction?  Because for any simple

>>> reduction we add two DRs that are not innermost, for constant

>>> initialized we add an additional

>>> cond-expr.  So ...

>>>

>>> +  /* Conservatively skip interchange in cases only have few data references

>>> +     and constant initialized simple reduction since it introduces new data

>>> +     reference as well as ?: operation.  */

>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>>> +    return false;

>>> +

>>>

>>> can you, instead of carrying num_const_init_simple_reduc simply loop

>>> over m_reductions

>>> and classify them in this function accordingly?  I think we want to

>>> cost non-constant-init

>>> reductions as well.  The :? can eventually count for another DR for

>>> cost purposes.

>> Number of non-constant-init reductions can still be carried in struct

>> loop_cand?  I am not very sure what's the advantage of an additional

>> loop over m_reductions getting the same information.

>> Perhaps the increase of stmts should be counted like:

>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs

>> Question is which number should this be compared against.  (we may

>> need to shift num_new_inv_drs to the other side for wrapping issue).

>>

>>>

>>> It looks like we do count the existing DRs for the reduction?  Is that

>>> why you arrive

>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

>> Yes.

>>> But we don't really know whether the DR was invariant in the outer

>>> loop (well, I suppose

>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of

>> invariant reference with regarding to inner loop, rather than the

>> outer loop.  The same to num_new_inv_drs,

>> which means a reference become invariant after loop interchange with

>> regarding to (the new) inner loop.  This invariant information is

>> always known from data reference, right?

>> As for DRs for reduction, we know it's invariant because we set its

>> inner loop stride to zero.

>>

>>> we could remember the DR in m_reductions).

>>>

>>> Note that the good thing is that the ?: has an invariant condition and

>>> thus vectorization

>>> can hoist the mask generation out of the vectorized loop which means

>>> it boils down to

>>> cheap operations.  My gut feeling is that just looking at the number

>>> of memory references

>>> isn't a good indicator of profitability as the regular stmt workload

>>> has a big impact on

>>> profitability of vectorization.

>> It's not specific to vectorization.  The generated new code also costs

>> too much in small loops without vectorization.  But yes, # of mem_refs

>> may be too inaccurate, maybe we should check against num_stmts.

>

> Not specific to vectorization but the interchange may pay off only when

> vectorizing a loop.  Would the loop in loop-interchange-5.c be still

> interchanged?  If we remove the multiplication and just keep

> c[i][j] = c[i][j] + b[k][j];

> ?  That is, why is the constant init so special?  Even for non-constant init

> we're changing two outer loop DRs to two non-consecutive inner loop DRs.

Hi Richard,
This is updated patch taking stmt cost into consideration.

Firstly stmt cost (from # of stmt)
of loops are recorded.  Then stmt cost of outer loop is adjusted by decreasing
number of IVs, increasing by the number of constant initialized simple
reductions.
Lastly we check stmt cost between inner/outer loops and give up on interchange
if outer loop has too many stmts.

Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Bootstrap and test
on x86_64 andAArch64.  Any comment?

Thanks,
bin
2017-12-12  Bin Cheng  <bin.cheng@arm.com>

    * gimple-loop-interchange.cc (STMT_COST_RATIO): New macro.
    (loop_cand::m_num_stmts, loop_cand::m_const_init_reduc): New members.
    (loop_cand::loop_cand): Initialize above members.
    (loop_cand::supported_operations): Delete.
    (loop_cand::can_interchange_p): Inline above function.
    (loop_cand::classify_simple_reduction): Record number of constant
    initialized simple reductions.
    (should_interchange_loops): New parameters.  Check stmt cost of loops
    to be interchange.
    (tree_loop_interchange::interchange): Prepare stmt cost of outer loop.
    Update call to should_interchange_loops.
    (should_interchange_loop_nest): Update call to
    should_interchange_loops.
diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index e80e65c..495bd03 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -90,6 +90,10 @@ along with GCC; see the file COPYING3.  If not see
    innermost two loops.  */
 #define INNER_STRIDE_RATIO  ((OUTER_STRIDE_RATIO) + 1)
 
+/* Comparison ratio of stmt cost between inner/outer loops.  Loops won't
+   be interchanged if outer loop has too many stmts.  */
+#define STMT_COST_RATIO     (3)
+
 /* Vector of strides that DR accesses in each level loop of a loop nest.  */
 #define DR_ACCESS_STRIDE(dr) ((vec<tree> *) dr->aux)
 
@@ -181,7 +185,6 @@ struct loop_cand
   bool analyze_carried_vars (loop_cand *);
   bool analyze_lcssa_phis (void);
   bool can_interchange_p (loop_cand *);
-  bool supported_operations (basic_block, loop_cand *, int *);
   void undo_simple_reduction (reduction_p, bitmap);
 
   /* The loop itself.  */
@@ -199,13 +202,17 @@ struct loop_cand
   edge m_exit;
   /* Basic blocks of this loop.  */
   basic_block *m_bbs;
+  /* Number of stmts of this loop.  Inner loops' stmts are not included.  */
+  int m_num_stmts;
+  /* Number of constant initialized simple reduction.  */
+  int m_const_init_reduc;
 };
 
 /* Constructor.  */
 
 loop_cand::loop_cand (struct loop *loop, struct loop *outer)
-  : m_loop (loop), m_outer (outer),
-    m_exit (single_exit (loop)), m_bbs (get_loop_body (loop))
+  : m_loop (loop), m_outer (outer), m_exit (single_exit (loop)),
+    m_bbs (get_loop_body (loop)), m_num_stmts (0), m_const_init_reduc (0)
 {
     m_inductions.create (3);
     m_reductions.create (3);
@@ -282,75 +289,6 @@ loop_cand::find_reduction_by_stmt (gimple *stmt)
   return NULL;
 }
 
-/* Return true if all stmts in BB can be supported by loop interchange,
-   otherwise return false.  ILOOP is not NULL if this loop_cand is the
-   outer loop in loop nest.  Add the number of supported statements to
-   NUM_STMTS.  */
-
-bool
-loop_cand::supported_operations (basic_block bb, loop_cand *iloop,
-				 int *num_stmts)
-{
-  int bb_num_stmts = 0;
-  gphi_iterator psi;
-  gimple_stmt_iterator gsi;
-
-  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      gimple *stmt = gsi_stmt (gsi);
-      if (is_gimple_debug (stmt))
-	continue;
-
-      if (gimple_has_side_effects (stmt))
-	return false;
-
-      bb_num_stmts++;
-      if (gcall *call = dyn_cast <gcall *> (stmt))
-	{
-	  /* In basic block of outer loop, the call should be cheap since
-	     it will be moved to inner loop.  */
-	  if (iloop != NULL
-	      && !gimple_inexpensive_call_p (call))
-	    return false;
-	  continue;
-	}
-
-      if (!iloop || !gimple_vuse (stmt))
-	continue;
-
-      /* Support stmt accessing memory in outer loop only if it is for inner
-	 loop's reduction.  */
-      if (iloop->find_reduction_by_stmt (stmt))
-	continue;
-
-      tree lhs;
-      /* Support loop invariant memory reference if it's only used once by
-	 inner loop.  */
-      /* ???  How's this checking for invariantness?  */
-      if (gimple_assign_single_p (stmt)
-	  && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
-	  && TREE_CODE (lhs) == SSA_NAME
-	  && single_use_in_loop (lhs, iloop->m_loop))
-	continue;
-
-      return false;
-    }
-  *num_stmts += bb_num_stmts;
-
-  /* Allow PHI nodes in any basic block of inner loop, PHI nodes in outer
-     loop's header, or PHI nodes in dest bb of inner loop's exit edge.  */
-  if (!iloop || bb == m_loop->header
-      || bb == iloop->m_exit->dest)
-    return true;
-
-  /* Don't allow any other PHI nodes.  */
-  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
-    if (!virtual_operand_p (PHI_RESULT (psi.phi ())))
-      return false;
-
-  return true;
-}
-
 /* Return true if current loop_cand be interchanged.  ILOOP is not NULL if
    current loop_cand is outer loop in loop nest.  */
 
@@ -375,23 +313,72 @@ loop_cand::can_interchange_p (loop_cand *iloop)
   if (m_lcssa_nodes.length () > allowed_reduction_num)
     return false;
 
-  int num_stmts = 0;
-  /* Check basic blocks other than loop header/exit.  */
+  /* Check if basic block has any unsupported operation.  Note basic blocks
+     of inner loops are not checked here.  */
   for (unsigned i = 0; i < m_loop->num_nodes; i++)
     {
       basic_block bb = m_bbs[i];
+      gphi_iterator psi;
+      gimple_stmt_iterator gsi;
 
       /* Skip basic blocks of inner loops.  */
       if (bb->loop_father != m_loop)
-	continue;
+       continue;
 
-      /* Check if basic block has any unsupported operation.  */
-      if (!supported_operations (bb, iloop, &num_stmts))
-	return false;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (is_gimple_debug (stmt))
+	    continue;
+
+	  if (gimple_has_side_effects (stmt))
+	    return false;
+
+	  m_num_stmts++;
+	  if (gcall *call = dyn_cast <gcall *> (stmt))
+	    {
+	      /* In basic block of outer loop, the call should be cheap since
+		 it will be moved to inner loop.  */
+	      if (iloop != NULL
+		  && !gimple_inexpensive_call_p (call))
+		return false;
+	      continue;
+	    }
+
+	  if (!iloop || !gimple_vuse (stmt))
+	    continue;
 
+	  /* Support stmt accessing memory in outer loop only if it is for
+	     inner loop's reduction.  */
+	  if (iloop->find_reduction_by_stmt (stmt))
+	    continue;
+
+	  tree lhs;
+	  /* Support loop invariant memory reference if it's only used once by
+	     inner loop.  */
+	  /* ???  How's this checking for invariantness?  */
+	  if (gimple_assign_single_p (stmt)
+	      && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	      && TREE_CODE (lhs) == SSA_NAME
+	      && single_use_in_loop (lhs, iloop->m_loop))
+	    continue;
+
+	  return false;
+	}
       /* Check if loop has too many stmts.  */
-      if (num_stmts > MAX_NUM_STMT)
+      if (m_num_stmts > MAX_NUM_STMT)
 	return false;
+
+      /* Allow PHI nodes in any basic block of inner loop, PHI nodes in outer
+	 loop's header, or PHI nodes in dest bb of inner loop's exit edge.  */
+      if (!iloop || bb == m_loop->header
+	  || bb == iloop->m_exit->dest)
+	continue;
+
+      /* Don't allow any other PHI nodes.  */
+      for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
+	if (!virtual_operand_p (PHI_RESULT (psi.phi ())))
+	  return false;
     }
 
   return true;
@@ -440,7 +427,9 @@ loop_cand::classify_simple_reduction (reduction_p re)
 
       re->init_ref = gimple_assign_rhs1 (producer);
     }
-  else if (!CONSTANT_CLASS_P (re->init))
+  else if (CONSTANT_CLASS_P (re->init))
+    m_const_init_reduc++;
+  else
     return;
 
   /* Check how reduction variable is used.  */
@@ -1446,6 +1435,7 @@ dump_access_strides (vec<data_reference_p> datarefs)
 static bool
 should_interchange_loops (unsigned i_idx, unsigned o_idx,
 			  vec<data_reference_p> datarefs,
+			  unsigned i_stmt_cost, unsigned o_stmt_cost,
 			  bool innermost_loops_p, bool dump_info_p = true)
 {
   unsigned HOST_WIDE_INT ratio;
@@ -1541,11 +1531,21 @@ should_interchange_loops (unsigned i_idx, unsigned o_idx,
 	       num_resolved_not_ok_drs);
       fprintf (dump_file, "Variable strides we cannot decide: %d\n",
 	       num_unresolved_drs);
+      fprintf (dump_file, "Stmt cost of inner loop: %d\n", i_stmt_cost);
+      fprintf (dump_file, "Stmt cost of outer loop: %d\n", o_stmt_cost);
     }
 
   if (num_unresolved_drs != 0 || num_resolved_not_ok_drs != 0)
     return false;
 
+  /* Stmts of outer loop will be moved to inner loop.  If there are two many
+     such stmts, it could make inner loop costly.  Here we compare stmt cost
+     between outer and inner loops.  */
+  if (i_stmt_cost && o_stmt_cost
+      && num_old_inv_drs + o_stmt_cost > num_new_inv_drs
+      && o_stmt_cost * STMT_COST_RATIO > i_stmt_cost)
+    return false;
+
   /* We use different stride comparison ratio for interchanging innermost
      two loops or not.  The idea is to be conservative in interchange for
      the innermost loops.  */
@@ -1599,8 +1599,24 @@ tree_loop_interchange::interchange (vec<data_reference_p> datarefs,
 	  || !oloop.can_interchange_p (&iloop))
 	break;
 
+      /* Outer loop's stmts will be moved to inner loop during interchange.
+	 If there are many of them, it may make inner loop very costly.  We
+	 need to check number of outer loop's stmts in profit cost model of
+	 interchange.  */
+      int stmt_cost = oloop.m_num_stmts;
+      /* Count out the exit checking stmt of outer loop.  */
+      stmt_cost --;
+      /* Count out IV's increasing stmt, IVOPTs takes care if it.  */
+      stmt_cost -= oloop.m_inductions.length ();
+      /* Count in the additional load and cond_expr stmts caused by inner
+	 loop's constant initialized reduction.  */
+      stmt_cost += iloop.m_const_init_reduc * 2;
+      if (stmt_cost < 0)
+	stmt_cost = 0;
+
       /* Check profitability for loop interchange.  */
       if (should_interchange_loops (i_idx, o_idx, datarefs,
+				    iloop.m_num_stmts, (unsigned) stmt_cost,
 				    iloop.m_loop->inner == NULL))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1793,7 +1809,7 @@ should_interchange_loop_nest (struct loop *loop_nest, struct loop *innermost,
   /* Check if any two adjacent loops should be interchanged.  */
   for (struct loop *loop = innermost;
        loop != loop_nest; loop = loop_outer (loop), idx--)
-    if (should_interchange_loops (idx, idx - 1, datarefs,
+    if (should_interchange_loops (idx, idx - 1, datarefs, 0, 0,
 				  loop == innermost, false))
       return true;
Richard Biener Dec. 14, 2017, 1:27 p.m. | #7
On Tue, Dec 12, 2017 at 1:42 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener

> <richard.guenther@gmail.com> wrote:

>> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:

>>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener

>>> <richard.guenther@gmail.com> wrote:

>>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:

>>>>> Hi,

>>>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.

>>>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small

>>>>> loop.

>>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test passes?

>>>>

>>>> Shouldn't we do this even for non-constant initialzied simple

>>>> reduction?  Because for any simple

>>>> reduction we add two DRs that are not innermost, for constant

>>>> initialized we add an additional

>>>> cond-expr.  So ...

>>>>

>>>> +  /* Conservatively skip interchange in cases only have few data references

>>>> +     and constant initialized simple reduction since it introduces new data

>>>> +     reference as well as ?: operation.  */

>>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

>>>> +    return false;

>>>> +

>>>>

>>>> can you, instead of carrying num_const_init_simple_reduc simply loop

>>>> over m_reductions

>>>> and classify them in this function accordingly?  I think we want to

>>>> cost non-constant-init

>>>> reductions as well.  The :? can eventually count for another DR for

>>>> cost purposes.

>>> Number of non-constant-init reductions can still be carried in struct

>>> loop_cand?  I am not very sure what's the advantage of an additional

>>> loop over m_reductions getting the same information.

>>> Perhaps the increase of stmts should be counted like:

>>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs

>>> Question is which number should this be compared against.  (we may

>>> need to shift num_new_inv_drs to the other side for wrapping issue).

>>>

>>>>

>>>> It looks like we do count the existing DRs for the reduction?  Is that

>>>> why you arrive

>>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)

>>> Yes.

>>>> But we don't really know whether the DR was invariant in the outer

>>>> loop (well, I suppose

>>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of

>>> invariant reference with regarding to inner loop, rather than the

>>> outer loop.  The same to num_new_inv_drs,

>>> which means a reference become invariant after loop interchange with

>>> regarding to (the new) inner loop.  This invariant information is

>>> always known from data reference, right?

>>> As for DRs for reduction, we know it's invariant because we set its

>>> inner loop stride to zero.

>>>

>>>> we could remember the DR in m_reductions).

>>>>

>>>> Note that the good thing is that the ?: has an invariant condition and

>>>> thus vectorization

>>>> can hoist the mask generation out of the vectorized loop which means

>>>> it boils down to

>>>> cheap operations.  My gut feeling is that just looking at the number

>>>> of memory references

>>>> isn't a good indicator of profitability as the regular stmt workload

>>>> has a big impact on

>>>> profitability of vectorization.

>>> It's not specific to vectorization.  The generated new code also costs

>>> too much in small loops without vectorization.  But yes, # of mem_refs

>>> may be too inaccurate, maybe we should check against num_stmts.

>>

>> Not specific to vectorization but the interchange may pay off only when

>> vectorizing a loop.  Would the loop in loop-interchange-5.c be still

>> interchanged?  If we remove the multiplication and just keep

>> c[i][j] = c[i][j] + b[k][j];

>> ?  That is, why is the constant init so special?  Even for non-constant init

>> we're changing two outer loop DRs to two non-consecutive inner loop DRs.

> Hi Richard,

> This is updated patch taking stmt cost into consideration.

>

> Firstly stmt cost (from # of stmt)

> of loops are recorded.  Then stmt cost of outer loop is adjusted by decreasing

> number of IVs, increasing by the number of constant initialized simple

> reductions.

> Lastly we check stmt cost between inner/outer loops and give up on interchange

> if outer loop has too many stmts.

>

> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Bootstrap and test

> on x86_64 andAArch64.  Any comment?


This looks good now!

Ok for trunk.

Thanks,
Richard.

> Thanks,

> bin

> 2017-12-12  Bin Cheng  <bin.cheng@arm.com>

>

>     * gimple-loop-interchange.cc (STMT_COST_RATIO): New macro.

>     (loop_cand::m_num_stmts, loop_cand::m_const_init_reduc): New members.

>     (loop_cand::loop_cand): Initialize above members.

>     (loop_cand::supported_operations): Delete.

>     (loop_cand::can_interchange_p): Inline above function.

>     (loop_cand::classify_simple_reduction): Record number of constant

>     initialized simple reductions.

>     (should_interchange_loops): New parameters.  Check stmt cost of loops

>     to be interchange.

>     (tree_loop_interchange::interchange): Prepare stmt cost of outer loop.

>     Update call to should_interchange_loops.

>     (should_interchange_loop_nest): Update call to

>     should_interchange_loops.

Patch

diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 6554a42..f45f7dc 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -199,13 +199,16 @@  struct loop_cand
   edge m_exit;
   /* Basic blocks of this loop.  */
   basic_block *m_bbs;
+  /* Number of constant initialized simple reduction.  */
+  unsigned m_num_const_init_simple_reduc;
 };
 
 /* Constructor.  */
 
 loop_cand::loop_cand (struct loop *loop, struct loop *outer)
   : m_loop (loop), m_outer (outer),
-    m_exit (single_exit (loop)), m_bbs (get_loop_body (loop))
+    m_exit (single_exit (loop)), m_bbs (get_loop_body (loop)),
+    m_num_const_init_simple_reduc (0)
 {
     m_inductions.create (3);
     m_reductions.create (3);
@@ -440,7 +443,9 @@  loop_cand::classify_simple_reduction (reduction_p re)
 
       re->init_ref = gimple_assign_rhs1 (producer);
     }
-  else if (!CONSTANT_CLASS_P (re->init))
+  else if (CONSTANT_CLASS_P (re->init))
+    m_num_const_init_simple_reduc++;
+  else
     return;
 
   /* Check how reduction variable is used.  */
@@ -1422,6 +1427,7 @@  dump_access_strides (vec<data_reference_p> datarefs)
 static bool
 should_interchange_loops (unsigned i_idx, unsigned o_idx,
 			  vec<data_reference_p> datarefs,
+			  unsigned num_const_init_simple_reduc,
 			  bool innermost_loops_p, bool dump_info_p = true)
 {
   unsigned HOST_WIDE_INT ratio;
@@ -1522,6 +1528,12 @@  should_interchange_loops (unsigned i_idx, unsigned o_idx,
   if (num_unresolved_drs != 0 || num_resolved_not_ok_drs != 0)
     return false;
 
+  /* Conservatively skip interchange in cases only have few data references
+     and constant initialized simple reduction since it introduces new data
+     reference as well as ?: operation.  */
+  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
+    return false;
+
   /* We use different stride comparison ratio for interchanging innermost
      two loops or not.  The idea is to be conservative in interchange for
      the innermost loops.  */
@@ -1576,6 +1588,7 @@  tree_loop_interchange::interchange (vec<data_reference_p> datarefs,
 
       /* Check profitability for loop interchange.  */
       if (should_interchange_loops (i_idx, o_idx, datarefs,
+				    iloop.m_num_const_init_simple_reduc,
 				    iloop.m_loop->inner == NULL))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1764,7 +1779,7 @@  should_interchange_loop_nest (struct loop *loop_nest, struct loop *innermost,
   /* Check if any two adjacent loops should be interchanged.  */
   for (struct loop *loop = innermost;
        loop != loop_nest; loop = loop_outer (loop), idx--)
-    if (should_interchange_loops (idx, idx - 1, datarefs,
+    if (should_interchange_loops (idx, idx - 1, datarefs, 0,
 				  loop == innermost, false))
       return true;