Restrict LOOP_ALIGN to loop headers only.

Message ID 90fcdb35-bd56-1614-041d-58d274d46074@suse.cz
State New
Headers show
Series
  • Restrict LOOP_ALIGN to loop headers only.
Related show

Commit Message

Martin Liška July 9, 2019, 9:02 a.m.
Hi.

I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the
basic blocks for which it makes the biggest sense. I quite some binary
size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly
positive.

Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

Ready to be installed?
Thanks,
Martin

gcc/ChangeLog:

2019-07-09  Martin Liska  <mliska@suse.cz>

	* final.c (compute_alignments): Apply the LOOP_ALIGN only
	to basic blocks that all loop headers.
---
 gcc/final.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jan Hubicka July 9, 2019, 9:56 a.m. | #1
> Hi.

> 

> I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the

> basic blocks for which it makes the biggest sense. I quite some binary

> size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly

> positive.

> 

> Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

> 

> Ready to be installed?

The original idea of distinction between jump alignment and loop
alignment was that they have two basic meanings:
 1) jump alignment is there to avoid jumping just to the end of decode
 window (if the window is aligned) so CPU will get stuck after reaching
 the jump and also to possibly reduce code cache polution by populating
 by code that is not executed
 2) loop alignment aims to fit loop in as few cache windows as possible

Now if you have loop laid in a way that header of loop is not first
basic block, 2) IMO still apply.  I.e.

		jump loop
:loopback
		loop body
:loop
		if cond jump to loopback

So dropping loop alignment for those does not seem to make much sense
from high level.  We may want to have differnt alignment for loops
starting by header and loops starting in the middle, but I still liked
more your patch which did bundles for loops.

modern x86 chips are not very good testing targets on it.  I guess
generic changes to alignment needs to be tested on other chips too.

Honza
> Thanks,

> Martin

> 

> gcc/ChangeLog:

> 

> 2019-07-09  Martin Liska  <mliska@suse.cz>

> 

> 	* final.c (compute_alignments): Apply the LOOP_ALIGN only

> 	to basic blocks that all loop headers.

> ---

>  gcc/final.c | 1 +

>  1 file changed, 1 insertion(+)

> 

> 


> diff --git a/gcc/final.c b/gcc/final.c

> index fefc4874b24..ce2678da988 100644

> --- a/gcc/final.c

> +++ b/gcc/final.c

> @@ -739,6 +739,7 @@ compute_alignments (void)

>        if (has_fallthru

>  	  && !(single_succ_p (bb)

>  	       && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))

> +	  && bb->loop_father->header == bb

>  	  && optimize_bb_for_speed_p (bb)

>  	  && branch_count + fallthru_count > count_threshold

>  	  && (branch_count

>
Martin Liška July 9, 2019, 10:10 a.m. | #2
On 7/9/19 11:56 AM, Jan Hubicka wrote:
>> Hi.

>>

>> I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the

>> basic blocks for which it makes the biggest sense. I quite some binary

>> size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly

>> positive.

>>

>> Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

>>

>> Ready to be installed?

> The original idea of distinction between jump alignment and loop

> alignment was that they have two basic meanings:

>  1) jump alignment is there to avoid jumping just to the end of decode

>  window (if the window is aligned) so CPU will get stuck after reaching

>  the jump and also to possibly reduce code cache polution by populating

>  by code that is not executed

>  2) loop alignment aims to fit loop in as few cache windows as possible

> 

> Now if you have loop laid in a way that header of loop is not first

> basic block, 2) IMO still apply.  I.e.

> 

> 		jump loop

> :loopback

> 		loop body

> :loop

> 		if cond jump to loopback

> 

> So dropping loop alignment for those does not seem to make much sense

> from high level.  We may want to have differnt alignment for loops

> starting by header and loops starting in the middle,


That's quite complicated condition, I would not introduce a new alignment.

> but I still liked

> more your patch which did bundles for loops.


The patch caused regression for quite some benchmarks and has it's own
problems (need of a recent GAS, not doing a bundle for bundles that
can't fit in a single bundle window). For that reasons, I decided
to not work on it any longer.

Martin

> 

> modern x86 chips are not very good testing targets on it.  I guess

> generic changes to alignment needs to be tested on other chips too.

> 

> Honza

>> Thanks,

>> Martin

>>

>> gcc/ChangeLog:

>>

>> 2019-07-09  Martin Liska  <mliska@suse.cz>

>>

>> 	* final.c (compute_alignments): Apply the LOOP_ALIGN only

>> 	to basic blocks that all loop headers.

>> ---

>>  gcc/final.c | 1 +

>>  1 file changed, 1 insertion(+)

>>

>>

> 

>> diff --git a/gcc/final.c b/gcc/final.c

>> index fefc4874b24..ce2678da988 100644

>> --- a/gcc/final.c

>> +++ b/gcc/final.c

>> @@ -739,6 +739,7 @@ compute_alignments (void)

>>        if (has_fallthru

>>  	  && !(single_succ_p (bb)

>>  	       && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))

>> +	  && bb->loop_father->header == bb

>>  	  && optimize_bb_for_speed_p (bb)

>>  	  && branch_count + fallthru_count > count_threshold

>>  	  && (branch_count

>>

> 

> 

>
Richard Biener July 9, 2019, 10:22 a.m. | #3
On Tue, Jul 9, 2019 at 11:56 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>

> > Hi.

> >

> > I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the

> > basic blocks for which it makes the biggest sense. I quite some binary

> > size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly

> > positive.

> >

> > Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

> >

> > Ready to be installed?

> The original idea of distinction between jump alignment and loop

> alignment was that they have two basic meanings:

>  1) jump alignment is there to avoid jumping just to the end of decode

>  window (if the window is aligned) so CPU will get stuck after reaching

>  the jump and also to possibly reduce code cache polution by populating

>  by code that is not executed

>  2) loop alignment aims to fit loop in as few cache windows as possible

>

> Now if you have loop laid in a way that header of loop is not first

> basic block, 2) IMO still apply.  I.e.

>

>                 jump loop

> :loopback

>                 loop body

> :loop

>                 if cond jump to loopback

>

> So dropping loop alignment for those does not seem to make much sense

> from high level.  We may want to have differnt alignment for loops

> starting by header and loops starting in the middle, but I still liked

> more your patch which did bundles for loops.

>

> modern x86 chips are not very good testing targets on it.  I guess

> generic changes to alignment needs to be tested on other chips too.

>

> Honza

> > Thanks,

> > Martin

> >

> > gcc/ChangeLog:

> >

> > 2019-07-09  Martin Liska  <mliska@suse.cz>

> >

> >       * final.c (compute_alignments): Apply the LOOP_ALIGN only

> >       to basic blocks that all loop headers.

> > ---

> >  gcc/final.c | 1 +

> >  1 file changed, 1 insertion(+)

> >

> >

>

> > diff --git a/gcc/final.c b/gcc/final.c

> > index fefc4874b24..ce2678da988 100644

> > --- a/gcc/final.c

> > +++ b/gcc/final.c

> > @@ -739,6 +739,7 @@ compute_alignments (void)

> >        if (has_fallthru

> >         && !(single_succ_p (bb)

> >              && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))

> > +       && bb->loop_father->header == bb


I agree that the above is the wrong condition - but I'm not sure we
only end up using LOOP_ALIGN for blocks reached by a DFS_BACK
edge.  Note that DFS_BACK would have to be applied considering
the current CFG layout, simply doing mark_dfs_back_edges doesn't
work (we're in CFG layout mode here, no?).  Eventually the code
counting brances effectively already does this though.

The odd thing is that we apply LOOP_ALIGN only to blocks that
have a fallthru incoming edge.  I don't see Honzas example
above having one.

> >         && optimize_bb_for_speed_p (bb)

> >         && branch_count + fallthru_count > count_threshold

> >         && (branch_count

> >

>

>

>
Richard Biener July 9, 2019, 10:23 a.m. | #4
On Tue, Jul 9, 2019 at 12:22 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>

> On Tue, Jul 9, 2019 at 11:56 AM Jan Hubicka <hubicka@ucw.cz> wrote:

> >

> > > Hi.

> > >

> > > I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the

> > > basic blocks for which it makes the biggest sense. I quite some binary

> > > size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly

> > > positive.

> > >

> > > Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

> > >

> > > Ready to be installed?

> > The original idea of distinction between jump alignment and loop

> > alignment was that they have two basic meanings:

> >  1) jump alignment is there to avoid jumping just to the end of decode

> >  window (if the window is aligned) so CPU will get stuck after reaching

> >  the jump and also to possibly reduce code cache polution by populating

> >  by code that is not executed

> >  2) loop alignment aims to fit loop in as few cache windows as possible

> >

> > Now if you have loop laid in a way that header of loop is not first

> > basic block, 2) IMO still apply.  I.e.

> >

> >                 jump loop

> > :loopback

> >                 loop body

> > :loop

> >                 if cond jump to loopback

> >

> > So dropping loop alignment for those does not seem to make much sense

> > from high level.  We may want to have differnt alignment for loops

> > starting by header and loops starting in the middle, but I still liked

> > more your patch which did bundles for loops.

> >

> > modern x86 chips are not very good testing targets on it.  I guess

> > generic changes to alignment needs to be tested on other chips too.

> >

> > Honza

> > > Thanks,

> > > Martin

> > >

> > > gcc/ChangeLog:

> > >

> > > 2019-07-09  Martin Liska  <mliska@suse.cz>

> > >

> > >       * final.c (compute_alignments): Apply the LOOP_ALIGN only

> > >       to basic blocks that all loop headers.

> > > ---

> > >  gcc/final.c | 1 +

> > >  1 file changed, 1 insertion(+)

> > >

> > >

> >

> > > diff --git a/gcc/final.c b/gcc/final.c

> > > index fefc4874b24..ce2678da988 100644

> > > --- a/gcc/final.c

> > > +++ b/gcc/final.c

> > > @@ -739,6 +739,7 @@ compute_alignments (void)

> > >        if (has_fallthru

> > >         && !(single_succ_p (bb)

> > >              && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))

> > > +       && bb->loop_father->header == bb

>

> I agree that the above is the wrong condition - but I'm not sure we

> only end up using LOOP_ALIGN for blocks reached by a DFS_BACK

> edge.  Note that DFS_BACK would have to be applied considering

> the current CFG layout, simply doing mark_dfs_back_edges doesn't

> work (we're in CFG layout mode here, no?).


So a "backedge" in this sense would be e->dest->index < e->src->index.
No?

> Eventually the code

> counting brances effectively already does this though.

>

> The odd thing is that we apply LOOP_ALIGN only to blocks that

> have a fallthru incoming edge.  I don't see Honzas example

> above having one.

>

> > >         && optimize_bb_for_speed_p (bb)

> > >         && branch_count + fallthru_count > count_threshold

> > >         && (branch_count

> > >

> >

> >

> >
Richard Biener July 9, 2019, 10:26 a.m. | #5
On Tue, Jul 9, 2019 at 12:23 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>

> On Tue, Jul 9, 2019 at 12:22 PM Richard Biener

> <richard.guenther@gmail.com> wrote:

> >

> > On Tue, Jul 9, 2019 at 11:56 AM Jan Hubicka <hubicka@ucw.cz> wrote:

> > >

> > > > Hi.

> > > >

> > > > I'm suggesting to restrict LOOP_ALIGN to only loop headers. That are the

> > > > basic blocks for which it makes the biggest sense. I quite some binary

> > > > size reductions on SPEC2006 and SPEC2017. Speed numbers are also slightly

> > > > positive.

> > > >

> > > > Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

> > > >

> > > > Ready to be installed?

> > > The original idea of distinction between jump alignment and loop

> > > alignment was that they have two basic meanings:

> > >  1) jump alignment is there to avoid jumping just to the end of decode

> > >  window (if the window is aligned) so CPU will get stuck after reaching

> > >  the jump and also to possibly reduce code cache polution by populating

> > >  by code that is not executed

> > >  2) loop alignment aims to fit loop in as few cache windows as possible

> > >

> > > Now if you have loop laid in a way that header of loop is not first

> > > basic block, 2) IMO still apply.  I.e.

> > >

> > >                 jump loop

> > > :loopback

> > >                 loop body

> > > :loop

> > >                 if cond jump to loopback

> > >

> > > So dropping loop alignment for those does not seem to make much sense

> > > from high level.  We may want to have differnt alignment for loops

> > > starting by header and loops starting in the middle, but I still liked

> > > more your patch which did bundles for loops.

> > >

> > > modern x86 chips are not very good testing targets on it.  I guess

> > > generic changes to alignment needs to be tested on other chips too.

> > >

> > > Honza

> > > > Thanks,

> > > > Martin

> > > >

> > > > gcc/ChangeLog:

> > > >

> > > > 2019-07-09  Martin Liska  <mliska@suse.cz>

> > > >

> > > >       * final.c (compute_alignments): Apply the LOOP_ALIGN only

> > > >       to basic blocks that all loop headers.

> > > > ---

> > > >  gcc/final.c | 1 +

> > > >  1 file changed, 1 insertion(+)

> > > >

> > > >

> > >

> > > > diff --git a/gcc/final.c b/gcc/final.c

> > > > index fefc4874b24..ce2678da988 100644

> > > > --- a/gcc/final.c

> > > > +++ b/gcc/final.c

> > > > @@ -739,6 +739,7 @@ compute_alignments (void)

> > > >        if (has_fallthru

> > > >         && !(single_succ_p (bb)

> > > >              && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))

> > > > +       && bb->loop_father->header == bb

> >

> > I agree that the above is the wrong condition - but I'm not sure we

> > only end up using LOOP_ALIGN for blocks reached by a DFS_BACK

> > edge.  Note that DFS_BACK would have to be applied considering

> > the current CFG layout, simply doing mark_dfs_back_edges doesn't

> > work (we're in CFG layout mode here, no?).

>

> So a "backedge" in this sense would be e->dest->index < e->src->index.

> No?


To me the following would make sense.

Index: gcc/final.c
===================================================================
--- gcc/final.c (revision 273294)
+++ gcc/final.c (working copy)
@@ -669,6 +669,7 @@ compute_alignments (void)
     {
       rtx_insn *label = BB_HEAD (bb);
       bool has_fallthru = 0;
+      bool has_backedge = 0;
       edge e;
       edge_iterator ei;

@@ -693,6 +694,8 @@ compute_alignments (void)
            has_fallthru = 1, fallthru_count += e->count ();
          else
            branch_count += e->count ();
+         if (e->src->index > bb->index)
+           has_backedge = 1;
        }
       if (dump_file)
        {
@@ -736,7 +739,7 @@ compute_alignments (void)
        }
       /* In case block is frequent and reached mostly by non-fallthru edge,
         align it.  It is most likely a first block of loop.  */
-      if (has_fallthru
+      if (has_backedge
          && !(single_succ_p (bb)
               && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
          && optimize_bb_for_speed_p (bb)
Michael Matz July 9, 2019, 1:10 p.m. | #6
Hi,

On Tue, 9 Jul 2019, Richard Biener wrote:

> > So a "backedge" in this sense would be e->dest->index < e->src->index.

> > No?

> 

> To me the following would make sense.


The basic block index is not a DFS index, so no, that's not a test for 
backedge.


Ciao,
Michael.
Richard Biener July 9, 2019, 4:37 p.m. | #7
On July 9, 2019 3:10:19 PM GMT+02:00, Michael Matz <matz@suse.de> wrote:
>Hi,

>

>On Tue, 9 Jul 2019, Richard Biener wrote:

>

>> > So a "backedge" in this sense would be e->dest->index <

>e->src->index.

>> > No?

>> 

>> To me the following would make sense.

>

>The basic block index is not a DFS index, so no, that's not a test for 

>backedge.


I think in CFG RTL mode the BB index designates the order of the BBs in the object file? So this is a way to identify backwards jumps? 

Richard. 

>

>Ciao,

>Michael.
Michael Matz July 10, 2019, 12:11 p.m. | #8
Hi,

On Tue, 9 Jul 2019, Richard Biener wrote:

> >The basic block index is not a DFS index, so no, that's not a test for 

> >backedge.

> 

> I think in CFG RTL mode the BB index designates the order of the BBs in 

> the object file? So this is a way to identify backwards jumps?


Even if it means a backwards jump (and that's not always the case, the 
insns are emitted by following the NEXT_INSN links, without a CFG, and 
that all happens after machine-dependend reorg, and going out of cfg 
layout might link insn together even from high index BBs to low index BBs 
(e.g. because of fall-through)), that's still not a backedge in the 
general case.  If a heuristic is enough here it might be okay, though.

OTOH, as here a CFG still exists, why not simply rely on a proper DFS 
marking backedges?


Ciao,
Michael.
Richard Biener July 10, 2019, 3:52 p.m. | #9
On July 10, 2019 2:11:17 PM GMT+02:00, Michael Matz <matz@suse.de> wrote:
>Hi,

>

>On Tue, 9 Jul 2019, Richard Biener wrote:

>

>> >The basic block index is not a DFS index, so no, that's not a test

>for 

>> >backedge.

>> 

>> I think in CFG RTL mode the BB index designates the order of the BBs

>in 

>> the object file? So this is a way to identify backwards jumps?

>

>Even if it means a backwards jump (and that's not always the case, the 

>insns are emitted by following the NEXT_INSN links, without a CFG, and 

>that all happens after machine-dependend reorg, and going out of cfg 

>layout might link insn together even from high index BBs to low index

>BBs 

>(e.g. because of fall-through)), that's still not a backedge in the 

>general case.  If a heuristic is enough here it might be okay, though.

>

>OTOH, as here a CFG still exists, why not simply rely on a proper DFS 

>marking backedges?


Because proper backedges is not what we want here, see honzas example. 

So I'm second-guessing why we have different LOOP_ALIGN and when it makes sense to apply. 

Richard. 

>

>Ciao,

>Michael.
Richard Biener July 11, 2019, 9:42 a.m. | #10
On Wed, Jul 10, 2019 at 5:52 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>

> On July 10, 2019 2:11:17 PM GMT+02:00, Michael Matz <matz@suse.de> wrote:

> >Hi,

> >

> >On Tue, 9 Jul 2019, Richard Biener wrote:

> >

> >> >The basic block index is not a DFS index, so no, that's not a test

> >for

> >> >backedge.

> >>

> >> I think in CFG RTL mode the BB index designates the order of the BBs

> >in

> >> the object file? So this is a way to identify backwards jumps?

> >

> >Even if it means a backwards jump (and that's not always the case, the

> >insns are emitted by following the NEXT_INSN links, without a CFG, and

> >that all happens after machine-dependend reorg, and going out of cfg

> >layout might link insn together even from high index BBs to low index

> >BBs

> >(e.g. because of fall-through)), that's still not a backedge in the

> >general case.  If a heuristic is enough here it might be okay, though.

> >

> >OTOH, as here a CFG still exists, why not simply rely on a proper DFS

> >marking backedges?

>

> Because proper backedges is not what we want here, see honzas example.

>

> So I'm second-guessing why we have different LOOP_ALIGN and when it makes sense to apply.


So docs have

@defmac JUMP_ALIGN (@var{label})
The alignment (log base 2) to put in front of @var{label}, which is
a common destination of jumps and has no fallthru incoming edge.

...

@defmac LOOP_ALIGN (@var{label})
The alignment (log base 2) to put in front of @var{label} that heads
a frequently executed basic block (usually the header of a loop).

so I would expect the alignment pass to have

      if ( (branch_count > count_threshold
              || (bb->count > bb->prev_bb->count.apply_scale (10, 1)
                  && (bb->prev_bb->count
                      <= ENTRY_BLOCK_PTR_FOR_FN (cfun)
                           ->count.apply_scale (1, 2)))))
        {
          align_flags alignment = has_fallthru ? JUMP_ALIGN (label) :
LOOP_ALIGN (label);
          if (dump_file)
            fprintf (dump_file, "  jump alignment added.\n");
          max_alignment = align_flags::max (max_alignment, alignment);
        }

instead of the two different conditions?  Or the JUMP_ALIGN case
computing "common destination" instead of "frequently executed"
somehow but I think it makes sense that those are actually the same
here (frequently executed).  It oddly looks at prev_bb and is not
guarded with optimize_bb_for_speed_p at all so this all is
full of heuristics and changing anything here just based on x86
measurements is surely going to cause issues for targets more
sensitive to (mis-)alignment.

Richard.

> Richard.

>

> >

> >Ciao,

> >Michael.

>

Patch

diff --git a/gcc/final.c b/gcc/final.c
index fefc4874b24..ce2678da988 100644
--- a/gcc/final.c
+++ b/gcc/final.c
@@ -739,6 +739,7 @@  compute_alignments (void)
       if (has_fallthru
 	  && !(single_succ_p (bb)
 	       && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	  && bb->loop_father->header == bb
 	  && optimize_bb_for_speed_p (bb)
 	  && branch_count + fallthru_count > count_threshold
 	  && (branch_count