[PR81165] discount killed stmts when sizing blocks for threading

Message ID ortvx2oiat.fsf@lxoliva.fsfla.org
State New
Headers show
Series
  • [PR81165] discount killed stmts when sizing blocks for threading
Related show

Commit Message

Alexandre Oliva Dec. 7, 2017, 12:04 p.m.
We limit the amount of copying for jump threading based on counting
stmts.  This counting is overly pessimistic, because we will very
often delete stmts as a consequence of jump threading: when the final
conditional jump of a block is removed, earlier SSA names computed
exclusively for use in that conditional are killed.  Furthermore, PHI
nodes in blocks with only two predecessors are trivially replaced with
their now-single values after threading.

This patch scans blocks to be copied in the path constructed so far
and estimates the number of stmts that will be removed in the copies,
bumping up the stmt count limit.

Regstrapped on x86_64-linux-gnu and i686-linux-gnu.  Ok to install?


for  gcc/ChangeLog

	* tree-ssa-threadedge.c (uses_in_bb): New.
	(estimate_threading_killed_stmts): New.
	(estimate_threading_killed_stmts): New overload.
	(record_temporary_equivalences_from_stmts_at_dest): Add path
	parameter; adjust caller.  Expand limit when it's hit.

for  gcc/testsuite/ChangeLog

	* gcc.dg/pr81165.c: New.
---
 gcc/testsuite/gcc.dg/pr81165.c |   59 ++++++++++++
 gcc/tree-ssa-threadedge.c      |  189 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 245 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr81165.c


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

Comments

Richard Biener Dec. 7, 2017, 12:29 p.m. | #1
On Thu, Dec 7, 2017 at 1:04 PM, Alexandre Oliva <aoliva@redhat.com> wrote:
> We limit the amount of copying for jump threading based on counting

> stmts.  This counting is overly pessimistic, because we will very

> often delete stmts as a consequence of jump threading: when the final

> conditional jump of a block is removed, earlier SSA names computed

> exclusively for use in that conditional are killed.  Furthermore, PHI

> nodes in blocks with only two predecessors are trivially replaced with

> their now-single values after threading.

>

> This patch scans blocks to be copied in the path constructed so far

> and estimates the number of stmts that will be removed in the copies,

> bumping up the stmt count limit.

>

> Regstrapped on x86_64-linux-gnu and i686-linux-gnu.  Ok to install?

>

>

> for  gcc/ChangeLog

>

>         * tree-ssa-threadedge.c (uses_in_bb): New.

>         (estimate_threading_killed_stmts): New.

>         (estimate_threading_killed_stmts): New overload.

>         (record_temporary_equivalences_from_stmts_at_dest): Add path

>         parameter; adjust caller.  Expand limit when it's hit.

>

> for  gcc/testsuite/ChangeLog

>

>         * gcc.dg/pr81165.c: New.

> ---

>  gcc/testsuite/gcc.dg/pr81165.c |   59 ++++++++++++

>  gcc/tree-ssa-threadedge.c      |  189 +++++++++++++++++++++++++++++++++++++++-

>  2 files changed, 245 insertions(+), 3 deletions(-)

>  create mode 100644 gcc/testsuite/gcc.dg/pr81165.c

>

> diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c

> new file mode 100644

> index 000000000000..8508d893bed6

> --- /dev/null

> +++ b/gcc/testsuite/gcc.dg/pr81165.c

> @@ -0,0 +1,59 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O3 -fdump-tree-optimized" } */

> +/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */

> +

> +/* Testcase submitted for PR81165, with its main function removed as

> +   it's turned into a compile test.  We want to make sure that all of

> +   the divide/remainder computations are removed by tree optimizers.

> +

> +   We can figure out that we don't need to compute at runtime even the

> +   condition to enter the loop: the initial i==0 would have to be

> +   greater than the sum of two small unsigned values: 1U>>t1 is in the

> +   range 0..1, whereas the char value is bounded by the range 0..127,

> +   being 128 % a positive number (zero would invoke undefined

> +   behavior, so we can assume it doesn't happen).  (We know it's

> +   nonnegative because it's 10 times a number that has no more than

> +   the bits for 16, 8 and 1 set.)

> +

> +   We don't realize that the loop is useless right away: jump

> +   threading helps remove some of the complexity, particularly of the

> +   computation within the loop: t1 is compared with 1, but it can

> +   never be 1.  (We could assume as much, since its being 1 would

> +   divide by zero, but we don't.)

> +

> +   If we don't enter the conditional block, t1 remains at 2; if we do,

> +   it's set to either -1.  If we jump thread at the end of the

> +   conditional block, we can figure out the ranges exclude 1 and the

> +   jump body is completely optimized out.  However, we used to fail to

> +   consider the block for jump threading due to the amount of

> +   computation in it, without realizing most of it would die in

> +   consequence of the threading.

> +

> +   We now take the dying code into account when deciding whether or

> +   not to try jump threading.  That might enable us to optimize the

> +   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the

> +   time of this writing, with the patch, we get close, but the test on

> +   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some

> +   of the loop remains.  */

> +

> +short x0 = 15;

> +

> +void func (){

> +  volatile int x1 = 1U;

> +  volatile char x2 = 0;

> +  char t0 = 0;

> +  unsigned long t1 = 2LU;

> +  int i = 0;

> +

> +  if(1>>x2) {

> +    t0 = -1;

> +    t1 = (1&(short)(x1^8U))-1;

> +  }

> +

> +  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {

> +    i += (int)(12L/(1!=(int)t1));

> +  }

> +

> +  if (t0 != -1) __builtin_abort();

> +  if (t1 != 0L) __builtin_abort();

> +}

> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c

> index 536c4717b725..25ccac2a3ecc 100644

> --- a/gcc/tree-ssa-threadedge.c

> +++ b/gcc/tree-ssa-threadedge.c

> @@ -170,6 +170,173 @@ threadedge_valueize (tree t)

>    return t;

>  }

>

> +/* Return how many uses of T there are within BB, as long as there

> +   aren't any uses outside BB.  If there are any uses outside BB,

> +   return -1 if there's at most one use within BB, or -2 if there is

> +   more than one use within BB.  */

> +

> +static int

> +uses_in_bb (tree t, basic_block bb)

> +{

> +  int uses = 0;

> +  bool outside_bb = false;

> +

> +  imm_use_iterator iter;

> +  use_operand_p use_p;

> +  FOR_EACH_IMM_USE_FAST (use_p, iter, t)

> +    {

> +      if (is_gimple_debug (USE_STMT (use_p)))

> +       continue;

> +

> +      if (gimple_bb (USE_STMT (use_p)) != bb)

> +       outside_bb = true;

> +      else

> +       uses++;

> +

> +      if (outside_bb && uses > 1)

> +       return -2;

> +    }

> +

> +  if (outside_bb)

> +    return -1;

> +

> +  return uses;

> +}

> +

> +/* Starting from the final control flow stmt in BB, assuming it will

> +   be removed, follow uses in to-be-removed stmts back to their defs

> +   and count how many defs are to become dead and be removed as

> +   well.  */

> +

> +static int

> +estimate_threading_killed_stmts (basic_block bb)

> +{

> +  int killed_stmts = 0;

> +  hash_map<tree, int> ssa_remaining_uses;

> +  auto_vec<gimple *, 4> dead_worklist;

> +

> +  /* If the block has only two predecessors, threading will turn phi

> +     dsts into either src, so count them as dead stmts.  */

> +  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;

> +

> +  if (drop_all_phis)

> +    for (gphi_iterator gsi = gsi_start_phis (bb);

> +        !gsi_end_p (gsi); gsi_next (&gsi))

> +      {

> +       gphi *phi = gsi.phi ();

> +       tree dst = gimple_phi_result (phi);

> +

> +       /* We don't count virtual PHIs as stmts in

> +          record_temporary_equivalences_from_phis.  */

> +       if (virtual_operand_p (dst))

> +         continue;

> +

> +       killed_stmts++;

> +      }

> +

> +  if (gsi_end_p (gsi_last_bb (bb)))

> +    return killed_stmts;

> +

> +  gimple *stmt = gsi_stmt (gsi_last_bb (bb));

> +  if (gimple_code (stmt) != GIMPLE_COND

> +      && gimple_code (stmt) != GIMPLE_GOTO

> +      && gimple_code (stmt) != GIMPLE_SWITCH)

> +    return killed_stmts;

> +

> +  dead_worklist.quick_push (stmt);

> +  while (!dead_worklist.is_empty ())

> +    {

> +      stmt = dead_worklist.pop ();

> +

> +      ssa_op_iter iter;

> +      use_operand_p use_p;

> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_ALL_USES)


SSA_OP_USE avoids walking virtual uses

> +       {

> +         tree t = USE_FROM_PTR (use_p);

> +         if (SSA_NAME_DEF_STMT (t)


all SSA names have a definition statement, also use a temporary here.

> +             && (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_ASSIGN


use ! gimple_has_side_effects (t), this will then also handle other
kinds of stmts.

I wonder if for the handling of a larger threading path it wouldn't be useful to
keep ssa_remaining_uses live and populated across processing of the threaded
conditionals?  OTOH it would probably require to track uses on the path vs.
not on the path.

I'll let Jeff ack the threading parts, the rest of this piece looks ok to me.

I do wonder if

/* When we thread through a block we have to make copies of the
   statements within the block.  Clearly for large blocks the code
   duplication is bad.

   PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS specifies the maximum number
   of statements and PHI nodes allowed in a block which is going to
   be duplicated for thread jumping purposes.

   Some simple analysis showed that more than 99% of the jump
   threading opportunities are for blocks with less than 15
   statements.  So we can get the benefits of jump threading
   without excessive code bloat for pathological cases with the
   throttle set at 15 statements.  */
DEFPARAM (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS,
          "max-jump-thread-duplication-stmts",
          "Maximum number of statements allowed in a block that needs
to be duplicated when threading jumps.",
          15, 0, 0)

should be adjusted though?  Can you check what effect this patch has
on code size of cc1?

Thanks,
Richard.

> +                 || (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_PHI

> +                     && !virtual_operand_p (t)

> +                     && !drop_all_phis))

> +             && gimple_bb (SSA_NAME_DEF_STMT (t)) == bb)

> +           {

> +             int *usesp = ssa_remaining_uses.get (t);

> +             int uses;

> +

> +             if (usesp)

> +               uses = *usesp;

> +             else

> +               uses = uses_in_bb (t, bb);

> +

> +             gcc_assert (uses);

> +

> +             /* Don't bother recording the expected use count if we

> +                won't find any further uses within BB.  */

> +             if (!usesp && (uses < -1 || uses > 1))

> +               {

> +                 usesp = &ssa_remaining_uses.get_or_insert (t);

> +                 *usesp = uses;

> +               }

> +

> +             if (uses < 0)

> +               continue;

> +

> +             --uses;

> +             if (usesp)

> +               *usesp = uses;

> +

> +             if (!uses)

> +               {

> +                 killed_stmts++;

> +                 if (usesp)

> +                   ssa_remaining_uses.remove (t);

> +                 if (gimple_code (SSA_NAME_DEF_STMT (t)) != GIMPLE_PHI)

> +                   dead_worklist.safe_push (SSA_NAME_DEF_STMT (t));

> +               }

> +           }

> +       }

> +    }

> +

> +  if (dump_file)

> +    fprintf (dump_file, "threading bb %i kills %i stmts\n",

> +            bb->index, killed_stmts);

> +

> +  return killed_stmts;

> +}

> +

> +/* Estimate killed statements over all blocks in the PATH, or in E.

> +   If PATH is not empty, E must be its last entry.  */

> +

> +static int

> +estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)

> +{

> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

> +  if (path->length () == 0)

> +    return estimate_threading_killed_stmts (e->dest);

> +

> +  int total = 0;

> +  int i = 0;

> +  jump_thread_edge *je;

> +  FOR_EACH_VEC_ELT (*path, i, je)

> +    switch (je->type)

> +      {

> +      case EDGE_START_JUMP_THREAD:

> +      case EDGE_COPY_SRC_BLOCK:

> +      case EDGE_COPY_SRC_JOINER_BLOCK:

> +       total += estimate_threading_killed_stmts (je->e->dest);

> +       break;

> +

> +      default:

> +       gcc_unreachable ();

> +

> +      case EDGE_FSM_THREAD:

> +      case EDGE_NO_COPY_SRC_BLOCK:

> +       break;

> +      }

> +

> +  return total;

> +}

> +

>  /* Try to simplify each statement in E->dest, ultimately leading to

>     a simplification of the COND_EXPR at the end of E->dest.

>

> @@ -191,7 +358,8 @@ static gimple *

>  record_temporary_equivalences_from_stmts_at_dest (edge e,

>      const_and_copies *const_and_copies,

>      avail_exprs_stack *avail_exprs_stack,

> -    pfn_simplify simplify)

> +    pfn_simplify simplify,

> +    vec<jump_thread_edge *> *path)

>  {

>    gimple *stmt = NULL;

>    gimple_stmt_iterator gsi;

> @@ -233,7 +401,22 @@ record_temporary_equivalences_from_stmts_at_dest (edge e,

>          expansion, then do not thread through this block.  */

>        stmt_count++;

>        if (stmt_count > max_stmt_count)

> -       return NULL;

> +       {

> +         /* If any of the stmts in the PATH's dests are going to be

> +            killed due to threading, grow the max count

> +            accordingly.  */

> +         if (max_stmt_count

> +             == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))

> +           {

> +             max_stmt_count += estimate_threading_killed_stmts (e, path);

> +             if (dump_file)

> +               fprintf (dump_file, "threading bb %i up to %i stmts\n",

> +                        e->dest->index, max_stmt_count);

> +           }

> +         /* If we're still past the limit, we're done.  */

> +         if (stmt_count > max_stmt_count)

> +           return NULL;

> +       }

>

>        /* If this is not a statement that sets an SSA_NAME to a new

>          value, then do not try to simplify this statement as it will

> @@ -991,7 +1174,7 @@ thread_through_normal_block (edge e,

>    gimple *stmt

>      = record_temporary_equivalences_from_stmts_at_dest (e, const_and_copies,

>                                                         avail_exprs_stack,

> -                                                       simplify);

> +                                                       simplify, path);

>

>    /* There's two reasons STMT might be null, and distinguishing

>       between them is important.

>

> --

> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/

> You must be the change you wish to see in the world. -- Gandhi

> Be Free! -- http://FSFLA.org/   FSF Latin America board member

> Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
Jeff Law Dec. 11, 2017, 8 p.m. | #2
On 12/07/2017 05:04 AM, Alexandre Oliva wrote:
> We limit the amount of copying for jump threading based on counting

> stmts.  This counting is overly pessimistic, because we will very

> often delete stmts as a consequence of jump threading: when the final

> conditional jump of a block is removed, earlier SSA names computed

> exclusively for use in that conditional are killed.  Furthermore, PHI

> nodes in blocks with only two predecessors are trivially replaced with

> their now-single values after threading.

> 

> This patch scans blocks to be copied in the path constructed so far

> and estimates the number of stmts that will be removed in the copies,

> bumping up the stmt count limit.

> 

> Regstrapped on x86_64-linux-gnu and i686-linux-gnu.  Ok to install?

> 

> 

> for  gcc/ChangeLog

> 

> 	* tree-ssa-threadedge.c (uses_in_bb): New.

> 	(estimate_threading_killed_stmts): New.

> 	(estimate_threading_killed_stmts): New overload.

> 	(record_temporary_equivalences_from_stmts_at_dest): Add path

> 	parameter; adjust caller.  Expand limit when it's hit.

> 

> for  gcc/testsuite/ChangeLog

> 

> 	* gcc.dg/pr81165.c: New.

> ---

>  gcc/testsuite/gcc.dg/pr81165.c |   59 ++++++++++++

>  gcc/tree-ssa-threadedge.c      |  189 +++++++++++++++++++++++++++++++++++++++-

>  2 files changed, 245 insertions(+), 3 deletions(-)

>  create mode 100644 gcc/testsuite/gcc.dg/pr81165.c

> 

> diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c

> new file mode 100644

> index 000000000000..8508d893bed6

> --- /dev/null

> +++ b/gcc/testsuite/gcc.dg/pr81165.c

> @@ -0,0 +1,59 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O3 -fdump-tree-optimized" } */

> +/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */

> +

> +/* Testcase submitted for PR81165, with its main function removed as

> +   it's turned into a compile test.  We want to make sure that all of

> +   the divide/remainder computations are removed by tree optimizers.

> +

> +   We can figure out that we don't need to compute at runtime even the

> +   condition to enter the loop: the initial i==0 would have to be

> +   greater than the sum of two small unsigned values: 1U>>t1 is in the

> +   range 0..1, whereas the char value is bounded by the range 0..127,

> +   being 128 % a positive number (zero would invoke undefined

> +   behavior, so we can assume it doesn't happen).  (We know it's

> +   nonnegative because it's 10 times a number that has no more than

> +   the bits for 16, 8 and 1 set.)

> +

> +   We don't realize that the loop is useless right away: jump

> +   threading helps remove some of the complexity, particularly of the

> +   computation within the loop: t1 is compared with 1, but it can

> +   never be 1.  (We could assume as much, since its being 1 would

> +   divide by zero, but we don't.)

> +

> +   If we don't enter the conditional block, t1 remains at 2; if we do,

> +   it's set to either -1.  If we jump thread at the end of the

> +   conditional block, we can figure out the ranges exclude 1 and the

> +   jump body is completely optimized out.  However, we used to fail to

> +   consider the block for jump threading due to the amount of

> +   computation in it, without realizing most of it would die in

> +   consequence of the threading.

> +

> +   We now take the dying code into account when deciding whether or

> +   not to try jump threading.  That might enable us to optimize the

> +   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the

> +   time of this writing, with the patch, we get close, but the test on

> +   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some

> +   of the loop remains.  */

> +

> +short x0 = 15;

> +

> +void func (){

> +  volatile int x1 = 1U;

> +  volatile char x2 = 0;

> +  char t0 = 0;

> +  unsigned long t1 = 2LU;

> +  int i = 0;

> +  

> +  if(1>>x2) {

> +    t0 = -1;

> +    t1 = (1&(short)(x1^8U))-1;

> +  }

> +

> +  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {

> +    i += (int)(12L/(1!=(int)t1));

> +  }

> +

> +  if (t0 != -1) __builtin_abort();

> +  if (t1 != 0L) __builtin_abort();

> +}

> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c

> index 536c4717b725..25ccac2a3ecc 100644

> --- a/gcc/tree-ssa-threadedge.c

> +++ b/gcc/tree-ssa-threadedge.c


> +

> +/* Starting from the final control flow stmt in BB, assuming it will

> +   be removed, follow uses in to-be-removed stmts back to their defs

> +   and count how many defs are to become dead and be removed as

> +   well.  */

> +

> +static int

> +estimate_threading_killed_stmts (basic_block bb)

> +{

> +  int killed_stmts = 0;

> +  hash_map<tree, int> ssa_remaining_uses;

> +  auto_vec<gimple *, 4> dead_worklist;

> +

> +  /* If the block has only two predecessors, threading will turn phi

> +     dsts into either src, so count them as dead stmts.  */

> +  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;

> +

> +  if (drop_all_phis)

> +    for (gphi_iterator gsi = gsi_start_phis (bb);

> +	 !gsi_end_p (gsi); gsi_next (&gsi))

> +      {

> +	gphi *phi = gsi.phi ();

> +	tree dst = gimple_phi_result (phi);

> +

> +	/* We don't count virtual PHIs as stmts in

> +	   record_temporary_equivalences_from_phis.  */

> +	if (virtual_operand_p (dst))

> +	  continue;

> +

> +	killed_stmts++;

> +      }

> +

> +  if (gsi_end_p (gsi_last_bb (bb)))

> +    return killed_stmts;

> +

> +  gimple *stmt = gsi_stmt (gsi_last_bb (bb));

> +  if (gimple_code (stmt) != GIMPLE_COND

> +      && gimple_code (stmt) != GIMPLE_GOTO

> +      && gimple_code (stmt) != GIMPLE_SWITCH)

> +    return killed_stmts;

> +

> +  dead_worklist.quick_push (stmt);

> +  while (!dead_worklist.is_empty ())

> +    {

> +      stmt = dead_worklist.pop ();

> +

> +      ssa_op_iter iter;

> +      use_operand_p use_p;

> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_ALL_USES)

> +	{

> +	  tree t = USE_FROM_PTR (use_p);

> +	  if (SSA_NAME_DEF_STMT (t)

> +	      && (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_ASSIGN

> +		  || (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_PHI

> +		      && !virtual_operand_p (t)

> +		      && !drop_all_phis))

> +	      && gimple_bb (SSA_NAME_DEF_STMT (t)) == bb)

> +	    {

> +	      int *usesp = ssa_remaining_uses.get (t);

> +	      int uses;

> +

> +	      if (usesp)

> +		uses = *usesp;

> +	      else

> +		uses = uses_in_bb (t, bb);

> +

> +	      gcc_assert (uses);

> +

> +	      /* Don't bother recording the expected use count if we

> +		 won't find any further uses within BB.  */

> +	      if (!usesp && (uses < -1 || uses > 1))

> +		{

> +		  usesp = &ssa_remaining_uses.get_or_insert (t);

> +		  *usesp = uses;

> +		}

> +

> +	      if (uses < 0)

> +		continue;

> +

> +	      --uses;

> +	      if (usesp)

> +		*usesp = uses;

> +

> +	      if (!uses)

> +		{

> +		  killed_stmts++;

> +		  if (usesp)

> +		    ssa_remaining_uses.remove (t);

> +		  if (gimple_code (SSA_NAME_DEF_STMT (t)) != GIMPLE_PHI)

> +		    dead_worklist.safe_push (SSA_NAME_DEF_STMT (t));

> +		}

> +	    }

> +	}

> +    }

> +

> +  if (dump_file)

> +    fprintf (dump_file, "threading bb %i kills %i stmts\n",

> +	     bb->index, killed_stmts);

> +

> +  return killed_stmts;

> +}

> +

> +/* Estimate killed statements over all blocks in the PATH, or in E.

> +   If PATH is not empty, E must be its last entry.  */

> +

> +static int

> +estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)

> +{

> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

> +  if (path->length () == 0)

> +    return estimate_threading_killed_stmts (e->dest);

> +

> +  int total = 0;

> +  int i = 0;

> +  jump_thread_edge *je;

> +  FOR_EACH_VEC_ELT (*path, i, je)

> +    switch (je->type)

> +      {

> +      case EDGE_START_JUMP_THREAD:

> +      case EDGE_COPY_SRC_BLOCK:

> +      case EDGE_COPY_SRC_JOINER_BLOCK:

> +	total += estimate_threading_killed_stmts (je->e->dest);

> +	break;

> +

> +      default:

> +	gcc_unreachable ();

> +

> +      case EDGE_FSM_THREAD:

> +      case EDGE_NO_COPY_SRC_BLOCK:

> +	break;

> +      }

> +

> +  return total;

> +}

So I'd mark je->e->src rather than je->e->dest.  And you only need this
handling for EDGE_COPY_SRC_BLOCK.  So I don't think you need a switch,
just a simple je->type == EDGE_COPY_SRC_BLOCK.

While we do make a copy for EDGE_COPY_SRC_JOINER_BLOCK, the conditional
at the end of that block is not dead, so we can't really do anything
special with it.

I think with that fix this should be OK.

jeff
Alexandre Oliva Dec. 12, 2017, 5:08 a.m. | #3
On Dec  7, 2017, Richard Biener <richard.guenther@gmail.com> wrote:

>> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_ALL_USES)


> SSA_OP_USE avoids walking virtual uses


.... and we don't want to walk those because we already know writes to
memory are not going to be dead anyway, I suppose.

>> +       {

>> +         tree t = USE_FROM_PTR (use_p);

>> +         if (SSA_NAME_DEF_STMT (t)


> all SSA names have a definition statement


Oooh, it wasn't always like that!  Nice!

> also use a temporary here.


*ack*

>> +             && (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_ASSIGN


> use ! gimple_has_side_effects (t), this will then also handle other

> kinds of stmts.


Hmm...  Yeah, I guess we can indeed check for and regard as dead even
stmts that the record equivalence code won't deal with.  Thanks, fixed.

> I wonder if for the handling of a larger threading path it wouldn't be useful to

> keep ssa_remaining_uses live and populated across processing of the threaded

> conditionals?  OTOH it would probably require to track uses on the path vs.

> not on the path.


Possibly.  Not sure.  I wondered about that myself.


>    PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS specifies the maximum number

>    of statements and PHI nodes allowed in a block which is going to

>    be duplicated for thread jumping purposes.


> should be adjusted though?


I guess not, we still abide by that, roughly, we just (as-if) remove
dead stmts and phi nodes from the block before counting them.

> Can you check what effect this patch has on code size of cc1?


   text    data     bss     dec     hex filename
27300976          54440 1424480 28779896        1b72578 unpatched/stage2-gcc/cc1
27308728          54440 1424480 28787648        1b743c0 stage2-gcc/cc1

unpatched/, above, means its stage1 gcc was unpatched, whereas the other
(patched) stage1 gcc was patched with the updated patch I'm about to
post in response to Jeff; both stage2 compilers (x86_64-linux-gnu) were
built out of the same (unpatched) sources, so all the differences are
caused by the stage1 compiler alone.

Here's a list comparing the text sizes of all of the ELF executables in
the trees, sorted by absolute growth (unchanged entries are omitted)

240918 was 241126 delta -208 var -0.086262% ./stage2-gcc/build/genextract
306578 was 306626 delta -48 var -0.0156542% ./stage2-gcc/build/genrecog
90064 was 90112 delta -48 var -0.053267% ./stage1-x86_64-pc-linux-gnu/libgcc/libgcc_s.so.1
22315 was 22283 delta 32 var 0.143607% ./stage2-gcc/gcc-nm
22315 was 22283 delta 32 var 0.143607% ./stage2-gcc/gcc-ranlib
22347 was 22315 delta 32 var 0.143401% ./stage2-gcc/gcc-ar
330904 was 330864 delta 40 var 0.0120896% ./stage2-gcc/build/genautomata
173470 was 173406 delta 64 var 0.0369076% ./stage1-x86_64-pc-linux-gnu/libgomp/.libs/libgomp.so.1.0.0
184937 was 184865 delta 72 var 0.0389473% ./stage1-x86_64-pc-linux-gnu/32/libgomp/.libs/libgomp.so.1.0.0
1883670 was 1883558 delta 112 var 0.00594619% ./stage2-gcc/gnatbind
772045 was 771933 delta 112 var 0.014509% ./stage2-gcc/gcov
620621 was 620493 delta 128 var 0.0206288% ./stage2-gcc/gcov-dump
643293 was 643165 delta 128 var 0.0199016% ./stage2-gcc/gcov-tool
704480 was 704336 delta 144 var 0.0204448% ./stage2-gcc/collect2
503850 was 503690 delta 160 var 0.0317656% ./stage2-gcc/build/genmatch
73391 was 73215 delta 176 var 0.240388% ./stage2-gcc/liblto_plugin.so.0.0.0
73391 was 73215 delta 176 var 0.240388% ./stage2-lto-plugin/.libs/liblto_plugin.so.0.0.0
1051788 was 1051500 delta 288 var 0.0273894% ./stage2-gcc/lto-wrapper   
179513 was 179113 delta 400 var 0.223323% ./stage2-gcc/build/gengtype   
179513 was 179113 delta 400 var 0.223323% ./stage2-gcc/gengtype
1519561 was 1519053 delta 508 var 0.0334419% ./stage1-x86_64-pc-linux-gnu/32/libstdc++-v3/src/.libs/libstdc++.so.6.0.25
1515402 was 1514758 delta 644 var 0.042515% ./stage1-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so.6.0.25
105197 was 104429 delta 768 var 0.735428% ./stage1-x86_64-pc-linux-gnu/32/libgcc/32/libgcc_s.so.1
1158594 was 1157690 delta 904 var 0.0780865% ./stage2-gcc/xgcc
1159570 was 1158666 delta 904 var 0.0780208% ./stage2-gcc/cpp
1161034 was 1160130 delta 904 var 0.0779223% ./stage2-gcc/gfortran
1161042 was 1160138 delta 904 var 0.0779218% ./stage2-gcc/xg++
1160914 was 1159898 delta 1016 var 0.0875939% ./stage2-gcc/gccgo
32065817 was 32059313 delta 6504 var 0.0202874% ./stage2-gcc/gnat1
26030665 was 26023281 delta 7384 var 0.0283746% ./stage2-gcc/lto1
27990493 was 27982845 delta 7648 var 0.027331% ./stage2-gcc/f951
27652456 was 27644736 delta 7720 var 0.0279258% ./stage2-gcc/cc1obj
27308728 was 27300976 delta 7752 var 0.0283946% ./stage2-gcc/cc1
28136819 was 28128855 delta 7964 var 0.0283126% ./stage2-gcc/go1
30244024 was 30235592 delta 8432 var 0.0278877% ./stage2-gcc/cc1objplus
29892916 was 29884436 delta 8480 var 0.028376% ./stage2-gcc/cc1plus


Here are top and bottom among the object files:

185713 was 186004 delta -291 var -0.156448% ./stage2-gcc/ada/decl.o
9492 was 9701 delta -209 var -2.15442% ./stage2-gcc/build/genextract.o
176153 was 176329 delta -176 var -0.0998134% ./stage2-gcc/ada/sem_util.o
165141 was 165301 delta -160 var -0.0967931% ./stage2-gcc/expr.o
21157 was 21317 delta -160 var -0.750575% ./stage2-gcc/ada/exp_unst.o
4958 was 5106 delta -148 var -2.89855% ./stage1-x86_64-pc-linux-gnu/libgcc/subtf3.o
4958 was 5106 delta -148 var -2.89855% ./stage1-x86_64-pc-linux-gnu/libgcc/subtf3_s.o
53964 was 54060 delta -96 var -0.17758% ./stage2-gcc/tree-ssa-alias.o
5145 was 5234 delta -89 var -1.70042% ./stage1-x86_64-pc-linux-gnu/libgcc/addtf3.o
5145 was 5234 delta -89 var -1.70042% ./stage1-x86_64-pc-linux-gnu/libgcc/addtf3_s.o
[...]
27643 was 27307 delta 336 var 1.23045% ./stage2-libiberty/pic/regex.o
28951 was 28567 delta 384 var 1.34421% ./stage2-libiberty/regex.o
174597 was 174110 delta 487 var 0.279708% ./stage2-gcc/omp-low.o
8957 was 8235 delta 722 var 8.76746% ./stage1-x86_64-pc-linux-gnu/32/libgcc/addtf3.o
8957 was 8235 delta 722 var 8.76746% ./stage1-x86_64-pc-linux-gnu/32/libgcc/addtf3_s.o
176102 was 175310 delta 792 var 0.451771% ./stage2-gcc/cp/typeck.o
38384 was 37568 delta 816 var 2.17206% ./stage1-x86_64-pc-linux-gnu/libgcc/bid64_div.o
64158 was 63308 delta 850 var 1.34264% ./stage1-x86_64-pc-linux-gnu/32/libgcc/bid64_div.o
73740 was 72852 delta 888 var 1.21891% ./stage1-x86_64-pc-linux-gnu/32/libgcc/bid128_fma.o
799019 was 796467 delta 2552 var 0.320415% ./stage2-gcc/i386.o


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
Alexandre Oliva Dec. 12, 2017, 5:17 a.m. | #4
On Dec 11, 2017, Jeff Law <law@redhat.com> wrote:

>> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

>> +  if (path->length () == 0)

>> +    return estimate_threading_killed_stmts (e->dest);

>> +

>> +  int total = 0;

>> +  int i = 0;

>> +  jump_thread_edge *je;

>> +  FOR_EACH_VEC_ELT (*path, i, je)

>> +    switch (je->type)

>> +      {

>> +      case EDGE_START_JUMP_THREAD:

>> +      case EDGE_COPY_SRC_BLOCK:

>> +      case EDGE_COPY_SRC_JOINER_BLOCK:

>> +	total += estimate_threading_killed_stmts (je->e->dest);

>> +	break;

>> +

>> +      default:

>> +	gcc_unreachable ();

>> +

>> +      case EDGE_FSM_THREAD:

>> +      case EDGE_NO_COPY_SRC_BLOCK:

>> +	break;

>> +      }

>> +

>> +  return total;

>> +}



> So I'd mark je->e->src rather than je->e->dest.


Ah, but then we have to mark e->dest regardless of path.  Which does
indeed make it all look nicer.

> And you only need this

> handling for EDGE_COPY_SRC_BLOCK.  So I don't think you need a switch,

> just a simple je->type == EDGE_COPY_SRC_BLOCK.


> While we do make a copy for EDGE_COPY_SRC_JOINER_BLOCK, the conditional

> at the end of that block is not dead, so we can't really do anything

> special with it.


I see, thanks.

I've updated it according to richi's and your feedbacks.  Regstrapped on
{x86_64,i686}-linux-gnu.  Ok to install?


We limit the amount of copying for jump threading based on counting
stmts.  This counting is overly pessimistic, because we will very
often delete stmts as a consequence of jump threading: when the final
conditional jump of a block is removed, earlier SSA names computed
exclusively for use in that conditional are killed.  Furthermore, PHI
nodes in blocks with only two predecessors are trivially replaced with
their now-single values after threading.

This patch scans blocks to be copied in the path constructed so far
and estimates the number of stmts that will be removed in the copies,
bumping up the stmt count limit.

for  gcc/ChangeLog

	PR tree-optimization/81165
	* tree-ssa-threadedge.c (uses_in_bb): New.
	(estimate_threading_killed_stmts): New.
	(estimate_threading_killed_stmts): New overload.
	(record_temporary_equivalences_from_stmts_at_dest): Add path
	parameter; adjust caller.  Expand limit when it's hit.

for  gcc/testsuite/ChangeLog

	PR tree-optimization/81165
	* gcc.dg/pr81165.c: New.
---
 gcc/testsuite/gcc.dg/pr81165.c |   59 +++++++++++++
 gcc/tree-ssa-threadedge.c      |  176 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 232 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr81165.c

diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c
new file mode 100644
index 000000000000..8508d893bed6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr81165.c
@@ -0,0 +1,59 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */
+
+/* Testcase submitted for PR81165, with its main function removed as
+   it's turned into a compile test.  We want to make sure that all of
+   the divide/remainder computations are removed by tree optimizers.
+
+   We can figure out that we don't need to compute at runtime even the
+   condition to enter the loop: the initial i==0 would have to be
+   greater than the sum of two small unsigned values: 1U>>t1 is in the
+   range 0..1, whereas the char value is bounded by the range 0..127,
+   being 128 % a positive number (zero would invoke undefined
+   behavior, so we can assume it doesn't happen).  (We know it's
+   nonnegative because it's 10 times a number that has no more than
+   the bits for 16, 8 and 1 set.)
+
+   We don't realize that the loop is useless right away: jump
+   threading helps remove some of the complexity, particularly of the
+   computation within the loop: t1 is compared with 1, but it can
+   never be 1.  (We could assume as much, since its being 1 would
+   divide by zero, but we don't.)
+
+   If we don't enter the conditional block, t1 remains at 2; if we do,
+   it's set to either -1.  If we jump thread at the end of the
+   conditional block, we can figure out the ranges exclude 1 and the
+   jump body is completely optimized out.  However, we used to fail to
+   consider the block for jump threading due to the amount of
+   computation in it, without realizing most of it would die in
+   consequence of the threading.
+
+   We now take the dying code into account when deciding whether or
+   not to try jump threading.  That might enable us to optimize the
+   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the
+   time of this writing, with the patch, we get close, but the test on
+   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some
+   of the loop remains.  */
+
+short x0 = 15;
+
+void func (){
+  volatile int x1 = 1U;
+  volatile char x2 = 0;
+  char t0 = 0;
+  unsigned long t1 = 2LU;
+  int i = 0;
+  
+  if(1>>x2) {
+    t0 = -1;
+    t1 = (1&(short)(x1^8U))-1;
+  }
+
+  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {
+    i += (int)(12L/(1!=(int)t1));
+  }
+
+  if (t0 != -1) __builtin_abort();
+  if (t1 != 0L) __builtin_abort();
+}
diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 91793bfa59d3..0f5b943aa9a0 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -170,6 +170,160 @@ threadedge_valueize (tree t)
   return t;
 }
 
+/* Return how many uses of T there are within BB, as long as there
+   aren't any uses outside BB.  If there are any uses outside BB,
+   return -1 if there's at most one use within BB, or -2 if there is
+   more than one use within BB.  */
+
+static int
+uses_in_bb (tree t, basic_block bb)
+{
+  int uses = 0;
+  bool outside_bb = false;
+
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, t)
+    {
+      if (is_gimple_debug (USE_STMT (use_p)))
+	continue;
+
+      if (gimple_bb (USE_STMT (use_p)) != bb)
+	outside_bb = true;
+      else
+	uses++;
+
+      if (outside_bb && uses > 1)
+	return -2;
+    }
+
+  if (outside_bb)
+    return -1;
+
+  return uses;
+}
+
+/* Starting from the final control flow stmt in BB, assuming it will
+   be removed, follow uses in to-be-removed stmts back to their defs
+   and count how many defs are to become dead and be removed as
+   well.  */
+
+static int
+estimate_threading_killed_stmts (basic_block bb)
+{
+  int killed_stmts = 0;
+  hash_map<tree, int> ssa_remaining_uses;
+  auto_vec<gimple *, 4> dead_worklist;
+
+  /* If the block has only two predecessors, threading will turn phi
+     dsts into either src, so count them as dead stmts.  */
+  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;
+
+  if (drop_all_phis)
+    for (gphi_iterator gsi = gsi_start_phis (bb);
+	 !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gphi *phi = gsi.phi ();
+	tree dst = gimple_phi_result (phi);
+
+	/* We don't count virtual PHIs as stmts in
+	   record_temporary_equivalences_from_phis.  */
+	if (virtual_operand_p (dst))
+	  continue;
+
+	killed_stmts++;
+      }
+
+  if (gsi_end_p (gsi_last_bb (bb)))
+    return killed_stmts;
+
+  gimple *stmt = gsi_stmt (gsi_last_bb (bb));
+  if (gimple_code (stmt) != GIMPLE_COND
+      && gimple_code (stmt) != GIMPLE_GOTO
+      && gimple_code (stmt) != GIMPLE_SWITCH)
+    return killed_stmts;
+
+  dead_worklist.quick_push (stmt);
+  while (!dead_worklist.is_empty ())
+    {
+      stmt = dead_worklist.pop ();
+
+      ssa_op_iter iter;
+      use_operand_p use_p;
+      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
+	{
+	  tree t = USE_FROM_PTR (use_p);
+	  gimple *def = SSA_NAME_DEF_STMT (t);
+
+	  if (gimple_bb (def) == bb
+	      && (gimple_code (def) != GIMPLE_PHI
+		  || !drop_all_phis)
+	      && !gimple_has_side_effects (def))
+	    {
+	      int *usesp = ssa_remaining_uses.get (t);
+	      int uses;
+
+	      if (usesp)
+		uses = *usesp;
+	      else
+		uses = uses_in_bb (t, bb);
+
+	      gcc_assert (uses);
+
+	      /* Don't bother recording the expected use count if we
+		 won't find any further uses within BB.  */
+	      if (!usesp && (uses < -1 || uses > 1))
+		{
+		  usesp = &ssa_remaining_uses.get_or_insert (t);
+		  *usesp = uses;
+		}
+
+	      if (uses < 0)
+		continue;
+
+	      --uses;
+	      if (usesp)
+		*usesp = uses;
+
+	      if (!uses)
+		{
+		  killed_stmts++;
+		  if (usesp)
+		    ssa_remaining_uses.remove (t);
+		  if (gimple_code (def) != GIMPLE_PHI)
+		    dead_worklist.safe_push (def);
+		}
+	    }
+	}
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "threading bb %i kills %i stmts\n",
+	     bb->index, killed_stmts);
+
+  return killed_stmts;
+}
+
+/* Estimate, over all src blocks to be copied in PATH and E's dest,
+   the number of stmts that become dead as a consequence of removing
+   their final control flow stmts.  If PATH is not empty, E must be
+   its last entry.  */
+
+static int
+estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)
+{
+  gcc_assert (path->length () == 0 || path->last ()->e == e);
+  int total = estimate_threading_killed_stmts (e->dest);
+
+  int i;
+  jump_thread_edge *je;
+  FOR_EACH_VEC_ELT_REVERSE (*path, i, je)
+    if (je->type == EDGE_COPY_SRC_BLOCK)
+      total += estimate_threading_killed_stmts (je->e->src);
+
+  return total;
+}
+
 /* Try to simplify each statement in E->dest, ultimately leading to
    a simplification of the COND_EXPR at the end of E->dest.
 
@@ -191,7 +345,8 @@ static gimple *
 record_temporary_equivalences_from_stmts_at_dest (edge e,
     const_and_copies *const_and_copies,
     avail_exprs_stack *avail_exprs_stack,
-    pfn_simplify simplify)
+    pfn_simplify simplify,
+    vec<jump_thread_edge *> *path)
 {
   gimple *stmt = NULL;
   gimple_stmt_iterator gsi;
@@ -233,7 +388,22 @@ record_temporary_equivalences_from_stmts_at_dest (edge e,
 	 expansion, then do not thread through this block.  */
       stmt_count++;
       if (stmt_count > max_stmt_count)
-	return NULL;
+	{
+	  /* If any of the stmts in the PATH's dests are going to be
+	     killed due to threading, grow the max count
+	     accordingly.  */
+	  if (max_stmt_count
+	      == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))
+	    {
+	      max_stmt_count += estimate_threading_killed_stmts (e, path);
+	      if (dump_file)
+		fprintf (dump_file, "threading bb %i up to %i stmts\n",
+			 e->dest->index, max_stmt_count);
+	    }
+	  /* If we're still past the limit, we're done.  */
+	  if (stmt_count > max_stmt_count)
+	    return NULL;
+	}
 
       /* If this is not a statement that sets an SSA_NAME to a new
 	 value, then do not try to simplify this statement as it will
@@ -1000,7 +1170,7 @@ thread_through_normal_block (edge e,
   gimple *stmt
     = record_temporary_equivalences_from_stmts_at_dest (e, const_and_copies,
 							avail_exprs_stack,
-							simplify);
+							simplify, path);
 
   /* There's two reasons STMT might be null, and distinguishing
      between them is important.


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
Richard Biener Dec. 12, 2017, 8:27 a.m. | #5
On Tue, Dec 12, 2017 at 6:17 AM, Alexandre Oliva <aoliva@redhat.com> wrote:
> On Dec 11, 2017, Jeff Law <law@redhat.com> wrote:

>

>>> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

>>> +  if (path->length () == 0)

>>> +    return estimate_threading_killed_stmts (e->dest);

>>> +

>>> +  int total = 0;

>>> +  int i = 0;

>>> +  jump_thread_edge *je;

>>> +  FOR_EACH_VEC_ELT (*path, i, je)

>>> +    switch (je->type)

>>> +      {

>>> +      case EDGE_START_JUMP_THREAD:

>>> +      case EDGE_COPY_SRC_BLOCK:

>>> +      case EDGE_COPY_SRC_JOINER_BLOCK:

>>> +    total += estimate_threading_killed_stmts (je->e->dest);

>>> +    break;

>>> +

>>> +      default:

>>> +    gcc_unreachable ();

>>> +

>>> +      case EDGE_FSM_THREAD:

>>> +      case EDGE_NO_COPY_SRC_BLOCK:

>>> +    break;

>>> +      }

>>> +

>>> +  return total;

>>> +}

>

>

>> So I'd mark je->e->src rather than je->e->dest.

>

> Ah, but then we have to mark e->dest regardless of path.  Which does

> indeed make it all look nicer.

>

>> And you only need this

>> handling for EDGE_COPY_SRC_BLOCK.  So I don't think you need a switch,

>> just a simple je->type == EDGE_COPY_SRC_BLOCK.

>

>> While we do make a copy for EDGE_COPY_SRC_JOINER_BLOCK, the conditional

>> at the end of that block is not dead, so we can't really do anything

>> special with it.

>

> I see, thanks.

>

> I've updated it according to richi's and your feedbacks.  Regstrapped on

> {x86_64,i686}-linux-gnu.  Ok to install?


Thanks for the numbers on size, OK for the parts I reviewed, leaving the
final ack to Jeff.

Richard.

>

> We limit the amount of copying for jump threading based on counting

> stmts.  This counting is overly pessimistic, because we will very

> often delete stmts as a consequence of jump threading: when the final

> conditional jump of a block is removed, earlier SSA names computed

> exclusively for use in that conditional are killed.  Furthermore, PHI

> nodes in blocks with only two predecessors are trivially replaced with

> their now-single values after threading.

>

> This patch scans blocks to be copied in the path constructed so far

> and estimates the number of stmts that will be removed in the copies,

> bumping up the stmt count limit.

>

> for  gcc/ChangeLog

>

>         PR tree-optimization/81165

>         * tree-ssa-threadedge.c (uses_in_bb): New.

>         (estimate_threading_killed_stmts): New.

>         (estimate_threading_killed_stmts): New overload.

>         (record_temporary_equivalences_from_stmts_at_dest): Add path

>         parameter; adjust caller.  Expand limit when it's hit.

>

> for  gcc/testsuite/ChangeLog

>

>         PR tree-optimization/81165

>         * gcc.dg/pr81165.c: New.

> ---

>  gcc/testsuite/gcc.dg/pr81165.c |   59 +++++++++++++

>  gcc/tree-ssa-threadedge.c      |  176 +++++++++++++++++++++++++++++++++++++++-

>  2 files changed, 232 insertions(+), 3 deletions(-)

>  create mode 100644 gcc/testsuite/gcc.dg/pr81165.c

>

> diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c

> new file mode 100644

> index 000000000000..8508d893bed6

> --- /dev/null

> +++ b/gcc/testsuite/gcc.dg/pr81165.c

> @@ -0,0 +1,59 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O3 -fdump-tree-optimized" } */

> +/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */

> +

> +/* Testcase submitted for PR81165, with its main function removed as

> +   it's turned into a compile test.  We want to make sure that all of

> +   the divide/remainder computations are removed by tree optimizers.

> +

> +   We can figure out that we don't need to compute at runtime even the

> +   condition to enter the loop: the initial i==0 would have to be

> +   greater than the sum of two small unsigned values: 1U>>t1 is in the

> +   range 0..1, whereas the char value is bounded by the range 0..127,

> +   being 128 % a positive number (zero would invoke undefined

> +   behavior, so we can assume it doesn't happen).  (We know it's

> +   nonnegative because it's 10 times a number that has no more than

> +   the bits for 16, 8 and 1 set.)

> +

> +   We don't realize that the loop is useless right away: jump

> +   threading helps remove some of the complexity, particularly of the

> +   computation within the loop: t1 is compared with 1, but it can

> +   never be 1.  (We could assume as much, since its being 1 would

> +   divide by zero, but we don't.)

> +

> +   If we don't enter the conditional block, t1 remains at 2; if we do,

> +   it's set to either -1.  If we jump thread at the end of the

> +   conditional block, we can figure out the ranges exclude 1 and the

> +   jump body is completely optimized out.  However, we used to fail to

> +   consider the block for jump threading due to the amount of

> +   computation in it, without realizing most of it would die in

> +   consequence of the threading.

> +

> +   We now take the dying code into account when deciding whether or

> +   not to try jump threading.  That might enable us to optimize the

> +   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the

> +   time of this writing, with the patch, we get close, but the test on

> +   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some

> +   of the loop remains.  */

> +

> +short x0 = 15;

> +

> +void func (){

> +  volatile int x1 = 1U;

> +  volatile char x2 = 0;

> +  char t0 = 0;

> +  unsigned long t1 = 2LU;

> +  int i = 0;

> +

> +  if(1>>x2) {

> +    t0 = -1;

> +    t1 = (1&(short)(x1^8U))-1;

> +  }

> +

> +  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {

> +    i += (int)(12L/(1!=(int)t1));

> +  }

> +

> +  if (t0 != -1) __builtin_abort();

> +  if (t1 != 0L) __builtin_abort();

> +}

> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c

> index 91793bfa59d3..0f5b943aa9a0 100644

> --- a/gcc/tree-ssa-threadedge.c

> +++ b/gcc/tree-ssa-threadedge.c

> @@ -170,6 +170,160 @@ threadedge_valueize (tree t)

>    return t;

>  }

>

> +/* Return how many uses of T there are within BB, as long as there

> +   aren't any uses outside BB.  If there are any uses outside BB,

> +   return -1 if there's at most one use within BB, or -2 if there is

> +   more than one use within BB.  */

> +

> +static int

> +uses_in_bb (tree t, basic_block bb)

> +{

> +  int uses = 0;

> +  bool outside_bb = false;

> +

> +  imm_use_iterator iter;

> +  use_operand_p use_p;

> +  FOR_EACH_IMM_USE_FAST (use_p, iter, t)

> +    {

> +      if (is_gimple_debug (USE_STMT (use_p)))

> +       continue;

> +

> +      if (gimple_bb (USE_STMT (use_p)) != bb)

> +       outside_bb = true;

> +      else

> +       uses++;

> +

> +      if (outside_bb && uses > 1)

> +       return -2;

> +    }

> +

> +  if (outside_bb)

> +    return -1;

> +

> +  return uses;

> +}

> +

> +/* Starting from the final control flow stmt in BB, assuming it will

> +   be removed, follow uses in to-be-removed stmts back to their defs

> +   and count how many defs are to become dead and be removed as

> +   well.  */

> +

> +static int

> +estimate_threading_killed_stmts (basic_block bb)

> +{

> +  int killed_stmts = 0;

> +  hash_map<tree, int> ssa_remaining_uses;

> +  auto_vec<gimple *, 4> dead_worklist;

> +

> +  /* If the block has only two predecessors, threading will turn phi

> +     dsts into either src, so count them as dead stmts.  */

> +  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;

> +

> +  if (drop_all_phis)

> +    for (gphi_iterator gsi = gsi_start_phis (bb);

> +        !gsi_end_p (gsi); gsi_next (&gsi))

> +      {

> +       gphi *phi = gsi.phi ();

> +       tree dst = gimple_phi_result (phi);

> +

> +       /* We don't count virtual PHIs as stmts in

> +          record_temporary_equivalences_from_phis.  */

> +       if (virtual_operand_p (dst))

> +         continue;

> +

> +       killed_stmts++;

> +      }

> +

> +  if (gsi_end_p (gsi_last_bb (bb)))

> +    return killed_stmts;

> +

> +  gimple *stmt = gsi_stmt (gsi_last_bb (bb));

> +  if (gimple_code (stmt) != GIMPLE_COND

> +      && gimple_code (stmt) != GIMPLE_GOTO

> +      && gimple_code (stmt) != GIMPLE_SWITCH)

> +    return killed_stmts;

> +

> +  dead_worklist.quick_push (stmt);

> +  while (!dead_worklist.is_empty ())

> +    {

> +      stmt = dead_worklist.pop ();

> +

> +      ssa_op_iter iter;

> +      use_operand_p use_p;

> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)

> +       {

> +         tree t = USE_FROM_PTR (use_p);

> +         gimple *def = SSA_NAME_DEF_STMT (t);

> +

> +         if (gimple_bb (def) == bb

> +             && (gimple_code (def) != GIMPLE_PHI

> +                 || !drop_all_phis)

> +             && !gimple_has_side_effects (def))

> +           {

> +             int *usesp = ssa_remaining_uses.get (t);

> +             int uses;

> +

> +             if (usesp)

> +               uses = *usesp;

> +             else

> +               uses = uses_in_bb (t, bb);

> +

> +             gcc_assert (uses);

> +

> +             /* Don't bother recording the expected use count if we

> +                won't find any further uses within BB.  */

> +             if (!usesp && (uses < -1 || uses > 1))

> +               {

> +                 usesp = &ssa_remaining_uses.get_or_insert (t);

> +                 *usesp = uses;

> +               }

> +

> +             if (uses < 0)

> +               continue;

> +

> +             --uses;

> +             if (usesp)

> +               *usesp = uses;

> +

> +             if (!uses)

> +               {

> +                 killed_stmts++;

> +                 if (usesp)

> +                   ssa_remaining_uses.remove (t);

> +                 if (gimple_code (def) != GIMPLE_PHI)

> +                   dead_worklist.safe_push (def);

> +               }

> +           }

> +       }

> +    }

> +

> +  if (dump_file)

> +    fprintf (dump_file, "threading bb %i kills %i stmts\n",

> +            bb->index, killed_stmts);

> +

> +  return killed_stmts;

> +}

> +

> +/* Estimate, over all src blocks to be copied in PATH and E's dest,

> +   the number of stmts that become dead as a consequence of removing

> +   their final control flow stmts.  If PATH is not empty, E must be

> +   its last entry.  */

> +

> +static int

> +estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)

> +{

> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

> +  int total = estimate_threading_killed_stmts (e->dest);

> +

> +  int i;

> +  jump_thread_edge *je;

> +  FOR_EACH_VEC_ELT_REVERSE (*path, i, je)

> +    if (je->type == EDGE_COPY_SRC_BLOCK)

> +      total += estimate_threading_killed_stmts (je->e->src);

> +

> +  return total;

> +}

> +

>  /* Try to simplify each statement in E->dest, ultimately leading to

>     a simplification of the COND_EXPR at the end of E->dest.

>

> @@ -191,7 +345,8 @@ static gimple *

>  record_temporary_equivalences_from_stmts_at_dest (edge e,

>      const_and_copies *const_and_copies,

>      avail_exprs_stack *avail_exprs_stack,

> -    pfn_simplify simplify)

> +    pfn_simplify simplify,

> +    vec<jump_thread_edge *> *path)

>  {

>    gimple *stmt = NULL;

>    gimple_stmt_iterator gsi;

> @@ -233,7 +388,22 @@ record_temporary_equivalences_from_stmts_at_dest (edge e,

>          expansion, then do not thread through this block.  */

>        stmt_count++;

>        if (stmt_count > max_stmt_count)

> -       return NULL;

> +       {

> +         /* If any of the stmts in the PATH's dests are going to be

> +            killed due to threading, grow the max count

> +            accordingly.  */

> +         if (max_stmt_count

> +             == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))

> +           {

> +             max_stmt_count += estimate_threading_killed_stmts (e, path);

> +             if (dump_file)

> +               fprintf (dump_file, "threading bb %i up to %i stmts\n",

> +                        e->dest->index, max_stmt_count);

> +           }

> +         /* If we're still past the limit, we're done.  */

> +         if (stmt_count > max_stmt_count)

> +           return NULL;

> +       }

>

>        /* If this is not a statement that sets an SSA_NAME to a new

>          value, then do not try to simplify this statement as it will

> @@ -1000,7 +1170,7 @@ thread_through_normal_block (edge e,

>    gimple *stmt

>      = record_temporary_equivalences_from_stmts_at_dest (e, const_and_copies,

>                                                         avail_exprs_stack,

> -                                                       simplify);

> +                                                       simplify, path);

>

>    /* There's two reasons STMT might be null, and distinguishing

>       between them is important.

>

>

> --

> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/

> You must be the change you wish to see in the world. -- Gandhi

> Be Free! -- http://FSFLA.org/   FSF Latin America board member

> Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
Jeff Law Dec. 12, 2017, 4:35 p.m. | #6
On 12/11/2017 10:08 PM, Alexandre Oliva wrote:
> 

>> I wonder if for the handling of a larger threading path it wouldn't be useful to

>> keep ssa_remaining_uses live and populated across processing of the threaded

>> conditionals?  OTOH it would probably require to track uses on the path vs.

>> not on the path.

> 

> Possibly.  Not sure.  I wondered about that myself.

So I finally found the other BZ in this space that I was looking for.
36550.

In 36550 we're getting a false positive warning at -Os because we
restrict jump threading to cases where the copied block has no real
statements other than the control statement at the end (which we're
going to eliminate).

As it turns out in the 36550 case the key block has real statements, but
they're all going to be dead after threading.  So naturally Alex's code
is of interest here.

I wonder if we could attach the result of Alex's analysis bits to the
thread path, then query it later.  That would allow us to solve 36550
without re-analyzing the path.  See class tree-ssa-threadupdate.h's
definition of jump_thread_edge -- that's where I want to hang the result
of the analysis.

Alex want to take a looksie and see if you can attach the result to the
path, then query it in tree-ssa-threadupdate.c at this point:

 /* If optimizing for size, only thread through block if we don't have
     to duplicate it or it's an otherwise empty redirection block.  */
  if (optimize_function_for_size_p (cfun))
    {
      EXECUTE_IF_SET_IN_BITMAP (tmp, 0, i, bi)
        {
          bb = BASIC_BLOCK_FOR_FN (cfun, i);
          if (EDGE_COUNT (bb->preds) > 1
              && !redirection_block_p (bb))
            {
              FOR_EACH_EDGE (e, ei, bb->preds)
                {
                  if (e->aux)
                    {
                      vec<jump_thread_edge *> *path = THREAD_PATH (e);
                      delete_jump_thread_path (path);
                      e->aux = NULL;
                    }
                }
            }
          else
            bitmap_set_bit (threaded_blocks, i);


There may be other BZs of a similar nature -- there's certainly others
were we get spurious -Wuninitialized warnings with -Os where threading
got throttled.  But I haven't verified if all the statements in the
copied block were dead on those other BZs.




> 

> 

>>    PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS specifies the maximum number

>>    of statements and PHI nodes allowed in a block which is going to

>>    be duplicated for thread jumping purposes.

> 

>> should be adjusted though?

> 

> I guess not, we still abide by that, roughly, we just (as-if) remove

> dead stmts and phi nodes from the block before counting them.

Well, consider a block with 16 statements, 15 of which were dead.
Previously we wouldn't thread through that block.  Now we would thread
resulting in a tiny amount of codesize increase as there's a statement
to copy that would survive DCE.

But the effect should be very small and I think your testing results so
that to be true.

Jeff
Jeff Law Dec. 15, 2017, 6:29 p.m. | #7
On 12/11/2017 10:17 PM, Alexandre Oliva wrote:
> On Dec 11, 2017, Jeff Law <law@redhat.com> wrote:

> 

> I've updated it according to richi's and your feedbacks.  Regstrapped on

> {x86_64,i686}-linux-gnu.  Ok to install?

> 

> 

> We limit the amount of copying for jump threading based on counting

> stmts.  This counting is overly pessimistic, because we will very

> often delete stmts as a consequence of jump threading: when the final

> conditional jump of a block is removed, earlier SSA names computed

> exclusively for use in that conditional are killed.  Furthermore, PHI

> nodes in blocks with only two predecessors are trivially replaced with

> their now-single values after threading.

> 

> This patch scans blocks to be copied in the path constructed so far

> and estimates the number of stmts that will be removed in the copies,

> bumping up the stmt count limit.

> 

> for  gcc/ChangeLog

> 

> 	PR tree-optimization/81165

> 	* tree-ssa-threadedge.c (uses_in_bb): New.

> 	(estimate_threading_killed_stmts): New.

> 	(estimate_threading_killed_stmts): New overload.

> 	(record_temporary_equivalences_from_stmts_at_dest): Add path

> 	parameter; adjust caller.  Expand limit when it's hit.

> 

> for  gcc/testsuite/ChangeLog

> 

> 	PR tree-optimization/81165

> 	* gcc.dg/pr81165.c: New.



> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c

> index 91793bfa59d3..0f5b943aa9a0 100644

> --- a/gcc/tree-ssa-threadedge.c

> +++ b/gcc/tree-ssa-threadedge.c

> @@ -170,6 +170,160 @@ threadedge_valueize (tree t)

>    return t;

>  }

>  

> +/* Return how many uses of T there are within BB, as long as there

> +   aren't any uses outside BB.  If there are any uses outside BB,

> +   return -1 if there's at most one use within BB, or -2 if there is

> +   more than one use within BB.  */

> +

> +static int

> +uses_in_bb (tree t, basic_block bb)

> +{

> +  int uses = 0;

> +  bool outside_bb = false;

> +

> +  imm_use_iterator iter;

> +  use_operand_p use_p;

> +  FOR_EACH_IMM_USE_FAST (use_p, iter, t)

> +    {

> +      if (is_gimple_debug (USE_STMT (use_p)))

> +	continue;

> +

> +      if (gimple_bb (USE_STMT (use_p)) != bb)

> +	outside_bb = true;

> +      else

> +	uses++;

> +

> +      if (outside_bb && uses > 1)

> +	return -2;

> +    }

> +

> +  if (outside_bb)

> +    return -1;

> +

> +  return uses;

> +}

> +

> +/* Starting from the final control flow stmt in BB, assuming it will

> +   be removed, follow uses in to-be-removed stmts back to their defs

> +   and count how many defs are to become dead and be removed as

> +   well.  */

> +

> +static int

> +estimate_threading_killed_stmts (basic_block bb)

> +{

> +  int killed_stmts = 0;

> +  hash_map<tree, int> ssa_remaining_uses;

> +  auto_vec<gimple *, 4> dead_worklist;

> +

> +  /* If the block has only two predecessors, threading will turn phi

> +     dsts into either src, so count them as dead stmts.  */

> +  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;

> +

> +  if (drop_all_phis)

> +    for (gphi_iterator gsi = gsi_start_phis (bb);

> +	 !gsi_end_p (gsi); gsi_next (&gsi))

> +      {

> +	gphi *phi = gsi.phi ();

> +	tree dst = gimple_phi_result (phi);

> +

> +	/* We don't count virtual PHIs as stmts in

> +	   record_temporary_equivalences_from_phis.  */

> +	if (virtual_operand_p (dst))

> +	  continue;

> +

> +	killed_stmts++;

> +      }

> +

> +  if (gsi_end_p (gsi_last_bb (bb)))

> +    return killed_stmts;

> +

> +  gimple *stmt = gsi_stmt (gsi_last_bb (bb));

> +  if (gimple_code (stmt) != GIMPLE_COND

> +      && gimple_code (stmt) != GIMPLE_GOTO

> +      && gimple_code (stmt) != GIMPLE_SWITCH)

> +    return killed_stmts;

> +

> +  dead_worklist.quick_push (stmt);

> +  while (!dead_worklist.is_empty ())

> +    {

> +      stmt = dead_worklist.pop ();

> +

> +      ssa_op_iter iter;

> +      use_operand_p use_p;

> +      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)

> +	{

> +	  tree t = USE_FROM_PTR (use_p);

> +	  gimple *def = SSA_NAME_DEF_STMT (t);

> +

> +	  if (gimple_bb (def) == bb

> +	      && (gimple_code (def) != GIMPLE_PHI

> +		  || !drop_all_phis)

> +	      && !gimple_has_side_effects (def))

> +	    {

> +	      int *usesp = ssa_remaining_uses.get (t);

> +	      int uses;

> +

> +	      if (usesp)

> +		uses = *usesp;

> +	      else

> +		uses = uses_in_bb (t, bb);

> +

> +	      gcc_assert (uses);

> +

> +	      /* Don't bother recording the expected use count if we

> +		 won't find any further uses within BB.  */

> +	      if (!usesp && (uses < -1 || uses > 1))

> +		{

> +		  usesp = &ssa_remaining_uses.get_or_insert (t);

> +		  *usesp = uses;

> +		}

> +

> +	      if (uses < 0)

> +		continue;

> +

> +	      --uses;

> +	      if (usesp)

> +		*usesp = uses;

> +

> +	      if (!uses)

> +		{

> +		  killed_stmts++;

> +		  if (usesp)

> +		    ssa_remaining_uses.remove (t);

> +		  if (gimple_code (def) != GIMPLE_PHI)

> +		    dead_worklist.safe_push (def);

> +		}

> +	    }

> +	}

> +    }

> +

> +  if (dump_file)

> +    fprintf (dump_file, "threading bb %i kills %i stmts\n",

> +	     bb->index, killed_stmts);

> +

> +  return killed_stmts;

So it appears the loop counts a statement as killed when its uses are no
longer needed.  AFAICT we fail to count the conditional as killed.
That's trivially fixed.

I also think we want to use this elsewhere.  I think putting it in
tree-ssa-threadupdate.c and exporting from tree-ssa-threadupdate.h
should work.







> +}

> +

> +/* Estimate, over all src blocks to be copied in PATH and E's dest,

> +   the number of stmts that become dead as a consequence of removing

> +   their final control flow stmts.  If PATH is not empty, E must be

> +   its last entry.  */

> +

> +static int

> +estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)

> +{

> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

> +  int total = estimate_threading_killed_stmts (e->dest);

> +

> +  int i;

> +  jump_thread_edge *je;

> +  FOR_EACH_VEC_ELT_REVERSE (*path, i, je)

> +    if (je->type == EDGE_COPY_SRC_BLOCK)

> +      total += estimate_threading_killed_stmts (je->e->src);

> +

> +  return total;

> +}

I don't think we need this function at all.

At the point where you're calling it, we're at the one and only block
within the jump threading path where we're copying a block where
statements are going to be killed.

So I think you should just remove this function and call the other one
(that works on just a basic block).  THat also means we don't have to
pass the path around.

I've got those minor tweaks in a tree here along with the changes to use
it to fix 36550 as well.  Let me run them through some deeper testing.


jeff
Jeff Law Dec. 15, 2017, 10:14 p.m. | #8
On 12/11/2017 10:17 PM, Alexandre Oliva wrote:
> On Dec 11, 2017, Jeff Law <law@redhat.com> wrote:

> 

>>> +  gcc_assert (path->length () == 0 || path->last ()->e == e);

>>> +  if (path->length () == 0)

>>> +    return estimate_threading_killed_stmts (e->dest);

>>> +

>>> +  int total = 0;

>>> +  int i = 0;

>>> +  jump_thread_edge *je;

>>> +  FOR_EACH_VEC_ELT (*path, i, je)

>>> +    switch (je->type)

>>> +      {

>>> +      case EDGE_START_JUMP_THREAD:

>>> +      case EDGE_COPY_SRC_BLOCK:

>>> +      case EDGE_COPY_SRC_JOINER_BLOCK:

>>> +	total += estimate_threading_killed_stmts (je->e->dest);

>>> +	break;

>>> +

>>> +      default:

>>> +	gcc_unreachable ();

>>> +

>>> +      case EDGE_FSM_THREAD:

>>> +      case EDGE_NO_COPY_SRC_BLOCK:

>>> +	break;

>>> +      }

>>> +

>>> +  return total;

>>> +}

> 

>> So I'd mark je->e->src rather than je->e->dest.

> Ah, but then we have to mark e->dest regardless of path.  Which does

> indeed make it all look nicer.

> 

>> And you only need this

>> handling for EDGE_COPY_SRC_BLOCK.  So I don't think you need a switch,

>> just a simple je->type == EDGE_COPY_SRC_BLOCK.

>> While we do make a copy for EDGE_COPY_SRC_JOINER_BLOCK, the conditional

>> at the end of that block is not dead, so we can't really do anything

>> special with it.

> I see, thanks.

> 

> I've updated it according to richi's and your feedbacks.  Regstrapped on

> {x86_64,i686}-linux-gnu.  Ok to install?

> 

> 

> We limit the amount of copying for jump threading based on counting

> stmts.  This counting is overly pessimistic, because we will very

> often delete stmts as a consequence of jump threading: when the final

> conditional jump of a block is removed, earlier SSA names computed

> exclusively for use in that conditional are killed.  Furthermore, PHI

> nodes in blocks with only two predecessors are trivially replaced with

> their now-single values after threading.

> 

> This patch scans blocks to be copied in the path constructed so far

> and estimates the number of stmts that will be removed in the copies,

> bumping up the stmt count limit.

> 

> for  gcc/ChangeLog

> 

> 	PR tree-optimization/81165

> 	* tree-ssa-threadedge.c (uses_in_bb): New.

> 	(estimate_threading_killed_stmts): New.

> 	(estimate_threading_killed_stmts): New overload.

> 	(record_temporary_equivalences_from_stmts_at_dest): Add path

> 	parameter; adjust caller.  Expand limit when it's hit.

> 

> for  gcc/testsuite/ChangeLog

> 

> 	PR tree-optimization/81165

> 	* gcc.dg/pr81165.c: New.

Attached is what I actually committed for you (so I can wrap 36550
today).  It incorporates my most recent comments.

Namely I fixed the failure to account for the control statement being
dead, moved the new bits into tree-ssa-threadupdate.[ch] and removed the
unnecessary overload and corresponding changes in the caller.

Thanks for taking care of this!  36550 is trivial to deal with on top of
your infrastructure :-)

Jeff
commit a4b4e317d94937c9f858dd0581358cd43957fbbd
Author: Jeff Law <law@torsion.usersys.redhat.com>
Date:   Fri Dec 15 15:42:28 2017 -0500

            PR tree-optimization/81165
            * tree-ssa-threadupdate.c (uses_in_bb): New.
            (estimate_threading_killed_stmts): New.
            * tree-ssa-threadupdate.h (estimate_threading_killed_stmts): Prototype.
            * tree-ssa-threadedge.c
            (record_temporary_equivalences_from_stmts_at_dest): Expand limit
            when its hit.
    
            PR tree-optimization/81165
            * gcc.dg/pr81165.c: New.

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2d53e24b4c1..e5ab1b1a4c7 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,14 @@
-2017-12-12  Jeff Law  <law@redhat.com>
+2017-12-15  Alexandre Oliva <aoliva@redhat.com>
+
+	PR tree-optimization/81165
+	* tree-ssa-threadupdate.c (uses_in_bb): New.
+	(estimate_threading_killed_stmts): New.
+	* tree-ssa-threadupdate.h (estimate_threading_killed_stmts): Prototype.
+	* tree-ssa-threadedge.c 
+	(record_temporary_equivalences_from_stmts_at_dest): Expand limit
+	when its hit.
+
+2017-12-15  Jeff Law  <law@redhat.com>
 
 	PR tree-optimization/83410
 	* tree-ssa-threadupdate.c (thread_block_1): Avoid certain jump
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 7bbf0e9f09e..bc005b8de26 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2017-12-15  Alexandre Oliva  <aoliva@redhat.com>
+
+	PR tree-optimization/81165
+	* gcc.dg/pr81165.c: New.
+
 2017-12-15  Jakub Jelinek  <jakub@redhat.com>
 
 	PR c++/83205
diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c
new file mode 100644
index 00000000000..8508d893bed
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr81165.c
@@ -0,0 +1,59 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */
+
+/* Testcase submitted for PR81165, with its main function removed as
+   it's turned into a compile test.  We want to make sure that all of
+   the divide/remainder computations are removed by tree optimizers.
+
+   We can figure out that we don't need to compute at runtime even the
+   condition to enter the loop: the initial i==0 would have to be
+   greater than the sum of two small unsigned values: 1U>>t1 is in the
+   range 0..1, whereas the char value is bounded by the range 0..127,
+   being 128 % a positive number (zero would invoke undefined
+   behavior, so we can assume it doesn't happen).  (We know it's
+   nonnegative because it's 10 times a number that has no more than
+   the bits for 16, 8 and 1 set.)
+
+   We don't realize that the loop is useless right away: jump
+   threading helps remove some of the complexity, particularly of the
+   computation within the loop: t1 is compared with 1, but it can
+   never be 1.  (We could assume as much, since its being 1 would
+   divide by zero, but we don't.)
+
+   If we don't enter the conditional block, t1 remains at 2; if we do,
+   it's set to either -1.  If we jump thread at the end of the
+   conditional block, we can figure out the ranges exclude 1 and the
+   jump body is completely optimized out.  However, we used to fail to
+   consider the block for jump threading due to the amount of
+   computation in it, without realizing most of it would die in
+   consequence of the threading.
+
+   We now take the dying code into account when deciding whether or
+   not to try jump threading.  That might enable us to optimize the
+   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the
+   time of this writing, with the patch, we get close, but the test on
+   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some
+   of the loop remains.  */
+
+short x0 = 15;
+
+void func (){
+  volatile int x1 = 1U;
+  volatile char x2 = 0;
+  char t0 = 0;
+  unsigned long t1 = 2LU;
+  int i = 0;
+  
+  if(1>>x2) {
+    t0 = -1;
+    t1 = (1&(short)(x1^8U))-1;
+  }
+
+  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {
+    i += (int)(12L/(1!=(int)t1));
+  }
+
+  if (t0 != -1) __builtin_abort();
+  if (t1 != 0L) __builtin_abort();
+}
diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index b7781dc7d60..1fafd7b5932 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -244,7 +244,22 @@ record_temporary_equivalences_from_stmts_at_dest (edge e,
 	 expansion, then do not thread through this block.  */
       stmt_count++;
       if (stmt_count > max_stmt_count)
-	return NULL;
+	{
+	  /* If any of the stmts in the PATH's dests are going to be
+	     killed due to threading, grow the max count
+	     accordingly.  */
+	  if (max_stmt_count
+	      == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))
+	    {
+	      max_stmt_count += estimate_threading_killed_stmts (e->dest);
+	      if (dump_file)
+		fprintf (dump_file, "threading bb %i up to %i stmts\n",
+			 e->dest->index, max_stmt_count);
+	    }
+	  /* If we're still past the limit, we're done.  */
+	  if (stmt_count > max_stmt_count)
+	    return NULL;
+	}
 
       /* These are temporary ranges, do nto reflect them back into
 	 the global range data.  */
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 63ad8f9c953..b29ffe195c8 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2431,3 +2431,139 @@ register_jump_thread (vec<jump_thread_edge *> *path)
 
   paths.safe_push (path);
 }
+
+/* Return how many uses of T there are within BB, as long as there
+   aren't any uses outside BB.  If there are any uses outside BB,
+   return -1 if there's at most one use within BB, or -2 if there is
+   more than one use within BB.  */
+
+static int
+uses_in_bb (tree t, basic_block bb)
+{
+  int uses = 0;
+  bool outside_bb = false;
+
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, t)
+    {
+      if (is_gimple_debug (USE_STMT (use_p)))
+	continue;
+
+      if (gimple_bb (USE_STMT (use_p)) != bb)
+	outside_bb = true;
+      else
+	uses++;
+
+      if (outside_bb && uses > 1)
+	return -2;
+    }
+
+  if (outside_bb)
+    return -1;
+
+  return uses;
+}
+
+/* Starting from the final control flow stmt in BB, assuming it will
+   be removed, follow uses in to-be-removed stmts back to their defs
+   and count how many defs are to become dead and be removed as
+   well.  */
+
+unsigned int
+estimate_threading_killed_stmts (basic_block bb)
+{
+  int killed_stmts = 0;
+  hash_map<tree, int> ssa_remaining_uses;
+  auto_vec<gimple *, 4> dead_worklist;
+
+  /* If the block has only two predecessors, threading will turn phi
+     dsts into either src, so count them as dead stmts.  */
+  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;
+
+  if (drop_all_phis)
+    for (gphi_iterator gsi = gsi_start_phis (bb);
+	 !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gphi *phi = gsi.phi ();
+	tree dst = gimple_phi_result (phi);
+
+	/* We don't count virtual PHIs as stmts in
+	   record_temporary_equivalences_from_phis.  */
+	if (virtual_operand_p (dst))
+	  continue;
+
+	killed_stmts++;
+      }
+
+  if (gsi_end_p (gsi_last_bb (bb)))
+    return killed_stmts;
+
+  gimple *stmt = gsi_stmt (gsi_last_bb (bb));
+  if (gimple_code (stmt) != GIMPLE_COND
+      && gimple_code (stmt) != GIMPLE_GOTO
+      && gimple_code (stmt) != GIMPLE_SWITCH)
+    return killed_stmts;
+
+  /* The control statement is always dead.  */
+  killed_stmts++;
+  dead_worklist.quick_push (stmt);
+  while (!dead_worklist.is_empty ())
+    {
+      stmt = dead_worklist.pop ();
+
+      ssa_op_iter iter;
+      use_operand_p use_p;
+      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
+	{
+	  tree t = USE_FROM_PTR (use_p);
+	  gimple *def = SSA_NAME_DEF_STMT (t);
+
+	  if (gimple_bb (def) == bb
+	      && (gimple_code (def) != GIMPLE_PHI
+		  || !drop_all_phis)
+	      && !gimple_has_side_effects (def))
+	    {
+	      int *usesp = ssa_remaining_uses.get (t);
+	      int uses;
+
+	      if (usesp)
+		uses = *usesp;
+	      else
+		uses = uses_in_bb (t, bb);
+
+	      gcc_assert (uses);
+
+	      /* Don't bother recording the expected use count if we
+		 won't find any further uses within BB.  */
+	      if (!usesp && (uses < -1 || uses > 1))
+		{
+		  usesp = &ssa_remaining_uses.get_or_insert (t);
+		  *usesp = uses;
+		}
+
+	      if (uses < 0)
+		continue;
+
+	      --uses;
+	      if (usesp)
+		*usesp = uses;
+
+	      if (!uses)
+		{
+		  killed_stmts++;
+		  if (usesp)
+		    ssa_remaining_uses.remove (t);
+		  if (gimple_code (def) != GIMPLE_PHI)
+		    dead_worklist.safe_push (def);
+		}
+	    }
+	}
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "threading bb %i kills %i stmts\n",
+	     bb->index, killed_stmts);
+
+  return killed_stmts;
+}
diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h
index 296eff29ff4..8a3b41d82cd 100644
--- a/gcc/tree-ssa-threadupdate.h
+++ b/gcc/tree-ssa-threadupdate.h
@@ -47,6 +47,7 @@ extern void remove_jump_threads_including (edge);
 extern void delete_jump_thread_path (vec <class jump_thread_edge *> *);
 extern void remove_ctrl_stmt_and_useless_edges (basic_block, basic_block);
 extern void free_dom_edge_info (edge);
+extern unsigned int estimate_threading_killed_stmts (basic_block);
 
 enum bb_dom_status
 {

Patch

diff --git a/gcc/testsuite/gcc.dg/pr81165.c b/gcc/testsuite/gcc.dg/pr81165.c
new file mode 100644
index 000000000000..8508d893bed6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr81165.c
@@ -0,0 +1,59 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-not " \[/%\] " "optimized" } } */
+
+/* Testcase submitted for PR81165, with its main function removed as
+   it's turned into a compile test.  We want to make sure that all of
+   the divide/remainder computations are removed by tree optimizers.
+
+   We can figure out that we don't need to compute at runtime even the
+   condition to enter the loop: the initial i==0 would have to be
+   greater than the sum of two small unsigned values: 1U>>t1 is in the
+   range 0..1, whereas the char value is bounded by the range 0..127,
+   being 128 % a positive number (zero would invoke undefined
+   behavior, so we can assume it doesn't happen).  (We know it's
+   nonnegative because it's 10 times a number that has no more than
+   the bits for 16, 8 and 1 set.)
+
+   We don't realize that the loop is useless right away: jump
+   threading helps remove some of the complexity, particularly of the
+   computation within the loop: t1 is compared with 1, but it can
+   never be 1.  (We could assume as much, since its being 1 would
+   divide by zero, but we don't.)
+
+   If we don't enter the conditional block, t1 remains at 2; if we do,
+   it's set to either -1.  If we jump thread at the end of the
+   conditional block, we can figure out the ranges exclude 1 and the
+   jump body is completely optimized out.  However, we used to fail to
+   consider the block for jump threading due to the amount of
+   computation in it, without realizing most of it would die in
+   consequence of the threading.
+
+   We now take the dying code into account when deciding whether or
+   not to try jump threading.  That might enable us to optimize the
+   function into { if (x2 != 0 || (x1 & 1) == 0) abort (); }.  At the
+   time of this writing, with the patch, we get close, but the test on
+   x2 only gets as far as ((1 >> x2) == 0).  Without the patch, some
+   of the loop remains.  */
+
+short x0 = 15;
+
+void func (){
+  volatile int x1 = 1U;
+  volatile char x2 = 0;
+  char t0 = 0;
+  unsigned long t1 = 2LU;
+  int i = 0;
+  
+  if(1>>x2) {
+    t0 = -1;
+    t1 = (1&(short)(x1^8U))-1;
+  }
+
+  while(i > (int)((1U>>t1)+(char)(128%(10*(25LU&(29%x0)))))) {
+    i += (int)(12L/(1!=(int)t1));
+  }
+
+  if (t0 != -1) __builtin_abort();
+  if (t1 != 0L) __builtin_abort();
+}
diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 536c4717b725..25ccac2a3ecc 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -170,6 +170,173 @@  threadedge_valueize (tree t)
   return t;
 }
 
+/* Return how many uses of T there are within BB, as long as there
+   aren't any uses outside BB.  If there are any uses outside BB,
+   return -1 if there's at most one use within BB, or -2 if there is
+   more than one use within BB.  */
+
+static int
+uses_in_bb (tree t, basic_block bb)
+{
+  int uses = 0;
+  bool outside_bb = false;
+
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, t)
+    {
+      if (is_gimple_debug (USE_STMT (use_p)))
+	continue;
+
+      if (gimple_bb (USE_STMT (use_p)) != bb)
+	outside_bb = true;
+      else
+	uses++;
+
+      if (outside_bb && uses > 1)
+	return -2;
+    }
+
+  if (outside_bb)
+    return -1;
+
+  return uses;
+}
+
+/* Starting from the final control flow stmt in BB, assuming it will
+   be removed, follow uses in to-be-removed stmts back to their defs
+   and count how many defs are to become dead and be removed as
+   well.  */
+
+static int
+estimate_threading_killed_stmts (basic_block bb)
+{
+  int killed_stmts = 0;
+  hash_map<tree, int> ssa_remaining_uses;
+  auto_vec<gimple *, 4> dead_worklist;
+
+  /* If the block has only two predecessors, threading will turn phi
+     dsts into either src, so count them as dead stmts.  */
+  bool drop_all_phis = EDGE_COUNT (bb->preds) == 2;
+
+  if (drop_all_phis)
+    for (gphi_iterator gsi = gsi_start_phis (bb);
+	 !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gphi *phi = gsi.phi ();
+	tree dst = gimple_phi_result (phi);
+
+	/* We don't count virtual PHIs as stmts in
+	   record_temporary_equivalences_from_phis.  */
+	if (virtual_operand_p (dst))
+	  continue;
+
+	killed_stmts++;
+      }
+
+  if (gsi_end_p (gsi_last_bb (bb)))
+    return killed_stmts;
+
+  gimple *stmt = gsi_stmt (gsi_last_bb (bb));
+  if (gimple_code (stmt) != GIMPLE_COND
+      && gimple_code (stmt) != GIMPLE_GOTO
+      && gimple_code (stmt) != GIMPLE_SWITCH)
+    return killed_stmts;
+
+  dead_worklist.quick_push (stmt);
+  while (!dead_worklist.is_empty ())
+    {
+      stmt = dead_worklist.pop ();
+
+      ssa_op_iter iter;
+      use_operand_p use_p;
+      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_ALL_USES)
+	{
+	  tree t = USE_FROM_PTR (use_p);
+	  if (SSA_NAME_DEF_STMT (t)
+	      && (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_ASSIGN
+		  || (gimple_code (SSA_NAME_DEF_STMT (t)) == GIMPLE_PHI
+		      && !virtual_operand_p (t)
+		      && !drop_all_phis))
+	      && gimple_bb (SSA_NAME_DEF_STMT (t)) == bb)
+	    {
+	      int *usesp = ssa_remaining_uses.get (t);
+	      int uses;
+
+	      if (usesp)
+		uses = *usesp;
+	      else
+		uses = uses_in_bb (t, bb);
+
+	      gcc_assert (uses);
+
+	      /* Don't bother recording the expected use count if we
+		 won't find any further uses within BB.  */
+	      if (!usesp && (uses < -1 || uses > 1))
+		{
+		  usesp = &ssa_remaining_uses.get_or_insert (t);
+		  *usesp = uses;
+		}
+
+	      if (uses < 0)
+		continue;
+
+	      --uses;
+	      if (usesp)
+		*usesp = uses;
+
+	      if (!uses)
+		{
+		  killed_stmts++;
+		  if (usesp)
+		    ssa_remaining_uses.remove (t);
+		  if (gimple_code (SSA_NAME_DEF_STMT (t)) != GIMPLE_PHI)
+		    dead_worklist.safe_push (SSA_NAME_DEF_STMT (t));
+		}
+	    }
+	}
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "threading bb %i kills %i stmts\n",
+	     bb->index, killed_stmts);
+
+  return killed_stmts;
+}
+
+/* Estimate killed statements over all blocks in the PATH, or in E.
+   If PATH is not empty, E must be its last entry.  */
+
+static int
+estimate_threading_killed_stmts (edge e, vec<jump_thread_edge *> *path)
+{
+  gcc_assert (path->length () == 0 || path->last ()->e == e);
+  if (path->length () == 0)
+    return estimate_threading_killed_stmts (e->dest);
+
+  int total = 0;
+  int i = 0;
+  jump_thread_edge *je;
+  FOR_EACH_VEC_ELT (*path, i, je)
+    switch (je->type)
+      {
+      case EDGE_START_JUMP_THREAD:
+      case EDGE_COPY_SRC_BLOCK:
+      case EDGE_COPY_SRC_JOINER_BLOCK:
+	total += estimate_threading_killed_stmts (je->e->dest);
+	break;
+
+      default:
+	gcc_unreachable ();
+
+      case EDGE_FSM_THREAD:
+      case EDGE_NO_COPY_SRC_BLOCK:
+	break;
+      }
+
+  return total;
+}
+
 /* Try to simplify each statement in E->dest, ultimately leading to
    a simplification of the COND_EXPR at the end of E->dest.
 
@@ -191,7 +358,8 @@  static gimple *
 record_temporary_equivalences_from_stmts_at_dest (edge e,
     const_and_copies *const_and_copies,
     avail_exprs_stack *avail_exprs_stack,
-    pfn_simplify simplify)
+    pfn_simplify simplify,
+    vec<jump_thread_edge *> *path)
 {
   gimple *stmt = NULL;
   gimple_stmt_iterator gsi;
@@ -233,7 +401,22 @@  record_temporary_equivalences_from_stmts_at_dest (edge e,
 	 expansion, then do not thread through this block.  */
       stmt_count++;
       if (stmt_count > max_stmt_count)
-	return NULL;
+	{
+	  /* If any of the stmts in the PATH's dests are going to be
+	     killed due to threading, grow the max count
+	     accordingly.  */
+	  if (max_stmt_count
+	      == PARAM_VALUE (PARAM_MAX_JUMP_THREAD_DUPLICATION_STMTS))
+	    {
+	      max_stmt_count += estimate_threading_killed_stmts (e, path);
+	      if (dump_file)
+		fprintf (dump_file, "threading bb %i up to %i stmts\n",
+			 e->dest->index, max_stmt_count);
+	    }
+	  /* If we're still past the limit, we're done.  */
+	  if (stmt_count > max_stmt_count)
+	    return NULL;
+	}
 
       /* If this is not a statement that sets an SSA_NAME to a new
 	 value, then do not try to simplify this statement as it will
@@ -991,7 +1174,7 @@  thread_through_normal_block (edge e,
   gimple *stmt
     = record_temporary_equivalences_from_stmts_at_dest (e, const_and_copies,
 							avail_exprs_stack,
-							simplify);
+							simplify, path);
 
   /* There's two reasons STMT might be null, and distinguishing
      between them is important.