Fix overflows in -fprofile-reorder-functions

Message ID 20191208170533.tqjym4jbeljzsnwo@kam.mff.cuni.cz
State New
Headers show
Series
  • Fix overflows in -fprofile-reorder-functions
Related show

Commit Message

Jan Hubicka Dec. 8, 2019, 5:05 p.m.
Hi,
this patch fixes three sissues with -fprofile-reorder-functions:
1) First is that tp_first_run is stored as 32bit integer while it can easily
   overflow (and does so during Firefox profiling).
2) Second problem is that flag_profile_functions can
   not be tested w/o function context.
   The changes to expand_all_functions makes it to work on mixed units by
   first outputting all functions w/o -fprofile-reorder-function (or with no
   profile info) and then outputting in first_run order
3) LTO partitioner was mixing up order by tp_first_run and by order.
   for no_reorder we definitly want to order via first, while for everything
   else we want to roder by second.

I have also merged duplicated comparators since they are bit fragile into
tp_first_run_node_cmp.

I originaly started to look into this because of undefined symbols with
Firefox PGO builds.  These symbols went away with fixing these bug but I am not
quite sure how. it is possible that there is another problem in lto_blanced_map
but even after reading the noreorder code few times carefuly I did not find it.
Other explanation would be that our new qsort with broken comparator due to
overflow can actualy remove some entries in the array, but that sounds bit
crazy.

Bootstrapped/regested x86_64-linux. Comitted.

	* cgraph.c (cgraph_node::dump): Make tp_first_run 64bit.
	* cgraph.h (cgrpah_node): Likewise.
	(tp_first_run_node_cmp): Deeclare.
	* cgraphunit.c (node_cmp): Rename to ...
	(tp_first_run_node_cmp): ... this; export; watch for 64bit overflows;
	clear tp_first_run for no_reorder and !flag_profile_reorder_functions.
	(expand_all_functions): Collect tp_first_run and normal functions to
	two vectors so the other functions remain sorted. Do not check for
	flag_profile_reorder_functions it is function local flag.
	* profile.c (compute_value_histograms): Update tp_first_run printing.

	* lto-partition.c (node_cmp): Turn into simple order comparsions.
	(varpool_node_cmp): Remove.
	(add_sorted_nodes): Use node_cmp.
	(lto_balanced_map): Use tp_first_run_node_cmp.

Comments

Jan Hubicka Dec. 8, 2019, 6:29 p.m. | #1
> Hi,

> this patch fixes three sissues with -fprofile-reorder-functions:

> 1) First is that tp_first_run is stored as 32bit integer while it can easily

>    overflow (and does so during Firefox profiling).


Actually the overflow problem is possible only with mismatched profiles
(which does happen for me). This is because time is increased only if
new function is found, so it can not get over 2^32 for programs that do
not have more than 2^32 functions.

So thinking this over, perhaps it is better to keep tp_first_run 32bit
and cap it on profile corruption.  I will look into that tomorrow.

The other two problems are valid though, so i will fix this
incrementally.

Honza
Alexander Monakov Dec. 8, 2019, 7:18 p.m. | #2
2On Sun, 8 Dec 2019, Jan Hubicka wrote:

> Other explanation would be that our new qsort with broken comparator due to

> overflow can actualy remove some entries in the array, but that sounds bit

> crazy.


gcc_qsort only reorders elements, making it possible for gcc_qsort_chk (that
runs afterwards) to catch crazy comparators in a sound manner.

> Bootstrapped/regested x86_64-linux. Comitted.


I have a few comments, please see below.

> --- cgraphunit.c	(revision 279076)

> +++ cgraphunit.c	(working copy)

> @@ -2359,19 +2359,33 @@ cgraph_node::expand (void)

>  /* Node comparator that is responsible for the order that corresponds

>     to time when a function was launched for the first time.  */

>  

> -static int

> -node_cmp (const void *pa, const void *pb)

> +int

> +tp_first_run_node_cmp (const void *pa, const void *pb)

>  {

>    const cgraph_node *a = *(const cgraph_node * const *) pa;

>    const cgraph_node *b = *(const cgraph_node * const *) pb;

> +  gcov_type tp_first_run_a = a->tp_first_run;

> +  gcov_type tp_first_run_b = b->tp_first_run;

> +

> +  if (!opt_for_fn (a->decl, flag_profile_reorder_functions)

> +      || a->no_reorder)

> +    tp_first_run_a = 0;

> +  if (!opt_for_fn (b->decl, flag_profile_reorder_functions)

> +      || b->no_reorder)

> +    tp_first_run_b = 0;

> +

> +  if (tp_first_run_a == tp_first_run_b)

> +    return b->order - a->order;

>  

>    /* Functions with time profile must be before these without profile.  */

> -  if (!a->tp_first_run || !b->tp_first_run)

> -    return a->tp_first_run - b->tp_first_run;

> +  if (!tp_first_run_a || !tp_first_run_b)

> +    return tp_first_run_a ? 1 : -1;


The code does the opposite of the comment: when tp_first_run_b is 0, it will
return 1, indicating a > b, causing b to appear in front of a in the sorted
array.

I would recommend to make these variables uint64_t, then you can simply do

  tp_first_run_a--;
  tp_first_run_b--;

making 0 wrap around to UINT64_MAX. Then they will naturally sort after all
other nodes.

> -  return a->tp_first_run != b->tp_first_run

> -	 ? b->tp_first_run - a->tp_first_run

> -	 : b->order - a->order;

> +  /* Watch for overlflow - tp_first_run is 64bit.  */

> +  if (tp_first_run_a > tp_first_run_b)

> +    return -1;

> +  else

> +    return 1;


This also sorts in the reverse order -- please fix the comments if that's really
intended.

> +  /* Output functions in RPO so callers get optimized before callees.  This

> +     makes ipa-ra and other propagators to work.

> +     FIXME: This is far from optimal code layout.  */


I think this should have said "callees get optimized before callers".


Thanks.
Alexander
Andreas Schwab Dec. 8, 2019, 7:45 p.m. | #3
On Dez 08 2019, Jan Hubicka wrote:

> Index: cgraphunit.c

> ===================================================================

> --- cgraphunit.c	(revision 279076)

> +++ cgraphunit.c	(working copy)

> @@ -2359,19 +2359,33 @@ cgraph_node::expand (void)

>  /* Node comparator that is responsible for the order that corresponds

>     to time when a function was launched for the first time.  */

>  

> -static int

> -node_cmp (const void *pa, const void *pb)

> +int

> +tp_first_run_node_cmp (const void *pa, const void *pb)

>  {

>    const cgraph_node *a = *(const cgraph_node * const *) pa;

>    const cgraph_node *b = *(const cgraph_node * const *) pb;

> +  gcov_type tp_first_run_a = a->tp_first_run;

> +  gcov_type tp_first_run_b = b->tp_first_run;

> +

> +  if (!opt_for_fn (a->decl, flag_profile_reorder_functions)

> +      || a->no_reorder)

> +    tp_first_run_a = 0;

> +  if (!opt_for_fn (b->decl, flag_profile_reorder_functions)

> +      || b->no_reorder)

> +    tp_first_run_b = 0;

> +

> +  if (tp_first_run_a == tp_first_run_b)

> +    return b->order - a->order;

>  

>    /* Functions with time profile must be before these without profile.  */

> -  if (!a->tp_first_run || !b->tp_first_run)

> -    return a->tp_first_run - b->tp_first_run;

> +  if (!tp_first_run_a || !tp_first_run_b)

> +    return tp_first_run_a ? 1 : -1;

>  

> -  return a->tp_first_run != b->tp_first_run

> -	 ? b->tp_first_run - a->tp_first_run

> -	 : b->order - a->order;

> +  /* Watch for overlflow - tp_first_run is 64bit.  */

                  overflow

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."
Jan Hubicka Dec. 8, 2019, 9:16 p.m. | #4
> 2On Sun, 8 Dec 2019, Jan Hubicka wrote:

> 

> > Other explanation would be that our new qsort with broken comparator due to

> > overflow can actualy remove some entries in the array, but that sounds bit

> > crazy.

> 

> gcc_qsort only reorders elements, making it possible for gcc_qsort_chk (that

> runs afterwards) to catch crazy comparators in a sound manner.

> 

> > Bootstrapped/regested x86_64-linux. Comitted.

> 

> I have a few comments, please see below.


Thanks. I will revisit the patch tomorrow - as mentioned in the other
mail I got overzelaous about the 64bit support - we can easily sort out
too large numbers as broken profile.
This was end of very long debugging session and I should have tought it
out better.

Honza
Jan Hubicka Dec. 10, 2019, 3:31 p.m. | #5
> 2On Sun, 8 Dec 2019, Jan Hubicka wrote:

> 

> > Other explanation would be that our new qsort with broken comparator due to

> > overflow can actualy remove some entries in the array, but that sounds bit

> > crazy.

> 

> gcc_qsort only reorders elements, making it possible for gcc_qsort_chk (that

> runs afterwards) to catch crazy comparators in a sound manner.


I understand this problem (and it is very weird).  It was caused by
optimize attribute overwritting incorrectly
flag_profile_reorder_functions which is not supposed to change mid
compilation globally. So we ended up putting some symbols for ordered
output and then forgetting about them.
> >    /* Functions with time profile must be before these without profile.  */

> > -  if (!a->tp_first_run || !b->tp_first_run)

> > -    return a->tp_first_run - b->tp_first_run;

> > +  if (!tp_first_run_a || !tp_first_run_b)

> > +    return tp_first_run_a ? 1 : -1;

> 

> The code does the opposite of the comment: when tp_first_run_b is 0, it will

> return 1, indicating a > b, causing b to appear in front of a in the sorted

> array.


You are right - I have noticed that and fixed it as (apparently
forgotten) last minute change. Trunk says tp_first_run_b.
> 

> I would recommend to make these variables uint64_t, then you can simply do

> 

>   tp_first_run_a--;

>   tp_first_run_b--;

> 

> making 0 wrap around to UINT64_MAX. Then they will naturally sort after all

> other nodes.


Then we would still have to watch the overflow before returning? I
actually find the condtional sort of more readable than intentional wrap
around the range, so I kept it in the code espeically because I made the
value 32bit again and without this trick I no longer need to watch
overflows.
> 

> > +  /* Output functions in RPO so callers get optimized before callees.  This

> > +     makes ipa-ra and other propagators to work.

> > +     FIXME: This is far from optimal code layout.  */

> 

> I think this should have said "callees get optimized before callers".


Indeed.
Here is patch fixing the issues which I have bootstrapped&regtested.
I will wait a bit for comments before comitting.

Honza

	* cgraph.c (cgraph_node::verify_node): Verify tp_first_run.
	* cgraph.h (cgrpah_node): Turn tp_first_run back to int.
	* cgraphunit.c (tp_first_run_node_cmp): Do not watch for overflows.
	(expand_all_functions): First expand ordered section and then
	unordered.
	* lto-partition.c (lto_balanced_map): Fix printing of tp_first_run.
	* profile.c (compute_value_histograms): Error on out of range
	tp_first_runs.
Index: cgraph.c
===================================================================
--- cgraph.c	(revision 279093)
+++ cgraph.c	(working copy)
@@ -3074,6 +3074,11 @@ cgraph_node::verify_node (void)
       inlined_to->count.debug ();
       error_found = true;
     }
+  if (tp_first_run < 0)
+    {
+      error ("tp_first_run must be positive");
+      error_found = true;
+    }
   if (!definition && !in_other_partition && local)
     {
       error ("local symbols must be defined");
Index: cgraph.h
===================================================================
--- cgraph.h	(revision 279093)
+++ cgraph.h	(working copy)
@@ -1430,8 +1430,6 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cg
 
   /* Expected number of executions: calculated in profile.c.  */
   profile_count count;
-  /* Time profiler: first run of function.  */
-  gcov_type tp_first_run;
   /* How to scale counts at materialization time; used to merge
      LTO units with different number of profile runs.  */
   int count_materialization_scale;
@@ -1439,6 +1437,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cg
   unsigned int profile_id;
   /* ID of the translation unit.  */
   int unit_id;
+  /* Time profiler: first run of function.  */
+  int tp_first_run;
 
   /* Set when decl is an abstract function pointed to by the
      ABSTRACT_DECL_ORIGIN of a reachable function.  */
Index: lto/lto-partition.c
===================================================================
--- lto/lto-partition.c	(revision 279093)
+++ lto/lto-partition.c	(working copy)
@@ -514,11 +514,11 @@ lto_balanced_map (int n_lto_partitions,
   if (dump_file)
     {
       for (unsigned i = 0; i < order.length (); i++)
-	fprintf (dump_file, "Balanced map symbol order:%s:%" PRId64 "\n",
-		 order[i]->name (), (int64_t) order[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol order:%s:%u\n",
+		 order[i]->name (), order[i]->tp_first_run);
       for (unsigned i = 0; i < noreorder.length (); i++)
-	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%" PRId64 "\n",
-		 noreorder[i]->name (), (int64_t) noreorder[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%u\n",
+		 noreorder[i]->name (), noreorder[i]->tp_first_run);
     }
 
   /* Collect all variables that should not be reordered.  */
Index: profile.c
===================================================================
--- profile.c	(revision 279093)
+++ profile.c	(working copy)
@@ -871,11 +871,18 @@ compute_value_histograms (histogram_valu
       if (hist->type == HIST_TYPE_TIME_PROFILE)
         {
 	  node = cgraph_node::get (hist->fun->decl);
-	  node->tp_first_run = hist->hvalue.counters[0];
+	  if (hist->hvalue.counters[0] >= 0
+	      && hist->hvalue.counters[0] < INT_MAX / 2)
+	    node->tp_first_run = hist->hvalue.counters[0];
+	  else
+	    {
+	      if (flag_profile_correction)
+		error ("corrupted profile info: invalid time profile");
+	      node->tp_first_run = 0;
+	    }
 
           if (dump_file)
-            fprintf (dump_file, "Read tp_first_run: %" PRId64 "\n",
-		     (int64_t) node->tp_first_run);
+            fprintf (dump_file, "Read tp_first_run: %d\n", node->tp_first_run);
         }
     }
 
Index: cgraphunit.c
===================================================================
--- cgraphunit.c	(revision 279093)
+++ cgraphunit.c	(working copy)
@@ -2364,8 +2364,8 @@ tp_first_run_node_cmp (const void *pa, c
 {
   const cgraph_node *a = *(const cgraph_node * const *) pa;
   const cgraph_node *b = *(const cgraph_node * const *) pb;
-  gcov_type tp_first_run_a = a->tp_first_run;
-  gcov_type tp_first_run_b = b->tp_first_run;
+  int tp_first_run_a = a->tp_first_run;
+  int tp_first_run_b = b->tp_first_run;
 
   if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
       || a->no_reorder)
@@ -2381,11 +2381,7 @@ tp_first_run_node_cmp (const void *pa, c
   if (!tp_first_run_a || !tp_first_run_b)
     return tp_first_run_b ? 1 : -1;
 
-  /* Watch for overlflow - tp_first_run is 64bit.  */
-  if (tp_first_run_a > tp_first_run_b)
-    return 1;
-  else
-    return -1;
+  return tp_first_run_a - tp_first_run_b;
 }
 
 /* Expand all functions that must be output.
@@ -2425,43 +2421,45 @@ expand_all_functions (void)
           order[new_order_pos++] = order[i];
       }
 
-  /* Output functions in RPO so callers get optimized before callees.  This
-     makes ipa-ra and other propagators to work.
-     FIXME: This is far from optimal code layout.  */
-  for (i = new_order_pos - 1; i >= 0; i--)
+  /* First output functions with time profile in specified order.  */
+  qsort (tp_first_run_order, tp_first_run_order_pos,
+	 sizeof (cgraph_node *), tp_first_run_node_cmp);
+  for (i = 0; i < tp_first_run_order_pos; i++)
     {
-      node = order[i];
+      node = tp_first_run_order[i];
 
       if (node->process)
 	{
 	  expanded_func_count++;
+	  profiled_func_count++;
+
+	  if (symtab->dump_file)
+	    fprintf (symtab->dump_file,
+		     "Time profile order in expand_all_functions:%s:%d\n",
+		     node->asm_name (), node->tp_first_run);
 	  node->process = 0;
 	  node->expand ();
 	}
     }
-  qsort (tp_first_run_order, tp_first_run_order_pos,
-	 sizeof (cgraph_node *), tp_first_run_node_cmp);
-  for (i = 0; i < tp_first_run_order_pos; i++)
+
+  /* Output functions in RPO so callees get optimized before callers.  This
+     makes ipa-ra and other propagators to work.
+     FIXME: This is far from optimal code layout.  */
+  for (i = new_order_pos - 1; i >= 0; i--)
     {
-      node = tp_first_run_order[i];
+      node = order[i];
 
       if (node->process)
 	{
 	  expanded_func_count++;
-	  profiled_func_count++;
-
-	  if (symtab->dump_file)
-	    fprintf (symtab->dump_file,
-		     "Time profile order in expand_all_functions:%s:%" PRId64
-		     "\n", node->asm_name (), (int64_t) node->tp_first_run);
 	  node->process = 0;
 	  node->expand ();
 	}
     }
 
-    if (dump_file)
-      fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
-               main_input_filename, profiled_func_count, expanded_func_count);
+  if (dump_file)
+    fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
+	     main_input_filename, profiled_func_count, expanded_func_count);
 
   if (symtab->dump_file && tp_first_run_order_pos)
     fprintf (symtab->dump_file, "Expanded functions with time profile:%u/%u\n",
Alexander Monakov Dec. 10, 2019, 4:11 p.m. | #6
On Tue, 10 Dec 2019, Jan Hubicka wrote:

> > I would recommend to make these variables uint64_t, then you can simply do

> > 

> >   tp_first_run_a--;

> >   tp_first_run_b--;

> > 

> > making 0 wrap around to UINT64_MAX. Then they will naturally sort after all

> > other nodes.

> 

> Then we would still have to watch the overflow before returning? I

> actually find the condtional sort of more readable than intentional wrap

> around the range, so I kept it in the code espeically because I made the

> value 32bit again and without this trick I no longer need to watch

> overflows.


For int-typed timestamps, you'd need to warp 0 to INT_MAX, e.g. like this:

  tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
  tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;

which keeps them in 0..INT_MAX range so 'return tp_first_run_a - tp_first_run_b'
still works.

> > > +  /* Output functions in RPO so callers get optimized before callees.  This

> > > +     makes ipa-ra and other propagators to work.

> > > +     FIXME: This is far from optimal code layout.  */

> > 

> > I think this should have said "callees get optimized before callers".

> 

> Indeed.


So technically this would be just postorder, not RPO :)

> --- cgraph.c	(revision 279093)

> +++ cgraph.c	(working copy)

> @@ -3074,6 +3074,11 @@ cgraph_node::verify_node (void)

>        inlined_to->count.debug ();

>        error_found = true;

>      }

> +  if (tp_first_run < 0)

> +    {

> +      error ("tp_first_run must be positive");

> +      error_found = true;

> +    }


"non-negative"?

Alexander
Jan Hubicka Dec. 10, 2019, 6:02 p.m. | #7
> On Tue, 10 Dec 2019, Jan Hubicka wrote:

> 

> > > I would recommend to make these variables uint64_t, then you can simply do

> > > 

> > >   tp_first_run_a--;

> > >   tp_first_run_b--;

> > > 

> > > making 0 wrap around to UINT64_MAX. Then they will naturally sort after all

> > > other nodes.

> > 

> > Then we would still have to watch the overflow before returning? I

> > actually find the condtional sort of more readable than intentional wrap

> > around the range, so I kept it in the code espeically because I made the

> > value 32bit again and without this trick I no longer need to watch

> > overflows.

> 

> For int-typed timestamps, you'd need to warp 0 to INT_MAX, e.g. like this:

> 

>   tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;

>   tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;

> 

> which keeps them in 0..INT_MAX range so 'return tp_first_run_a - tp_first_run_b'

> still works.


OK, updatd code for that :)
> 

> > > > +  /* Output functions in RPO so callers get optimized before callees.  This

> > > > +     makes ipa-ra and other propagators to work.

> > > > +     FIXME: This is far from optimal code layout.  */

> > > 

> > > I think this should have said "callees get optimized before callers".

> > 

> > Indeed.

> 

> So technically this would be just postorder, not RPO :)


Well, we it is computed by ipa_reverse_postorder :)
> 

> > --- cgraph.c	(revision 279093)

> > +++ cgraph.c	(working copy)

> > @@ -3074,6 +3074,11 @@ cgraph_node::verify_node (void)

> >        inlined_to->count.debug ();

> >        error_found = true;

> >      }

> > +  if (tp_first_run < 0)

> > +    {

> > +      error ("tp_first_run must be positive");

> > +      error_found = true;

> > +    }

> 

> "non-negative"?

Fixed too.

I have comitted the following variant of patch after re-testing.


	* cgraph.c (cgraph_node::verify_node): Verify tp_first_run.
	* cgraph.h (cgrpah_node): Turn tp_first_run back to int.
	* cgraphunit.c (tp_first_run_node_cmp): Do not watch for overflows.
	(expand_all_functions): First expand ordered section and then
	unordered.
	* lto-partition.c (lto_balanced_map): Fix printing of tp_first_run.
	* profile.c (compute_value_histograms): Error on out of range
	tp_first_runs.
Index: cgraph.c
===================================================================
--- cgraph.c	(revision 279167)
+++ cgraph.c	(working copy)
@@ -3066,6 +3066,11 @@ cgraph_node::verify_node (void)
       inlined_to->count.debug ();
       error_found = true;
     }
+  if (tp_first_run < 0)
+    {
+      error ("tp_first_run must be non-negative");
+      error_found = true;
+    }
   if (!definition && !in_other_partition && local)
     {
       error ("local symbols must be defined");
Index: cgraph.h
===================================================================
--- cgraph.h	(revision 279167)
+++ cgraph.h	(working copy)
@@ -926,9 +926,9 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cg
       clone_of (NULL), call_site_hash (NULL), former_clone_of (NULL),
       simdclone (NULL), simd_clones (NULL), ipa_transforms_to_apply (vNULL),
       inlined_to (NULL), rtl (NULL), clone (), thunk (),
-      count (profile_count::uninitialized ()), tp_first_run (false),
+      count (profile_count::uninitialized ()),
       count_materialization_scale (REG_BR_PROB_BASE), profile_id (0),
-      unit_id (0), used_as_abstract_origin (false),
+      unit_id (0), tp_first_run (0), used_as_abstract_origin (false),
       lowered (false), process (false), frequency (NODE_FREQUENCY_NORMAL),
       only_called_at_startup (false), only_called_at_exit (false),
       tm_clone (false), dispatcher_function (false), calls_comdat_local (false),
@@ -1469,8 +1469,6 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cg
 
   /* Expected number of executions: calculated in profile.c.  */
   profile_count count;
-  /* Time profiler: first run of function.  */
-  gcov_type tp_first_run;
   /* How to scale counts at materialization time; used to merge
      LTO units with different number of profile runs.  */
   int count_materialization_scale;
@@ -1478,6 +1476,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cg
   unsigned int profile_id;
   /* ID of the translation unit.  */
   int unit_id;
+  /* Time profiler: first run of function.  */
+  int tp_first_run;
 
   /* Set when decl is an abstract function pointed to by the
      ABSTRACT_DECL_ORIGIN of a reachable function.  */
Index: cgraphunit.c
===================================================================
--- cgraphunit.c	(revision 279167)
+++ cgraphunit.c	(working copy)
@@ -2364,8 +2364,8 @@ tp_first_run_node_cmp (const void *pa, c
 {
   const cgraph_node *a = *(const cgraph_node * const *) pa;
   const cgraph_node *b = *(const cgraph_node * const *) pb;
-  gcov_type tp_first_run_a = a->tp_first_run;
-  gcov_type tp_first_run_b = b->tp_first_run;
+  unsigned int tp_first_run_a = a->tp_first_run;
+  unsigned int tp_first_run_b = b->tp_first_run;
 
   if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
       || a->no_reorder)
@@ -2378,14 +2378,10 @@ tp_first_run_node_cmp (const void *pa, c
     return a->order - b->order;
 
   /* Functions with time profile must be before these without profile.  */
-  if (!tp_first_run_a || !tp_first_run_b)
-    return tp_first_run_b ? 1 : -1;
+  tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
+  tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;
 
-  /* Watch for overlflow - tp_first_run is 64bit.  */
-  if (tp_first_run_a > tp_first_run_b)
-    return 1;
-  else
-    return -1;
+  return tp_first_run_a - tp_first_run_b;
 }
 
 /* Expand all functions that must be output.
@@ -2425,43 +2421,45 @@ expand_all_functions (void)
           order[new_order_pos++] = order[i];
       }
 
-  /* Output functions in RPO so callers get optimized before callees.  This
-     makes ipa-ra and other propagators to work.
-     FIXME: This is far from optimal code layout.  */
-  for (i = new_order_pos - 1; i >= 0; i--)
+  /* First output functions with time profile in specified order.  */
+  qsort (tp_first_run_order, tp_first_run_order_pos,
+	 sizeof (cgraph_node *), tp_first_run_node_cmp);
+  for (i = 0; i < tp_first_run_order_pos; i++)
     {
-      node = order[i];
+      node = tp_first_run_order[i];
 
       if (node->process)
 	{
 	  expanded_func_count++;
+	  profiled_func_count++;
+
+	  if (symtab->dump_file)
+	    fprintf (symtab->dump_file,
+		     "Time profile order in expand_all_functions:%s:%d\n",
+		     node->asm_name (), node->tp_first_run);
 	  node->process = 0;
 	  node->expand ();
 	}
     }
-  qsort (tp_first_run_order, tp_first_run_order_pos,
-	 sizeof (cgraph_node *), tp_first_run_node_cmp);
-  for (i = 0; i < tp_first_run_order_pos; i++)
+
+  /* Output functions in RPO so callees get optimized before callers.  This
+     makes ipa-ra and other propagators to work.
+     FIXME: This is far from optimal code layout.  */
+  for (i = new_order_pos - 1; i >= 0; i--)
     {
-      node = tp_first_run_order[i];
+      node = order[i];
 
       if (node->process)
 	{
 	  expanded_func_count++;
-	  profiled_func_count++;
-
-	  if (symtab->dump_file)
-	    fprintf (symtab->dump_file,
-		     "Time profile order in expand_all_functions:%s:%" PRId64
-		     "\n", node->asm_name (), (int64_t) node->tp_first_run);
 	  node->process = 0;
 	  node->expand ();
 	}
     }
 
-    if (dump_file)
-      fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
-               main_input_filename, profiled_func_count, expanded_func_count);
+  if (dump_file)
+    fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
+	     main_input_filename, profiled_func_count, expanded_func_count);
 
   if (symtab->dump_file && tp_first_run_order_pos)
     fprintf (symtab->dump_file, "Expanded functions with time profile:%u/%u\n",
Index: lto/lto-partition.c
===================================================================
--- lto/lto-partition.c	(revision 279167)
+++ lto/lto-partition.c	(working copy)
@@ -514,11 +514,11 @@ lto_balanced_map (int n_lto_partitions,
   if (dump_file)
     {
       for (unsigned i = 0; i < order.length (); i++)
-	fprintf (dump_file, "Balanced map symbol order:%s:%" PRId64 "\n",
-		 order[i]->name (), (int64_t) order[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol order:%s:%u\n",
+		 order[i]->name (), order[i]->tp_first_run);
       for (unsigned i = 0; i < noreorder.length (); i++)
-	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%" PRId64 "\n",
-		 noreorder[i]->name (), (int64_t) noreorder[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%u\n",
+		 noreorder[i]->name (), noreorder[i]->tp_first_run);
     }
 
   /* Collect all variables that should not be reordered.  */
Index: profile.c
===================================================================
--- profile.c	(revision 279167)
+++ profile.c	(working copy)
@@ -871,11 +871,18 @@ compute_value_histograms (histogram_valu
       if (hist->type == HIST_TYPE_TIME_PROFILE)
         {
 	  node = cgraph_node::get (hist->fun->decl);
-	  node->tp_first_run = hist->hvalue.counters[0];
+	  if (hist->hvalue.counters[0] >= 0
+	      && hist->hvalue.counters[0] < INT_MAX / 2)
+	    node->tp_first_run = hist->hvalue.counters[0];
+	  else
+	    {
+	      if (flag_profile_correction)
+		error ("corrupted profile info: invalid time profile");
+	      node->tp_first_run = 0;
+	    }
 
           if (dump_file)
-            fprintf (dump_file, "Read tp_first_run: %" PRId64 "\n",
-		     (int64_t) node->tp_first_run);
+            fprintf (dump_file, "Read tp_first_run: %d\n", node->tp_first_run);
         }
     }

Patch

Index: cgraph.c
===================================================================
--- cgraph.c	(revision 279076)
+++ cgraph.c	(working copy)
@@ -1954,7 +1954,7 @@  cgraph_node::dump (FILE *f)
       count.dump (f);
     }
   if (tp_first_run > 0)
-    fprintf (f, " first_run:%i", tp_first_run);
+    fprintf (f, " first_run:%" PRId64, (int64_t) tp_first_run);
   if (origin)
     fprintf (f, " nested in:%s", origin->asm_name ());
   if (gimple_has_body_p (decl))
Index: cgraph.h
===================================================================
--- cgraph.h	(revision 279076)
+++ cgraph.h	(working copy)
@@ -1430,6 +1430,8 @@  struct GTY((tag ("SYMTAB_FUNCTION"))) cg
 
   /* Expected number of executions: calculated in profile.c.  */
   profile_count count;
+  /* Time profiler: first run of function.  */
+  gcov_type tp_first_run;
   /* How to scale counts at materialization time; used to merge
      LTO units with different number of profile runs.  */
   int count_materialization_scale;
@@ -1437,8 +1439,6 @@  struct GTY((tag ("SYMTAB_FUNCTION"))) cg
   unsigned int profile_id;
   /* ID of the translation unit.  */
   int unit_id;
-  /* Time profiler: first run of function.  */
-  int tp_first_run;
 
   /* Set when decl is an abstract function pointed to by the
      ABSTRACT_DECL_ORIGIN of a reachable function.  */
@@ -2463,6 +2463,7 @@  cgraph_inline_failed_type_t cgraph_inlin
 
 /* In cgraphunit.c  */
 void cgraphunit_c_finalize (void);
+int tp_first_run_node_cmp (const void *pa, const void *pb);
 
 /*  Initialize datastructures so DECL is a function in lowered gimple form.
     IN_SSA is true if the gimple is in SSA.  */
Index: cgraphunit.c
===================================================================
--- cgraphunit.c	(revision 279076)
+++ cgraphunit.c	(working copy)
@@ -2359,19 +2359,33 @@  cgraph_node::expand (void)
 /* Node comparator that is responsible for the order that corresponds
    to time when a function was launched for the first time.  */
 
-static int
-node_cmp (const void *pa, const void *pb)
+int
+tp_first_run_node_cmp (const void *pa, const void *pb)
 {
   const cgraph_node *a = *(const cgraph_node * const *) pa;
   const cgraph_node *b = *(const cgraph_node * const *) pb;
+  gcov_type tp_first_run_a = a->tp_first_run;
+  gcov_type tp_first_run_b = b->tp_first_run;
+
+  if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
+      || a->no_reorder)
+    tp_first_run_a = 0;
+  if (!opt_for_fn (b->decl, flag_profile_reorder_functions)
+      || b->no_reorder)
+    tp_first_run_b = 0;
+
+  if (tp_first_run_a == tp_first_run_b)
+    return b->order - a->order;
 
   /* Functions with time profile must be before these without profile.  */
-  if (!a->tp_first_run || !b->tp_first_run)
-    return a->tp_first_run - b->tp_first_run;
+  if (!tp_first_run_a || !tp_first_run_b)
+    return tp_first_run_a ? 1 : -1;
 
-  return a->tp_first_run != b->tp_first_run
-	 ? b->tp_first_run - a->tp_first_run
-	 : b->order - a->order;
+  /* Watch for overlflow - tp_first_run is 64bit.  */
+  if (tp_first_run_a > tp_first_run_b)
+    return -1;
+  else
+    return 1;
 }
 
 /* Expand all functions that must be output.
@@ -2390,8 +2404,10 @@  expand_all_functions (void)
   cgraph_node *node;
   cgraph_node **order = XCNEWVEC (cgraph_node *,
 					 symtab->cgraph_count);
+  cgraph_node **tp_first_run_order = XCNEWVEC (cgraph_node *,
+					 symtab->cgraph_count);
   unsigned int expanded_func_count = 0, profiled_func_count = 0;
-  int order_pos, new_order_pos = 0;
+  int order_pos, tp_first_run_order_pos = 0, new_order_pos = 0;
   int i;
 
   order_pos = ipa_reverse_postorder (order);
@@ -2401,11 +2417,17 @@  expand_all_functions (void)
      optimization.  So we must be sure to not reference them.  */
   for (i = 0; i < order_pos; i++)
     if (order[i]->process)
-      order[new_order_pos++] = order[i];
-
-  if (flag_profile_reorder_functions)
-    qsort (order, new_order_pos, sizeof (cgraph_node *), node_cmp);
-
+      {
+	if (order[i]->tp_first_run
+	    && opt_for_fn (order[i]->decl, flag_profile_reorder_functions))
+	  tp_first_run_order[tp_first_run_order_pos++] = order[i];
+	else
+          order[new_order_pos++] = order[i];
+      }
+
+  /* Output functions in RPO so callers get optimized before callees.  This
+     makes ipa-ra and other propagators to work.
+     FIXME: This is far from optimal code layout.  */
   for (i = new_order_pos - 1; i >= 0; i--)
     {
       node = order[i];
@@ -2413,13 +2435,25 @@  expand_all_functions (void)
       if (node->process)
 	{
 	  expanded_func_count++;
-	  if(node->tp_first_run)
-	    profiled_func_count++;
+	  node->process = 0;
+	  node->expand ();
+	}
+    }
+  qsort (tp_first_run_order, tp_first_run_order_pos,
+	 sizeof (cgraph_node *), tp_first_run_node_cmp);
+  for (i = 0; i < tp_first_run_order_pos; i++)
+    {
+      node = tp_first_run_order[i];
+
+      if (node->process)
+	{
+	  expanded_func_count++;
+	  profiled_func_count++;
 
 	  if (symtab->dump_file)
 	    fprintf (symtab->dump_file,
-		     "Time profile order in expand_all_functions:%s:%d\n",
-		     node->asm_name (), node->tp_first_run);
+		     "Time profile order in expand_all_functions:%s:%" PRId64
+		     "\n", node->asm_name (), (int64_t) node->tp_first_run);
 	  node->process = 0;
 	  node->expand ();
 	}
@@ -2429,7 +2463,7 @@  expand_all_functions (void)
       fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
                main_input_filename, profiled_func_count, expanded_func_count);
 
-  if (symtab->dump_file && flag_profile_reorder_functions)
+  if (symtab->dump_file && tp_first_run_order_pos)
     fprintf (symtab->dump_file, "Expanded functions with time profile:%u/%u\n",
              profiled_func_count, expanded_func_count);
 
Index: lto/lto-partition.c
===================================================================
--- lto/lto-partition.c	(revision 279076)
+++ lto/lto-partition.c	(working copy)
@@ -372,38 +372,9 @@  lto_max_map (void)
     new_partition ("empty");
 }
 
-/* Helper function for qsort; sort nodes by order. noreorder functions must have
-   been removed earlier.  */
-static int
-node_cmp (const void *pa, const void *pb)
-{
-  const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
-  const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
-
-  /* Profile reorder flag enables function reordering based on first execution
-     of a function. All functions with profile are placed in ascending
-     order at the beginning.  */
-
-  if (flag_profile_reorder_functions)
-  {
-    /* Functions with time profile are sorted in ascending order.  */
-    if (a->tp_first_run && b->tp_first_run)
-      return a->tp_first_run != b->tp_first_run
-	? a->tp_first_run - b->tp_first_run
-        : a->order - b->order;
-
-    /* Functions with time profile are sorted before the functions
-       that do not have the profile.  */
-    if (a->tp_first_run || b->tp_first_run)
-      return b->tp_first_run - a->tp_first_run;
-  }
-
-  return b->order - a->order;
-}
-
 /* Helper function for qsort; sort nodes by order.  */
 static int
-varpool_node_cmp (const void *pa, const void *pb)
+node_cmp (const void *pa, const void *pb)
 {
   const symtab_node *a = *static_cast<const symtab_node * const *> (pa);
   const symtab_node *b = *static_cast<const symtab_node * const *> (pb);
@@ -418,7 +389,7 @@  add_sorted_nodes (vec<symtab_node *> &ne
   unsigned i;
   symtab_node *node;
 
-  next_nodes.qsort (varpool_node_cmp);
+  next_nodes.qsort (node_cmp);
   FOR_EACH_VEC_ELT (next_nodes, i, node)
     if (!symbol_partitioned_p (node))
       add_symbol_to_partition (partition, node);
@@ -537,17 +508,17 @@  lto_balanced_map (int n_lto_partitions,
      unit tends to import a lot of global trees defined there.  We should
      get better about minimizing the function bounday, but until that
      things works smoother if we order in source order.  */
-  order.qsort (node_cmp);
+  order.qsort (tp_first_run_node_cmp);
   noreorder.qsort (node_cmp);
 
   if (dump_file)
     {
       for (unsigned i = 0; i < order.length (); i++)
-	fprintf (dump_file, "Balanced map symbol order:%s:%u\n",
-		 order[i]->name (), order[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol order:%s:%" PRId64 "\n",
+		 order[i]->name (), (int64_t) order[i]->tp_first_run);
       for (unsigned i = 0; i < noreorder.length (); i++)
-	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%u\n",
-		 noreorder[i]->name (), noreorder[i]->tp_first_run);
+	fprintf (dump_file, "Balanced map symbol no_reorder:%s:%" PRId64 "\n",
+		 noreorder[i]->name (), (int64_t) noreorder[i]->tp_first_run);
     }
 
   /* Collect all variables that should not be reordered.  */
@@ -556,7 +527,7 @@  lto_balanced_map (int n_lto_partitions,
 	&& vnode->no_reorder)
       varpool_order.safe_push (vnode);
   n_varpool_nodes = varpool_order.length ();
-  varpool_order.qsort (varpool_node_cmp);
+  varpool_order.qsort (node_cmp);
 
   /* Compute partition size and create the first partition.  */
   if (param_min_partition_size > max_partition_size)
Index: profile.c
===================================================================
--- profile.c	(revision 279076)
+++ profile.c	(working copy)
@@ -874,7 +874,8 @@  compute_value_histograms (histogram_valu
 	  node->tp_first_run = hist->hvalue.counters[0];
 
           if (dump_file)
-            fprintf (dump_file, "Read tp_first_run: %d\n", node->tp_first_run);
+            fprintf (dump_file, "Read tp_first_run: %" PRId64 "\n",
+		     (int64_t) node->tp_first_run);
         }
     }