[RFA,PR,target/84128] Fix restore of scratch register used in probing loops

Message ID 328b4297-f032-6bf0-9252-1d7ee7b50133@redhat.com
State New
Headers show
Series
  • [RFA,PR,target/84128] Fix restore of scratch register used in probing loops
Related show

Commit Message

Jeff Law Jan. 31, 2018, 5:35 a.m.
Whee, fun, this appears to have been broken for a very long time,
possibly since the introduction of -fstack-check in 2010.  Thankfully it
only affects 32 bit and only in relatively constrained circumstances.

-fstack-check and -fstack-clash both potentially create a loop for stack
probing.  In that case they both need a scratch register to hold the
loop upper bound.

The code to allocate a scratch register first starts with the
caller-saved registers as they're zero cost.  Then it'll use any callee
saved register that is actually saved.  If neither is available (say
because all the caller-saved regs are used for parameter passing and
there are no callee saved register used), then the allocation routine
will push %eax on the stack and the deallocation routine will pop it off
to restore its value.

Of course there is a *stack allocation* between those two points.  So
the pop ends up restoring garbage into the register.

Obviously the restore routine needs to use reg+d addressing to get to
the stack slot and the allocated space needs to be deallocated by the
epilogue.  But sadly there's enough assertions sprinkled around that
prevent that from working as-is.

So what this patch does is continue to use the push to allocate the
register.  And it uses reg+d to restore the register after the probing
loops.  The "trick" is that the space allocated by the push becomes part
of the normal stack frame after we restore the scratch register's value.
 Ie, if we push a 4 byte register, then we reduce the size of the main
allocation request by 4 bytes.  And everything just works.

Clearly this code had not be exercised.  So I hacked up things so that
we always generated a probing loop and always assumed that we needed to
save/restore a scratch register and enabled stack clash protections by
default.  I bootstrapped that and compared testsuite runs against a run
which just had stack clash protection on by default.  That did turn up
an issue, but it was with my testing hacks, not with this patch :-)


I (of course) also did the usual bootstrap and regression tests, using
x86_64 and i686.  Hopefully this is the last iteration on the x86/x86_64
stack clash bits :-)

The one concern I have is do we need to tell the CFI machinery that
%eax's value was restored to its entry value?

OK for the trunk?  Or do we need some magic CFI bit to describe the
restore of %eax?

Thanks,
Jeff
PR target/84128
	* config/i386/i386.c (release_scratch_register_on_entry): Add new
	OFFSET argument.  Use it to restore the scratch register rathern
	than a pop insn.
	(ix86_adjust_stack_and_probe_stack_clash): Un-constify SIZE.
	If we have to save a temporary register, decrement SIZE appropriately.
	Pass SIZE as the offset to find the saved register into
	release_scratch_register_on_entry.
	(ix86_adjust_stack_and_probe): Likewise.
	(ix86_emit_probe_stack_range): Likewise.


	PR target/84128
	* gcc.target/i386/pr84128.c: New test.

Comments

Eric Botcazou Jan. 31, 2018, 8:28 a.m. | #1
> -fstack-check and -fstack-clash both potentially create a loop for stack

> probing.  In that case they both need a scratch register to hold the

> loop upper bound.

> 

> The code to allocate a scratch register first starts with the

> caller-saved registers as they're zero cost.  Then it'll use any callee

> saved register that is actually saved.  If neither is available (say

> because all the caller-saved regs are used for parameter passing and

> there are no callee saved register used), then the allocation routine

> will push %eax on the stack and the deallocation routine will pop it off

> to restore its value.

> 

> Of course there is a *stack allocation* between those two points.  So

> the pop ends up restoring garbage into the register.


There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, 
Windows, LynxOS and other OSes where you can probe below the stack pointer.

> Clearly this code had not be exercised.  So I hacked up things so that

> we always generated a probing loop and always assumed that we needed to

> save/restore a scratch register and enabled stack clash protections by

> default. 


Probably not on Linux indeed.  But, since you modified the non-Linux path, you 
need to avoid repeating the same mistake and test on non-Linux native too.

-- 
Eric Botcazou
Uros Bizjak Jan. 31, 2018, 10:41 a.m. | #2
On Wed, Jan 31, 2018 at 6:35 AM, Jeff Law <law@redhat.com> wrote:
> Whee, fun, this appears to have been broken for a very long time,

> possibly since the introduction of -fstack-check in 2010.  Thankfully it

> only affects 32 bit and only in relatively constrained circumstances.

>

> -fstack-check and -fstack-clash both potentially create a loop for stack

> probing.  In that case they both need a scratch register to hold the

> loop upper bound.

>

> The code to allocate a scratch register first starts with the

> caller-saved registers as they're zero cost.  Then it'll use any callee

> saved register that is actually saved.  If neither is available (say

> because all the caller-saved regs are used for parameter passing and

> there are no callee saved register used), then the allocation routine

> will push %eax on the stack and the deallocation routine will pop it off

> to restore its value.

>

> Of course there is a *stack allocation* between those two points.  So

> the pop ends up restoring garbage into the register.

>

> Obviously the restore routine needs to use reg+d addressing to get to

> the stack slot and the allocated space needs to be deallocated by the

> epilogue.  But sadly there's enough assertions sprinkled around that

> prevent that from working as-is.

>

> So what this patch does is continue to use the push to allocate the

> register.  And it uses reg+d to restore the register after the probing

> loops.  The "trick" is that the space allocated by the push becomes part

> of the normal stack frame after we restore the scratch register's value.

>  Ie, if we push a 4 byte register, then we reduce the size of the main

> allocation request by 4 bytes.  And everything just works.

>

> Clearly this code had not be exercised.  So I hacked up things so that

> we always generated a probing loop and always assumed that we needed to

> save/restore a scratch register and enabled stack clash protections by

> default.  I bootstrapped that and compared testsuite runs against a run

> which just had stack clash protection on by default.  That did turn up

> an issue, but it was with my testing hacks, not with this patch :-)

>

>

> I (of course) also did the usual bootstrap and regression tests, using

> x86_64 and i686.  Hopefully this is the last iteration on the x86/x86_64

> stack clash bits :-)

>

> The one concern I have is do we need to tell the CFI machinery that

> %eax's value was restored to its entry value?


Can you or someone that knows CFI stuff please investigate this a bit?
I'm not expert in this area, and I don't feel comfortable to approve
the patch that has some known loose edges in the area I don't know
that well.

Uros.
Jeff Law Jan. 31, 2018, 3:48 p.m. | #3
On 01/31/2018 01:28 AM, Eric Botcazou wrote:
>> -fstack-check and -fstack-clash both potentially create a loop for stack

>> probing.  In that case they both need a scratch register to hold the

>> loop upper bound.

>>

>> The code to allocate a scratch register first starts with the

>> caller-saved registers as they're zero cost.  Then it'll use any callee

>> saved register that is actually saved.  If neither is available (say

>> because all the caller-saved regs are used for parameter passing and

>> there are no callee saved register used), then the allocation routine

>> will push %eax on the stack and the deallocation routine will pop it off

>> to restore its value.

>>

>> Of course there is a *stack allocation* between those two points.  So

>> the pop ends up restoring garbage into the register.

> 

> There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, 

> Windows, LynxOS and other OSes where you can probe below the stack pointer.

Ack.  Let's hold the patch until I can test *that* path.

jeff
Jeff Law Jan. 31, 2018, 10:55 p.m. | #4
On 01/31/2018 01:28 AM, Eric Botcazou wrote:
>> -fstack-check and -fstack-clash both potentially create a loop for stack

>> probing.  In that case they both need a scratch register to hold the

>> loop upper bound.

>>

>> The code to allocate a scratch register first starts with the

>> caller-saved registers as they're zero cost.  Then it'll use any callee

>> saved register that is actually saved.  If neither is available (say

>> because all the caller-saved regs are used for parameter passing and

>> there are no callee saved register used), then the allocation routine

>> will push %eax on the stack and the deallocation routine will pop it off

>> to restore its value.

>>

>> Of course there is a *stack allocation* between those two points.  So

>> the pop ends up restoring garbage into the register.

> 

> There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, 

> Windows, LynxOS and other OSes where you can probe below the stack pointer.

So I went back and restored the code for targets that don't define
STACK_CHECK_MOVING_SP (everything but linux I believe).

To test it I turned on stack checking by default, bootstrapped that as a
baseline.

Then I turned off STACK_CHECK_MOVING_SP, compared that to the baseline.
Not surprisingly, things with large stacks failed during probing.  I
then hacked up the dejagnu bits to present just an 8k stack which made
all the regressions which had stack usage dg markers turn into
UNRESOLVED.  There were still many tests that had to be checked manually
because they didn't have stack usage dg markup.  This is baseline #2.

I then forced all prologue allocations to go through the loop code,
bootstrapped that and compared it to baseline #2.

Then I turned off STACK_CHECK_MOVING_SP and forced all prologue
allocations to go through the loop code.  Bootstrapped that and compared
the results to the baseline to ensure we weren't regressing.

Note some tests want to verify we don't have push/pops in their
resulting assembly code or count the number of orl instructions.  Those
had to be hand-checked along the way as well.  Par for the course.

I'm confident I've got all 3 paths covered reasonably well now.  I still
need to check into the CFI concern before resubmitting.

jeff

ps.  I also did light testing before restoring the code to handle
!STACK_CHECK_MOVING_SP to verify that it indeed was wrong with my
original patch.
Jeff Law Feb. 1, 2018, 1:19 a.m. | #5
On 01/31/2018 03:41 AM, Uros Bizjak wrote:
>>

>> The one concern I have is do we need to tell the CFI machinery that

>> %eax's value was restored to its entry value?

> 

> Can you or someone that knows CFI stuff please investigate this a bit?

> I'm not expert in this area, and I don't feel comfortable to approve

> the patch that has some known loose edges in the area I don't know

> that well.

Fair enough.  I'm not terribly familiar with the CFI bits either, but I
was able to work this out.

I've looked at gcc-6, gcc-7 and the trunk.   None of them record
anything WRT %eax at the restore point.  If the CFA is the stack pointer
both record that the CFA offset changed, but nothing about %eax.

So this patch doesn't change change the accuracy of the CFI info.  One
might be able to argue that we should record the restores, but that
seems like an independent issue to me.

I'll post an updated patch momentarily.

jeff

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index fef34a1..93ce79c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -12567,22 +12567,25 @@  get_scratch_register_on_entry (struct scratch_reg *sr)
     }
 }
 
-/* Release a scratch register obtained from the preceding function.  */
+/* Release a scratch register obtained from the preceding function.
+
+   Note there will always be some kind of stack adjustment between
+   allocation and releasing the scratch register.  So we can't just
+   pop the scratch register off the stack if we were forced to save it
+   (the stack pointer itself has a different value).
+
+   Instead we're passed the offset into the stack where the value will
+   be found and the space becomes part of the local frame that is
+   deallocated by the epilogue.  */
 
 static void
-release_scratch_register_on_entry (struct scratch_reg *sr)
+release_scratch_register_on_entry (struct scratch_reg *sr, HOST_WIDE_INT offset)
 {
   if (sr->saved)
     {
-      struct machine_function *m = cfun->machine;
-      rtx x, insn = emit_insn (gen_pop (sr->reg));
-
-      /* The RTX_FRAME_RELATED_P mechanism doesn't know about pop.  */
-      RTX_FRAME_RELATED_P (insn) = 1;
-      x = gen_rtx_PLUS (Pmode, stack_pointer_rtx, GEN_INT (UNITS_PER_WORD));
-      x = gen_rtx_SET (stack_pointer_rtx, x);
-      add_reg_note (insn, REG_FRAME_RELATED_EXPR, x);
-      m->fs.sp_offset -= UNITS_PER_WORD;
+      rtx x = gen_rtx_PLUS (Pmode, stack_pointer_rtx, GEN_INT (offset));
+      x = gen_rtx_SET (sr->reg, gen_rtx_MEM (word_mode, x));
+      emit_insn (x);
     }
 }
 
@@ -12597,7 +12600,7 @@  release_scratch_register_on_entry (struct scratch_reg *sr)
    pushed on the stack.  */
 
 static void
-ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size,
+ix86_adjust_stack_and_probe_stack_clash (HOST_WIDE_INT size,
 					 const bool int_registers_saved)
 {
   struct machine_function *m = cfun->machine;
@@ -12713,6 +12716,12 @@  ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size,
       struct scratch_reg sr;
       get_scratch_register_on_entry (&sr);
 
+      /* If we needed to save a register, then account for any space
+	 that was pushed (we are not going to pop the register when
+	 we do the restore).  */
+      if (sr.saved)
+	size -= UNITS_PER_WORD;
+
       /* Step 1: round SIZE down to a multiple of the interval.  */
       HOST_WIDE_INT rounded_size = size & -probe_interval;
 
@@ -12761,7 +12770,9 @@  ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size,
 				   m->fs.cfa_reg == stack_pointer_rtx);
       dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
 
-      release_scratch_register_on_entry (&sr);
+      /* This does not deallocate the space reserved for the scratch
+	 register.  That will be deallocated in the epilogue.  */
+      release_scratch_register_on_entry (&sr, size);
     }
 
   /* Make sure nothing is scheduled before we are done.  */
@@ -12774,7 +12785,7 @@  ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size,
    pushed on the stack.  */
 
 static void
-ix86_adjust_stack_and_probe (const HOST_WIDE_INT size,
+ix86_adjust_stack_and_probe (HOST_WIDE_INT size,
 			     const bool int_registers_saved)
 {
   /* We skip the probe for the first interval + a small dope of 4 words and
@@ -12847,6 +12858,11 @@  ix86_adjust_stack_and_probe (const HOST_WIDE_INT size,
 
       get_scratch_register_on_entry (&sr);
 
+      /* If we needed to save a register, then account for any space
+	 that was pushed (we are not going to pop the register when
+	 we do the restore).  */
+      if (sr.saved)
+	size -= UNITS_PER_WORD;
 
       /* Step 1: round SIZE to the previous multiple of the interval.  */
 
@@ -12906,7 +12922,9 @@  ix86_adjust_stack_and_probe (const HOST_WIDE_INT size,
 						    (get_probe_interval ()
 						     + dope))));
 
-      release_scratch_register_on_entry (&sr);
+      /* This does not deallocate the space reserved for the scratch
+	 register.  That will be deallocated in the epilogue.  */
+      release_scratch_register_on_entry (&sr, size);
     }
 
   /* Even if the stack pointer isn't the CFA register, we need to correctly
@@ -13015,6 +13033,11 @@  ix86_emit_probe_stack_range (HOST_WIDE_INT first, HOST_WIDE_INT size,
 
       get_scratch_register_on_entry (&sr);
 
+      /* If we needed to save a register, then account for any space
+	 that was pushed (we are not going to pop the register when
+	 we do the restore).  */
+      if (sr.saved)
+	size -= UNITS_PER_WORD;
 
       /* Step 1: round SIZE to the previous multiple of the interval.  */
 
@@ -13055,7 +13078,9 @@  ix86_emit_probe_stack_range (HOST_WIDE_INT first, HOST_WIDE_INT size,
 						       sr.reg),
 					 rounded_size - size));
 
-      release_scratch_register_on_entry (&sr);
+      /* This does not deallocate the space reserved for the scratch
+	 register.  That will be deallocated in the epilogue.  */
+      release_scratch_register_on_entry (&sr, size);
     }
 
   /* Make sure nothing is scheduled before we are done.  */
diff --git a/gcc/testsuite/gcc.target/i386/pr84128.c b/gcc/testsuite/gcc.target/i386/pr84128.c
new file mode 100644
index 0000000..a8323fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr84128.c
@@ -0,0 +1,30 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -march=i686 -mtune=generic -fstack-clash-protection" } */
+/* { dg-require-effective-target ia32 } */
+
+__attribute__ ((noinline, noclone, weak, regparm (3)))
+int
+f1 (long arg0, int (*pf) (long, void *))
+{
+  unsigned char buf[32768];
+  return pf (arg0, buf);
+}
+
+__attribute__ ((noinline, noclone, weak))
+int
+f2 (long arg0, void *ignored)
+{
+  if (arg0 != 17)
+    __builtin_abort ();
+  return 19;
+}
+
+int
+main (void)
+{
+  if (f1 (17, f2) != 19)
+    __builtin_abort ();
+  return 0;
+}
+
+