MIPS: Fix GCC `noreorder' for undefined R5900 short loops

Message ID 20190721171726.GA47580@sx9
State New
Headers show
Series
  • MIPS: Fix GCC `noreorder' for undefined R5900 short loops
Related show

Commit Message

Fredrik Noring July 21, 2019, 5:17 p.m.
Hi Paul, Matthew,

Paul -- as I'm preparing the R5900 kernel patches there was a USB DMA series
and breakage that needed attention. The fixes ending with ff2437befd8f ("usb:
host: Fix excessive alignment restriction for local memory allocations") are
now merged with Linus' kernel, and recommended for the R5900 series. The
initial R5900 patch submission is getting closer. :)

The present problem is related to GCC and the R5900 short loop bug[1]. It
turns out GCC emits assembly code like

loop:	addiu	$5,$5,1
	addiu	$4,$4,1
	lb	$2,-1($5)
	.set	noreorder
	.set	nomacro
	bne	$2,$3,loop
	sb	$2,-1($4)
	.set	macro
	.set	reorder

that is undefined for the R5900 (this short loop has five instructions),
for simple C code such as

	while ((*s++ = *p++) != '\n')
		;

in the kernel. The noreorder directive prohibits GAS from corrections, and
GAS really ought to give an error for it, I think. In the meantime, I have a
tool that does machine code analysis of ELF objects to catch undefined R5900
short loops, including those made with assembler macros in the kernel.

[ In theory, GAS could actually insert NOPs prior to the noreorder directive,
to make the loop longer that six instructions, but GAS does not have that
kind of capability. Another option that GCC prevents is to place a NOP in
the delay slot. ]

A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not
make explicit use of the branch delay slot, as suggested by the patch below?
Then GCC will emit

loop:	addiu	$5,$5,1
	addiu	$4,$4,1
	lb	$2,-1($5)
	sb	$2,-1($4)
	bne	$2,$3,loop

that GAS will adjust in the ELF object to

   4:	24a50001 	addiu	a1,a1,1
   8:	24840001 	addiu	a0,a0,1
   c:	80a2ffff 	lb	v0,-1(a1)
  10:	a082ffff 	sb	v0,-1(a0)
  14:	1443fffb 	bne	v0,v1,4
  18:	00000000 	nop

where a NOP is placed in the delay slot to avoid the bug. For longer loops,
this relies on GAS making appropriate use of the delay slot. I'm not sure if
GAS is good at that, though?

Fredrik

References:

[1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/config/tc-mips.c;h=b7b4b6989a12d3091a02de7155fbea3adbf1c9d7;hb=HEAD#l7133

    On the R5900 short loops need to be fixed by inserting a NOP in the
    branch delay slot.
    
    The short loop bug under certain conditions causes loops to execute
    only once or twice.  We must ensure that the assembler never
    generates loops that satisfy all of the following conditions:
    
    - a loop consists of less than or equal to six instructions
      (including the branch delay slot);
    - a loop contains only one conditional branch instruction at the end
      of the loop;
    - a loop does not contain any other branch or jump instructions;
    - a branch delay slot of the loop is not NOP (EE 2.9 or later).
    
    We need to do this because of a hardware bug in the R5900 chip.

Comments

Maciej W. Rozycki July 21, 2019, 9:50 p.m. | #1
Hi Fredrik,

> The present problem is related to GCC and the R5900 short loop bug[1]. It

> turns out GCC emits assembly code like

> 

> loop:	addiu	$5,$5,1

> 	addiu	$4,$4,1

> 	lb	$2,-1($5)

> 	.set	noreorder

> 	.set	nomacro

> 	bne	$2,$3,loop

> 	sb	$2,-1($4)

> 	.set	macro

> 	.set	reorder

> 

> that is undefined for the R5900 (this short loop has five instructions),

> for simple C code such as

> 

> 	while ((*s++ = *p++) != '\n')

> 		;

> 

> in the kernel. The noreorder directive prohibits GAS from corrections, and

> GAS really ought to give an error for it, I think. In the meantime, I have a

> tool that does machine code analysis of ELF objects to catch undefined R5900

> short loops, including those made with assembler macros in the kernel.


 I think that should be a GAS warning really (similarly to macros that 
expand to multiple instructions in a delay slot) as people ought to be 
allowed to do what they wish, and then `-Werror' can be used for code 
quality enforcement (and possibly disabled on a case-by-case basis).

> [ In theory, GAS could actually insert NOPs prior to the noreorder directive,

> to make the loop longer that six instructions, but GAS does not have that

> kind of capability. Another option that GCC prevents is to place a NOP in

> the delay slot. ]


 Well, GAS does have that capability, although of course it is not enabled 
for `noreorder' code.  For generated code I think however that usually it 
will be cheaper performance-wise if a non-trivial delay-slot instruction 
is never produced in the affected cases (i.e. a dummy NOP is always used).

> A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not

> make explicit use of the branch delay slot, as suggested by the patch below?

> Then GCC will emit

> 

> loop:	addiu	$5,$5,1

> 	addiu	$4,$4,1

> 	lb	$2,-1($5)

> 	sb	$2,-1($4)

> 	bne	$2,$3,loop

> 

> that GAS will adjust in the ELF object to

> 

>    4:	24a50001 	addiu	a1,a1,1

>    8:	24840001 	addiu	a0,a0,1

>    c:	80a2ffff 	lb	v0,-1(a1)

>   10:	a082ffff 	sb	v0,-1(a0)

>   14:	1443fffb 	bne	v0,v1,4

>   18:	00000000 	nop

> 

> where a NOP is placed in the delay slot to avoid the bug. For longer loops,

> this relies on GAS making appropriate use of the delay slot. I'm not sure if

> GAS is good at that, though?


 I'm sort-of surprised that GCC has produced `reorder' code here, making 
it rely on GAS for delay slot scheduling.  Have you used an unusual set of 
options that prevents GCC from making `noreorder' code, which as I recall 
is the usual default (not for the MIPS16 mode IIRC, as well as some 
obscure corner cases)?

>     On the R5900 short loops need to be fixed by inserting a NOP in the

>     branch delay slot.

>     

>     The short loop bug under certain conditions causes loops to execute

>     only once or twice.  We must ensure that the assembler never

>     generates loops that satisfy all of the following conditions:

>     

>     - a loop consists of less than or equal to six instructions

>       (including the branch delay slot);

>     - a loop contains only one conditional branch instruction at the end

>       of the loop;

>     - a loop does not contain any other branch or jump instructions;

>     - a branch delay slot of the loop is not NOP (EE 2.9 or later).

>     

>     We need to do this because of a hardware bug in the R5900 chip.

> 

> diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md

> index e17b1d522f0..acd31a8960c 100644

> --- a/gcc/config/mips/mips.md

> +++ b/gcc/config/mips/mips.md

> @@ -1091,6 +1091,7 @@

>  

>  (define_delay (and (eq_attr "type" "branch")

>  		   (not (match_test "TARGET_MIPS16"))

> +		   (not (match_test "TARGET_FIX_R5900"))

>  		   (eq_attr "branch_likely" "yes"))

>    [(eq_attr "can_delay" "yes")

>     (nil)

> @@ -1100,6 +1101,7 @@

>  ;; not annul on false.

>  (define_delay (and (eq_attr "type" "branch")

>  		   (not (match_test "TARGET_MIPS16"))

> +		   (not (match_test "TARGET_FIX_R5900"))

>  		   (ior (match_test "TARGET_CB_NEVER")

>  			(and (eq_attr "compact_form" "maybe")

>  			     (not (match_test "TARGET_CB_ALWAYS")))

> 


 I think you need to modify the default `can_delay' attribute definition 
instead (in the same way).  An improved future version might determine the 
exact conditions as noted in your proposed commit description, however I'd 
suggest making this simple change first.

 HTH,

  Maciej
Fredrik Noring July 22, 2019, 3:55 p.m. | #2
Hi Maciej,

I'm glad to hear from you again!

>  I think that should be a GAS warning really (similarly to macros that 

> expand to multiple instructions in a delay slot) as people ought to be 

> allowed to do what they wish, and then `-Werror' can be used for code 

> quality enforcement (and possibly disabled on a case-by-case basis).


How can one reasonably prevent the bug when compiling a whole Linux
distribution with thousands of packages if GAS merely warns and proceeds
in many cases?

> > [ In theory, GAS could actually insert NOPs prior to the noreorder directive,

> > to make the loop longer that six instructions, but GAS does not have that

> > kind of capability. Another option that GCC prevents is to place a NOP in

> > the delay slot. ]

> 

>  Well, GAS does have that capability, although of course it is not enabled 

> for `noreorder' code.


Hmm? I'm unable to make sense of that, since such NOPs are unaffected by
noreorder. This is what I meant:

loop:	addiu	$5,$5,1
	addiu	$4,$4,1
	lb	$2,-1($5)
	nop			/* These two NOPs would prevent the */
	nop			/* bug while preserving noreorder. */
	.set	noreorder
	.set	nomacro
	bne	$2,$3,loop
	sb	$2,-1($4)
	.set	macro
	.set	reorder

> For generated code I think however that usually it 

> will be cheaper performance-wise if a non-trivial delay-slot instruction 

> is never produced in the affected cases (i.e. a dummy NOP is always used).


I suppose so too. One observation is that R5900 BogoMIPS went down from
584 to 195 when switching from the generic kernel variant

	.set	noreorder
1:	bnez	a0,1b
	addiu	a0,a0,-1
	.set	reorder

that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant

	beqz	a0,2f
1:	addiu	a0,a0,-1
	bnez	a0,1b
2:

for the R5900 where GAS will place a NOP in the branch delay slot. With
loop unrolling BogoMIPS could be inflated again, of course. In general,
unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC
could be informed about that, possibly via profile-guided optimisation.

> > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not

> > make explicit use of the branch delay slot, as suggested by the patch below?

> > Then GCC will emit

> > 

> > loop:	addiu	$5,$5,1

> > 	addiu	$4,$4,1

> > 	lb	$2,-1($5)

> > 	sb	$2,-1($4)

> > 	bne	$2,$3,loop

> > 

> > that GAS will adjust in the ELF object to

> > 

> >    4:	24a50001 	addiu	a1,a1,1

> >    8:	24840001 	addiu	a0,a0,1

> >    c:	80a2ffff 	lb	v0,-1(a1)

> >   10:	a082ffff 	sb	v0,-1(a0)

> >   14:	1443fffb 	bne	v0,v1,4

> >   18:	00000000 	nop

> > 

> > where a NOP is placed in the delay slot to avoid the bug. For longer loops,

> > this relies on GAS making appropriate use of the delay slot. I'm not sure if

> > GAS is good at that, though?

> 

>  I'm sort-of surprised that GCC has produced `reorder' code here, making 

> it rely on GAS for delay slot scheduling.  Have you used an unusual set of 

> options that prevents GCC from making `noreorder' code, which as I recall 

> is the usual default (not for the MIPS16 mode IIRC, as well as some 

> obscure corner cases)?


I used the suggested patch, and recompiled the kernel with it, and verified
with the machine code tool that there were no undefined loops. Wouldn't that
be enough?

> > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md

> > index e17b1d522f0..acd31a8960c 100644

> > --- a/gcc/config/mips/mips.md

> > +++ b/gcc/config/mips/mips.md

> > @@ -1091,6 +1091,7 @@

> >  

> >  (define_delay (and (eq_attr "type" "branch")

> >  		   (not (match_test "TARGET_MIPS16"))

> > +		   (not (match_test "TARGET_FIX_R5900"))

> >  		   (eq_attr "branch_likely" "yes"))

> >    [(eq_attr "can_delay" "yes")

> >     (nil)

> > @@ -1100,6 +1101,7 @@

> >  ;; not annul on false.

> >  (define_delay (and (eq_attr "type" "branch")

> >  		   (not (match_test "TARGET_MIPS16"))

> > +		   (not (match_test "TARGET_FIX_R5900"))

> >  		   (ior (match_test "TARGET_CB_NEVER")

> >  			(and (eq_attr "compact_form" "maybe")

> >  			     (not (match_test "TARGET_CB_ALWAYS")))

> > 

> 

>  I think you need to modify the default `can_delay' attribute definition 

> instead (in the same way).


I tried to limit the case to branch delays only, which is a rough
approximation. Call and jump delay slots are acceptable. Are you referring
to this piece?

;; Can the instruction be put into a delay slot?
(define_attr "can_delay" "no,yes"
  (if_then_else (and (eq_attr "type" "!branch,call,jump")
		     (eq_attr "hazard" "none")
		     (match_test "get_attr_insn_count (insn) == 1"))
		(const_string "yes")
		(const_string "no")))

> An improved future version might determine the 

> exact conditions as noted in your proposed commit description, however I'd 

> suggest making this simple change first.


Learning the exact conditions, in a hardware sense, would be interesting
indeed, as some short loops actually do appear to work despite being labeled
as undefined. A fairly deep understanding of the R5900 implementation is
essential for such an undertaking. :)

Fredrik
Maciej W. Rozycki July 22, 2019, 9:47 p.m. | #3
Hi Fredrik,

> I'm glad to hear from you again!


 I'm not dead, just distracted.

> >  I think that should be a GAS warning really (similarly to macros that 

> > expand to multiple instructions in a delay slot) as people ought to be 

> > allowed to do what they wish, and then `-Werror' can be used for code 

> > quality enforcement (and possibly disabled on a case-by-case basis).

> 

> How can one reasonably prevent the bug when compiling a whole Linux

> distribution with thousands of packages if GAS merely warns and proceeds

> in many cases?


 I think the tools must not set a policy.  By using `.set noreorder' the 
user told the toolchain he knows better and asked to keep its hands away.

 As I say you can use `-Werror' for code auditing.  And in most cases 
manually scheduled delay slots in handcoded assembly are a sign of a 
problem with the piece of code affected, regardless of the R5900 erratum.

> > > [ In theory, GAS could actually insert NOPs prior to the noreorder directive,

> > > to make the loop longer that six instructions, but GAS does not have that

> > > kind of capability. Another option that GCC prevents is to place a NOP in

> > > the delay slot. ]

> > 

> >  Well, GAS does have that capability, although of course it is not enabled 

> > for `noreorder' code.

> 

> Hmm? I'm unable to make sense of that, since such NOPs are unaffected by

> noreorder. This is what I meant:

> 

> loop:	addiu	$5,$5,1

> 	addiu	$4,$4,1

> 	lb	$2,-1($5)

> 	nop			/* These two NOPs would prevent the */

> 	nop			/* bug while preserving noreorder. */

> 	.set	noreorder

> 	.set	nomacro

> 	bne	$2,$3,loop

> 	sb	$2,-1($4)

> 	.set	macro

> 	.set	reorder


 See `nops_for_insn'.  However again, as I noted, `.set noreorder' tells 
GAS not to analyse the dependencies for code within.  There is no need to 
schedule this delay slot manually, and if this is generated code, then the 
producer (GCC) should have taken care of the hazards, be it architectural 
or errata.  If this is manually written code, then the author asked for 
trouble.

> > For generated code I think however that usually it 

> > will be cheaper performance-wise if a non-trivial delay-slot instruction 

> > is never produced in the affected cases (i.e. a dummy NOP is always used).

> 

> I suppose so too. One observation is that R5900 BogoMIPS went down from

> 584 to 195 when switching from the generic kernel variant

> 

> 	.set	noreorder

> 1:	bnez	a0,1b

> 	addiu	a0,a0,-1

> 	.set	reorder

> 

> that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant

> 

> 	beqz	a0,2f

> 1:	addiu	a0,a0,-1

> 	bnez	a0,1b

> 2:

> 

> for the R5900 where GAS will place a NOP in the branch delay slot. With

> loop unrolling BogoMIPS could be inflated again, of course. In general,

> unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC

> could be informed about that, possibly via profile-guided optimisation.


 Well, BogoMIPS is just an arbitrary number.

 There is a data dependency here, so manual delay slot scheduling was 
unavoidable.  I'd suggest using this as a functional equivalent for the 
R5900:

	addiu	a0,a0,1
1:	addiu	a0,a0,-1
	bnez	a0,1b

and avoiding the irregularity for a0==0.

> > > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not

> > > make explicit use of the branch delay slot, as suggested by the patch below?

> > > Then GCC will emit

> > > 

> > > loop:	addiu	$5,$5,1

> > > 	addiu	$4,$4,1

> > > 	lb	$2,-1($5)

> > > 	sb	$2,-1($4)

> > > 	bne	$2,$3,loop

> > > 

> > > that GAS will adjust in the ELF object to

> > > 

> > >    4:	24a50001 	addiu	a1,a1,1

> > >    8:	24840001 	addiu	a0,a0,1

> > >    c:	80a2ffff 	lb	v0,-1(a1)

> > >   10:	a082ffff 	sb	v0,-1(a0)

> > >   14:	1443fffb 	bne	v0,v1,4

> > >   18:	00000000 	nop

> > > 

> > > where a NOP is placed in the delay slot to avoid the bug. For longer loops,

> > > this relies on GAS making appropriate use of the delay slot. I'm not sure if

> > > GAS is good at that, though?

> > 

> >  I'm sort-of surprised that GCC has produced `reorder' code here, making 

> > it rely on GAS for delay slot scheduling.  Have you used an unusual set of 

> > options that prevents GCC from making `noreorder' code, which as I recall 

> > is the usual default (not for the MIPS16 mode IIRC, as well as some 

> > obscure corner cases)?

> 

> I used the suggested patch, and recompiled the kernel with it, and verified

> with the machine code tool that there were no undefined loops. Wouldn't that

> be enough?


 It would break if GAS was invoked without `-mfix-r5900' (e.g. by using an 
explicit `-Wa,-mno-fix-r5900' option).

> > > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md

> > > index e17b1d522f0..acd31a8960c 100644

> > > --- a/gcc/config/mips/mips.md

> > > +++ b/gcc/config/mips/mips.md

> > > @@ -1091,6 +1091,7 @@

> > >  

> > >  (define_delay (and (eq_attr "type" "branch")

> > >  		   (not (match_test "TARGET_MIPS16"))

> > > +		   (not (match_test "TARGET_FIX_R5900"))

> > >  		   (eq_attr "branch_likely" "yes"))

> > >    [(eq_attr "can_delay" "yes")

> > >     (nil)

> > > @@ -1100,6 +1101,7 @@

> > >  ;; not annul on false.

> > >  (define_delay (and (eq_attr "type" "branch")

> > >  		   (not (match_test "TARGET_MIPS16"))

> > > +		   (not (match_test "TARGET_FIX_R5900"))

> > >  		   (ior (match_test "TARGET_CB_NEVER")

> > >  			(and (eq_attr "compact_form" "maybe")

> > >  			     (not (match_test "TARGET_CB_ALWAYS")))

> > > 

> > 

> >  I think you need to modify the default `can_delay' attribute definition 

> > instead (in the same way).

> 

> I tried to limit the case to branch delays only, which is a rough

> approximation. Call and jump delay slots are acceptable. Are you referring

> to this piece?

> 

> ;; Can the instruction be put into a delay slot?

> (define_attr "can_delay" "no,yes"

>   (if_then_else (and (eq_attr "type" "!branch,call,jump")

> 		     (eq_attr "hazard" "none")

> 		     (match_test "get_attr_insn_count (insn) == 1"))

> 		(const_string "yes")

> 		(const_string "no")))


 Yes.  My suggestion does not preclude limiting the workaround to branches 
only while being precise about what the situation is, i.e. branches still 
require a delay slot instruction (unlike MIPS16 or uMIPS compact branches) 
but no real instruction is suitable.  I'd prefer if GCC actually scheduled 
the required dummy NOP instruction explicitly.

> > An improved future version might determine the 

> > exact conditions as noted in your proposed commit description, however I'd 

> > suggest making this simple change first.

> 

> Learning the exact conditions, in a hardware sense, would be interesting

> indeed, as some short loops actually do appear to work despite being labeled

> as undefined. A fairly deep understanding of the R5900 implementation is

> essential for such an undertaking. :)


 Ack, although in this case I only meant what was stated in the commit 
description, i.e. avoiding forcing a dummy delay slot instruction where 
known for sure not to be needed, e.g. a long loop, a forward branch, etc.

  Maciej
Jeff Law July 22, 2019, 10:15 p.m. | #4
On 7/22/19 3:47 PM, Maciej W. Rozycki wrote:
> Hi Fredrik,

> 

>> I'm glad to hear from you again!

> 

>  I'm not dead, just distracted.

> 

>>>  I think that should be a GAS warning really (similarly to macros that 

>>> expand to multiple instructions in a delay slot) as people ought to be 

>>> allowed to do what they wish, and then `-Werror' can be used for code 

>>> quality enforcement (and possibly disabled on a case-by-case basis).

>>

>> How can one reasonably prevent the bug when compiling a whole Linux

>> distribution with thousands of packages if GAS merely warns and proceeds

>> in many cases?

> 

>  I think the tools must not set a policy.  By using `.set noreorder' the 

> user told the toolchain he knows better and asked to keep its hands away.

Agreed, 100%.



Jeff
Fredrik Noring July 23, 2019, 4:54 p.m. | #5
Hi Maciej,

> > How can one reasonably prevent the bug when compiling a whole Linux

> > distribution with thousands of packages if GAS merely warns and proceeds

> > in many cases?

> 

>  I think the tools must not set a policy.  By using `.set noreorder' the 

> user told the toolchain he knows better and asked to keep its hands away.

> 

>  As I say you can use `-Werror' for code auditing.  And in most cases 

> manually scheduled delay slots in handcoded assembly are a sign of a 

> problem with the piece of code affected, regardless of the R5900 erratum.


Well, it seems practical to use unofficial tools and a patched GAS to fix
this assembly bug, then. It's a grave problem for the R5900 and it needs to
be reliably detected.

> > > > [ In theory, GAS could actually insert NOPs prior to the noreorder directive,

> > > > to make the loop longer that six instructions, but GAS does not have that

> > > > kind of capability. Another option that GCC prevents is to place a NOP in

> > > > the delay slot. ]

> > > 

> > >  Well, GAS does have that capability, although of course it is not enabled 

> > > for `noreorder' code.

> > 

> > Hmm? I'm unable to make sense of that, since such NOPs are unaffected by

> > noreorder. This is what I meant:

> > 

> > loop:	addiu	$5,$5,1

> > 	addiu	$4,$4,1

> > 	lb	$2,-1($5)

> > 	nop			/* These two NOPs would prevent the */

> > 	nop			/* bug while preserving noreorder. */

> > 	.set	noreorder

> > 	.set	nomacro

> > 	bne	$2,$3,loop

> > 	sb	$2,-1($4)

> > 	.set	macro

> > 	.set	reorder

> 

>  See `nops_for_insn'.  However again, as I noted, `.set noreorder' tells 

> GAS not to analyse the dependencies for code within.  There is no need to 

> schedule this delay slot manually, and if this is generated code, then the 

> producer (GCC) should have taken care of the hazards, be it architectural 

> or errata.  If this is manually written code, then the author asked for 

> trouble.


I'm using the principle above to unobtrusively adjust problematic kernel
code, via a short_loop_war macro. Here is one patched instance:

--- a/arch/mips/boot/compressed/head.S
+++ b/arch/mips/boot/compressed/head.S
@@ -13,6 +13,7 @@
  */
 
 #include <asm/asm.h>
+#include <asm/asmmacro.h>
 #include <asm/regdef.h>
 
 	.set noreorder
@@ -29,6 +30,7 @@ start:
 	PTR_LA	a0, _edata
 	PTR_LA	a2, _end
 1:	sw	zero, 0(a0)
+	short_loop_war(1b)
 	bne	a2, a0, 1b
 	 addiu	a0, a0, 4
 
@@ -48,6 +50,7 @@ start:
 	jr	k0
 	 nop
 3:
+	short_loop_war(3b)
 	b	3b
 	 nop
 	END(start)

The short_loop_war macro is designed to be placed just before the branch,
and it inserts NOPs as necessary for the R5900:

#ifdef CONFIG_CPU_R5900
	.macro	short_loop_war loop_target
	.set	instruction_count,2 + (. - \loop_target) / 4
	.ifgt	7 - instruction_count
	.rept	7 - instruction_count
	nop
	.endr
	.endif
	.endm
#else
	.macro	short_loop_war loop_target
	.endm
#endif

>  Well, BogoMIPS is just an arbitrary number.


So presumably the noreorder BogoMIPS variant can be replaced with a
single reorder variant that works with all MIPS implementations?

>  There is a data dependency here, so manual delay slot scheduling was 

> unavoidable.  I'd suggest using this as a functional equivalent for the 

> R5900:

> 

> 	addiu	a0,a0,1

> 1:	addiu	a0,a0,-1

> 	bnez	a0,1b

> 

> and avoiding the irregularity for a0==0.


Thanks!

> > I used the suggested patch, and recompiled the kernel with it, and verified

> > with the machine code tool that there were no undefined loops. Wouldn't that

> > be enough?

> 

>  It would break if GAS was invoked without `-mfix-r5900' (e.g. by using an 

> explicit `-Wa,-mno-fix-r5900' option).


I see, hmm.

> > I tried to limit the case to branch delays only, which is a rough

> > approximation. Call and jump delay slots are acceptable. Are you referring

> > to this piece?

> > 

> > ;; Can the instruction be put into a delay slot?

> > (define_attr "can_delay" "no,yes"

> >   (if_then_else (and (eq_attr "type" "!branch,call,jump")

> > 		     (eq_attr "hazard" "none")

> > 		     (match_test "get_attr_insn_count (insn) == 1"))

> > 		(const_string "yes")

> > 		(const_string "no")))

> 

>  Yes.  My suggestion does not preclude limiting the workaround to branches 

> only while being precise about what the situation is, i.e. branches still 

> require a delay slot instruction (unlike MIPS16 or uMIPS compact branches) 

> but no real instruction is suitable.  I'd prefer if GCC actually scheduled 

> the required dummy NOP instruction explicitly.


OK.

Fredrik
Maciej W. Rozycki July 24, 2019, 11:09 a.m. | #6
Fredrik,

> > > How can one reasonably prevent the bug when compiling a whole Linux

> > > distribution with thousands of packages if GAS merely warns and proceeds

> > > in many cases?

> > 

> >  I think the tools must not set a policy.  By using `.set noreorder' the 

> > user told the toolchain he knows better and asked to keep its hands away.

> > 

> >  As I say you can use `-Werror' for code auditing.  And in most cases 

> > manually scheduled delay slots in handcoded assembly are a sign of a 

> > problem with the piece of code affected, regardless of the R5900 erratum.

> 

> Well, it seems practical to use unofficial tools and a patched GAS to fix

> this assembly bug, then. It's a grave problem for the R5900 and it needs to

> be reliably detected.


 Why?  What is wrong with using `-Werror'?

 Or you could use `-Wa,-fatal-warnings' to limit making warnings fatal to 
the assembly stage only if you are concerned about bad high-level language 
code quality interfering with your goal.

> >  See `nops_for_insn'.  However again, as I noted, `.set noreorder' tells 

> > GAS not to analyse the dependencies for code within.  There is no need to 

> > schedule this delay slot manually, and if this is generated code, then the 

> > producer (GCC) should have taken care of the hazards, be it architectural 

> > or errata.  If this is manually written code, then the author asked for 

> > trouble.

> 

> I'm using the principle above to unobtrusively adjust problematic kernel

> code, via a short_loop_war macro. Here is one patched instance:

> 

> --- a/arch/mips/boot/compressed/head.S

> +++ b/arch/mips/boot/compressed/head.S

> @@ -13,6 +13,7 @@

>   */

>  

>  #include <asm/asm.h>

> +#include <asm/asmmacro.h>

>  #include <asm/regdef.h>

>  

>  	.set noreorder

> @@ -29,6 +30,7 @@ start:

>  	PTR_LA	a0, _edata

>  	PTR_LA	a2, _end

>  1:	sw	zero, 0(a0)

> +	short_loop_war(1b)

>  	bne	a2, a0, 1b

>  	 addiu	a0, a0, 4

>  

> @@ -48,6 +50,7 @@ start:

>  	jr	k0

>  	 nop

>  3:

> +	short_loop_war(3b)

>  	b	3b

>  	 nop


 Is it needed here given that the delay slot instruction is a NOP anyway?  

 Also this source does not need `.set noreorder', except for the loop 
around `1:' where a data dependency is present with the delay slot 
instruction.  And I am surprised that it even assembles as it uses 
`.cprestore' without the required offset argument (not that this pseudo-op 
makes any sense here).  And it's weird overall, e.g. it loads $ra 
explicitly rather than using JALR (or JAL even).  But these are unrelated 
problems.

> >  Well, BogoMIPS is just an arbitrary number.

> 

> So presumably the noreorder BogoMIPS variant can be replaced with a

> single reorder variant that works with all MIPS implementations?


 Sure, BogoMIPS just records how quickly the delay loop runs, and 
therefore how many iterations are required for a given amount of time to 
spend twiddling thumbs.

  Maciej
Fredrik Noring July 24, 2019, 2:39 p.m. | #7
Hi Maciej,

> > > > How can one reasonably prevent the bug when compiling a whole Linux

> > > > distribution with thousands of packages if GAS merely warns and proceeds

> > > > in many cases?

> > > 

> > >  I think the tools must not set a policy.  By using `.set noreorder' the 

> > > user told the toolchain he knows better and asked to keep its hands away.

> > > 

> > >  As I say you can use `-Werror' for code auditing.  And in most cases 

> > > manually scheduled delay slots in handcoded assembly are a sign of a 

> > > problem with the piece of code affected, regardless of the R5900 erratum.

> > 

> > Well, it seems practical to use unofficial tools and a patched GAS to fix

> > this assembly bug, then. It's a grave problem for the R5900 and it needs to

> > be reliably detected.

> 

>  Why?  What is wrong with using `-Werror'?

> 

>  Or you could use `-Wa,-fatal-warnings' to limit making warnings fatal to 

> the assembly stage only if you are concerned about bad high-level language 

> code quality interfering with your goal.


It appears unfeasible to fix thousands of build scripts to work like that.
Although the numbers with actual MIPS code in them likely are very small,
some may source macros and other stuff via various libraries. Some
distributions, such as Gentoo, do have global CFLAGS but such mechanism are
unreliable.

I understand that one may want to take a principled stance with "hands away"
and "author asked for trouble", but fact is that whomever put a given
noreorder directive into a piece of code most likely didn't have R5900
errata in mind, and R5900 users most likely don't want the assembler to
generate code that is known to cause subtle bugs of all imaginable kinds,
including data corruption.

Hence my recommendation. It's not really an argument because I'm not going
to ask that the official GAS changes behaviour on this.

> >  3:

> > +	short_loop_war(3b)

> >  	b	3b

> >  	 nop

> 

>  Is it needed here given that the delay slot instruction is a NOP anyway?  


Ah, no. That is a left-over from the note that "a branch delay slot of the
loop is not NOP (EE 2.9 or later)", as it appears a NOP is not enough for
EE before 2.9.

Perhaps the short_loop_war macro can be improved to give errors when
misused. For example, it could potentially give an error if it is placed
outside of noreorder, but it doesn't seem like GAS can inform the macro
whether it is inside or outside.

>  Also this source does not need `.set noreorder', except for the loop 

> around `1:' where a data dependency is present with the delay slot 

> instruction.


That data dependency can surely be reworked too, so as to make the whole
piece reorderable?

> And I am surprised that it even assembles as it uses 

> `.cprestore' without the required offset argument (not that this pseudo-op 

> makes any sense here).  And it's weird overall, e.g. it loads $ra 

> explicitly rather than using JALR (or JAL even).  But these are unrelated 

> problems.


Good to know, thanks. :)

Fredrik
Maciej W. Rozycki July 24, 2019, 3:31 p.m. | #8
Fredrik,

> >  Why?  What is wrong with using `-Werror'?

> > 

> >  Or you could use `-Wa,-fatal-warnings' to limit making warnings fatal to 

> > the assembly stage only if you are concerned about bad high-level language 

> > code quality interfering with your goal.

> 

> It appears unfeasible to fix thousands of build scripts to work like that.

> Although the numbers with actual MIPS code in them likely are very small,

> some may source macros and other stuff via various libraries. Some

> distributions, such as Gentoo, do have global CFLAGS but such mechanism are

> unreliable.


 I'd expect a sane distribution packager to prepare/patch any disobedient 
software packages to accept global distribution-specific flags, as there 
are often more important flags to pass which all packages need to be build 
with.

 However if they haven't, then put the flags in CC/CXX/etc. as you'd do 
with multilib selection options, or if that fails too, then wrap your 
compiler driver used for a distribution build into a shell script found 
according to $PATH (which may be non-standard) that adds any options 
required implicitly.  I've done that many times when I found myself in 
situation where I had to override some silly hardcoded assumptions, 
including removing or transforming rather than adding options passed.

> I understand that one may want to take a principled stance with "hands away"

> and "author asked for trouble", but fact is that whomever put a given

> noreorder directive into a piece of code most likely didn't have R5900

> errata in mind, and R5900 users most likely don't want the assembler to

> generate code that is known to cause subtle bugs of all imaginable kinds,

> including data corruption.


 Yes, and a warning is I think a reasonable solution.  Alternatively maybe 
you could persuade binutils maintainers to have a `configure' option to 
make `-fatal-warnings' GAS's default for a given build.

> > >  3:

> > > +	short_loop_war(3b)

> > >  	b	3b

> > >  	 nop

> > 

> >  Is it needed here given that the delay slot instruction is a NOP anyway?  

> 

> Ah, no. That is a left-over from the note that "a branch delay slot of the

> loop is not NOP (EE 2.9 or later)", as it appears a NOP is not enough for

> EE before 2.9.


 Ah, OK then.

> >  Also this source does not need `.set noreorder', except for the loop 

> > around `1:' where a data dependency is present with the delay slot 

> > instruction.

> 

> That data dependency can surely be reworked too, so as to make the whole

> piece reorderable?


 The SW and ADDIU instructions can surely be swapped, e.g.:

	PTR_LA	a0, _edata - 4
	PTR_LA	a2, _end
1:	addiu	a0, a0, 4
	sw	zero, 0(a0)
	bne	a2, a0, 1b

You can make it all `reorder' then.  Note that this preserves the final 
out-of-range store, which I gather is deliberate for performance reasons.

  Maciej
Jeff Law July 24, 2019, 3:35 p.m. | #9
On 7/24/19 9:31 AM, Maciej W. Rozycki wrote:
> Fredrik,

> 

>>>  Why?  What is wrong with using `-Werror'?

>>>

>>>  Or you could use `-Wa,-fatal-warnings' to limit making warnings fatal to 

>>> the assembly stage only if you are concerned about bad high-level language 

>>> code quality interfering with your goal.

>>

>> It appears unfeasible to fix thousands of build scripts to work like that.

>> Although the numbers with actual MIPS code in them likely are very small,

>> some may source macros and other stuff via various libraries. Some

>> distributions, such as Gentoo, do have global CFLAGS but such mechanism are

>> unreliable.

> 

>  I'd expect a sane distribution packager to prepare/patch any disobedient 

> software packages to accept global distribution-specific flags, as there 

> are often more important flags to pass which all packages need to be build 

> with.

Getting good flags injection and being able to audit it is painful, but
amazingly helpful (insert plug here for Nick's annobin/annocheck).
Fixing packages to accept the standard methods of flag injection as well
as addressing issues that arise is all standard procedure.


Jeff

Patch

diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index e17b1d522f0..acd31a8960c 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -1091,6 +1091,7 @@ 
 
 (define_delay (and (eq_attr "type" "branch")
 		   (not (match_test "TARGET_MIPS16"))
+		   (not (match_test "TARGET_FIX_R5900"))
 		   (eq_attr "branch_likely" "yes"))
   [(eq_attr "can_delay" "yes")
    (nil)
@@ -1100,6 +1101,7 @@ 
 ;; not annul on false.
 (define_delay (and (eq_attr "type" "branch")
 		   (not (match_test "TARGET_MIPS16"))
+		   (not (match_test "TARGET_FIX_R5900"))
 		   (ior (match_test "TARGET_CB_NEVER")
 			(and (eq_attr "compact_form" "maybe")
 			     (not (match_test "TARGET_CB_ALWAYS")))