[v2,09/10] RISC-V: Provide programmatic implementation of CAS [PR 100266]

Message ID 20210505193651.2075405-10-cmuellner@gcc.gnu.org
State New
Headers show
Series
  • Atomics improvements [PR100265/PR100266]
Related show

Commit Message

Jonathan Wakely via Gcc-patches May 5, 2021, 7:36 p.m.
The existing CAS implementation uses an INSN definition, which provides
the core LR/SC sequence. Additionally to that, there is a follow-up code,
that evaluates the results and calculates the return values.
This has two drawbacks: a) an extension to sub-word CAS implementations
is not possible (even if, then it would be unmaintainable), and b) the
implementation is hard to maintain/improve.
This patch provides a programmatic implementation of CAS, similar
like many other architectures are having one.

The implementation supports both, RV32 and RV64.

Additionally, the implementation does not introduce data dependencies
for computation of the return value. Instead, we set the return value
(success state of the CAS operation) based on structural information.
This approach is also shown in the the RISC-V unpriv spec (as part
of the sample code for a compare-and-swap function using LR/SC).
The cost of this implementation is a single LI instruction on top,
which is actually not required in case of success (it will be
overwritten in the success case later).

The resulting sequence requires 9 instructions in the success case.
The previous implementation required 11 instructions in the succcess
case (including a taken branch) and had a "subw;seqz;beqz" sequence,
with direct dependencies.

Below is the generated code of a 32-bit CAS sequence with the old
implementation and the new implementation (ignore the ANDIs below).

Old:
     f00:       419c                    lw      a5,0(a1)
     f02:       1005272f                lr.w    a4,(a0)
     f06:       00f71563                bne     a4,a5,f10
     f0a:       18c526af                sc.w    a3,a2,(a0)
     f0e:       faf5                    bnez    a3,f02
     f10:       40f707bb                subw    a5,a4,a5
     f14:       0017b513                seqz    a0,a5
     f18:       c391                    beqz    a5,f1c
     f1a:       c198                    sw      a4,0(a1)
     f1c:       8905                    andi    a0,a0,1
     f1e:       8082                    ret

New:
     e28:       4194                    lw      a3,0(a1)
     e2a:       4701                    li      a4,0
     e2c:       1005282f                lr.w    a6,(a0)
     e30:       00d81963                bne     a6,a3,e42
     e34:       18c527af                sc.w    a5,a2,(a0)
     e38:       fbf5                    bnez    a5,e2c
     e3a:       4705                    li      a4,1
     e3c:       00177513                andi    a0,a4,1
     e40:       8082                    ret
     e42:       0105a023                sw      a6,0(a1)
     e46:       00177513                andi    a0,a4,1
     e4a:       8082                    ret

    gcc/
        PR 100266
        * config/riscv/riscv-protos.h (riscv_expand_compare_and_swap): New.
        * config/riscv/riscv.c (riscv_emit_unlikely_jump): New.
        * config/rsicv/riscv.c (riscv_expand_compare_and_swap): New.
        * config/rsicv/sync.md (atomic_cas_value_strong): Removed.
        * config/rsicv/sync.md (atomic_compare_and_swap): Call
	  riscv_expand_compare_and_swap.
---
 gcc/config/riscv/riscv-protos.h |  1 +
 gcc/config/riscv/riscv.c        | 75 +++++++++++++++++++++++++++++++++
 gcc/config/riscv/sync.md        | 35 +--------------
 3 files changed, 77 insertions(+), 34 deletions(-)

-- 
2.31.1

Comments

Jim Wilson May 6, 2021, 12:27 a.m. | #1
On Wed, May 5, 2021 at 12:37 PM Christoph Muellner <cmuellner@gcc.gnu.org>
wrote:

> The existing CAS implementation uses an INSN definition, which provides

> the core LR/SC sequence. Additionally to that, there is a follow-up code,

> that evaluates the results and calculates the return values.

> This has two drawbacks: a) an extension to sub-word CAS implementations

> is not possible (even if, then it would be unmaintainable), and b) the

> implementation is hard to maintain/improve.

> This patch provides a programmatic implementation of CAS, similar

> like many other architectures are having one.

>


A comment that Andrew Waterman made to me today about the safety of this
under various circumstances got me thinking, and I realized that without
the special cas pattern we can get reloads in the middle of the sequence
which would be bad.  Experimenting a bit, I managed to prove it.  This is
using the old version of the patch which I already had handy, but I'm sure
the new version will behave roughly the same way.  Using the testsuite
testcase atomic-compare-exchange-3.c as before, and adding a lot of
-ffixed-X options to simulate high register pressure, with the compiler
command
./xgcc -B./ -O2 -S tmp.c -ffixed-x16 -ffixed-x17 -ffixed-x18 -ffixed-x19
-ffixed-x20 -ffixed-x21 -ffixed-x22 -ffixed-x23 -ffixed-x24 -ffixed-x25
-ffixed-x26 -ffixed-x27 -ffixed-x28 -ffixed-x29 -ffixed-x30 -ffixed-x31
-ffixed-x15 -ffixed-x14 -ffixed-x13 -ffixed-x12 -ffixed-s0 -ffixed-s1
-ffixed-t2 -ffixed-t1 -ffixed-t0
I get for the first lr/sc loop
.L2:
        lui     a1,%hi(v)
        addi    a0,a1,%lo(v)
        lr.w a1, 0(a0)
        ld      a0,8(sp)
        sw      a1,24(sp)
        bne     a1,a0,.L39
        lui     a1,%hi(v)
        addi    a0,a1,%lo(v)
        lw      a1,16(sp)
        sd      ra,24(sp)
        sc.w ra, a1, 0(a0)
        sext.w  a1,ra
        ld      ra,24(sp)
        bne     a1,zero,.L2
and note all of the misc load/store instructions added by reload.  I don't
think this is safe or guaranteed to work.  With the cas pattern, any
reloads are guaranteed to be emitted before and/or after the lr/sc loop.
With the separate patterns, there is no way to ensure that we won't get
accidental reloads in the middle of the lr/sc loop.

I think we need to keep the cas pattern.  We can always put C code inside
the output template of the cas pattern if that is helpful.  It can do any
necessary tests and then return an appropriate string for the instructions
we want.

Jim

Patch

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 43d7224d6941..eb7e67d3b95a 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -59,6 +59,7 @@  extern void riscv_expand_int_scc (rtx, enum rtx_code, rtx, rtx);
 extern void riscv_expand_float_scc (rtx, enum rtx_code, rtx, rtx);
 extern void riscv_expand_conditional_branch (rtx, enum rtx_code, rtx, rtx);
 extern void riscv_expand_conditional_move (rtx, rtx, rtx, rtx_code, rtx, rtx);
+extern void riscv_expand_compare_and_swap (rtx[]);
 #endif
 extern rtx riscv_legitimize_call_address (rtx);
 extern void riscv_set_return_address (rtx, rtx);
diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
index 5fe65776e608..a7b18d650daa 100644
--- a/gcc/config/riscv/riscv.c
+++ b/gcc/config/riscv/riscv.c
@@ -2496,6 +2496,81 @@  riscv_expand_conditional_move (rtx dest, rtx cons, rtx alt, rtx_code code,
 						      cons, alt)));
 }
 
+/* Mark the previous jump instruction as unlikely.  */
+
+static void
+riscv_emit_unlikely_jump (rtx insn)
+{
+  rtx_insn *jump = emit_jump_insn (insn);
+  add_reg_br_prob_note (jump, profile_probability::very_unlikely ());
+}
+
+/* Expand code to perform a compare-and-swap.  */
+
+extern void
+riscv_expand_compare_and_swap (rtx operands[])
+{
+  rtx bval, oldval, mem, expval, newval, mod_s, mod_f, scratch, cond1, cond2;
+  machine_mode mode;
+  rtx_code_label *begin_label, *end_label;
+
+  bval = operands[0];
+  oldval = operands[1];
+  mem = operands[2];
+  expval = operands[3];
+  newval = operands[4];
+  mod_s = operands[6];
+  mod_f = operands[7];
+  mode = GET_MODE (mem);
+  scratch = gen_reg_rtx (mode);
+  begin_label = gen_label_rtx ();
+  end_label = gen_label_rtx ();
+
+  /* No support for sub-word CAS.  */
+  if (mode == QImode || mode == HImode)
+    gcc_unreachable ();
+
+  /* We use mod_f for LR and mod_s for SC below, but
+     RV does not have any guarantees for LR.rl and SC.aq.  */
+  if (is_mm_acquire (memmodel_base (INTVAL (mod_s)))
+      && is_mm_relaxed (memmodel_base (INTVAL (mod_f))))
+    {
+      mod_f = GEN_INT (MEMMODEL_ACQUIRE);
+      mod_s = GEN_INT (MEMMODEL_RELAXED);
+    }
+
+  /* Since we want to maintain a branch-free good-case, but also want
+     to not have two branches in the bad-case, we set bval to FALSE
+     on top of the sequence.  In the bad case, we simply jump over
+     the assignment of bval to TRUE at the end of the sequence.  */
+
+  emit_insn (gen_rtx_SET (bval, gen_rtx_CONST_INT (SImode, FALSE)));
+
+  /* Make sure the scheduler does not move INSNs beyond here.  */
+  emit_insn (gen_blockage ());
+
+  emit_label (begin_label);
+
+  emit_insn (gen_riscv_load_reserved (mode, oldval, mem, mod_f));
+
+  cond1 = gen_rtx_NE (mode, oldval, expval);
+  riscv_emit_unlikely_jump (gen_cbranch4 (Pmode, cond1, oldval, expval,
+			    end_label));
+
+  emit_insn (gen_riscv_store_conditional (mode, scratch, mem, newval, mod_s));
+
+  /* Make sure the scheduler does not move INSNs beyond here.  */
+  emit_insn (gen_blockage ());
+
+  cond2 = gen_rtx_NE (mode, scratch, const0_rtx);
+  riscv_emit_unlikely_jump (gen_cbranch4 (Pmode, cond2, scratch, const0_rtx,
+			    begin_label));
+
+  emit_insn (gen_rtx_SET (bval, gen_rtx_CONST_INT (SImode, TRUE)));
+
+  emit_label (end_label);
+}
+
 /* Implement TARGET_FUNCTION_ARG_BOUNDARY.  Every parameter gets at
    least PARM_BOUNDARY bits of alignment, but will be given anything up
    to PREFERRED_STACK_BOUNDARY bits if the type requires it.  */
diff --git a/gcc/config/riscv/sync.md b/gcc/config/riscv/sync.md
index 49b860da8ef0..da8dbf698163 100644
--- a/gcc/config/riscv/sync.md
+++ b/gcc/config/riscv/sync.md
@@ -207,20 +207,6 @@ 
   "amoswap.<amo>%A3 %0,%z2,%1"
 )
 
-(define_insn "atomic_cas_value_strong<mode>"
-  [(set (match_operand:GPR 0 "register_operand" "=&r")
-	(match_operand:GPR 1 "memory_operand" "+A"))
-   (set (match_dup 1)
-	(unspec_volatile:GPR [(match_operand:GPR 2 "reg_or_0_operand" "rJ")
-			      (match_operand:GPR 3 "reg_or_0_operand" "rJ")
-			      (match_operand:SI 4 "const_int_operand")  ;; mod_s
-			      (match_operand:SI 5 "const_int_operand")] ;; mod_f
-	 UNSPEC_COMPARE_AND_SWAP))
-   (clobber (match_scratch:GPR 6 "=&r"))]
-  "TARGET_ATOMIC"
-  "1: lr.<amo>%A5 %0,%1; bne %0,%z2,1f; sc.<amo>%A4 %6,%z3,%1; bnez %6,1b; 1:"
-  [(set (attr "length") (const_int 20))])
-
 (define_expand "atomic_compare_and_swap<mode>"
   [(match_operand:SI 0 "register_operand" "")   ;; bool output
    (match_operand:GPR 1 "register_operand" "")  ;; val output
@@ -232,26 +218,7 @@ 
    (match_operand:SI 7 "const_int_operand" "")] ;; mod_f
   "TARGET_ATOMIC"
 {
-  emit_insn (gen_atomic_cas_value_strong<mode> (operands[1], operands[2],
-						operands[3], operands[4],
-						operands[6], operands[7]));
-
-  rtx compare = operands[1];
-  if (operands[3] != const0_rtx)
-    {
-      rtx difference = gen_rtx_MINUS (<MODE>mode, operands[1], operands[3]);
-      compare = gen_reg_rtx (<MODE>mode);
-      emit_insn (gen_rtx_SET (compare, difference));
-    }
-
-  if (word_mode != <MODE>mode)
-    {
-      rtx reg = gen_reg_rtx (word_mode);
-      emit_insn (gen_rtx_SET (reg, gen_rtx_SIGN_EXTEND (word_mode, compare)));
-      compare = reg;
-    }
-
-  emit_insn (gen_rtx_SET (operands[0], gen_rtx_EQ (SImode, compare, const0_rtx)));
+  riscv_expand_compare_and_swap (operands);
   DONE;
 })