[AArch64] Support for LDP/STP of Q-registers

Message ID 5B157981.3010408@foss.arm.com
State New
Headers show
Series
  • [AArch64] Support for LDP/STP of Q-registers
Related show

Commit Message

Kyrill Tkachov June 4, 2018, 5:40 p.m.
Hi all,

This patch adds support for generating LDPs and STPs of Q-registers.
This allows for more compact code generation and makes better use of the ISA.

It's implemented in a straightforward way by allowing 16-byte modes in the
sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md
as well as the patterns themselves in aarch64-simd.md.

I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

Bootstrapped and tested on aarch64-none-linux-gnu.

Ok for trunk?

Thanks,
Kyrill

2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):
     Allow 16-byte modes.
     (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p.
     * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of
     128-bit modes.
     * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>):
     New pattern.
     (vec_store_pair<VQ:mode><VQ2:mode>): Likewise.
     * config/aarch64/iterators.md (VQ2): New mode iterator.

2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/ldp_stp_q.c: New test.
     * gcc.target/aarch64/stp_vec_128_1.c: Likewise.

Comments

Kyrill Tkachov June 5, 2018, 4:32 p.m. | #1
On 04/06/18 18:40, Kyrill Tkachov wrote:
> Hi all,

>

> This patch adds support for generating LDPs and STPs of Q-registers.

> This allows for more compact code generation and makes better use of the ISA.

>

> It's implemented in a straightforward way by allowing 16-byte modes in the

> sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

> as well as the patterns themselves in aarch64-simd.md.

>

> I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

>


Adding some folks who know more about other CPUs as well.
Are you okay with enabling these instructions in AArch64?

If you could give this a spin on some benchmarks you
care about on your platforms it would be really useful data.

Thanks,
Kyrill

> Bootstrapped and tested on aarch64-none-linux-gnu.

>

> Ok for trunk?

>

> Thanks,

> Kyrill

>

> 2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

>

>     * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):

>     Allow 16-byte modes.

>     (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p.

>     * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of

>     128-bit modes.

>     * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>):

>     New pattern.

>     (vec_store_pair<VQ:mode><VQ2:mode>): Likewise.

>     * config/aarch64/iterators.md (VQ2): New mode iterator.

>

> 2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

>

>     * gcc.target/aarch64/ldp_stp_q.c: New test.

>     * gcc.target/aarch64/stp_vec_128_1.c: Likewise.
Andrew Pinski June 5, 2018, 4:45 p.m. | #2
On Tue, Jun 5, 2018 at 9:32 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>

>

> On 04/06/18 18:40, Kyrill Tkachov wrote:

> > Hi all,

> >

> > This patch adds support for generating LDPs and STPs of Q-registers.

> > This allows for more compact code generation and makes better use of the ISA.

> >

> > It's implemented in a straightforward way by allowing 16-byte modes in the

> > sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

> > as well as the patterns themselves in aarch64-simd.md.

> >

> > I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

> >

>

> Adding some folks who know more about other CPUs as well.

> Are you okay with enabling these instructions in AArch64?

>

> If you could give this a spin on some benchmarks you

> care about on your platforms it would be really useful data.


It might be useful to have a aarch64-tuning-flags.def for this; even
if all current cores have it on.
I might do some performance analysis on OcteonTX 81xx and 83xx (aka
thunderxt81 and thunderxt83) but it won't be until end of June as I am
on vacation until then.

Thanks,
Andrew

>

> Thanks,

> Kyrill

>

> > Bootstrapped and tested on aarch64-none-linux-gnu.

> >

> > Ok for trunk?

> >

> > Thanks,

> > Kyrill

> >

> > 2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

> >

> >     * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):

> >     Allow 16-byte modes.

> >     (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p.

> >     * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of

> >     128-bit modes.

> >     * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>):

> >     New pattern.

> >     (vec_store_pair<VQ:mode><VQ2:mode>): Likewise.

> >     * config/aarch64/iterators.md (VQ2): New mode iterator.

> >

> > 2018-06-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

> >

> >     * gcc.target/aarch64/ldp_stp_q.c: New test.

> >     * gcc.target/aarch64/stp_vec_128_1.c: Likewise.

>
Siddhesh Poyarekar June 5, 2018, 5:18 p.m. | #3
On 06/05/2018 10:02 PM, Kyrill Tkachov wrote:
> Adding some folks who know more about other CPUs as well.

> Are you okay with enabling these instructions in AArch64?

> 

> If you could give this a spin on some benchmarks you

> care about on your platforms it would be really useful data.


Sameera had written something similar (at least in terms of the result, 
I don't remember if the approach was the same) and saw the same results 
as you did; non-significant changes to performance for CPU2017 because 
of which she did not submit it upstream.

That said, I think it is OK to have this upstream.  apinski's suggestion 
to add a tuning flag makes sense too; it will make testing and tuning 
easier for us if we need to in future.

Siddhesh
James Greenhalgh June 5, 2018, 5:28 p.m. | #4
On Tue, Jun 05, 2018 at 11:32:06AM -0500, Kyrill Tkachov wrote:
> 

> On 04/06/18 18:40, Kyrill Tkachov wrote:

> > Hi all,

> >

> > This patch adds support for generating LDPs and STPs of Q-registers.

> > This allows for more compact code generation and makes better use of the ISA.

> >

> > It's implemented in a straightforward way by allowing 16-byte modes in the

> > sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

> > as well as the patterns themselves in aarch64-simd.md.

> >

> > I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

> >

> 

> Adding some folks who know more about other CPUs as well.

> Are you okay with enabling these instructions in AArch64?

> 

> If you could give this a spin on some benchmarks you

> care about on your platforms it would be really useful data.


From an architecture perspective, I think this is the right thing for us
to do. Given the feedback from Andrew and Siddhesh I think we should support
this patch, defaulting to on; but behind a tuning flag for those who want
to disable it for their -mcpu tuning.

If you can respin it behind a tuning parameter and give the community
another 48 hours or so to respond, I think we'd have a good patch here.

I'm also adding some more contributors to the AArch64 cores file for their
thoughts on the proposal.

Thanks,
James
Philipp Tomsich June 5, 2018, 5:47 p.m. | #5
> On 5 Jun 2018, at 19:28, James Greenhalgh <james.greenhalgh@arm.com> wrote:

> 

> On Tue, Jun 05, 2018 at 11:32:06AM -0500, Kyrill Tkachov wrote:

>> 

>> On 04/06/18 18:40, Kyrill Tkachov wrote:

>>> Hi all,

>>> 

>>> This patch adds support for generating LDPs and STPs of Q-registers.

>>> This allows for more compact code generation and makes better use of the ISA.

>>> 

>>> It's implemented in a straightforward way by allowing 16-byte modes in the

>>> sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

>>> as well as the patterns themselves in aarch64-simd.md.

>>> 

>>> I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

>>> 

>> 

>> Adding some folks who know more about other CPUs as well.

>> Are you okay with enabling these instructions in AArch64?

>> 

>> If you could give this a spin on some benchmarks you

>> care about on your platforms it would be really useful data.

> 

> From an architecture perspective, I think this is the right thing for us

> to do. Given the feedback from Andrew and Siddhesh I think we should support

> this patch, defaulting to on; but behind a tuning flag for those who want

> to disable it for their -mcpu tuning.


My gut feeling is that for XGene, we’d like to not have this on by default,
as we’d prefer to schedule work between the loads (if there are instructions
available to insert there).  I have asked my team to look at this with some
benchmarks and we’ll report back.

This may be a moot point, though, as there’s probably only a few cases where
we have useful work to do (i.e. don’t depend on the data-results of these loads)
and an opportunity for this fusion.

Thanks,
Philipp.

> If you can respin it behind a tuning parameter and give the community

> another 48 hours or so to respond, I think we'd have a good patch here.

> 

> I'm also adding some more contributors to the AArch64 cores file for their

> thoughts on the proposal.

> 

> Thanks,

> James

> 

>
Kyrill Tkachov June 7, 2018, 10:58 a.m. | #6
On 05/06/18 18:28, James Greenhalgh wrote:
> On Tue, Jun 05, 2018 at 11:32:06AM -0500, Kyrill Tkachov wrote:

>> On 04/06/18 18:40, Kyrill Tkachov wrote:

>>> Hi all,

>>>

>>> This patch adds support for generating LDPs and STPs of Q-registers.

>>> This allows for more compact code generation and makes better use of the ISA.

>>>

>>> It's implemented in a straightforward way by allowing 16-byte modes in the

>>> sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

>>> as well as the patterns themselves in aarch64-simd.md.

>>>

>>> I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

>>>

>> Adding some folks who know more about other CPUs as well.

>> Are you okay with enabling these instructions in AArch64?

>>

>> If you could give this a spin on some benchmarks you

>> care about on your platforms it would be really useful data.

>  From an architecture perspective, I think this is the right thing for us

> to do. Given the feedback from Andrew and Siddhesh I think we should support

> this patch, defaulting to on; but behind a tuning flag for those who want

> to disable it for their -mcpu tuning.

>

> If you can respin it behind a tuning parameter and give the community

> another 48 hours or so to respond, I think we'd have a good patch here.

>

> I'm also adding some more contributors to the AArch64 cores file for their

> thoughts on the proposal.


Ok, here's an updated patch adding a new no_ldp_stp_qregs tuning flag.
I use it to restrict the peepholes in aarch64-ldpstp.md from merging the
operations together into PARALLELs. I also use it to restrict the sched fusion
check that brings such loads and stores together. This is enough to avoid
forming the pairs.

Given Philipp's comments I've enabled this flag for xgene1_tunings so that -mtune=xgene1
will not generate these pairs, but every other tuning will by default.

The tests are updated to explicitly pass -moverride=tune=none so that
they do not depend on a particular default -mcpu option.
This -moverride option disables all the tuning flags, so it disables
the "disabling" of LDP/STP-Q.

I've manually confirmed that for -mcpu=xgene1 the pairs are not formed
and added a test that the new tuning flag does indeed disable the generation.

Bootstrapped and tested on aarch64-none-linux-gnu.
What do folks think of this version?

Thanks,
Kyrill

2018-06-07  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64-tuning-flags.def (no_ldp_stp_qregs): New.
     * config/aarch64/aarch64.c (xgene1_tunings): Add
     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS to tune_flags.
     (aarch64_mode_valid_for_sched_fusion_p):
     Allow 16-byte modes.
     (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p.
     * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of
     128-bit modes.
     * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>):
     New pattern.
     (vec_store_pair<VQ:mode><VQ2:mode>): Likewise.
     * config/aarch64/iterators.md (VQ2): New mode iterator.

2018-06-07  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/ldp_stp_q.c: New test.
     * gcc.target/aarch64/stp_vec_128_1.c: Likewise.
     * gcc.target/aarch64/ldp_stp_q_disable.c: Likewise.
diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 7f1031dc80fab31f691c0b03d6a485c1b6fd7e53..650a80dcb2fed1c148a5bff83468fb2b768aa14a 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -91,6 +91,37 @@ (define_peephole2
   aarch64_swap_ldrstr_operands (operands, false);
 })
 
+(define_peephole2
+  [(set (match_operand:VQ 0 "register_operand" "")
+	(match_operand:VQ 1 "memory_operand" ""))
+   (set (match_operand:VQ2 2 "register_operand" "")
+	(match_operand:VQ2 3 "memory_operand" ""))]
+  "TARGET_SIMD
+   && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
+   && (aarch64_tune_params.extra_tuning_flags
+	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  aarch64_swap_ldrstr_operands (operands, true);
+})
+
+(define_peephole2
+  [(set (match_operand:VQ 0 "memory_operand" "")
+	(match_operand:VQ 1 "register_operand" ""))
+   (set (match_operand:VQ2 2 "memory_operand" "")
+	(match_operand:VQ2 3 "register_operand" ""))]
+  "TARGET_SIMD
+   && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
+   && (aarch64_tune_params.extra_tuning_flags
+	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  aarch64_swap_ldrstr_operands (operands, false);
+})
+
+
 ;; Handle sign/zero extended consecutive load/store.
 
 (define_peephole2
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index b056a8f200d6c4b673dde346e4ef6c93bb93e046..ac42ac1e0e6d43689924c2cbd2f488d550480eb4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -205,6 +205,34 @@ (define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
   [(set_attr "type" "neon_stp")]
 )
 
+(define_insn "load_pair<VQ:mode><VQ2:mode>"
+  [(set (match_operand:VQ 0 "register_operand" "=w")
+	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
+   (set (match_operand:VQ2 2 "register_operand" "=w")
+	(match_operand:VQ2 3 "memory_operand" "m"))]
+  "TARGET_SIMD
+    && rtx_equal_p (XEXP (operands[3], 0),
+		    plus_constant (Pmode,
+			       XEXP (operands[1], 0),
+			       GET_MODE_SIZE (<VQ:MODE>mode)))"
+  "ldp\\t%q0, %q2, %1"
+  [(set_attr "type" "neon_ldp_q")]
+)
+
+(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
+  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
+	(match_operand:VQ 1 "register_operand" "w"))
+   (set (match_operand:VQ2 2 "memory_operand" "=m")
+	(match_operand:VQ2 3 "register_operand" "w"))]
+  "TARGET_SIMD && rtx_equal_p (XEXP (operands[2], 0),
+		plus_constant (Pmode,
+			       XEXP (operands[0], 0),
+			       GET_MODE_SIZE (<VQ:MODE>mode)))"
+  "stp\\t%q1, %q3, %0"
+  [(set_attr "type" "neon_stp_q")]
+)
+
+
 (define_split
   [(set (match_operand:VQ 0 "register_operand" "")
       (match_operand:VQ 1 "register_operand" ""))]
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index ea9ead234cbc9abcff874dcc1fef4e76c76239ad..fb9700ca42f4660e2d5e19c14866a9d784318dd5 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -41,4 +41,7 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
    are not considered cheap.  */
 AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 
+/* Disallow load/store pair instructions on Q-registers.  */
+AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2974ca4c0964bb7a105ea96afeb1b7f2f8860a25..322bfcbdee9c96a315fff6ac4f8bab6b652a16a6 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -880,7 +880,7 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
   &generic_prefetch_tune
 };
 
@@ -5690,7 +5690,10 @@ aarch64_mode_valid_for_sched_fusion_p (machine_mode mode)
   return mode == SImode || mode == DImode
 	 || mode == SFmode || mode == DFmode
 	 || (aarch64_vector_mode_supported_p (mode)
-	     && known_eq (GET_MODE_SIZE (mode), 8));
+	     && (known_eq (GET_MODE_SIZE (mode), 8)
+		 || (known_eq (GET_MODE_SIZE (mode), 16)
+		    && (aarch64_tune_params.extra_tuning_flags
+			& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0)));
 }
 
 /* Return true if REGNO is a virtual pointer register, or an eliminable
@@ -5847,7 +5850,8 @@ aarch64_classify_address (struct aarch64_address_info *info,
 
 	  if (load_store_pair_p)
 	    return ((known_eq (GET_MODE_SIZE (mode), 4)
-		     || known_eq (GET_MODE_SIZE (mode), 8))
+		     || known_eq (GET_MODE_SIZE (mode), 8)
+		     || known_eq (GET_MODE_SIZE (mode), 16))
 		    && aarch64_offset_7bit_signed_scaled_p (mode, offset));
 	  else
 	    return (offset_9bit_signed_unscaled_p (mode, offset)
@@ -5907,7 +5911,8 @@ aarch64_classify_address (struct aarch64_address_info *info,
 
 	  if (load_store_pair_p)
 	    return ((known_eq (GET_MODE_SIZE (mode), 4)
-		     || known_eq (GET_MODE_SIZE (mode), 8))
+		     || known_eq (GET_MODE_SIZE (mode), 8)
+		     || known_eq (GET_MODE_SIZE (mode), 16))
 		    && aarch64_offset_7bit_signed_scaled_p (mode, offset));
 	  else
 	    return offset_9bit_signed_unscaled_p (mode, offset);
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index d8b78ef4c474a58071e97eced1fe95ca4e033910..9146414c335dfe16b40dcb6a4233612ae43e2926 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -84,6 +84,9 @@ (define_mode_iterator VDQ_BHSI [V8QI V16QI V4HI V8HI V2SI V4SI])
 ;; Quad vector modes.
 (define_mode_iterator VQ [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
 
+;; Copy of the above.
+(define_mode_iterator VQ2 [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
+
 ;; Quad integer vector modes.
 (define_mode_iterator VQ_I [V16QI V8HI V4SI V2DI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4f258699a30634f14d8a72405bddbbfd158db6e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c
@@ -0,0 +1,26 @@
+/* { dg-options "-O2 -moverride=tune=none" } */
+
+typedef float float32x4_t __attribute__ ((__vector_size__ ((16))));
+
+float32x4_t arr[4][4];
+
+void
+foo (float32x4_t x, float32x4_t y)
+{
+  arr[0][1] = x;
+  arr[1][0] = y;
+  arr[2][0] = x;
+  arr[1][1] = y;
+  arr[0][2] = x;
+  arr[0][3] = y;
+  arr[1][2] = x;
+  arr[2][1] = y;
+  arr[3][0] = x;
+  arr[3][1] = y;
+  arr[2][2] = x;
+  arr[1][3] = y;
+  arr[2][3] = x;
+  arr[3][2] = y;
+}
+
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 7 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_q_disable.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q_disable.c
new file mode 100644
index 0000000000000000000000000000000000000000..38c1870c47c69175e4b641532333bf7108351476
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q_disable.c
@@ -0,0 +1,26 @@
+/* { dg-options "-O2 -moverride=tune=no_ldp_stp_qregs" } */
+
+typedef float float32x4_t __attribute__ ((__vector_size__ ((16))));
+
+float32x4_t arr[4][4];
+
+void
+foo (float32x4_t x, float32x4_t y)
+{
+  arr[0][1] = x;
+  arr[1][0] = y;
+  arr[2][0] = x;
+  arr[1][1] = y;
+  arr[0][2] = x;
+  arr[0][3] = y;
+  arr[1][2] = x;
+  arr[2][1] = y;
+  arr[3][0] = x;
+  arr[3][1] = y;
+  arr[2][2] = x;
+  arr[1][3] = y;
+  arr[2][3] = x;
+  arr[3][2] = y;
+}
+
+/* { dg-final { scan-assembler-not "stp\tq\[0-9\]+, q\[0-9\]" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7d8d54ebb8c7af6bcbde3e6ff9d3cb97ab293f1c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -moverride=tune=none" } */
+
+
+typedef int int32x4_t __attribute__ ((__vector_size__ ((16))));
+
+void
+bar (int32x4_t *foo)
+{
+  int i = 0;
+  int32x4_t val = { 3, 2, 5, 1 };
+
+  for (i = 0; i < 256; i+=2)
+    {
+      foo[i] = val;
+      foo[i+1] = val;
+    }
+}
+
+/* { dg-final { scan-assembler "stp\tq\[0-9\]+, q\[0-9\]" } } */
James Greenhalgh June 19, 2018, 3:52 p.m. | #7
On Thu, Jun 07, 2018 at 05:58:01AM -0500, Kyrill Tkachov wrote:
> 

> On 05/06/18 18:28, James Greenhalgh wrote:

> > On Tue, Jun 05, 2018 at 11:32:06AM -0500, Kyrill Tkachov wrote:

> >> On 04/06/18 18:40, Kyrill Tkachov wrote:

> >>> Hi all,

> >>>

> >>> This patch adds support for generating LDPs and STPs of Q-registers.

> >>> This allows for more compact code generation and makes better use of the ISA.

> >>>

> >>> It's implemented in a straightforward way by allowing 16-byte modes in the

> >>> sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md

> >>> as well as the patterns themselves in aarch64-simd.md.

> >>>

> >>> I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.

> >>>

> >> Adding some folks who know more about other CPUs as well.

> >> Are you okay with enabling these instructions in AArch64?

> >>

> >> If you could give this a spin on some benchmarks you

> >> care about on your platforms it would be really useful data.

> >  From an architecture perspective, I think this is the right thing for us

> > to do. Given the feedback from Andrew and Siddhesh I think we should support

> > this patch, defaulting to on; but behind a tuning flag for those who want

> > to disable it for their -mcpu tuning.

> >

> > If you can respin it behind a tuning parameter and give the community

> > another 48 hours or so to respond, I think we'd have a good patch here.

> >

> > I'm also adding some more contributors to the AArch64 cores file for their

> > thoughts on the proposal.

> 

> Ok, here's an updated patch adding a new no_ldp_stp_qregs tuning flag.

> I use it to restrict the peepholes in aarch64-ldpstp.md from merging the

> operations together into PARALLELs. I also use it to restrict the sched fusion

> check that brings such loads and stores together. This is enough to avoid

> forming the pairs.

> 

> Given Philipp's comments I've enabled this flag for xgene1_tunings so that -mtune=xgene1

> will not generate these pairs, but every other tuning will by default.

> 

> The tests are updated to explicitly pass -moverride=tune=none so that

> they do not depend on a particular default -mcpu option.

> This -moverride option disables all the tuning flags, so it disables

> the "disabling" of LDP/STP-Q.

> 

> I've manually confirmed that for -mcpu=xgene1 the pairs are not formed

> and added a test that the new tuning flag does indeed disable the generation.

> 

> Bootstrapped and tested on aarch64-none-linux-gnu.

> What do folks think of this version?


OK for trunk.

We can tune up other cores ready for the GCC 9 release; but no reason not
to have this in now. If, as we get close to GCC 9 the consensus in our
supported cores is to go back to GCC 8 behaviour, we can flick the switch
to no_ldp_stp_qregs in generic, and cores which will benefit from this can
turn it back on in their private tuning structures.

Thanks,
James

> 2018-06-07  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

> 

>      * config/aarch64/aarch64-tuning-flags.def (no_ldp_stp_qregs): New.

>      * config/aarch64/aarch64.c (xgene1_tunings): Add

>      AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS to tune_flags.

>      (aarch64_mode_valid_for_sched_fusion_p):

>      Allow 16-byte modes.

>      (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p.

>      * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of

>      128-bit modes.

>      * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>):

>      New pattern.

>      (vec_store_pair<VQ:mode><VQ2:mode>): Likewise.

>      * config/aarch64/iterators.md (VQ2): New mode iterator.

> 

> 2018-06-07  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

> 

>      * gcc.target/aarch64/ldp_stp_q.c: New test.

>      * gcc.target/aarch64/stp_vec_128_1.c: Likewise.

>      * gcc.target/aarch64/ldp_stp_q_disable.c: Likewise.

Patch

diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 7f1031dc80fab31f691c0b03d6a485c1b6fd7e53..12d89fd2ef5db5e0d3828d75ae244fdc04438f45 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -91,6 +91,32 @@  (define_peephole2
   aarch64_swap_ldrstr_operands (operands, false);
 })
 
+(define_peephole2
+  [(set (match_operand:VQ 0 "register_operand" "")
+	(match_operand:VQ 1 "memory_operand" ""))
+   (set (match_operand:VQ2 2 "register_operand" "")
+	(match_operand:VQ2 3 "memory_operand" ""))]
+  "aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  aarch64_swap_ldrstr_operands (operands, true);
+})
+
+(define_peephole2
+  [(set (match_operand:VQ 0 "memory_operand" "")
+	(match_operand:VQ 1 "register_operand" ""))
+   (set (match_operand:VQ2 2 "memory_operand" "")
+	(match_operand:VQ2 3 "register_operand" ""))]
+  "TARGET_SIMD
+   && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  aarch64_swap_ldrstr_operands (operands, false);
+})
+
+
 ;; Handle sign/zero extended consecutive load/store.
 
 (define_peephole2
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index d5803998c60bf9422dbc4481bac1590f4d209a4a..740a3414a8d9c80addbfa611d530d9f56da11100 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -205,6 +205,34 @@  (define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
   [(set_attr "type" "neon_stp")]
 )
 
+(define_insn "load_pair<VQ:mode><VQ2:mode>"
+  [(set (match_operand:VQ 0 "register_operand" "=w")
+	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
+   (set (match_operand:VQ2 2 "register_operand" "=w")
+	(match_operand:VQ2 3 "memory_operand" "m"))]
+  "TARGET_SIMD
+    && rtx_equal_p (XEXP (operands[3], 0),
+		    plus_constant (Pmode,
+			       XEXP (operands[1], 0),
+			       GET_MODE_SIZE (<VQ:MODE>mode)))"
+  "ldp\\t%q0, %q2, %1"
+  [(set_attr "type" "neon_ldp_q")]
+)
+
+(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
+  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
+	(match_operand:VQ 1 "register_operand" "w"))
+   (set (match_operand:VQ2 2 "memory_operand" "=m")
+	(match_operand:VQ2 3 "register_operand" "w"))]
+  "TARGET_SIMD && rtx_equal_p (XEXP (operands[2], 0),
+		plus_constant (Pmode,
+			       XEXP (operands[0], 0),
+			       GET_MODE_SIZE (<VQ:MODE>mode)))"
+  "stp\\t%q1, %q3, %0"
+  [(set_attr "type" "neon_stp_q")]
+)
+
+
 (define_split
   [(set (match_operand:VQ 0 "register_operand" "")
       (match_operand:VQ 1 "register_operand" ""))]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 89fce15e0194365a6c0a85236c3ea6b26d26e89e..77f9f8adef6155e51be7aa5551e71c688128ecfc 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5681,7 +5681,8 @@  aarch64_mode_valid_for_sched_fusion_p (machine_mode mode)
   return mode == SImode || mode == DImode
 	 || mode == SFmode || mode == DFmode
 	 || (aarch64_vector_mode_supported_p (mode)
-	     && known_eq (GET_MODE_SIZE (mode), 8));
+	     && (known_eq (GET_MODE_SIZE (mode), 8)
+		 || known_eq (GET_MODE_SIZE (mode), 16)));
 }
 
 /* Return true if REGNO is a virtual pointer register, or an eliminable
@@ -5838,7 +5839,8 @@  aarch64_classify_address (struct aarch64_address_info *info,
 
 	  if (load_store_pair_p)
 	    return ((known_eq (GET_MODE_SIZE (mode), 4)
-		     || known_eq (GET_MODE_SIZE (mode), 8))
+		     || known_eq (GET_MODE_SIZE (mode), 8)
+		     || known_eq (GET_MODE_SIZE (mode), 16))
 		    && aarch64_offset_7bit_signed_scaled_p (mode, offset));
 	  else
 	    return (offset_9bit_signed_unscaled_p (mode, offset)
@@ -5898,7 +5900,8 @@  aarch64_classify_address (struct aarch64_address_info *info,
 
 	  if (load_store_pair_p)
 	    return ((known_eq (GET_MODE_SIZE (mode), 4)
-		     || known_eq (GET_MODE_SIZE (mode), 8))
+		     || known_eq (GET_MODE_SIZE (mode), 8)
+		     || known_eq (GET_MODE_SIZE (mode), 16))
 		    && aarch64_offset_7bit_signed_scaled_p (mode, offset));
 	  else
 	    return offset_9bit_signed_unscaled_p (mode, offset);
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index d8b78ef4c474a58071e97eced1fe95ca4e033910..9146414c335dfe16b40dcb6a4233612ae43e2926 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -84,6 +84,9 @@  (define_mode_iterator VDQ_BHSI [V8QI V16QI V4HI V8HI V2SI V4SI])
 ;; Quad vector modes.
 (define_mode_iterator VQ [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
 
+;; Copy of the above.
+(define_mode_iterator VQ2 [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
+
 ;; Quad integer vector modes.
 (define_mode_iterator VQ_I [V16QI V8HI V4SI V2DI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c
new file mode 100644
index 0000000000000000000000000000000000000000..e5f04b7ad678627e08e791c05c50ae6dcf081088
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_q.c
@@ -0,0 +1,26 @@ 
+/* { dg-options "-O2 -mcpu=generic" } */
+
+typedef float float32x4_t __attribute__ ((__vector_size__ ((16))));
+
+float32x4_t arr[4][4];
+
+void
+foo (float32x4_t x, float32x4_t y)
+{
+  arr[0][1] = x;
+  arr[1][0] = y;
+  arr[2][0] = x;
+  arr[1][1] = y;
+  arr[0][2] = x;
+  arr[0][3] = y;
+  arr[1][2] = x;
+  arr[2][1] = y;
+  arr[3][0] = x;
+  arr[3][1] = y;
+  arr[2][2] = x;
+  arr[1][3] = y;
+  arr[2][3] = x;
+  arr[3][2] = y;
+}
+
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 7 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..07d98d2f36637dfa244fb0e217c39db7c2ee3913
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_128_1.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast" } */
+
+
+typedef int int32x4_t __attribute__ ((__vector_size__ ((16))));
+
+void
+bar (int32x4_t *foo)
+{
+  int i = 0;
+  int32x4_t val = { 3, 2, 5, 1 };
+
+  for (i = 0; i < 256; i+=2)
+    {
+      foo[i] = val;
+      foo[i+1] = val;
+    }
+}
+
+/* { dg-final { scan-assembler "stp\tq\[0-9\]+, q\[0-9\]" } } */