Don't use permutes for single-element accesses (PR83753)

Message ID 87lgh6bsmr.fsf@linaro.org
State New
Headers show
Series
  • Don't use permutes for single-element accesses (PR83753)
Related show

Commit Message

Richard Sandiford Jan. 9, 2018, 9:59 p.m.
After cunrolling the inner loop, the remaining loop in the testcase
has a single 32-bit access and a group of 64-bit accesses.  We first
try to vectorise at 128 bits (VF 4), but decide not to for cost reasons.
We then try with 64 bits (VF 2) instead.  This means that the group
of 64-bit accesses uses a single-element vector, which is deliberately
supported as of r251538.  We then try to create "permutes" for these
single-element vectors and fall foul of:

	      for (i = 0; i < 6; i++)
		sel[i] += exact_div (nelt, 2);

in vect_grouped_store_supported, since nelt==1.

Maybe we shouldn't even be trying to vectorise statements in the
single-element case, and instead just copy the scalar statement
for each member of the group.  But until then, this patch treats
non-strided grouped accesses as VMAT_CONTIGUOUS if no permutation
is necessary.

Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64le-linux-gnu.
OK to install?

Richard


2018-01-09  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	PR tree-optimization/83753
	* tree-vect-stmts.c (get_group_load_store_type): Use VMAT_CONTIGUOUS
	for non-strided grouped accesses if the number of elements is 1.

gcc/testsuite/
	PR tree-optimization/83753
	* gcc.dg/torture/pr83753.c: New test.

Comments

Richard Biener Jan. 10, 2018, 1:02 p.m. | #1
On Tue, Jan 9, 2018 at 10:59 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> After cunrolling the inner loop, the remaining loop in the testcase

> has a single 32-bit access and a group of 64-bit accesses.  We first

> try to vectorise at 128 bits (VF 4), but decide not to for cost reasons.

> We then try with 64 bits (VF 2) instead.  This means that the group

> of 64-bit accesses uses a single-element vector, which is deliberately

> supported as of r251538.  We then try to create "permutes" for these

> single-element vectors and fall foul of:

>

>               for (i = 0; i < 6; i++)

>                 sel[i] += exact_div (nelt, 2);

>

> in vect_grouped_store_supported, since nelt==1.

>

> Maybe we shouldn't even be trying to vectorise statements in the

> single-element case, and instead just copy the scalar statement

> for each member of the group.  But until then, this patch treats

> non-strided grouped accesses as VMAT_CONTIGUOUS if no permutation

> is necessary.

>

> Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64le-linux-gnu.

> OK to install?


Ok.

RIchard.

> Richard

>

>

> 2018-01-09  Richard Sandiford  <richard.sandiford@linaro.org>

>

> gcc/

>         PR tree-optimization/83753

>         * tree-vect-stmts.c (get_group_load_store_type): Use VMAT_CONTIGUOUS

>         for non-strided grouped accesses if the number of elements is 1.

>

> gcc/testsuite/

>         PR tree-optimization/83753

>         * gcc.dg/torture/pr83753.c: New test.

>

> Index: gcc/tree-vect-stmts.c

> ===================================================================

> --- gcc/tree-vect-stmts.c       2018-01-09 15:46:34.439449019 +0000

> +++ gcc/tree-vect-stmts.c       2018-01-09 18:15:53.481983778 +0000

> @@ -1849,10 +1849,16 @@ get_group_load_store_type (gimple *stmt,

>           && (can_overrun_p || !would_overrun_p)

>           && compare_step_with_zero (stmt) > 0)

>         {

> -         /* First try using LOAD/STORE_LANES.  */

> -         if (vls_type == VLS_LOAD

> -             ? vect_load_lanes_supported (vectype, group_size)

> -             : vect_store_lanes_supported (vectype, group_size))

> +         /* First cope with the degenerate case of a single-element

> +            vector.  */

> +         if (known_eq (TYPE_VECTOR_SUBPARTS (vectype), 1U))

> +           *memory_access_type = VMAT_CONTIGUOUS;

> +

> +         /* Otherwise try using LOAD/STORE_LANES.  */

> +         if (*memory_access_type == VMAT_ELEMENTWISE

> +             && (vls_type == VLS_LOAD

> +                 ? vect_load_lanes_supported (vectype, group_size)

> +                 : vect_store_lanes_supported (vectype, group_size)))

>             {

>               *memory_access_type = VMAT_LOAD_STORE_LANES;

>               overrun_p = would_overrun_p;

> Index: gcc/testsuite/gcc.dg/torture/pr83753.c

> ===================================================================

> --- /dev/null   2018-01-08 18:48:58.045015662 +0000

> +++ gcc/testsuite/gcc.dg/torture/pr83753.c      2018-01-09 18:15:53.480983817 +0000

> @@ -0,0 +1,19 @@

> +/* { dg-do compile } */

> +/* { dg-options "-mcpu=xgene1" { target aarch64*-*-* } } */

> +

> +typedef struct {

> +  int m1[10];

> +  double m2[10][8];

> +} blah;

> +

> +void

> +foo (blah *info) {

> +  int i, d;

> +

> +  for (d=0; d<10; d++) {

> +    info->m1[d] = 0;

> +    info->m2[d][0] = 1;

> +    for (i=1; i<8; i++)

> +      info->m2[d][i] = 2;

> +  }

> +}

Patch

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2018-01-09 15:46:34.439449019 +0000
+++ gcc/tree-vect-stmts.c	2018-01-09 18:15:53.481983778 +0000
@@ -1849,10 +1849,16 @@  get_group_load_store_type (gimple *stmt,
 	  && (can_overrun_p || !would_overrun_p)
 	  && compare_step_with_zero (stmt) > 0)
 	{
-	  /* First try using LOAD/STORE_LANES.  */
-	  if (vls_type == VLS_LOAD
-	      ? vect_load_lanes_supported (vectype, group_size)
-	      : vect_store_lanes_supported (vectype, group_size))
+	  /* First cope with the degenerate case of a single-element
+	     vector.  */
+	  if (known_eq (TYPE_VECTOR_SUBPARTS (vectype), 1U))
+	    *memory_access_type = VMAT_CONTIGUOUS;
+
+	  /* Otherwise try using LOAD/STORE_LANES.  */
+	  if (*memory_access_type == VMAT_ELEMENTWISE
+	      && (vls_type == VLS_LOAD
+		  ? vect_load_lanes_supported (vectype, group_size)
+		  : vect_store_lanes_supported (vectype, group_size)))
 	    {
 	      *memory_access_type = VMAT_LOAD_STORE_LANES;
 	      overrun_p = would_overrun_p;
Index: gcc/testsuite/gcc.dg/torture/pr83753.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.dg/torture/pr83753.c	2018-01-09 18:15:53.480983817 +0000
@@ -0,0 +1,19 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mcpu=xgene1" { target aarch64*-*-* } } */
+
+typedef struct {
+  int m1[10];
+  double m2[10][8];
+} blah;
+
+void
+foo (blah *info) {
+  int i, d;
+
+  for (d=0; d<10; d++) {
+    info->m1[d] = 0;
+    info->m2[d][0] = 1;
+    for (i=1; i<8; i++)
+      info->m2[d][i] = 2;
+  }
+}