Fix PR84512

Message ID alpine.LSU.2.20.1802271339480.18265@zhemvz.fhfr.qr
State New
Headers show
Series
  • Fix PR84512
Related show

Commit Message

Richard Biener Feb. 27, 2018, 12:42 p.m.
With Honzas change to make the x86 backend consider the actual operation
for costing vector stmts it becomes apparent that 
vect_compute_single_scalar_iteration_cost uses the old-style target
cost hook which doesn't get enough information to distinguish different
operations.  This means instead of actual scalar multiplication cost
we cost a general scalar-stmt cost for the testcase for the scalar
iteration but cost a vector multiplication for the vectorized body
resulting in an apples-to-oranges comparison in the end.

Fixed as follows.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

Richard.

2018-02-27  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84512
	* tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
	Do not use the estimate returned from record_stmt_cost for
	the scalar iteration cost but sum properly using add_stmt_cost.

	* gcc.dg/tree-ssa/pr84512.c: New testcase.

Comments

Tom de Vries March 16, 2018, 11:42 a.m. | #1
On 02/27/2018 01:42 PM, Richard Biener wrote:
> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

> ===================================================================

> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

> @@ -0,0 +1,15 @@

> +/* { dg-do compile } */

> +/* { dg-options "-O3 -fdump-tree-optimized" } */

> +

> +int foo()

> +{

> +  int a[10];

> +  for(int i = 0; i < 10; ++i)

> +    a[i] = i*i;

> +  int res = 0;

> +  for(int i = 0; i < 10; ++i)

> +    res += a[i];

> +  return res;

> +}

> +

> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */


This fails for nvptx, because it doesn't have the required vector 
operations. To fix the fail, I've added requiring effective target 
vect_int_mult.

Thanks,
- Tom
[testsuite] Require vect_int_mult in pr84512.c

2018-03-16  Tom de Vries  <tom@codesourcery.com>

	* gcc.dg/tree-ssa/pr84512.c: Require effective target vect_int_mult.

---
 gcc/testsuite/gcc.dg/tree-ssa/pr84512.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
index 288fa5d..41b6c06 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int_mult } */
 
 int foo()
 {
Richard Biener March 16, 2018, 11:55 a.m. | #2
On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 02/27/2018 01:42 PM, Richard Biener wrote:

> > Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

> > ===================================================================

> > --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

> > +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

> > @@ -0,0 +1,15 @@

> > +/* { dg-do compile } */

> > +/* { dg-options "-O3 -fdump-tree-optimized" } */

> > +

> > +int foo()

> > +{

> > +  int a[10];

> > +  for(int i = 0; i < 10; ++i)

> > +    a[i] = i*i;

> > +  int res = 0;

> > +  for(int i = 0; i < 10; ++i)

> > +    res += a[i];

> > +  return res;

> > +}

> > +

> > +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

> 

> This fails for nvptx, because it doesn't have the required vector operations.

> To fix the fail, I've added requiring effective target vect_int_mult.


On targets that do not vectorize you should see the scalar loops unrolled
instead.  Or do you have only one loop vectorized?  That's precisely
what the PR was about...  which means it isn't fixed for nvptx :/

Richard.

> Thanks,

> - Tom

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Tom de Vries March 16, 2018, 2:51 p.m. | #3
On 03/16/2018 12:55 PM, Richard Biener wrote:
> On Fri, 16 Mar 2018, Tom de Vries wrote:

> 

>> On 02/27/2018 01:42 PM, Richard Biener wrote:

>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

>>> ===================================================================

>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

>>> @@ -0,0 +1,15 @@

>>> +/* { dg-do compile } */

>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */

>>> +

>>> +int foo()

>>> +{

>>> +  int a[10];

>>> +  for(int i = 0; i < 10; ++i)

>>> +    a[i] = i*i;

>>> +  int res = 0;

>>> +  for(int i = 0; i < 10; ++i)

>>> +    res += a[i];

>>> +  return res;

>>> +}

>>> +

>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

>>

>> This fails for nvptx, because it doesn't have the required vector operations.

>> To fix the fail, I've added requiring effective target vect_int_mult.

> 

> On targets that do not vectorize you should see the scalar loops unrolled

> instead.  Or do you have only one loop vectorized?


Sort of. Loop vectorization has no effect, and the scalar loops are 
completely unrolled. But then slp vectorization vectorizes the stores.

So at optimized we have:
...
   MEM[(int *)&a] = { 0, 1 };
   MEM[(int *)&a + 8B] = { 4, 9 };
   MEM[(int *)&a + 16B] = { 16, 25 };
   MEM[(int *)&a + 24B] = { 36, 49 };
   MEM[(int *)&a + 32B] = { 64, 81 };
   _6 = a[0];
   _28 = a[1];
   res_29 = _6 + _28;
   _35 = a[2];
   res_36 = res_29 + _35;
   _42 = a[3];
   res_43 = res_36 + _42;
   _49 = a[4];
   res_50 = res_43 + _49;
   _56 = a[5];
   res_57 = res_50 + _56;
   _63 = a[6];
   res_64 = res_57 + _63;
   _70 = a[7];
   res_71 = res_64 + _70;
   _77 = a[8];
   res_78 = res_71 + _77;
   _2 = a[9];
   res_11 = _2 + res_78;
   a ={v} {CLOBBER};
   return res_11;
...

The stores and loads are eliminated by dse1 in the rtl phase, and in the 
end we have:
...
.visible .func (.param.u32 %value_out) foo
{
         .reg.u32 %value;
         .local .align 16 .b8 %frame_ar[48];
         .reg.u64 %frame;
         cvta.local.u64 %frame, %frame_ar;
         mov.u32 %value, 285;
         st.param.u32    [%value_out], %value;
         ret;
}
...

> That's precisely

> what the PR was about...  which means it isn't fixed for nvptx :/


Indeed the assembly is not optimal, and would be optimal if we'd have 
optimal code at optimized.

FWIW, using this patch we generate optimal code at optimized:
...
diff --git a/gcc/passes.def b/gcc/passes.def
index 3ebcfc30349..6b64f600c4a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
        NEXT_PASS (pass_tracer);
        NEXT_PASS (pass_thread_jumps);
        NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+      NEXT_PASS (pass_fre);
        NEXT_PASS (pass_strlen);
        NEXT_PASS (pass_thread_jumps);
        NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
...

and we get:
...
.visible .func (.param.u32 %value_out) foo
{
         .reg.u32 %value;
         mov.u32 %value, 285;
         st.param.u32    [%value_out], %value;
         ret;
}
...

I could file a missing optimization PR for nvptx, but I'm not sure where 
this should be fixed.

Thanks,
- Tom
Richard Biener March 19, 2018, 9:11 a.m. | #4
On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 03/16/2018 12:55 PM, Richard Biener wrote:

> > On Fri, 16 Mar 2018, Tom de Vries wrote:

> > 

> > > On 02/27/2018 01:42 PM, Richard Biener wrote:

> > > > Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

> > > > ===================================================================

> > > > --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

> > > > +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

> > > > @@ -0,0 +1,15 @@

> > > > +/* { dg-do compile } */

> > > > +/* { dg-options "-O3 -fdump-tree-optimized" } */

> > > > +

> > > > +int foo()

> > > > +{

> > > > +  int a[10];

> > > > +  for(int i = 0; i < 10; ++i)

> > > > +    a[i] = i*i;

> > > > +  int res = 0;

> > > > +  for(int i = 0; i < 10; ++i)

> > > > +    res += a[i];

> > > > +  return res;

> > > > +}

> > > > +

> > > > +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

> > > 

> > > This fails for nvptx, because it doesn't have the required vector

> > > operations.

> > > To fix the fail, I've added requiring effective target vect_int_mult.

> > 

> > On targets that do not vectorize you should see the scalar loops unrolled

> > instead.  Or do you have only one loop vectorized?

> 

> Sort of. Loop vectorization has no effect, and the scalar loops are completely

> unrolled. But then slp vectorization vectorizes the stores.

> 

> So at optimized we have:

> ...

>   MEM[(int *)&a] = { 0, 1 };

>   MEM[(int *)&a + 8B] = { 4, 9 };

>   MEM[(int *)&a + 16B] = { 16, 25 };

>   MEM[(int *)&a + 24B] = { 36, 49 };

>   MEM[(int *)&a + 32B] = { 64, 81 };

>   _6 = a[0];

>   _28 = a[1];

>   res_29 = _6 + _28;

>   _35 = a[2];

>   res_36 = res_29 + _35;

>   _42 = a[3];

>   res_43 = res_36 + _42;

>   _49 = a[4];

>   res_50 = res_43 + _49;

>   _56 = a[5];

>   res_57 = res_50 + _56;

>   _63 = a[6];

>   res_64 = res_57 + _63;

>   _70 = a[7];

>   res_71 = res_64 + _70;

>   _77 = a[8];

>   res_78 = res_71 + _77;

>   _2 = a[9];

>   res_11 = _2 + res_78;

>   a ={v} {CLOBBER};

>   return res_11;

> ...

> 

> The stores and loads are eliminated by dse1 in the rtl phase, and in the end

> we have:

> ...

> .visible .func (.param.u32 %value_out) foo

> {

>         .reg.u32 %value;

>         .local .align 16 .b8 %frame_ar[48];

>         .reg.u64 %frame;

>         cvta.local.u64 %frame, %frame_ar;

>         mov.u32 %value, 285;

>         st.param.u32    [%value_out], %value;

>         ret;

> }

> ...

> 

> > That's precisely

> > what the PR was about...  which means it isn't fixed for nvptx :/

> 

> Indeed the assembly is not optimal, and would be optimal if we'd have optimal

> code at optimized.

> 

> FWIW, using this patch we generate optimal code at optimized:

> ...

> diff --git a/gcc/passes.def b/gcc/passes.def

> index 3ebcfc30349..6b64f600c4a 100644

> --- a/gcc/passes.def

> +++ b/gcc/passes.def

> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see

>        NEXT_PASS (pass_tracer);

>        NEXT_PASS (pass_thread_jumps);

>        NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);

> +      NEXT_PASS (pass_fre);

>        NEXT_PASS (pass_strlen);

>        NEXT_PASS (pass_thread_jumps);

>        NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);

> ...

> 

> and we get:

> ...

> .visible .func (.param.u32 %value_out) foo

> {

>         .reg.u32 %value;

>         mov.u32 %value, 285;

>         st.param.u32    [%value_out], %value;

>         ret;

> }

> ...

> 

> I could file a missing optimization PR for nvptx, but I'm not sure where this

> should be fixed.


Ah, yeah... the usual issue then.

Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

Thanks,
Richard.
Tom de Vries March 20, 2018, 12:24 p.m. | #5
On 03/19/2018 10:11 AM, Richard Biener wrote:
> On Fri, 16 Mar 2018, Tom de Vries wrote:

> 

>> On 03/16/2018 12:55 PM, Richard Biener wrote:

>>> On Fri, 16 Mar 2018, Tom de Vries wrote:

>>>

>>>> On 02/27/2018 01:42 PM, Richard Biener wrote:

>>>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

>>>>> ===================================================================

>>>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

>>>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

>>>>> @@ -0,0 +1,15 @@

>>>>> +/* { dg-do compile } */

>>>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */

>>>>> +

>>>>> +int foo()

>>>>> +{

>>>>> +  int a[10];

>>>>> +  for(int i = 0; i < 10; ++i)

>>>>> +    a[i] = i*i;

>>>>> +  int res = 0;

>>>>> +  for(int i = 0; i < 10; ++i)

>>>>> +    res += a[i];

>>>>> +  return res;

>>>>> +}

>>>>> +

>>>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

>>>>

>>>> This fails for nvptx, because it doesn't have the required vector

>>>> operations.

>>>> To fix the fail, I've added requiring effective target vect_int_mult.

>>>

>>> On targets that do not vectorize you should see the scalar loops unrolled

>>> instead.  Or do you have only one loop vectorized?

>>

>> Sort of. Loop vectorization has no effect, and the scalar loops are completely

>> unrolled. But then slp vectorization vectorizes the stores.

>>

>> So at optimized we have:

>> ...

>>    MEM[(int *)&a] = { 0, 1 };

>>    MEM[(int *)&a + 8B] = { 4, 9 };

>>    MEM[(int *)&a + 16B] = { 16, 25 };

>>    MEM[(int *)&a + 24B] = { 36, 49 };

>>    MEM[(int *)&a + 32B] = { 64, 81 };

>>    _6 = a[0];

>>    _28 = a[1];

>>    res_29 = _6 + _28;

>>    _35 = a[2];

>>    res_36 = res_29 + _35;

>>    _42 = a[3];

>>    res_43 = res_36 + _42;

>>    _49 = a[4];

>>    res_50 = res_43 + _49;

>>    _56 = a[5];

>>    res_57 = res_50 + _56;

>>    _63 = a[6];

>>    res_64 = res_57 + _63;

>>    _70 = a[7];

>>    res_71 = res_64 + _70;

>>    _77 = a[8];

>>    res_78 = res_71 + _77;

>>    _2 = a[9];

>>    res_11 = _2 + res_78;

>>    a ={v} {CLOBBER};

>>    return res_11;

>> ...

>>

>> The stores and loads are eliminated by dse1 in the rtl phase, and in the end

>> we have:

>> ...

>> .visible .func (.param.u32 %value_out) foo

>> {

>>          .reg.u32 %value;

>>          .local .align 16 .b8 %frame_ar[48];

>>          .reg.u64 %frame;

>>          cvta.local.u64 %frame, %frame_ar;

>>          mov.u32 %value, 285;

>>          st.param.u32    [%value_out], %value;

>>          ret;

>> }

>> ...

>>

>>> That's precisely

>>> what the PR was about...  which means it isn't fixed for nvptx :/

>>

>> Indeed the assembly is not optimal, and would be optimal if we'd have optimal

>> code at optimized.

>>

>> FWIW, using this patch we generate optimal code at optimized:

>> ...

>> diff --git a/gcc/passes.def b/gcc/passes.def

>> index 3ebcfc30349..6b64f600c4a 100644

>> --- a/gcc/passes.def

>> +++ b/gcc/passes.def

>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see

>>         NEXT_PASS (pass_tracer);

>>         NEXT_PASS (pass_thread_jumps);

>>         NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);

>> +      NEXT_PASS (pass_fre);

>>         NEXT_PASS (pass_strlen);

>>         NEXT_PASS (pass_thread_jumps);

>>         NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);

>> ...

>>

>> and we get:

>> ...

>> .visible .func (.param.u32 %value_out) foo

>> {

>>          .reg.u32 %value;

>>          mov.u32 %value, 285;

>>          st.param.u32    [%value_out], %value;

>>          ret;

>> }

>> ...

>>

>> I could file a missing optimization PR for nvptx, but I'm not sure where this

>> should be fixed.

> 

> Ah, yeah... the usual issue then.

> 

> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

> 


Done.

Committed at attached.

Thanks,
- Tom
[testsuite] Add nvptx xfail to pr84512.c

2018-03-19  Tom de Vries  <tom@codesourcery.com>

	* gcc.dg/tree-ssa/pr84512.c: Don't require effective target
	vect_int_mult.  Add nvptx xfail for PR84958.

---
 gcc/testsuite/ChangeLog                 | 5 +++++
 gcc/testsuite/gcc.dg/tree-ssa/pr84512.c | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
index 41b6c06..9560160 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -1,6 +1,5 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -fdump-tree-optimized" } */
-/* { dg-require-effective-target vect_int_mult } */
 
 int foo()
 {
@@ -13,4 +12,5 @@ int foo()
   return res;
 }
 
-/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
+/* Target nvptx xfail due to PR84958.  */
+/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail nvptx*-*-* } } } */
Rainer Orth March 20, 2018, 8:14 p.m. | #6
Hi Tom,

> On 03/19/2018 10:11 AM, Richard Biener wrote:

>> On Fri, 16 Mar 2018, Tom de Vries wrote:

>>

>>> On 03/16/2018 12:55 PM, Richard Biener wrote:

>>>> On Fri, 16 Mar 2018, Tom de Vries wrote:

>>>>

>>>>> On 02/27/2018 01:42 PM, Richard Biener wrote:

>>>>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

>>>>>> ===================================================================

>>>>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

>>>>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

>>>>>> @@ -0,0 +1,15 @@

>>>>>> +/* { dg-do compile } */

>>>>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */

>>>>>> +

>>>>>> +int foo()

>>>>>> +{

>>>>>> +  int a[10];

>>>>>> +  for(int i = 0; i < 10; ++i)

>>>>>> +    a[i] = i*i;

>>>>>> +  int res = 0;

>>>>>> +  for(int i = 0; i < 10; ++i)

>>>>>> +    res += a[i];

>>>>>> +  return res;

>>>>>> +}

>>>>>> +

>>>>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

>>>>>

>>>>> This fails for nvptx, because it doesn't have the required vector

>>>>> operations.

>>>>> To fix the fail, I've added requiring effective target vect_int_mult.

>>>>

>>>> On targets that do not vectorize you should see the scalar loops unrolled

>>>> instead.  Or do you have only one loop vectorized?

>>>

>>> Sort of. Loop vectorization has no effect, and the scalar loops are completely

>>> unrolled. But then slp vectorization vectorizes the stores.

>>>

>>> So at optimized we have:

>>> ...

>>>    MEM[(int *)&a] = { 0, 1 };

>>>    MEM[(int *)&a + 8B] = { 4, 9 };

>>>    MEM[(int *)&a + 16B] = { 16, 25 };

>>>    MEM[(int *)&a + 24B] = { 36, 49 };

>>>    MEM[(int *)&a + 32B] = { 64, 81 };

>>>    _6 = a[0];

>>>    _28 = a[1];

>>>    res_29 = _6 + _28;

>>>    _35 = a[2];

>>>    res_36 = res_29 + _35;

>>>    _42 = a[3];

>>>    res_43 = res_36 + _42;

>>>    _49 = a[4];

>>>    res_50 = res_43 + _49;

>>>    _56 = a[5];

>>>    res_57 = res_50 + _56;

>>>    _63 = a[6];

>>>    res_64 = res_57 + _63;

>>>    _70 = a[7];

>>>    res_71 = res_64 + _70;

>>>    _77 = a[8];

>>>    res_78 = res_71 + _77;

>>>    _2 = a[9];

>>>    res_11 = _2 + res_78;

>>>    a ={v} {CLOBBER};

>>>    return res_11;

>>> ...

>>>

>>> The stores and loads are eliminated by dse1 in the rtl phase, and in the end

>>> we have:

>>> ...

>>> .visible .func (.param.u32 %value_out) foo

>>> {

>>>          .reg.u32 %value;

>>>          .local .align 16 .b8 %frame_ar[48];

>>>          .reg.u64 %frame;

>>>          cvta.local.u64 %frame, %frame_ar;

>>>          mov.u32 %value, 285;

>>>          st.param.u32    [%value_out], %value;

>>>          ret;

>>> }

>>> ...

>>>

>>>> That's precisely

>>>> what the PR was about...  which means it isn't fixed for nvptx :/

>>>

>>> Indeed the assembly is not optimal, and would be optimal if we'd have optimal

>>> code at optimized.

>>>

>>> FWIW, using this patch we generate optimal code at optimized:

>>> ...

>>> diff --git a/gcc/passes.def b/gcc/passes.def

>>> index 3ebcfc30349..6b64f600c4a 100644

>>> --- a/gcc/passes.def

>>> +++ b/gcc/passes.def

>>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see

>>>         NEXT_PASS (pass_tracer);

>>>         NEXT_PASS (pass_thread_jumps);

>>>         NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);

>>> +      NEXT_PASS (pass_fre);

>>>         NEXT_PASS (pass_strlen);

>>>         NEXT_PASS (pass_thread_jumps);

>>>         NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);

>>> ...

>>>

>>> and we get:

>>> ...

>>> .visible .func (.param.u32 %value_out) foo

>>> {

>>>          .reg.u32 %value;

>>>          mov.u32 %value, 285;

>>>          st.param.u32    [%value_out], %value;

>>>          ret;

>>> }

>>> ...

>>>

>>> I could file a missing optimization PR for nvptx, but I'm not sure where this

>>> should be fixed.

>>

>> Ah, yeah... the usual issue then.

>>

>> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

>>

>

> Done.

>

> Committed at attached.


this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:

FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"

where it was UNSUPPORTED before.

The dump has

;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, symbol_order=0)

foo ()
{
  int res;
  int a[10];
  int _2;
  int _6;
  int _28;
  int _35;
  int _42;
  int _49;
  int _56;
  int _63;
  int _70;
  int _77;

  <bb 2> [local count: 97603132]:
  MEM[(int *)&a] = { 0, 1 };
  MEM[(int *)&a + 8B] = { 4, 9 };
  MEM[(int *)&a + 16B] = { 16, 25 };
  MEM[(int *)&a + 24B] = { 36, 49 };
  MEM[(int *)&a + 32B] = { 64, 81 };
  _6 = a[0];
  _28 = a[1];
  res_29 = _6 + _28;
  _35 = a[2];
  res_36 = res_29 + _35;
  _42 = a[3];
  res_43 = res_36 + _42;
  _49 = a[4];
  res_50 = res_43 + _49;
  _56 = a[5];
  res_57 = res_50 + _56;
  _63 = a[6];
  res_64 = res_57 + _63;
  _70 = a[7];
  res_71 = res_64 + _70;
  _77 = a[8];
  res_78 = res_71 + _77;
  _2 = a[9];
  res_11 = _2 + res_78;
  a ={v} {CLOBBER};
  return res_11;

}

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University
Richard Biener March 21, 2018, 8:15 a.m. | #7
On Tue, 20 Mar 2018, Rainer Orth wrote:

> Hi Tom,

> 

> > On 03/19/2018 10:11 AM, Richard Biener wrote:

> >> On Fri, 16 Mar 2018, Tom de Vries wrote:

> >>

> >>> On 03/16/2018 12:55 PM, Richard Biener wrote:

> >>>> On Fri, 16 Mar 2018, Tom de Vries wrote:

> >>>>

> >>>>> On 02/27/2018 01:42 PM, Richard Biener wrote:

> >>>>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c

> >>>>>> ===================================================================

> >>>>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)

> >>>>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)

> >>>>>> @@ -0,0 +1,15 @@

> >>>>>> +/* { dg-do compile } */

> >>>>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */

> >>>>>> +

> >>>>>> +int foo()

> >>>>>> +{

> >>>>>> +  int a[10];

> >>>>>> +  for(int i = 0; i < 10; ++i)

> >>>>>> +    a[i] = i*i;

> >>>>>> +  int res = 0;

> >>>>>> +  for(int i = 0; i < 10; ++i)

> >>>>>> +    res += a[i];

> >>>>>> +  return res;

> >>>>>> +}

> >>>>>> +

> >>>>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

> >>>>>

> >>>>> This fails for nvptx, because it doesn't have the required vector

> >>>>> operations.

> >>>>> To fix the fail, I've added requiring effective target vect_int_mult.

> >>>>

> >>>> On targets that do not vectorize you should see the scalar loops unrolled

> >>>> instead.  Or do you have only one loop vectorized?

> >>>

> >>> Sort of. Loop vectorization has no effect, and the scalar loops are completely

> >>> unrolled. But then slp vectorization vectorizes the stores.

> >>>

> >>> So at optimized we have:

> >>> ...

> >>>    MEM[(int *)&a] = { 0, 1 };

> >>>    MEM[(int *)&a + 8B] = { 4, 9 };

> >>>    MEM[(int *)&a + 16B] = { 16, 25 };

> >>>    MEM[(int *)&a + 24B] = { 36, 49 };

> >>>    MEM[(int *)&a + 32B] = { 64, 81 };

> >>>    _6 = a[0];

> >>>    _28 = a[1];

> >>>    res_29 = _6 + _28;

> >>>    _35 = a[2];

> >>>    res_36 = res_29 + _35;

> >>>    _42 = a[3];

> >>>    res_43 = res_36 + _42;

> >>>    _49 = a[4];

> >>>    res_50 = res_43 + _49;

> >>>    _56 = a[5];

> >>>    res_57 = res_50 + _56;

> >>>    _63 = a[6];

> >>>    res_64 = res_57 + _63;

> >>>    _70 = a[7];

> >>>    res_71 = res_64 + _70;

> >>>    _77 = a[8];

> >>>    res_78 = res_71 + _77;

> >>>    _2 = a[9];

> >>>    res_11 = _2 + res_78;

> >>>    a ={v} {CLOBBER};

> >>>    return res_11;

> >>> ...

> >>>

> >>> The stores and loads are eliminated by dse1 in the rtl phase, and in the end

> >>> we have:

> >>> ...

> >>> .visible .func (.param.u32 %value_out) foo

> >>> {

> >>>          .reg.u32 %value;

> >>>          .local .align 16 .b8 %frame_ar[48];

> >>>          .reg.u64 %frame;

> >>>          cvta.local.u64 %frame, %frame_ar;

> >>>          mov.u32 %value, 285;

> >>>          st.param.u32    [%value_out], %value;

> >>>          ret;

> >>> }

> >>> ...

> >>>

> >>>> That's precisely

> >>>> what the PR was about...  which means it isn't fixed for nvptx :/

> >>>

> >>> Indeed the assembly is not optimal, and would be optimal if we'd have optimal

> >>> code at optimized.

> >>>

> >>> FWIW, using this patch we generate optimal code at optimized:

> >>> ...

> >>> diff --git a/gcc/passes.def b/gcc/passes.def

> >>> index 3ebcfc30349..6b64f600c4a 100644

> >>> --- a/gcc/passes.def

> >>> +++ b/gcc/passes.def

> >>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see

> >>>         NEXT_PASS (pass_tracer);

> >>>         NEXT_PASS (pass_thread_jumps);

> >>>         NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);

> >>> +      NEXT_PASS (pass_fre);

> >>>         NEXT_PASS (pass_strlen);

> >>>         NEXT_PASS (pass_thread_jumps);

> >>>         NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);

> >>> ...

> >>>

> >>> and we get:

> >>> ...

> >>> .visible .func (.param.u32 %value_out) foo

> >>> {

> >>>          .reg.u32 %value;

> >>>          mov.u32 %value, 285;

> >>>          st.param.u32    [%value_out], %value;

> >>>          ret;

> >>> }

> >>> ...

> >>>

> >>> I could file a missing optimization PR for nvptx, but I'm not sure where this

> >>> should be fixed.

> >>

> >> Ah, yeah... the usual issue then.

> >>

> >> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

> >>

> >

> > Done.

> >

> > Committed at attached.

> 

> this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:

> 

> FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"

> 

> where it was UNSUPPORTED before.


So it failed before Toms original patch.  Please add sparc-solaris
to the list of XFAILed targets.

> The dump has

> 

> ;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, symbol_order=0)

> 

> foo ()

> {

>   int res;

>   int a[10];

>   int _2;

>   int _6;

>   int _28;

>   int _35;

>   int _42;

>   int _49;

>   int _56;

>   int _63;

>   int _70;

>   int _77;

> 

>   <bb 2> [local count: 97603132]:

>   MEM[(int *)&a] = { 0, 1 };

>   MEM[(int *)&a + 8B] = { 4, 9 };

>   MEM[(int *)&a + 16B] = { 16, 25 };

>   MEM[(int *)&a + 24B] = { 36, 49 };

>   MEM[(int *)&a + 32B] = { 64, 81 };

>   _6 = a[0];

>   _28 = a[1];

>   res_29 = _6 + _28;

>   _35 = a[2];

>   res_36 = res_29 + _35;

>   _42 = a[3];

>   res_43 = res_36 + _42;

>   _49 = a[4];

>   res_50 = res_43 + _49;

>   _56 = a[5];

>   res_57 = res_50 + _56;

>   _63 = a[6];

>   res_64 = res_57 + _63;

>   _70 = a[7];

>   res_71 = res_64 + _70;

>   _77 = a[8];

>   res_78 = res_71 + _77;

>   _2 = a[9];

>   res_11 = _2 + res_78;

>   a ={v} {CLOBBER};

>   return res_11;

> 

> }

> 

> 	Rainer

> 

> 


-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Eric Botcazou March 21, 2018, 8:59 a.m. | #8
> So it failed before Toms original patch.  Please add sparc-solaris

> to the list of XFAILed targets.


SPARC/Linux is affected too so sparc*-*-* instead.

-- 
Eric Botcazou
Rainer Orth March 21, 2018, 6 p.m. | #9
Hi Eric,

>> So it failed before Toms original patch.  Please add sparc-solaris

>> to the list of XFAILed targets.

>

> SPARC/Linux is affected too so sparc*-*-* instead.


actually, it's sparc*-*-* && lp64 only.  Done like this after testing on
sparc-sun-solaris2.11 and i386-pc-solaris2.11.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University


2018-03-21  Rainer Orth  <ro@CeBiTec.Uni-Bielefeld.DE>

	* gcc.dg/tree-ssa/pr84512.c: xfail on 64-bit SPARC.
# HG changeset patch
# Parent  50996d41bbbc78ab2cf0002ba6479559089a2337
xfail gcc.dg/tree-ssa/pr84512.c on 64-bit sparc

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -13,4 +13,4 @@ int foo()
 }
 
 /* Target nvptx xfail due to PR84958.  */
-/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail nvptx*-*-* } } } */
+/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { nvptx*-*-* || { sparc*-*-* && lp64 } } } } } */

Patch

Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	(revision 258030)
+++ gcc/tree-vect-loop.c	(working copy)
@@ -1384,16 +1384,10 @@  vect_compute_single_scalar_iteration_cos
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
-  int nbbs = loop->num_nodes, factor, scalar_single_iter_cost = 0;
+  int nbbs = loop->num_nodes, factor;
   int innerloop_iters, i;
 
-  /* Count statements in scalar loop.  Using this as scalar cost for a single
-     iteration for now.
-
-     TODO: Add outer loop support.
-
-     TODO: Consider assigning different costs to different scalar
-     statements.  */
+  /* Gather costs for statements in the scalar loop.  */
 
   /* FORNOW.  */
   innerloop_iters = 1;
@@ -1437,13 +1431,28 @@  vect_compute_single_scalar_iteration_cos
           else
             kind = scalar_stmt;
 
-	  scalar_single_iter_cost
-	    += record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-				 factor, kind, stmt_info, 0, vect_prologue);
+	  record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
+			    factor, kind, stmt_info, 0, vect_prologue);
         }
     }
-  LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo)
-    = scalar_single_iter_cost;
+
+  /* Now accumulate cost.  */
+  void *target_cost_data = init_cost (loop);
+  stmt_info_for_cost *si;
+  int j;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
+		    j, si)
+    {
+      struct _stmt_vec_info *stmt_info
+	= si->stmt ? vinfo_for_stmt (si->stmt) : NULL;
+      (void) add_stmt_cost (target_cost_data, si->count,
+			    si->kind, stmt_info, si->misalign,
+			    vect_body);
+    }
+  unsigned dummy, body_cost = 0;
+  finish_cost (target_cost_data, &dummy, &body_cost, &dummy);
+  destroy_cost_data (target_cost_data);
+  LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo) = body_cost;
 }
 
 
Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int foo()
+{
+  int a[10];
+  for(int i = 0; i < 10; ++i)
+    a[i] = i*i;
+  int res = 0;
+  for(int i = 0; i < 10; ++i)
+    res += a[i];
+  return res;
+}
+
+/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */