[rs6000] avoid using unaligned vsx or lxvd2x/stxvd2x for memcpy/memmove inline expansion

Message ID 0a17416b-57a0-99e7-2e7e-90a63da66fe6@linux.ibm.com
State New
Headers show
Series
  • [rs6000] avoid using unaligned vsx or lxvd2x/stxvd2x for memcpy/memmove inline expansion
Related show

Commit Message

Aaron Sawdey Dec. 19, 2018, 7:53 p.m.
Because of POWER9 dd2.1 issues with certain unaligned vsx instructions
to cache inhibited memory, here is a patch that keeps memmove (and memcpy)
inline expansion from doing unaligned vector or using vector load/store
other than lvx/stvx. More description of the issue is here:

https://patchwork.ozlabs.org/patch/814059/

OK for trunk if bootstrap/regtest ok?

Thanks!
   Aaron

2018-12-19  Aaron Sawdey  <acsawdey@linux.ibm.com>

	* config/rs6000/rs6000-string.c (expand_block_move): Don't use
	unaligned vsx and avoid lxvd2x/stxvd2x.
	(gen_lvx_v4si_move): New function.





-- 
Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

Comments

Segher Boessenkool Dec. 20, 2018, 9:51 a.m. | #1
On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:
> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

> inline expansion from doing unaligned vector or using vector load/store

> other than lvx/stvx. More description of the issue is here:

> 

> https://patchwork.ozlabs.org/patch/814059/

> 

> OK for trunk if bootstrap/regtest ok?


Okay, but see below.

> 2018-12-19  Aaron Sawdey  <acsawdey@linux.ibm.com>

> 

> 	* config/rs6000/rs6000-string.c (expand_block_move): Don't use

> 	unaligned vsx and avoid lxvd2x/stxvd2x.

> 	(gen_lvx_v4si_move): New function.


> +static rtx

> +gen_lvx_v4si_move (rtx dest, rtx src)

> +{

> +  rtx rv = NULL;

> +  if (MEM_P (dest))

> +    {

> +      gcc_assert (!MEM_P (src));

> +      gcc_assert (GET_MODE (src) == V4SImode);

> +      rv = gen_altivec_stvx_v4si_internal (dest, src);

> +    }

> +  else if (MEM_P (src))

> +    {

> +      gcc_assert (!MEM_P (dest));

> +      gcc_assert (GET_MODE (dest) == V4SImode);

> +      rv = gen_altivec_lvx_v4si_internal (dest, src);

> +    }

> +  else

> +    gcc_unreachable ();

> +

> +  return rv;

> +}


This is extraordinarily clumsy :-)  Maybe something like:

static rtx
gen_lvx_v4si_move (rtx dest, rtx src)
{
  gcc_assert (!(MEM_P (dest) && MEM_P (src));
  gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);
  if (MEM_P (dest))
    return gen_altivec_stvx_v4si_internal (dest, src);
  else if (MEM_P (src))
    return gen_altivec_lvx_v4si_internal (dest, src);
  else
    gcc_unreachable ();
}

(Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of
the useless extra variable.

Thanks!


Segher
Aaron Sawdey Dec. 20, 2018, 11:34 p.m. | #2
On 12/20/18 3:51 AM, Segher Boessenkool wrote:
> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:

>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

>> inline expansion from doing unaligned vector or using vector load/store

>> other than lvx/stvx. More description of the issue is here:

>>

>> https://patchwork.ozlabs.org/patch/814059/

>>

>> OK for trunk if bootstrap/regtest ok?

> 

> Okay, but see below.

> 

[snip]
> 

> This is extraordinarily clumsy :-)  Maybe something like:

> 

> static rtx

> gen_lvx_v4si_move (rtx dest, rtx src)

> {

>   gcc_assert (!(MEM_P (dest) && MEM_P (src));

>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

>   if (MEM_P (dest))

>     return gen_altivec_stvx_v4si_internal (dest, src);

>   else if (MEM_P (src))

>     return gen_altivec_lvx_v4si_internal (dest, src);

>   else

>     gcc_unreachable ();

> }

> 

> (Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of

> the useless extra variable.


I think this should be better:

static rtx
gen_lvx_v4si_move (rtx dest, rtx src)
{
  gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest)));
  gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);
  if (MEM_P (dest))
      return gen_altivec_stvx_v4si_internal (dest, src);
  else if (MEM_P (src))
      return gen_altivec_lvx_v4si_internal (dest, src);
  gcc_unreachable ();
}

I'll commit after I re-regstrap.

Thanks!
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain
Segher Boessenkool Dec. 20, 2018, 11:44 p.m. | #3
On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote:
> On 12/20/18 3:51 AM, Segher Boessenkool wrote:

> > On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:

> >> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

> >> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

> >> inline expansion from doing unaligned vector or using vector load/store

> >> other than lvx/stvx. More description of the issue is here:

> >>

> >> https://patchwork.ozlabs.org/patch/814059/

> >>

> >> OK for trunk if bootstrap/regtest ok?

> > 

> > Okay, but see below.

> > 

> [snip]

> > 

> > This is extraordinarily clumsy :-)  Maybe something like:

> > 

> > static rtx

> > gen_lvx_v4si_move (rtx dest, rtx src)

> > {

> >   gcc_assert (!(MEM_P (dest) && MEM_P (src));

> >   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> >   if (MEM_P (dest))

> >     return gen_altivec_stvx_v4si_internal (dest, src);

> >   else if (MEM_P (src))

> >     return gen_altivec_lvx_v4si_internal (dest, src);

> >   else

> >     gcc_unreachable ();

> > }

> > 

> > (Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of

> > the useless extra variable.

> 

> I think this should be better:


The gcc_unreachable at the end catches the non-mem to non-mem case.

> static rtx

> gen_lvx_v4si_move (rtx dest, rtx src)

> {

>   gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest)));


But if you prefer this, how about

{
  gcc_assert (MEM_P (dest) ^ MEM_P (src));
  gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

  if (MEM_P (dest))
    return gen_altivec_stvx_v4si_internal (dest, src);
  else
    return gen_altivec_lvx_v4si_internal (dest, src);
}

:-)


Segher
Aaron Sawdey Dec. 20, 2018, 11:47 p.m. | #4
On 12/20/18 5:44 PM, Segher Boessenkool wrote:
> On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote:

>> On 12/20/18 3:51 AM, Segher Boessenkool wrote:

>>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:

>>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

>>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

>>>> inline expansion from doing unaligned vector or using vector load/store

>>>> other than lvx/stvx. More description of the issue is here:

>>>>

>>>> https://patchwork.ozlabs.org/patch/814059/

>>>>

>>>> OK for trunk if bootstrap/regtest ok?

>>>

>>> Okay, but see below.

>>>

>> [snip]

>>>

>>> This is extraordinarily clumsy :-)  Maybe something like:

>>>

>>> static rtx

>>> gen_lvx_v4si_move (rtx dest, rtx src)

>>> {

>>>   gcc_assert (!(MEM_P (dest) && MEM_P (src));

>>>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

>>>   if (MEM_P (dest))

>>>     return gen_altivec_stvx_v4si_internal (dest, src);

>>>   else if (MEM_P (src))

>>>     return gen_altivec_lvx_v4si_internal (dest, src);

>>>   else

>>>     gcc_unreachable ();

>>> }

>>>

>>> (Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of

>>> the useless extra variable.

>>

>> I think this should be better:

> 

> The gcc_unreachable at the end catches the non-mem to non-mem case.

> 

>> static rtx

>> gen_lvx_v4si_move (rtx dest, rtx src)

>> {

>>   gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest)));

> 

> But if you prefer this, how about

> 

> {

>   gcc_assert (MEM_P (dest) ^ MEM_P (src));

>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> 

>   if (MEM_P (dest))

>     return gen_altivec_stvx_v4si_internal (dest, src);

>   else

>     return gen_altivec_lvx_v4si_internal (dest, src);

> }

> 

> :-)

> 

> 

> Segher

> 


I like that even better, thanks!

-- 
Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain
Aaron Sawdey Jan. 14, 2019, 6:49 p.m. | #5
The patch for this was committed to trunk as 267562 (see below). Is this also ok for backport to 8?

Thanks,
   Aaron

On 12/20/18 5:44 PM, Segher Boessenkool wrote:
> On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote:

>> On 12/20/18 3:51 AM, Segher Boessenkool wrote:

>>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:

>>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

>>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

>>>> inline expansion from doing unaligned vector or using vector load/store

>>>> other than lvx/stvx. More description of the issue is here:

>>>>

>>>> https://patchwork.ozlabs.org/patch/814059/

>>>>

>>>> OK for trunk if bootstrap/regtest ok?

>>>

>>> Okay, but see below.

>>>

>> [snip]

>>>

>>> This is extraordinarily clumsy :-)  Maybe something like:

>>>

>>> static rtx

>>> gen_lvx_v4si_move (rtx dest, rtx src)

>>> {

>>>   gcc_assert (!(MEM_P (dest) && MEM_P (src));

>>>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

>>>   if (MEM_P (dest))

>>>     return gen_altivec_stvx_v4si_internal (dest, src);

>>>   else if (MEM_P (src))

>>>     return gen_altivec_lvx_v4si_internal (dest, src);

>>>   else

>>>     gcc_unreachable ();

>>> }

>>>

>>> (Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of

>>> the useless extra variable.

>>

>> I think this should be better:

> 

> The gcc_unreachable at the end catches the non-mem to non-mem case.

> 

>> static rtx

>> gen_lvx_v4si_move (rtx dest, rtx src)

>> {

>>   gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest)));

> 

> But if you prefer this, how about

> 

> {

>   gcc_assert (MEM_P (dest) ^ MEM_P (src));

>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> 

>   if (MEM_P (dest))

>     return gen_altivec_stvx_v4si_internal (dest, src);

>   else

>     return gen_altivec_lvx_v4si_internal (dest, src);

> }

> 

> :-)

> 

> 

> Segher

> 


2019-01-03  Aaron Sawdey  <acsawdey@linux.ibm.com>

        * config/rs6000/rs6000-string.c (expand_block_move): Don't use
        unaligned vsx and avoid lxvd2x/stxvd2x.
        (gen_lvx_v4si_move): New function.


Index: gcc/config/rs6000/rs6000-string.c
===================================================================
--- gcc/config/rs6000/rs6000-string.c	(revision 267299)
+++ gcc/config/rs6000/rs6000-string.c	(working copy)
@@ -2669,6 +2669,25 @@
   return true;
 }

+/* Generate loads and stores for a move of v4si mode using lvx/stvx.
+   This uses altivec_{l,st}vx_<mode>_internal which use unspecs to
+   keep combine from changing what instruction gets used.
+
+   DEST is the destination for the data.
+   SRC is the source of the data for the move.  */
+
+static rtx
+gen_lvx_v4si_move (rtx dest, rtx src)
+{
+  gcc_assert (MEM_P (dest) ^ MEM_P (src));
+  gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);
+
+  if (MEM_P (dest))
+    return gen_altivec_stvx_v4si_internal (dest, src);
+  else
+    return gen_altivec_lvx_v4si_internal (dest, src);
+}
+
 /* Expand a block move operation, and return 1 if successful.  Return 0
    if we should let the compiler generate normal code.

@@ -2721,11 +2740,11 @@

       /* Altivec first, since it will be faster than a string move
 	 when it applies, and usually not significantly larger.  */
-      if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128))
+      if (TARGET_ALTIVEC && bytes >= 16 && align >= 128)
 	{
 	  move_bytes = 16;
 	  mode = V4SImode;
-	  gen_func.mov = gen_movv4si;
+	  gen_func.mov = gen_lvx_v4si_move;
 	}
       else if (bytes >= 8 && TARGET_POWERPC64
 	       && (align >= 64 || !STRICT_ALIGNMENT))



-- 
Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain
Segher Boessenkool Jan. 16, 2019, 4:27 p.m. | #6
On Mon, Jan 14, 2019 at 12:49:33PM -0600, Aaron Sawdey wrote:
> The patch for this was committed to trunk as 267562 (see below). Is this also ok for backport to 8?


Yes please.  Thanks!


Segher


> On 12/20/18 5:44 PM, Segher Boessenkool wrote:

> > On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote:

> >> On 12/20/18 3:51 AM, Segher Boessenkool wrote:

> >>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote:

> >>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions

> >>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy)

> >>>> inline expansion from doing unaligned vector or using vector load/store

> >>>> other than lvx/stvx. More description of the issue is here:

> >>>>

> >>>> https://patchwork.ozlabs.org/patch/814059/

> >>>>

> >>>> OK for trunk if bootstrap/regtest ok?

> >>>

> >>> Okay, but see below.

> >>>

> >> [snip]

> >>>

> >>> This is extraordinarily clumsy :-)  Maybe something like:

> >>>

> >>> static rtx

> >>> gen_lvx_v4si_move (rtx dest, rtx src)

> >>> {

> >>>   gcc_assert (!(MEM_P (dest) && MEM_P (src));

> >>>   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> >>>   if (MEM_P (dest))

> >>>     return gen_altivec_stvx_v4si_internal (dest, src);

> >>>   else if (MEM_P (src))

> >>>     return gen_altivec_lvx_v4si_internal (dest, src);

> >>>   else

> >>>     gcc_unreachable ();

> >>> }

> >>>

> >>> (Or do you allow VOIDmode for src as well?)  Anyway, at least get rid of

> >>> the useless extra variable.

> >>

> >> I think this should be better:

> > 

> > The gcc_unreachable at the end catches the non-mem to non-mem case.

> > 

> >> static rtx

> >> gen_lvx_v4si_move (rtx dest, rtx src)

> >> {

> >>   gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest)));

> > 

> > But if you prefer this, how about

> > 

> > {

> >   gcc_assert (MEM_P (dest) ^ MEM_P (src));

> >   gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> > 

> >   if (MEM_P (dest))

> >     return gen_altivec_stvx_v4si_internal (dest, src);

> >   else

> >     return gen_altivec_lvx_v4si_internal (dest, src);

> > }

> > 

> > :-)

> > 

> > 

> > Segher

> > 

> 

> 2019-01-03  Aaron Sawdey  <acsawdey@linux.ibm.com>

> 

>         * config/rs6000/rs6000-string.c (expand_block_move): Don't use

>         unaligned vsx and avoid lxvd2x/stxvd2x.

>         (gen_lvx_v4si_move): New function.

> 

> 

> Index: gcc/config/rs6000/rs6000-string.c

> ===================================================================

> --- gcc/config/rs6000/rs6000-string.c	(revision 267299)

> +++ gcc/config/rs6000/rs6000-string.c	(working copy)

> @@ -2669,6 +2669,25 @@

>    return true;

>  }

> 

> +/* Generate loads and stores for a move of v4si mode using lvx/stvx.

> +   This uses altivec_{l,st}vx_<mode>_internal which use unspecs to

> +   keep combine from changing what instruction gets used.

> +

> +   DEST is the destination for the data.

> +   SRC is the source of the data for the move.  */

> +

> +static rtx

> +gen_lvx_v4si_move (rtx dest, rtx src)

> +{

> +  gcc_assert (MEM_P (dest) ^ MEM_P (src));

> +  gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode);

> +

> +  if (MEM_P (dest))

> +    return gen_altivec_stvx_v4si_internal (dest, src);

> +  else

> +    return gen_altivec_lvx_v4si_internal (dest, src);

> +}

> +

>  /* Expand a block move operation, and return 1 if successful.  Return 0

>     if we should let the compiler generate normal code.

> 

> @@ -2721,11 +2740,11 @@

> 

>        /* Altivec first, since it will be faster than a string move

>  	 when it applies, and usually not significantly larger.  */

> -      if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128))

> +      if (TARGET_ALTIVEC && bytes >= 16 && align >= 128)

>  	{

>  	  move_bytes = 16;

>  	  mode = V4SImode;

> -	  gen_func.mov = gen_movv4si;

> +	  gen_func.mov = gen_lvx_v4si_move;

>  	}

>        else if (bytes >= 8 && TARGET_POWERPC64

>  	       && (align >= 64 || !STRICT_ALIGNMENT))

> 

> 

> 

> -- 

> Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com

> 050-2/C113  (507) 253-7520 home: 507/263-0782

> IBM Linux Technology Center - PPC Toolchain

Patch

Index: gcc/config/rs6000/rs6000-string.c
===================================================================
--- gcc/config/rs6000/rs6000-string.c	(revision 267055)
+++ gcc/config/rs6000/rs6000-string.c	(working copy)
@@ -2669,6 +2669,35 @@ 
   return true;
 }

+/* Generate loads and stores for a move of v4si mode using lvx/stvx.
+   This uses altivec_{l,st}vx_<mode>_internal which use unspecs to
+   keep combine from changing what instruction gets used.
+
+   DEST is the destination for the data.
+   SRC is the source of the data for the move.  */
+
+static rtx
+gen_lvx_v4si_move (rtx dest, rtx src)
+{
+  rtx rv = NULL;
+  if (MEM_P (dest))
+    {
+      gcc_assert (!MEM_P (src));
+      gcc_assert (GET_MODE (src) == V4SImode);
+      rv = gen_altivec_stvx_v4si_internal (dest, src);
+    }
+  else if (MEM_P (src))
+    {
+      gcc_assert (!MEM_P (dest));
+      gcc_assert (GET_MODE (dest) == V4SImode);
+      rv = gen_altivec_lvx_v4si_internal (dest, src);
+    }
+  else
+    gcc_unreachable ();
+
+  return rv;
+}
+
 /* Expand a block move operation, and return 1 if successful.  Return 0
    if we should let the compiler generate normal code.

@@ -2721,11 +2750,11 @@ 

       /* Altivec first, since it will be faster than a string move
 	 when it applies, and usually not significantly larger.  */
-      if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128))
+      if (TARGET_ALTIVEC && bytes >= 16 && align >= 128)
 	{
 	  move_bytes = 16;
 	  mode = V4SImode;
-	  gen_func.mov = gen_movv4si;
+	  gen_func.mov = gen_lvx_v4si_move;
 	}
       else if (bytes >= 8 && TARGET_POWERPC64
 	       && (align >= 64 || !STRICT_ALIGNMENT))