[AArch64] Adjust writeback in non-zero memset

Message ID DB5PR08MB1030B6D34743A65E1BF4FEBC83CB0@DB5PR08MB1030.eurprd08.prod.outlook.com
State New
Headers show
Series
  • [AArch64] Adjust writeback in non-zero memset
Related show

Commit Message

Wilco Dijkstra Nov. 6, 2018, 2:42 p.m.
This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite.

---

Comments

Richard Earnshaw (lists) Nov. 6, 2018, 3:01 p.m. | #1
On 06/11/2018 14:42, Wilco Dijkstra wrote:
> This fixes an ineffiency in the non-zero memset.  Delaying the writeback

> until the end of the loop is slightly faster on some cores - this shows

> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.

> 

> Tested against the GLIBC testsuite.


Thanks, pushed.

R.

> 

> ---

> 

> diff --git a/newlib/libc/machine/aarch64/memset.S b/newlib/libc/machine/aarch64/memset.S

> index 799e7b7874a397138c5c85cfa2adb85f63c94cef..7c8fe583bf88722d73b90ec470c72b509e5be137 100644

> --- a/newlib/libc/machine/aarch64/memset.S

> +++ b/newlib/libc/machine/aarch64/memset.S

> @@ -142,10 +142,10 @@ L(set_long):

>  	b.eq	L(try_zva)

>  L(no_zva):

>  	sub	count, dstend, dst	/* Count is 16 too large.  */

> -	add	dst, dst, 16

> +	sub	dst, dst, 16		/* Dst is biased by -32.  */

>  	sub	count, count, 64 + 16	/* Adjust count and bias for loop.  */

> -1:	stp	q0, q0, [dst], 64

> -	stp	q0, q0, [dst, -32]

> +1:	stp	q0, q0, [dst, 32]

> +	stp	q0, q0, [dst, 64]!

>  L(tail64):

>  	subs	count, count, 64

>  	b.hi	1b

>

Patch

diff --git a/newlib/libc/machine/aarch64/memset.S b/newlib/libc/machine/aarch64/memset.S
index 799e7b7874a397138c5c85cfa2adb85f63c94cef..7c8fe583bf88722d73b90ec470c72b509e5be137 100644
--- a/newlib/libc/machine/aarch64/memset.S
+++ b/newlib/libc/machine/aarch64/memset.S
@@ -142,10 +142,10 @@  L(set_long):
 	b.eq	L(try_zva)
 L(no_zva):
 	sub	count, dstend, dst	/* Count is 16 too large.  */
-	add	dst, dst, 16
+	sub	dst, dst, 16		/* Dst is biased by -32.  */
 	sub	count, count, 64 + 16	/* Adjust count and bias for loop.  */
-1:	stp	q0, q0, [dst], 64
-	stp	q0, q0, [dst, -32]
+1:	stp	q0, q0, [dst, 32]
+	stp	q0, q0, [dst, 64]!
 L(tail64):
 	subs	count, count, 64
 	b.hi	1b