Complex division improvements in Libgcc

Message ID AM0PR08MB336270B1CBEBF8122A3D36FD84A40@AM0PR08MB3362.eurprd08.prod.outlook.com
State New
Headers show
Series
  • Complex division improvements in Libgcc
Related show

Commit Message

Elen Kalda Aug. 23, 2019, 11:19 a.m.
Hi all,

Libgcc currently implements the Smith's algorithm
(https://dl.acm.org/citation.cfm?id=368661) for complex division. It is
designed to prevent overflows and underflows and it does that well for around
97.7% cases. However, there has been some interest in improving that result, as
an example Baudin improves the robustness to ~99.5% by introducing an extra
underflow check and also gains in performance by replacing two divisions by two
multiplications and one division (https://arxiv.org/abs/1210.4539).

This patch proposes the implementation of Baudin algorithm to Libgcc (in case
the increase in rounding error is acceptable).

In general, when it comes to deciding what is the best way of doing the
complex divisions, there are three things to consider:
- speed
- accuracy
- robustness (how often overflow/underflow happens, how many division results
are way off from the exact value)

Improving the algorithms by looking at the failures of specific cases is not
simple. There are 4 real numbers to be manipulated to get the final result and
some minimum number of operations that needs to be done. The quite minimal
Smiths algorithm currently does 9 operations, of which 6 are multiplications
or divisions which are susceptible to overflow and underflow. It is not
feasible to check for both, underflow and overflow, in all 6 operations
without significantly impacting the performance. Most of the complex division
algorithms have been created to solve some special difficult cases, however,
because of the abundance of failure opportunities, algorithm that works well
for some type of divisions is likely to fail for other types of divisions.

The simplest way to dodge overflow and underflow without impacting performance
much is to vary the order of operations. When naively expanding the complex
division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the
denominator is the greatest source of overflow and underflow. Smiths
improvement was to introduce scaling factor r = real(y)/imag(y), such that the
denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check
explicitly for underflow in r and change the order of operations if necessary.

One way of comparing the algorithms is to generate lots of random complex
numbers and see how the different algorithms perform. That approach is closer
to the "real world" situation where the complex numbers are often random, not
powers of two (which oddly is the assumption many authors make, including
Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm 
was compared to two versions of Baudin. The difference is that in one version
(b2div), real and imaginary parts of the result are both divided through by
the same denominator d, in the other one (b1div) the real and imaginary
parts are multiplied through by the reciprocal of that d, effectively
replacing two divisions with one division and two multiplications. Since
division is significantly slower on AArch64, that swap introduces gains in
performance (though with the cost of rounding error, unfortunately).

To compare the performance, I used a test that generates 1800 random complex
double pairs (which fit nicely into the 64 kb cache) and divide each pair 200
000 times, effectively doing 360 000 000 operations.

           | CPU time
------------------
smiths | 7 296 924
b1div  | 6 558 148
b2div  | 8 658 079

On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than 
Smiths.

For the accuracy, 1 000 000 pairs of complex doubles were divided and compared
to the exact results (assuming that the division of the same numbers as long
doubles gives the exact value). 

           |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp
---------------------------------------------
smiths | 22 937 | 23 944 | 124 228 | 828 891
b1div  |  4 450 | 59 716 | 350 792 | 585 042
b2div  |  4 036 | 24 488 | 127 144 | 844 332

Same data, but showing the percentage of divisions falling into each category: 

           |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp
---------------------------------------------
smiths | 2.29%  | 2.39%  | 12.42%  | 82.89%
b1div  | 0.45%  | 5.97%  | 35.08%  | 58.50%
b2div  | 0.40%  | 2.45%  | 12.71%  | 84.43%

The rightmost column (<0.5ulp) shows the number of calculations for which the 
ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning 
they were correctly rounded. >0.5ulp gives the number of calculations for 
which the largest ulp error of the real and imaginary parts were between 0.5 
and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but 
not THE nearest double. Anything less accurate is not great...

FMA doesn't create any problems on AArch64. Bootstrapped and tested on
aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in
gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of
the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths
gives:
0.800000000000000044408920985006 - 0.599999999999999977795539507497 i
ulp error in real part: 0.4
ulp error in imaginary part: 0.2
b1div gives:
0.800000000000000044408920985006 - 0.600000000000000088817841970013 i
ulp error in real part: 0.4
ulp error in imaginary part: 0.8
That means the imaginary part of the Baudin result is rounded to one of the two
nearest floats but not the one which is the nearest. Similar thing happens to
another division in that test.


Some questions:
- Is the 10.13% improvement in performance with b1div worth the loss in 
accuracy?
- In the case of b1div, most of the results that ceased to be correctly
rounded as a result of replacing the two divisions with multiplications, ended
up in being in 1.0ulp. Maybe that is acceptable?
- The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is
that a significant/useful improvement?


Best wishes,
Elen

libgcc/ChangeLog:

2019-07-31  Elen Kalda  <elen.kalda@arm.com>

	* libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm

Comments

Jeff Law Aug. 23, 2019, 9:18 p.m. | #1
On 8/23/19 5:19 AM, Elen Kalda wrote:
> Hi all,

> 

> Libgcc currently implements the Smith's algorithm

> (https://dl.acm.org/citation.cfm?id=368661) for complex division. It is

> designed to prevent overflows and underflows and it does that well for around

> 97.7% cases. However, there has been some interest in improving that result, as

> an example Baudin improves the robustness to ~99.5% by introducing an extra

> underflow check and also gains in performance by replacing two divisions by two

> multiplications and one division (https://arxiv.org/abs/1210.4539).

> 

> This patch proposes the implementation of Baudin algorithm to Libgcc (in case

> the increase in rounding error is acceptable).

> 

> In general, when it comes to deciding what is the best way of doing the

> complex divisions, there are three things to consider:

> - speed

> - accuracy

> - robustness (how often overflow/underflow happens, how many division results

> are way off from the exact value)

> 

> Improving the algorithms by looking at the failures of specific cases is not

> simple. There are 4 real numbers to be manipulated to get the final result and

> some minimum number of operations that needs to be done. The quite minimal

> Smiths algorithm currently does 9 operations, of which 6 are multiplications

> or divisions which are susceptible to overflow and underflow. It is not

> feasible to check for both, underflow and overflow, in all 6 operations

> without significantly impacting the performance. Most of the complex division

> algorithms have been created to solve some special difficult cases, however,

> because of the abundance of failure opportunities, algorithm that works well

> for some type of divisions is likely to fail for other types of divisions.

> 

> The simplest way to dodge overflow and underflow without impacting performance

> much is to vary the order of operations. When naively expanding the complex

> division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the

> denominator is the greatest source of overflow and underflow. Smiths

> improvement was to introduce scaling factor r = real(y)/imag(y), such that the

> denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check

> explicitly for underflow in r and change the order of operations if necessary.

> 

> One way of comparing the algorithms is to generate lots of random complex

> numbers and see how the different algorithms perform. That approach is closer

> to the "real world" situation where the complex numbers are often random, not

> powers of two (which oddly is the assumption many authors make, including

> Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm 

> was compared to two versions of Baudin. The difference is that in one version

> (b2div), real and imaginary parts of the result are both divided through by

> the same denominator d, in the other one (b1div) the real and imaginary

> parts are multiplied through by the reciprocal of that d, effectively

> replacing two divisions with one division and two multiplications. Since

> division is significantly slower on AArch64, that swap introduces gains in

> performance (though with the cost of rounding error, unfortunately).

> 

> To compare the performance, I used a test that generates 1800 random complex

> double pairs (which fit nicely into the 64 kb cache) and divide each pair 200

> 000 times, effectively doing 360 000 000 operations.

> 

>            | CPU time

> ------------------

> smiths | 7 296 924

> b1div  | 6 558 148

> b2div  | 8 658 079

> 

> On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than 

> Smiths.

> 

> For the accuracy, 1 000 000 pairs of complex doubles were divided and compared

> to the exact results (assuming that the division of the same numbers as long

> doubles gives the exact value). 

> 

>            |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp

> ---------------------------------------------

> smiths | 22 937 | 23 944 | 124 228 | 828 891

> b1div  |  4 450 | 59 716 | 350 792 | 585 042

> b2div  |  4 036 | 24 488 | 127 144 | 844 332

> 

> Same data, but showing the percentage of divisions falling into each category: 

> 

>            |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp

> ---------------------------------------------

> smiths | 2.29%  | 2.39%  | 12.42%  | 82.89%

> b1div  | 0.45%  | 5.97%  | 35.08%  | 58.50%

> b2div  | 0.40%  | 2.45%  | 12.71%  | 84.43%

> 

> The rightmost column (<0.5ulp) shows the number of calculations for which the 

> ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning 

> they were correctly rounded. >0.5ulp gives the number of calculations for 

> which the largest ulp error of the real and imaginary parts were between 0.5 

> and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but 

> not THE nearest double. Anything less accurate is not great...

> 

> FMA doesn't create any problems on AArch64. Bootstrapped and tested on

> aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in

> gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of

> the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths

> gives:

> 0.800000000000000044408920985006 - 0.599999999999999977795539507497 i

> ulp error in real part: 0.4

> ulp error in imaginary part: 0.2

> b1div gives:

> 0.800000000000000044408920985006 - 0.600000000000000088817841970013 i

> ulp error in real part: 0.4

> ulp error in imaginary part: 0.8

> That means the imaginary part of the Baudin result is rounded to one of the two

> nearest floats but not the one which is the nearest. Similar thing happens to

> another division in that test.

> 

> 

> Some questions:

> - Is the 10.13% improvement in performance with b1div worth the loss in 

> accuracy?

> - In the case of b1div, most of the results that ceased to be correctly

> rounded as a result of replacing the two divisions with multiplications, ended

> up in being in 1.0ulp. Maybe that is acceptable?

> - The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is

> that a significant/useful improvement?

> 

> 

> Best wishes,

> Elen

> 

> libgcc/ChangeLog:

> 

> 2019-07-31  Elen Kalda  <elen.kalda@arm.com>

> 

> 	* libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm

I'd really like to hear from Joseph on this.  He's got *far* more
experience in evaluating stuff in this space than I do and is almost
certainly going to be able to give a more informed opinion than myself.

I believe he's digging out from some time off, so it might be a little
while before he can respond.

jeff
>

Patch

diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index 763b5fabd512d4efd3ca007d9a96d8778a5de422..9bc0168281dac231aeb4d7b9852a24e61128b77a 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2033,58 +2033,75 @@  CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 #if defined(L_divhc3) || defined(L_divsc3) || defined(L_divdc3) \
     || defined(L_divxc3) || defined(L_divtc3)
 
+
+/* Complex division x/y, where
+	a = real (x)
+	b = imag (x)
+	c = real (y)
+	d = imag (y)
+*/
 CTYPE
 CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 {
-  MTYPE denom, ratio, x, y;
+  MTYPE e, f, r, t;
   CTYPE res;
 
-  /* ??? We can get better behavior from logarithmic scaling instead of
-     the division.  But that would mean starting to link libgcc against
-     libm.  We could implement something akin to ldexp/frexp as gcc builtins
-     fairly easily...  */
-  if (FABS (c) < FABS (d))
+  inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
+  {
+    r = d/c;
+    t = 1.0 / (c + (d * r));
+
+    if (r != 0)
     {
-      ratio = c / d;
-      denom = (c * ratio) + d;
-      x = ((a * ratio) + b) / denom;
-      y = ((b * ratio) - a) / denom;
+      e = (a + (b * r)) * t;
+      f = (b - (a * r)) * t;
     }
-  else
+    /* Changing the order of operations avoids the underflow of r impacting
+    the result. */
+    else
     {
-      ratio = d / c;
-      denom = (d * ratio) + c;
-      x = ((b * ratio) + a) / denom;
-      y = (b - (a * ratio)) / denom;
+      e = (a + (d * (b / c))) * t;
+      f = (b - (d * (a / c))) * t;
     }
+  }
+
+  if (FABS (d) <= FABS (c))
+  {
+    improved_internal (a, b, c, d);
+  }
+  else
+  {
+    improved_internal (b, a, d, c);
+    f = -f;
+  }
 
   /* Recover infinities and zeros that computed as NaN+iNaN; the only cases
      are nonzero/zero, infinite/finite, and finite/infinite.  */
-  if (isnan (x) && isnan (y))
+  if (isnan (e) && isnan (f))
+  {
+    if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b)))
     {
-      if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b)))
-	{
-	  x = COPYSIGN (INFINITY, c) * a;
-	  y = COPYSIGN (INFINITY, c) * b;
-	}
-      else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d))
-	{
-	  a = COPYSIGN (isinf (a) ? 1 : 0, a);
-	  b = COPYSIGN (isinf (b) ? 1 : 0, b);
-	  x = INFINITY * (a * c + b * d);
-	  y = INFINITY * (b * c - a * d);
-	}
-      else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b))
-	{
-	  c = COPYSIGN (isinf (c) ? 1 : 0, c);
-	  d = COPYSIGN (isinf (d) ? 1 : 0, d);
-	  x = 0.0 * (a * c + b * d);
-	  y = 0.0 * (b * c - a * d);
-	}
+      e = COPYSIGN (INFINITY, c) * a;
+      f = COPYSIGN (INFINITY, c) * b;
     }
+    else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d))
+    {
+      a = COPYSIGN (isinf (a) ? 1 : 0, a);
+      b = COPYSIGN (isinf (b) ? 1 : 0, b);
+      e = INFINITY * (a * c + b * d);
+      f = INFINITY * (b * c - a * d);
+    }
+    else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b))
+    {
+      c = COPYSIGN (isinf (c) ? 1 : 0, c);
+      d = COPYSIGN (isinf (d) ? 1 : 0, d);
+      e = 0.0 * (a * c + b * d);
+      f = 0.0 * (b * c - a * d);
+    }
+  }
 
-  __real__ res = x;
-  __imag__ res = y;
+  __real__ res = e;
+  __imag__ res = f;
   return res;
 }
 #endif /* complex divide */