Practical Improvement to Double Precision Complex Divide

Message ID 1591047643-9019-1-git-send-email-patrick.mcgehearty@oracle.com
State New
Headers show
Series
  • Practical Improvement to Double Precision Complex Divide
Related show

Commit Message

Kewen.Lin via Gcc-patches June 1, 2020, 9:40 p.m.
The following patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default double precision complex divide routine when dealing
with very large or very small exponents.

The current code correctly implements Smith's method (1962) [1]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512),
results are substantially different from the answers provided by quad
precision more than 1% of the time. Since the allowed exponent range
for double precision numbers is -1076 to +1023, the error rate may be
unacceptable for many applications. The proposed method reduces the
frequency of "substantially different" answers by more than 99% at a
modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

NOTATION

For all of the following, the notation is:
Input complex values:
  a+bi  (a= 64bit real part, b= 64bit imaginary part)
  c+di
Output complex value:
  e+fi = (a+bi)/(c+di)

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
  e = (a*c + b*d)/(c*c + d*d)
  f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of intermediate values
generated by a computation should consider using this switch.

SMITH's METHOD (current libgcc):
  if(fabs(c)<fabs(d) {
    r = c/d;
    denom = (c*r) + d;
    e = (a*r + b) / denom;
    f = (b*r - a) / denom;
  } else {
    r = d/c;
    denom = c + (d*r);
    e = (a + b*r) / denom;
    f = (b - a*r) / denom;
  }

Smith's method is the current default method available with __divdc3.

NEW METHOD (proposed by patch) to replace the current default method:

The proposed method starts with an algorithm proposed by Baudin&Smith
(2012) in the paper "A Robust Complex Division in Scilab" [2]. The
patch makes additional modifications to that method for further
reductions in the error rate.

  #define RBIG ((DBL_MAX)/2.0)
  #define RMIN (DBL_MIN)
  #define RMIN2 (0x1.0p-512)
  #define RMINSCAL (0x1.0p+510)

  /* prevent overflow when arguments are near max representable */
  if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
      a = a * 0.5;
      b = b * 0.5;
      c = c * 0.5;
      d = d * 0.5;
  }
  /* minimize overflow/underflow issues when c and d are small */
  else if (FABS (d) < RMIN2) {
      a = a * RMINSCAL;
      b = b * RMINSCAL;
      c = c * RMINSCAL;
      d = d * RMINSCAL;
  }
  r = c/d; denom = (c*r) + d;
  if( r > RMIN ) {
      e = (a*r + b) / denom   ;
      f = (b*r - a) / denom
  } else {
      e = (c * (a/d) + b) / denom;
      f = (c * (b/d) - a) / denom;
  }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

Before any computation of the answer, the code checks for near maximum or
near minimum inputs and scale the results to move all values away from
the extremes. If the complex divide can be computed at all without
generating infinities, these scalings will not affect the accuracy
since they are by a power of 2.  Values that are over RBIG are
relatively rare but it is easy to test for them and required to avoid
unnecessary overflows.

Testing for RMIN2 reveals when both c and d are less than 2^-512.  By
scaling all values by 2^510, the code avoids many underflows in
intermediate computations that otherwise might occur. If scaling a and
b by 2^510 causes either to overflow, then the computation will overflow
whatever method is used.

Next, r (the ratio of c to d) is checked for being near zero. Baudin
and Smith checked r for zero. Checking for values less than DBL_MIN
covers more cases and improves overall accuracy. If r is near zero,
then when it is used in a multiplication, there is a high chance that
the result will underflow to zero, losing significant accuracy. That
underflow can be avoided if the computation is done in a different
order.  When r is subnormal, the code replaces a*r (= a*(c/d)) with
((a/d)*c) which is mathematically the same but avoids the unnecessary
underflow.

TEST Data

Two sets of data are presented to test these methods.  Both sets
contain 10 million pairs of 64bit complex values.  The exponents and
mantissas are generated using multiple calls to random() and then
combining the results. Only values which give results to complex
divide that are representable in 64-bits after being computed in quad
precision are used.

The first data set is labeled "moderate exponents".
The exponent range is limited to -512 to +511.
The second data set is labeled "full exponents".
The exponent range is -1076 to + 1024.

ACCURACY Test results:

Note: All results are based on use of fused multiply-add. If
fused multiply-add is not used, the error rate increases slightly
for the 2 ulp and 8 ulp cases.

The complex divide methods are evaluated by determining what
percentage of values exceed different ulp (units in last place)
levels.  If a "2 ulp" test results show 1%, that would mean that 1% of
10,000,000 values (100,000) have either a real or imaginary part that
had a greater than 2 bit difference from the quad precision result.

Results are reported for differences greater than or equal to 2 ulp, 8
ulp, 16 ulp, 24 ulp, and 52 ulp.  Even when the patch avoids overflows and
underflows, some input values are expected to have errors due to
normal limitations of floating point subtraction. For example, when
b*c and a*d are nearly equal subtraction has potential loss of
accuracy. This patch does not attempt to detect or minimize this type
of error, but neither does it increase them.

In the following charts, lower values are better.

-- 
1.8.3.1

Patch

==============================
Errors   Moderate Dataset
gtr eq     current    new
======    ========   ========
 2 ulp    0.01762%   0.01762%
 8 ulp    0.00026%   0.00026%
16 ulp    0.00000%   0.00000%
24 ulp          0%         0%
52 ulp          0%         0%
==============================
Table 1: Errors with Moderate Dataset

Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 ulp are
exceedingly rare.

==============================
Errors   Full Dataset
gtr eq    current    new
======    ========   ========
 2 ulp      1.70%   0.03874%
 8 ulp      1.61%   0.02403%
16 ulp      1.51%   0.01554%
24 ulp      1.41%   0.00947%
52 ulp      1.13%   0.00001%
==============================
Table 2: Errors with Full Dataset

Table 2 shows significant differences in error rates. With the current
code, 1.13% of test values have results where no bits of the mantissa
match the correct answer. That level of error is extremely rare with
the new code. Consider the results for errors greater than or equal to
24 ulp.  The current code sees those errors at a rate of 1.4%. Most
would agree that such answers are "substantially different" from the
answers calculated using extended precision.  The new code reduces
errors at that level by more than 99%. These improvements are
important for serious computation of complex numbers.

PERFORMANCE Test results

In order for a library change to be practical, it is necessary to
show the slowdown is tolerable. The slowdowns observed are much
less than would be seen by (for example) switching to quad
precision, which on many machines may cause a slowdown of 1000x.

The actual slowdown depends on the machine architecture.  It also
depends on the nature of the input data. If underflow/overflow is rare,
then implementations that have strong branch prediction will only
slowdown by a few cycles. If underflow/overflow is common, then the
branch predictors will be less successful and the cost will be higher.

Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.

In the following chart, lower values are better. The values are
nanoseconds/complex divide (ns) averaged over the full 10 million
value data sets.

=====================================================
               Moderate set            full set
                x86    aarch64        x86    aarch64
========     =================     =================
current      19.5 ns   34.9 ns     29.4 ns   32.9 ns
new          20.0 ns   42.7 ns     37.4 ns   53.8 ns
========     =================     =================
%slowdown      3%       22%         27%       64%
=====================================================
Table 3: Performance Comparisons

In Table 3, overhead is modest when the new branches are consistently
false. The overhead grows when the new branches are taken to get
correct answers. Both platforms are somewhat dated, with the x86
having a better branch predictor which reduces the cost of the
additional branches in the new code. Of course, the relative slowdown
may be greater for some architectures, especially those with limited
branch prediction combined with a high cost of misprediction.

This cost is claimed to be tolerable on the grounds that:
(a) most applications will only spend a small fraction of their time
    calculating complex divide.
(b) it is much less than the cost of extended precision
(c) users are not forced to use it (as described below)
(d) the cost if worthwhile considering the accuracy improvement shown
    in Table 2.

Those users who find this degree of slowdown unsatisfactory may use
the switch -fcx-fortran-rules which does not use the library routine,
instead inlining Smith's method without the C99 requirement for
dealing with NaN results. The proposed patch for double precision
complex divide does not affect the behavior of -fcx-fortran-rules.

SUMMARY

When input data to complex divide has exponents between -512 and 511,
this patch makes no changes in accuracy and has only a modest effect
on performance.  When input data contains values outside those ranges,
the patch eliminates more than 99% of major errors with a tolerable
cost in performance.

REFERENCES

[1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
 5(8):435, 1962.

[2] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.
---
 libgcc/ChangeLog |  6 ++++++
 libgcc/libgcc2.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
index 06e7986..5a666d4 100644
--- a/libgcc/ChangeLog
+++ b/libgcc/ChangeLog
@@ -1,3 +1,9 @@ 
+2020-05-29  Patrick McGehearty <patrick.mcgehearty@oracle.com>
+
+	* libgcc2.c (__divdc3): Enhance accuracy of double precision
+	complex divide for input values with very large and very
+	small exponents.
+
 2020-05-28  Dong JianQiang  <dongjianqiang2@huawei.com>
 
 	PR gcov-profile/95332
diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index e0a9fd7..2d32b7c 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2036,6 +2036,10 @@  CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 CTYPE
 CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 {
+#define RBIG	((DBL_MAX)/2.0)
+#define RMIN	(DBL_MIN)
+#define RMIN2	(0x1.0p-512)
+#define RMINSCAL (0x1.0p+510)
   MTYPE denom, ratio, x, y;
   CTYPE res;
 
@@ -2043,19 +2047,71 @@  CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
      the division.  But that would mean starting to link libgcc against
      libm.  We could implement something akin to ldexp/frexp as gcc builtins
      fairly easily...  */
+  /*
+     Scaling by max(c,d) shifts a division and avoids a multiplication
+     reducing round-off errors by a modest amount (average of a half-bit)
+   */
   if (FABS (c) < FABS (d))
     {
+      /* prevent overflow when arguments are near max representable */
+      if ((FABS (d) >= RBIG) || (FABS (a) >= RBIG) || (FABS (b) >= RBIG) )
+	{
+	  a = a * 0.5;
+	  b = b * 0.5;
+	  c = c * 0.5;
+	  d = d * 0.5;
+	} 
+      /* avoid overflow/underflow issues when c and d are small */
+      else if (FABS (d) < RMIN2)
+	{
+	  a = a * RMINSCAL;
+	  b = b * RMINSCAL;
+	  c = c * RMINSCAL;
+	  d = d * RMINSCAL;
+	}
       ratio = c / d;
       denom = (c * ratio) + d;
-      x = ((a * ratio) + b) / denom;
-      y = ((b * ratio) - a) / denom;
+      if ( FABS (ratio) > RMIN )
+	{
+	  x = ((a * ratio) + b) / denom;
+	  y = ((b * ratio) - a) / denom;
+	}
+      else
+	{
+	  x = ((c * (a / d)) + b) / denom;
+	  y = ((c * (b / d)) - a) / denom;
+	}
     }
   else
     {
+      /* prevent overflow when arguments are near max representable */
+      if ((FABS(c) >= RBIG) || (FABS(a) >= RBIG) || (FABS(b) >= RBIG) )
+	{
+	  a = a * 0.5;
+	  b = b * 0.5;
+	  c = c * 0.5;
+	  d = d * 0.5;
+	} 
+      /* avoid overflow issues when c and d are small */
+      else if (FABS(c) < RMIN2)
+	{
+	  a = a * RMINSCAL;
+	  b = b * RMINSCAL;
+	  c = c * RMINSCAL;
+	  d = d * RMINSCAL;
+	}
       ratio = d / c;
       denom = (d * ratio) + c;
-      x = ((b * ratio) + a) / denom;
-      y = (b - (a * ratio)) / denom;
+      if ( FABS (ratio) > RMIN )
+	{
+	  x = (a + (b * ratio)) / denom;
+	  y = (b - (a * ratio)) / denom;
+	}
+      else
+	{
+	  x = (a + (d * (b / c))) / denom;
+	  y = (b - (d * (a / c))) / denom;
+	}
     }
 
   /* Recover infinities and zeros that computed as NaN+iNaN; the only cases