==============================
Errors Moderate Dataset
gtr eq current new
====== ======== ========
2 ulp 0.01762% 0.01762%
8 ulp 0.00026% 0.00026%
16 ulp 0.00000% 0.00000%
24 ulp 0% 0%
52 ulp 0% 0%
==============================
Table 1: Errors with Moderate Dataset
Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 ulp are
exceedingly rare.
==============================
Errors Full Dataset
gtr eq current new
====== ======== ========
2 ulp 1.70% 0.03874%
8 ulp 1.61% 0.02403%
16 ulp 1.51% 0.01554%
24 ulp 1.41% 0.00947%
52 ulp 1.13% 0.00001%
==============================
Table 2: Errors with Full Dataset
Table 2 shows significant differences in error rates. With the current
code, 1.13% of test values have results where no bits of the mantissa
match the correct answer. That level of error is extremely rare with
the new code. Consider the results for errors greater than or equal to
24 ulp. The current code sees those errors at a rate of 1.4%. Most
would agree that such answers are "substantially different" from the
answers calculated using extended precision. The new code reduces
errors at that level by more than 99%. These improvements are
important for serious computation of complex numbers.
PERFORMANCE Test results
In order for a library change to be practical, it is necessary to
show the slowdown is tolerable. The slowdowns observed are much
less than would be seen by (for example) switching to quad
precision, which on many machines may cause a slowdown of 1000x.
The actual slowdown depends on the machine architecture. It also
depends on the nature of the input data. If underflow/overflow is rare,
then implementations that have strong branch prediction will only
slowdown by a few cycles. If underflow/overflow is common, then the
branch predictors will be less successful and the cost will be higher.
Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.
In the following chart, lower values are better. The values are
nanoseconds/complex divide (ns) averaged over the full 10 million
value data sets.
=====================================================
Moderate set full set
x86 aarch64 x86 aarch64
======== ================= =================
current 19.5 ns 34.9 ns 29.4 ns 32.9 ns
new 20.0 ns 42.7 ns 37.4 ns 53.8 ns
======== ================= =================
%slowdown 3% 22% 27% 64%
=====================================================
Table 3: Performance Comparisons
In Table 3, overhead is modest when the new branches are consistently
false. The overhead grows when the new branches are taken to get
correct answers. Both platforms are somewhat dated, with the x86
having a better branch predictor which reduces the cost of the
additional branches in the new code. Of course, the relative slowdown
may be greater for some architectures, especially those with limited
branch prediction combined with a high cost of misprediction.
This cost is claimed to be tolerable on the grounds that:
(a) most applications will only spend a small fraction of their time
calculating complex divide.
(b) it is much less than the cost of extended precision
(c) users are not forced to use it (as described below)
(d) the cost if worthwhile considering the accuracy improvement shown
in Table 2.
Those users who find this degree of slowdown unsatisfactory may use
the switch -fcx-fortran-rules which does not use the library routine,
instead inlining Smith's method without the C99 requirement for
dealing with NaN results. The proposed patch for double precision
complex divide does not affect the behavior of -fcx-fortran-rules.
SUMMARY
When input data to complex divide has exponents between -512 and 511,
this patch makes no changes in accuracy and has only a modest effect
on performance. When input data contains values outside those ranges,
the patch eliminates more than 99% of major errors with a tolerable
cost in performance.
REFERENCES
[1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM,
5(8):435, 1962.
[2] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.
---
libgcc/ChangeLog | 6 ++++++
libgcc/libgcc2.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 66 insertions(+), 4 deletions(-)
@@ -1,3 +1,9 @@
+2020-05-29 Patrick McGehearty <patrick.mcgehearty@oracle.com>
+
+ * libgcc2.c (__divdc3): Enhance accuracy of double precision
+ complex divide for input values with very large and very
+ small exponents.
+
2020-05-28 Dong JianQiang <dongjianqiang2@huawei.com>
PR gcov-profile/95332
@@ -2036,6 +2036,10 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
CTYPE
CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
{
+#define RBIG ((DBL_MAX)/2.0)
+#define RMIN (DBL_MIN)
+#define RMIN2 (0x1.0p-512)
+#define RMINSCAL (0x1.0p+510)
MTYPE denom, ratio, x, y;
CTYPE res;
@@ -2043,19 +2047,71 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
the division. But that would mean starting to link libgcc against
libm. We could implement something akin to ldexp/frexp as gcc builtins
fairly easily... */
+ /*
+ Scaling by max(c,d) shifts a division and avoids a multiplication
+ reducing round-off errors by a modest amount (average of a half-bit)
+ */
if (FABS (c) < FABS (d))
{
+ /* prevent overflow when arguments are near max representable */
+ if ((FABS (d) >= RBIG) || (FABS (a) >= RBIG) || (FABS (b) >= RBIG) )
+ {
+ a = a * 0.5;
+ b = b * 0.5;
+ c = c * 0.5;
+ d = d * 0.5;
+ }
+ /* avoid overflow/underflow issues when c and d are small */
+ else if (FABS (d) < RMIN2)
+ {
+ a = a * RMINSCAL;
+ b = b * RMINSCAL;
+ c = c * RMINSCAL;
+ d = d * RMINSCAL;
+ }
ratio = c / d;
denom = (c * ratio) + d;
- x = ((a * ratio) + b) / denom;
- y = ((b * ratio) - a) / denom;
+ if ( FABS (ratio) > RMIN )
+ {
+ x = ((a * ratio) + b) / denom;
+ y = ((b * ratio) - a) / denom;
+ }
+ else
+ {
+ x = ((c * (a / d)) + b) / denom;
+ y = ((c * (b / d)) - a) / denom;
+ }
}
else
{
+ /* prevent overflow when arguments are near max representable */
+ if ((FABS(c) >= RBIG) || (FABS(a) >= RBIG) || (FABS(b) >= RBIG) )
+ {
+ a = a * 0.5;
+ b = b * 0.5;
+ c = c * 0.5;
+ d = d * 0.5;
+ }
+ /* avoid overflow issues when c and d are small */
+ else if (FABS(c) < RMIN2)
+ {
+ a = a * RMINSCAL;
+ b = b * RMINSCAL;
+ c = c * RMINSCAL;
+ d = d * RMINSCAL;
+ }
ratio = d / c;
denom = (d * ratio) + c;
- x = ((b * ratio) + a) / denom;
- y = (b - (a * ratio)) / denom;
+ if ( FABS (ratio) > RMIN )
+ {
+ x = (a + (b * ratio)) / denom;
+ y = (b - (a * ratio)) / denom;
+ }
+ else
+ {
+ x = (a + (d * (b / c))) / denom;
+ y = (b - (d * (a / c))) / denom;
+ }
}
/* Recover infinities and zeros that computed as NaN+iNaN; the only cases