Message ID  1593621012150581gitsendemailpatrick.mcgehearty@oracle.com 

State  New 
Headers  show 
Series 

Related  show 
Ping On 7/1/2020 11:30 AM, Patrick McGehearty via Gccpatches wrote: > (Version 3) > > (Added in version 3) > Support for half, float, extended, and long double precision has > been added to the prior work for double precision. Since half precision > is computed with float precision as per current libgcc practice, > the enhanced underflow/overflow tests provide no benefit for half > precision and would cost performance. Therefore half precision is > left unchanged. > > The existing constants for each precision: > float: FLT_MAX, FLT_MIN; > double: DBL_MAX, DBL_MIN; > extended and/or long double: LDBL_MAX, LDBL_MIN > are used for avoiding the more common overflow/underflow cases. > > Additional investigation showed that testing for when both parts of > the denominator had exponents roughly small enough to allow shifting > any subnormal values to normal values, all input values could be > scaled up without risking unnecessary overflow and gaining a clear > improvement in accuracy. The test and scaling values used all fit > within the allowed exponent range for each precision required by the C > standard. The remaining number of troubling results in version 3 is > measurably smaller than in versions 1 and 2. > > The timing and precision tables below have been revised appropriately > to match the algorithms used in this version for double precision > and additional tables added to include results for other precisions. > > In prior versions, I omitted mention of the bug report that started me > on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714 > complex division is surprising on targets with FMA (was: on aarch64) > With the proposed method, whether using FMA or not, dividing > 1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i. > > I also have added a reference to Beebe's "The Mathematical Function > Computation Handbook" [4] which was my starting point for research > into better complex divide methods. > > (Added for Version 2) > In my initial research, I missed Elen Kalda's proposed patch > https://gcc.gnu.org/legacyml/gccpatches/201908/msg01629.html [3] > Thanks to Joseph Myers for providing me with the pointer. > This version includes performance and accuracy comparisions > between Elen's proposed patch and my latest patch version for > double precision. > > (from earlier Versions) > > The following patch to libgcc/libgcc2.c __divdc3 provides an > opportunity to gain important improvements to the quality of answers > for the default complex divide routine (half, float, double, extended, > long double precisions) when dealing with very large or very small exponents. > > The current code correctly implements Smith's method (1962) [1] > further modified by c99's requirements for dealing with NaN (not a > number) results. When working with input values where the exponents > are greater than *_MAX_EXP/2 or less than (*_MAX_EXP)/2, results are > substantially different from the answers provided by quad precision > more than 1% of the time. This error rate may be unacceptable for many > applications that cannot a priori restrict their computations to the > safe range. The proposed method reduces the frequency of > "substantially different" answers by more than 99% for double > precision at a modest cost of performance. > > Differences between current gcc methods and the new method will be > described. Then accuracy and performance differences will be discussed. > > NOTATION > > For all of the following, the notation is: > Input complex values: > a+bi (a= real part, b= imaginary part) > c+di > Output complex value: > e+fi = (a+bi)/(c+di) > > For the result tables: > current = current method (SMITH) > b1div = method proposed by Elen Kalda > b2div = alternate method considered by Elen Kalda > new1 = new method using 1 divide and 2 multiplies > new = new method proposed by this patch > > DESCRIPTIONS of different complex divide methods: > > NAIVE COMPUTATION (fcxlimitedrange): > e = (a*c + b*d)/(c*c + d*d) > f = (b*c  a*d)/(c*c + d*d) > > Note that c*c and d*d will overflow or underflow if either > c or d is outside the range 2^538 to 2^512. > > This method is available in gcc when the switch fcxlimitedrange is > used. That switch is also enabled by ffastmath. Only one who has a > clear understanding of the maximum range of intermediate values > generated by a computation should consider using this switch. > > SMITH's METHOD (current libgcc): > if(fabs(c)<fabs(d) { > r = c/d; > denom = (c*r) + d; > e = (a*r + b) / denom; > f = (b*r  a) / denom; > } else { > r = d/c; > denom = c + (d*r); > e = (a + b*r) / denom; > f = (b  a*r) / denom; > } > > Smith's method is the current default method available with __divdc3. > > Elen Kalda's METHOD > > This method applies the most significant part of the algorithm > proposed by Baudin&Smith (2012) in the paper "A Robust Complex > Division in Scilab" [2]. Elen's method also replaces two divides > by one divide and two multiplies due to the high cost of divide > on aarch64. In the comparison sections, this method will be > labeled b1div. A variation discussed in that patch which > does not replace the two divides will be labeled b2div. > > inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d) > { > r = d/c; > t = 1.0 / (c + (d * r)); > if (r != 0) { > x = (a + (b * r)) * t; > y = (b  (a * r)) * t; > } else { > /* Changing the order of operations avoids the underflow of r impacting > the result. */ > x = (a + (d * (b / c))) * t; > y = (b  (d * (a / c))) * t; > } > } > > if (FABS (d) < FABS (c)) { > improved_internal (a, b, c, d); > } else { > improved_internal (b, a, d, c); > y = y; > } > > NEW METHOD (proposed by patch) to replace the current default method: > > The proposed method starts with an algorithm proposed by Baudin&Smith > (2012) in the paper "A Robust Complex Division in Scilab" [2]. The > patch makes additional modifications to that method for further > reductions in the error rate. The following code shows the #define > values for double precision. See the patch for #define values used > for other precisions. > > #define RBIG ((DBL_MAX)/2.0) > #define RMIN (DBL_MIN) > #define RMIN2 (0x1.0p53) > #define RMINSCAL (0x1.0p+51) > > /* prevent overflow when arguments are near max representable */ > if ((FABS (d) > RBIG)  (FABS (a) > RBIG)  (FABS (b) > RBIG) ) { > a = a * 0.5; > b = b * 0.5; > c = c * 0.5; > d = d * 0.5; > } > /* minimize overflow/underflow issues when c and d are small */ > else if (FABS (d) < RMIN2) { > a = a * RMINSCAL; > b = b * RMINSCAL; > c = c * RMINSCAL; > d = d * RMINSCAL; > } > r = c/d; denom = (c*r) + d; > if( r > RMIN ) { > e = (a*r + b) / denom ; > f = (b*r  a) / denom > } else { > e = (c * (a/d) + b) / denom; > f = (c * (b/d)  a) / denom; > } > [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ] > > Before any computation of the answer, the code checks for any input > values near maximum to allow down scaling to avoid overflow. These > scalings almost never harm the accuracy since they are by 2. Values that > are over RBIG are relatively rare but it is easy to test for them and > allow aviodance of overflows. > > Testing for RMIN2 reveals when both c and d are less than 2^53 (for > double precision, see patch for other values). By scaling all values > by 2^51, the code avoids many underflows in intermediate computations > that otherwise might occur. If scaling a and b by 2^51 causes either > to overflow, then the computation will overflow whatever method is > used. > > Next, r (the ratio of c to d) is checked for being near zero. Baudin > and Smith checked r for zero. Checking for values less than DBL_MIN > (subnormal) covers roughly 10 times as many cases and improves overall > accuracy. If r is too small, then when it is used in a multiplication, > there is a high chance that the result will underflow to zero, losing > significant accuracy. That underflow is avoided by reordering the > computation. When r is subnormal, the code replaces a*r (= a*(c/d)) > with ((a/d)*c) which is mathematically the same but avoids the > unnecessary underflow. > > TEST Data > > Two sets of data are presented to test these methods. Both sets > contain 10 million pairs of complex values. The exponents and > mantissas are generated using multiple calls to random() and then > combining the results. Only values which give results to complex > divide that are representable in the appropriate precision after > being computed in quad precision are used. > > The first data set is labeled "moderate exponents". > are greater than *_MAX_EXP/2 or less than (*_MAX_EXP)/2, results are > The exponent range is limited to DBL_MAX_EXP/2 to DBL_MAX_EXP/2 > or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions. > The second data set is labeled "full exponents". > The exponent range for these cases is the full exponent range > for the precision. > > ACCURACY Test results: > > Note: The following accuracy tests are based on IEEE754 arithmetic. > > Note: All results are based on use of fused multiplyadd. If > fused multiplyadd is not used, the error rate increases slightly > for the 1 and 2 ulp cases for both current and new complex divide. > Differences at 8 ulp are barely measurable. > > The complex divide methods are evaluated by determining what > percentage of values exceed different ulp (units in last place) > levels. If a "2 ulp" test results show 1%, that would mean that 1% of > 10,000,000 values (100,000) have either a real or imaginary part that > had a greater than or equal to a 2 bit difference from the quad > precision result. > > Results are reported for differences greater than or equal to 1 ulp, 2 > ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision. Even > when the patch avoids overflows and underflows, some input values are > expected to have errors due to the potential for catastrophic roundoff > from floating point subtraction. For example, when b*c and a*d are > nearly equal, the result of subtraction may lose several places of > accuracy. This patch does not attempt to detect or minimize this type > of error, but neither does it increase them. > > I only show the results for Elen Kalda's method (with both 1 and > 2 divides) and the new method for only 1 divide in the double > precision table. > > In the following charts, lower values are better. > > current  current complex divide in libgcc > b1div  Elen Kalda's method from Baudin & Smith with one divide > b2div  Elen Kalda's method from Baudin & Smith with two divides > new1  This patch except with 1 divide and 2 multiplies > new  This patch which uses 2 divides > > =============================================================== > Errors Moderate Dataset > gtr eq current b1div b2div new1 new > ====== ======== ======== ======== ======== ======== > 1 ulp 0.24707% 0.92986% 0.24707% 0.92986% 0.24707% > 2 ulp 0.01762% 0.01770% 0.01762% 0.01770% 0.01762% > 8 ulp 0.00026% 0.00026% 0.00026% 0.00026% 0.00026% > 16 ulp 0.00000% 0.00000% 0.00000% 0.00000% 0.00000% > 24 ulp 0% 0% 0% 0% 0% > 52 ulp 0% 0% 0% 0% 0% > =============================================================== > Table 1: Errors with Moderate Dataset > > Note in Table 1 that both the old and new methods give identical error > rates for data with moderate exponents. Errors exceeding 16 ulp are > exceedingly rare. There are substantial increases in the 1 ulp error > rates for the 1 divide/2 multiply methods as compared to the 2 divides > methods. These differences are minimal for 2 ulp and larger error > measurements. I chose to use the more accurate method of two divides > to avoid loss of accuracy relative to current behavior where current > behavior is not at risk of under/overflow. > > =============================================================== > Errors Full Dataset > gtr eq current b1div b2div new1 new > ====== ======== ======== ======== ======== ======== > 1 ulp 2.05% 1.23842% 0.67130% 0.71774% 0.17100% > 2 ulp 1.88% 0.51615% 0.50354% 0.02052% 0.01319% > 8 ulp 1.79% 0.43227% 0.42531% 0.00358% 0.00350% > 16 ulp 1.69% 0.34503% 0.33542% 0.00212% 0.00212% > 24 ulp 1.59% 0.26367% 0.25189% 0.00138% 0.00138% > 52 ulp 1.28% 0.01914% 0.00378% 0.00001% 0.00001% > =============================================================== > Table 2: Errors with Full Dataset (double) > > Table 2 shows significant differences in error rates, especially > between the current method and the new method. With the current code, > 1.13% of test values have results where no bits of the mantissa match > the correct answer. That level of error is extremely rare with the new > code. Consider the results for errors greater than or equal to 24 ulp. > The current code sees those errors at a rate of 1.6%. Most would agree > that such answers are "substantially different" from the answers > calculated using extended precision. The b1div and b2div still shows > errors about about onesixth of that rate. The new code reduces errors > at that level by more than a factor of 1000. These improvements are > important for serious computation of complex numbers. > > Correctness for float > ============================= > Errors Moderate Dataset > gtr eq current new > ====== ======== ======== > 1 ulp 28.68070% 28.68070% > 2 ulp 0.64386% 0.64386% > 8 ulp 0.00403% 0.00403% > 16 ulp 0.00001% 0.00001% > 24 ulp 0% 0% > ============================= > Table 3: Errors with Moderate Dataset (float) > > We see again in Table 3 as in Table 1, there is no change in accuracy > between the current and new methods when the input data does not > contain very large or very small exponents with potential for > under/overflow. > > ============================= > Errors Full Dataset > gtr eq current new > ====== ======== ======== > 1 ulp 19.97% 17.85193% > 2 ulp 3.20% 0.43153% > 8 ulp 2.16% 0.04139% > 16 ulp 1.40% 0.00818% (99.4% better) > 24 ulp 0.98% 0.00016% > ============================= > Table 4: Errors with Full Dataset (float) > > In Table 4, we see as in Table 2, when extreme exponents are encountered, > the new method provides a substantial reduction in unacceptable answers. > The improvements are not quite as much as for double precision due > to the inherent limitations of 32bit floats. > > There is no change in accuracy for halfprecision since the code is > unchanged. libgcc computes halfprecision functions in float precision > allowing the existing methods to avoid overflow/underflow issues > for the allowed range of exponents for halfprecision. > > Extended precision (using x87 80bit format on x86) and Long double > (using IEEE754 128bit on x86 and aarch64) both have 15bit exponents > as compared to 11bit exponents in double precision. We note that the > C standard also allows Long Double to be implemented in the equivalent > range of Double. The RMIN2 and RMINSCAL constants are selected to work > within the Double range as well as with extended and 128bit ranges. > We will limit our performance and accurancy discussions to the 80bit > and 128bit formats as seen on x86 here. > > The extended and long double precision investigations were more > limited. Aarch64 does not supported extended precision but does > support the software implementation of 128bit long double > precision. For x86, long double defaults to the 80bit precision but > using the mlongdouble128 flag switches to using the software > implementation of 128bit precision. Both 80bit and 128bit > precisions have the same exponent range, with the 128bit precision > has extended mantissas. Since this change is only aimed at avoiding > underflow/overflow for extreme exponents, I studied the extended > precision results on x86 for 100,000 values. The limited exponent > dataset showed no differences. For the dataset with full exponent > range, the current and new values showed major differences (greater > than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of > these cases, the ratio of c/d or d/c (as appropriate) was zero or > subnormal, indicating the advantage of the new method and its > continued correctness where needed. > > PERFORMANCE Test results > > In order for a library change to be practical, it is necessary to show > the slowdown is tolerable. The slowdowns observed are much less than > would be seen by (for example) switching from hardware double precison > to a software quad precision, which on the tested machines causes a > slowdown of around 100x). > > The actual slowdown depends on the machine architecture. It also > depends on the nature of the input data. If underflow/overflow is > rare, then implementations that have strong branch prediction will > only slowdown by a few cycles. If underflow/overflow is common, then > the branch predictors will be less successful and the cost will be > higher. > > Results from two machines are presented as examples of the overhead > for the new method. The one labeled x86 is a 5 year old Intel x86 > processor and the one labeled aarch64 is a 3 year old arm64 processor. > > In the following chart, the times are averaged over a one million > value data set. All values are scaled to set the time of current > method to be 1.0. Lower values are better. A value of less than 1.0 would > be faster than the current method and a value greater than 1.0 would > be slower than the current method. > > ================================================ > Moderate set full set > x86 aarch64 x86 aarch64 > ======== =============== =============== > float 1.15 1.15 1.96 1.96 > double 1.07 1.40 1.31 1.78 > long double 1.08 1.23 1.08 1.24 > ================================================ > Table 5: Performance Comparisons (ratio new/current) > > The above tables omit the timing for the 1 divide and 2 multiply > comparison with the 2 divide approach. I consider that optimization > vs accuracy issue separate from the rest of this proposal. > Roughly speaking the 1 divide and 2 multiply approach provides > a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5% > of the time for double precison. The loss of accurancy is somewhat > greater for float than for double precision. > > For the proposed change, the moderate dataset shows less overhead for > the newer methods than the full dataset. That's because the moderate > dataset does not ever take the new branches which protect from > under/overflow. The better the branch predictor, the lower the cost > for these untaken branches. Both platforms are somewhat dated, with > the x86 having a better branch predictor which reduces the cost of the > additional branches in the new code. Of course, the relative slowdown > may be greater for some architectures, especially those with limited > branch prediction combined with a high cost of misprediction. > > This cost is claimed to be tolerable on the grounds that: > (a) most applications will only spend a small fraction of their time > calculating complex divide. > (b) it is much less than the cost of extended precision > (c) users are not forced to use it (as described below) > (d) the cost is worthwhile considering the accuracy improvement shown > in Table 2 and 4. > > Those users who find this degree of slowdown unsatisfactory may use > the switch fcxfortranrules which does not use the library routine, > instead inlining Smith's method without the C99 requirement for > dealing with NaN results. The proposed patch for libgcc complex divide > does not affect the code generated by fcxfortranrules. > > As a side note, double precision is not slower than float for the > aarch64 and x64 platforms tested. A future change might investigate > the performance of promoting the float values of complex divide to > double precision and using the current Smith's method. Using double > precision for float values will avoid the round off effects and using > the larger mantissa will give better answers in the case of > catastrophic subtraction. Better accuracy with no loss of performance > seems a win all around, but would require performance testing on a > wide range of platforms to confirm the "no loss of performance" > observed on the two implementions tested so far. > > SUMMARY > > When input data to complex divide has exponents whose absolute value > is half than half of *_MAX_EXP, this patch makes no changes in > accuracy and has only a modest effect on performance. When input data > contains values outside those ranges, the patch eliminates more than > 99.5% of major errors with a tolerable cost in performance. > > In comparison to Elen Kalda's method, this patch introduces more > performance overhead but reduces overflow/underflow more than 100 > times as often. > > There is an intermediate design choice which would catch 13x fewer > errors while introducing roughly 20% lower overhead. That could be > analyzed in depth if the overhead of the current proposal is > unacceptable. > > REFERENCES > > [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, > 5(8):435, 1962. > > [2] Michael Baudin and Robert L. Smith. "A robust complex division in > Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. > > [3] Elen Kalda: Complex division improvements in libgcc > https://gcc.gnu.org/legacyml/gccpatches/201908/msg01629.html > > [4] Nelson H.F. Beebe, "The MathematicalFunction Computation Handbook. > Springer International Publishing AG, 2017. >  > libgcc/ChangeLog  6 ++++ > libgcc/libgcc2.c  105 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 107 insertions(+), 4 deletions() > > diff git a/libgcc/ChangeLog b/libgcc/ChangeLog > index d6367e2..d1db004 100644 >  a/libgcc/ChangeLog > +++ b/libgcc/ChangeLog > @@ 1,3 +1,9 @@ > +20200624 Patrick McGehearty <patrick.mcgehearty@oracle.com> > + > + * libgcc2.c (__divdc3): Enhance accuracy of complex divide by > + avoiding underflow/overflow when input values have very large > + or very small exponents. > + > 20200625 Martin Liska <mliska@suse.cz> > > * libgcovdriver.c (merge_summary): Remove function as its name > diff git a/libgcc/libgcc2.c b/libgcc/libgcc2.c > index e0a9fd7..feffb13 100644 >  a/libgcc/libgcc2.c > +++ b/libgcc/libgcc2.c > @@ 2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) > CTYPE > CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) > { > +#if defined(L_divsc3) > +#define RBIG ((FLT_MAX)/2.0) > +#define RMIN (FLT_MIN) > +#define RMIN2 (0x1.0p21) > +#define RMINSCAL (0x1.0p+19) > +#endif > + > +#if defined(L_divdc3) > +#define RBIG ((DBL_MAX)/2.0) > +#define RMIN (DBL_MIN) > +#define RMIN2 (0x1.0p53) > +#define RMINSCAL (0x1.0p+51) > +#endif > + > +#if (defined(L_divxc3)  defined(L_divtc3)) > +#define RBIG ((LDBL_MAX)/2.0) > +#define RMIN (LDBL_MIN) > +#define RMIN2 (0x1.0p53) > +#define RMINSCAL (0x1.0p+51) > +#endif > + > MTYPE denom, ratio, x, y; > CTYPE res; > >  /* ??? We can get better behavior from logarithmic scaling instead of >  the division. But that would mean starting to link libgcc against >  libm. We could implement something akin to ldexp/frexp as gcc builtins >  fairly easily... */ > +#if defined(L_divhc3) > + /* half precision is handled with float precision. > + no extra measures are needed to avoid overflow/underflow */ > + > + /* Scale by max(c,d) to reduce chances of denominator overflowing */ > if (FABS (c) < FABS (d)) > { > ratio = c / d; > @@ 2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) > x = ((b * ratio) + a) / denom; > y = (b  (a * ratio)) / denom; > } > +#else > + /* float, double, extended, long double have significant potential > + underflow/overflow errors than can be greatly reduced with > + a limited number of adjustments */ > + > + /* Scale by max(c,d) to reduce chances of denominator overflowing */ > + if (FABS (c) < FABS (d)) > + { > + /* prevent underflow when denominator is near max representable */ > + if (FABS (d) >= RBIG) > + { > + a = a * 0.5; > + b = b * 0.5; > + c = c * 0.5; > + d = d * 0.5; > + } > + /* avoid overflow/underflow issues when c and d are small */ > + /* scaling up helps avoid some underflows */ > + /* No new overflow possible since c&d < RMIN2 */ > + if (FABS (d) < RMIN2) > + { > + a = a * RMINSCAL; > + b = b * RMINSCAL; > + c = c * RMINSCAL; > + d = d * RMINSCAL; > + } > + ratio = c / d; > + denom = (c * ratio) + d; > + /* choose alternate order of computation if ratio is subnormal */ > + if (FABS (ratio) > RMIN) > + { > + x = ((a * ratio) + b) / denom; > + y = ((b * ratio)  a) / denom; > + } > + else > + { > + x = ((c * (a / d)) + b) / denom; > + y = ((c * (b / d))  a) / denom; > + } > + } > + else > + { > + /* prevent underflow when denominator is near max representable */ > + if (FABS(c) >= RBIG) > + { > + a = a * 0.5; > + b = b * 0.5; > + c = c * 0.5; > + d = d * 0.5; > + } > + /* avoid overflow/underflow issues when both c and d are small */ > + /* scaling up helps avoid some underflows */ > + /* No new overflow possible since both c&d are less than RMIN2 */ > + if (FABS(c) < RMIN2) > + { > + a = a * RMINSCAL; > + b = b * RMINSCAL; > + c = c * RMINSCAL; > + d = d * RMINSCAL; > + } > + ratio = d / c; > + denom = (d * ratio) + c; > + /* choose alternate order of computation if ratio is subnormal */ > + if (FABS(ratio) > RMIN) > + { > + x = ((b * ratio) + a) / denom; > + y = (b  (a * ratio)) / denom; > + } > + else > + { > + x = (a + (d * (b / c))) / denom; > + y = (b  (d * (a / c))) / denom; > + } > + } > +#endif > > /* Recover infinities and zeros that computed as NaN+iNaN; the only cases > are nonzero/zero, infinite/finite, and finite/infinite. */
2nd ping. Any estimate on when a reviewer might get to this improvement in the accuracy of Complex Divide? I'm happy to supply more info on what testing I've done and details about design decisions. I'd prefer to do that sooner than later as who knows when corporate priority decisions might prevent me from having time for rapid response.  Patrick McGehearty (patrick.mcgehearty@oracle.com) On 7/21/2020 12:19 PM, Patrick McGehearty via Gccpatches wrote: > Ping > > > > On 7/1/2020 11:30 AM, Patrick McGehearty via Gccpatches wrote: >> (Version 3) >> >> (Added in version 3) >> Support for half, float, extended, and long double precision has >> been added to the prior work for double precision. Since half precision >> is computed with float precision as per current libgcc practice, >> the enhanced underflow/overflow tests provide no benefit for half >> precision and would cost performance. Therefore half precision is >> left unchanged. >> >> The existing constants for each precision: >> float: FLT_MAX, FLT_MIN; >> double: DBL_MAX, DBL_MIN; >> extended and/or long double: LDBL_MAX, LDBL_MIN >> are used for avoiding the more common overflow/underflow cases. >> >> Additional investigation showed that testing for when both parts of >> the denominator had exponents roughly small enough to allow shifting >> any subnormal values to normal values, all input values could be >> scaled up without risking unnecessary overflow and gaining a clear >> improvement in accuracy. The test and scaling values used all fit >> within the allowed exponent range for each precision required by the C >> standard. The remaining number of troubling results in version 3 is >> measurably smaller than in versions 1 and 2. >> >> The timing and precision tables below have been revised appropriately >> to match the algorithms used in this version for double precision >> and additional tables added to include results for other precisions. >> >> In prior versions, I omitted mention of the bug report that started me >> on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714 >> complex division is surprising on targets with FMA (was: on aarch64) >> With the proposed method, whether using FMA or not, dividing >> 1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i. >> >> I also have added a reference to Beebe's "The Mathematical Function >> Computation Handbook" [4] which was my starting point for research >> into better complex divide methods. >> >> (Added for Version 2) >> In my initial research, I missed Elen Kalda's proposed patch >> https://gcc.gnu.org/legacyml/gccpatches/201908/msg01629.html [3] >> Thanks to Joseph Myers for providing me with the pointer. >> This version includes performance and accuracy comparisions >> between Elen's proposed patch and my latest patch version for >> double precision. >> >> (from earlier Versions) >> >> The following patch to libgcc/libgcc2.c __divdc3 provides an >> opportunity to gain important improvements to the quality of answers >> for the default complex divide routine (half, float, double, extended, >> long double precisions) when dealing with very large or very small >> exponents. >> >> The current code correctly implements Smith's method (1962) [1] >> further modified by c99's requirements for dealing with NaN (not a >> number) results. When working with input values where the exponents >> are greater than *_MAX_EXP/2 or less than (*_MAX_EXP)/2, results are >> substantially different from the answers provided by quad precision >> more than 1% of the time. This error rate may be unacceptable for many >> applications that cannot a priori restrict their computations to the >> safe range. The proposed method reduces the frequency of >> "substantially different" answers by more than 99% for double >> precision at a modest cost of performance. >> >> Differences between current gcc methods and the new method will be >> described. Then accuracy and performance differences will be discussed. >> >> NOTATION >> >> For all of the following, the notation is: >> Input complex values: >> a+bi (a= real part, b= imaginary part) >> c+di >> Output complex value: >> e+fi = (a+bi)/(c+di) >> >> For the result tables: >> current = current method (SMITH) >> b1div = method proposed by Elen Kalda >> b2div = alternate method considered by Elen Kalda >> new1 = new method using 1 divide and 2 multiplies >> new = new method proposed by this patch >> >> DESCRIPTIONS of different complex divide methods: >> >> NAIVE COMPUTATION (fcxlimitedrange): >> e = (a*c + b*d)/(c*c + d*d) >> f = (b*c  a*d)/(c*c + d*d) >> >> Note that c*c and d*d will overflow or underflow if either >> c or d is outside the range 2^538 to 2^512. >> >> This method is available in gcc when the switch fcxlimitedrange is >> used. That switch is also enabled by ffastmath. Only one who has a >> clear understanding of the maximum range of intermediate values >> generated by a computation should consider using this switch. >> >> SMITH's METHOD (current libgcc): >> if(fabs(c)<fabs(d) { >> r = c/d; >> denom = (c*r) + d; >> e = (a*r + b) / denom; >> f = (b*r  a) / denom; >> } else { >> r = d/c; >> denom = c + (d*r); >> e = (a + b*r) / denom; >> f = (b  a*r) / denom; >> } >> >> Smith's method is the current default method available with __divdc3. >> >> Elen Kalda's METHOD >> >> This method applies the most significant part of the algorithm >> proposed by Baudin&Smith (2012) in the paper "A Robust Complex >> Division in Scilab" [2]. Elen's method also replaces two divides >> by one divide and two multiplies due to the high cost of divide >> on aarch64. In the comparison sections, this method will be >> labeled b1div. A variation discussed in that patch which >> does not replace the two divides will be labeled b2div. >> >> inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d) >> { >> r = d/c; >> t = 1.0 / (c + (d * r)); >> if (r != 0) { >> x = (a + (b * r)) * t; >> y = (b  (a * r)) * t; >> } else { >> /* Changing the order of operations avoids the underflow of r >> impacting >> the result. */ >> x = (a + (d * (b / c))) * t; >> y = (b  (d * (a / c))) * t; >> } >> } >> >> if (FABS (d) < FABS (c)) { >> improved_internal (a, b, c, d); >> } else { >> improved_internal (b, a, d, c); >> y = y; >> } >> >> NEW METHOD (proposed by patch) to replace the current default method: >> >> The proposed method starts with an algorithm proposed by Baudin&Smith >> (2012) in the paper "A Robust Complex Division in Scilab" [2]. The >> patch makes additional modifications to that method for further >> reductions in the error rate. The following code shows the #define >> values for double precision. See the patch for #define values used >> for other precisions. >> >> #define RBIG ((DBL_MAX)/2.0) >> #define RMIN (DBL_MIN) >> #define RMIN2 (0x1.0p53) >> #define RMINSCAL (0x1.0p+51) >> >> /* prevent overflow when arguments are near max representable */ >> if ((FABS (d) > RBIG)  (FABS (a) > RBIG)  (FABS (b) > RBIG) ) { >> a = a * 0.5; >> b = b * 0.5; >> c = c * 0.5; >> d = d * 0.5; >> } >> /* minimize overflow/underflow issues when c and d are small */ >> else if (FABS (d) < RMIN2) { >> a = a * RMINSCAL; >> b = b * RMINSCAL; >> c = c * RMINSCAL; >> d = d * RMINSCAL; >> } >> r = c/d; denom = (c*r) + d; >> if( r > RMIN ) { >> e = (a*r + b) / denom ; >> f = (b*r  a) / denom >> } else { >> e = (c * (a/d) + b) / denom; >> f = (c * (b/d)  a) / denom; >> } >> [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ] >> >> Before any computation of the answer, the code checks for any input >> values near maximum to allow down scaling to avoid overflow. These >> scalings almost never harm the accuracy since they are by 2. Values that >> are over RBIG are relatively rare but it is easy to test for them and >> allow aviodance of overflows. >> >> Testing for RMIN2 reveals when both c and d are less than 2^53 (for >> double precision, see patch for other values). By scaling all values >> by 2^51, the code avoids many underflows in intermediate computations >> that otherwise might occur. If scaling a and b by 2^51 causes either >> to overflow, then the computation will overflow whatever method is >> used. >> >> Next, r (the ratio of c to d) is checked for being near zero. Baudin >> and Smith checked r for zero. Checking for values less than DBL_MIN >> (subnormal) covers roughly 10 times as many cases and improves overall >> accuracy. If r is too small, then when it is used in a multiplication, >> there is a high chance that the result will underflow to zero, losing >> significant accuracy. That underflow is avoided by reordering the >> computation. When r is subnormal, the code replaces a*r (= a*(c/d)) >> with ((a/d)*c) which is mathematically the same but avoids the >> unnecessary underflow. >> >> TEST Data >> >> Two sets of data are presented to test these methods. Both sets >> contain 10 million pairs of complex values. The exponents and >> mantissas are generated using multiple calls to random() and then >> combining the results. Only values which give results to complex >> divide that are representable in the appropriate precision after >> being computed in quad precision are used. >> >> The first data set is labeled "moderate exponents". >> are greater than *_MAX_EXP/2 or less than (*_MAX_EXP)/2, results are >> The exponent range is limited to DBL_MAX_EXP/2 to DBL_MAX_EXP/2 >> or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions. >> The second data set is labeled "full exponents". >> The exponent range for these cases is the full exponent range >> for the precision. >> >> ACCURACY Test results: >> >> Note: The following accuracy tests are based on IEEE754 arithmetic. >> >> Note: All results are based on use of fused multiplyadd. If >> fused multiplyadd is not used, the error rate increases slightly >> for the 1 and 2 ulp cases for both current and new complex divide. >> Differences at 8 ulp are barely measurable. >> >> The complex divide methods are evaluated by determining what >> percentage of values exceed different ulp (units in last place) >> levels. If a "2 ulp" test results show 1%, that would mean that 1% of >> 10,000,000 values (100,000) have either a real or imaginary part that >> had a greater than or equal to a 2 bit difference from the quad >> precision result. >> >> Results are reported for differences greater than or equal to 1 ulp, 2 >> ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision. Even >> when the patch avoids overflows and underflows, some input values are >> expected to have errors due to the potential for catastrophic roundoff >> from floating point subtraction. For example, when b*c and a*d are >> nearly equal, the result of subtraction may lose several places of >> accuracy. This patch does not attempt to detect or minimize this type >> of error, but neither does it increase them. >> >> I only show the results for Elen Kalda's method (with both 1 and >> 2 divides) and the new method for only 1 divide in the double >> precision table. >> >> In the following charts, lower values are better. >> >> current  current complex divide in libgcc >> b1div  Elen Kalda's method from Baudin & Smith with one divide >> b2div  Elen Kalda's method from Baudin & Smith with two divides >> new1  This patch except with 1 divide and 2 multiplies >> new  This patch which uses 2 divides >> >> =============================================================== >> Errors Moderate Dataset >> gtr eq current b1div b2div new1 new >> ====== ======== ======== ======== ======== ======== >> 1 ulp 0.24707% 0.92986% 0.24707% 0.92986% 0.24707% >> 2 ulp 0.01762% 0.01770% 0.01762% 0.01770% 0.01762% >> 8 ulp 0.00026% 0.00026% 0.00026% 0.00026% 0.00026% >> 16 ulp 0.00000% 0.00000% 0.00000% 0.00000% 0.00000% >> 24 ulp 0% 0% 0% 0% 0% >> 52 ulp 0% 0% 0% 0% 0% >> =============================================================== >> Table 1: Errors with Moderate Dataset >> >> Note in Table 1 that both the old and new methods give identical error >> rates for data with moderate exponents. Errors exceeding 16 ulp are >> exceedingly rare. There are substantial increases in the 1 ulp error >> rates for the 1 divide/2 multiply methods as compared to the 2 divides >> methods. These differences are minimal for 2 ulp and larger error >> measurements. I chose to use the more accurate method of two divides >> to avoid loss of accuracy relative to current behavior where current >> behavior is not at risk of under/overflow. >> >> =============================================================== >> Errors Full Dataset >> gtr eq current b1div b2div new1 new >> ====== ======== ======== ======== ======== ======== >> 1 ulp 2.05% 1.23842% 0.67130% 0.71774% 0.17100% >> 2 ulp 1.88% 0.51615% 0.50354% 0.02052% 0.01319% >> 8 ulp 1.79% 0.43227% 0.42531% 0.00358% 0.00350% >> 16 ulp 1.69% 0.34503% 0.33542% 0.00212% 0.00212% >> 24 ulp 1.59% 0.26367% 0.25189% 0.00138% 0.00138% >> 52 ulp 1.28% 0.01914% 0.00378% 0.00001% 0.00001% >> =============================================================== >> Table 2: Errors with Full Dataset (double) >> >> Table 2 shows significant differences in error rates, especially >> between the current method and the new method. With the current code, >> 1.13% of test values have results where no bits of the mantissa match >> the correct answer. That level of error is extremely rare with the new >> code. Consider the results for errors greater than or equal to 24 ulp. >> The current code sees those errors at a rate of 1.6%. Most would agree >> that such answers are "substantially different" from the answers >> calculated using extended precision. The b1div and b2div still shows >> errors about about onesixth of that rate. The new code reduces errors >> at that level by more than a factor of 1000. These improvements are >> important for serious computation of complex numbers. >> >> Correctness for float >> ============================= >> Errors Moderate Dataset >> gtr eq current new >> ====== ======== ======== >> 1 ulp 28.68070% 28.68070% >> 2 ulp 0.64386% 0.64386% >> 8 ulp 0.00403% 0.00403% >> 16 ulp 0.00001% 0.00001% >> 24 ulp 0% 0% >> ============================= >> Table 3: Errors with Moderate Dataset (float) >> >> We see again in Table 3 as in Table 1, there is no change in accuracy >> between the current and new methods when the input data does not >> contain very large or very small exponents with potential for >> under/overflow. >> >> ============================= >> Errors Full Dataset >> gtr eq current new >> ====== ======== ======== >> 1 ulp 19.97% 17.85193% >> 2 ulp 3.20% 0.43153% >> 8 ulp 2.16% 0.04139% >> 16 ulp 1.40% 0.00818% (99.4% better) >> 24 ulp 0.98% 0.00016% >> ============================= >> Table 4: Errors with Full Dataset (float) >> >> In Table 4, we see as in Table 2, when extreme exponents are >> encountered, >> the new method provides a substantial reduction in unacceptable answers. >> The improvements are not quite as much as for double precision due >> to the inherent limitations of 32bit floats. >> >> There is no change in accuracy for halfprecision since the code is >> unchanged. libgcc computes halfprecision functions in float precision >> allowing the existing methods to avoid overflow/underflow issues >> for the allowed range of exponents for halfprecision. >> >> Extended precision (using x87 80bit format on x86) and Long double >> (using IEEE754 128bit on x86 and aarch64) both have 15bit exponents >> as compared to 11bit exponents in double precision. We note that the >> C standard also allows Long Double to be implemented in the equivalent >> range of Double. The RMIN2 and RMINSCAL constants are selected to work >> within the Double range as well as with extended and 128bit ranges. >> We will limit our performance and accurancy discussions to the 80bit >> and 128bit formats as seen on x86 here. >> >> The extended and long double precision investigations were more >> limited. Aarch64 does not supported extended precision but does >> support the software implementation of 128bit long double >> precision. For x86, long double defaults to the 80bit precision but >> using the mlongdouble128 flag switches to using the software >> implementation of 128bit precision. Both 80bit and 128bit >> precisions have the same exponent range, with the 128bit precision >> has extended mantissas. Since this change is only aimed at avoiding >> underflow/overflow for extreme exponents, I studied the extended >> precision results on x86 for 100,000 values. The limited exponent >> dataset showed no differences. For the dataset with full exponent >> range, the current and new values showed major differences (greater >> than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of >> these cases, the ratio of c/d or d/c (as appropriate) was zero or >> subnormal, indicating the advantage of the new method and its >> continued correctness where needed. >> >> PERFORMANCE Test results >> >> In order for a library change to be practical, it is necessary to show >> the slowdown is tolerable. The slowdowns observed are much less than >> would be seen by (for example) switching from hardware double precison >> to a software quad precision, which on the tested machines causes a >> slowdown of around 100x). >> >> The actual slowdown depends on the machine architecture. It also >> depends on the nature of the input data. If underflow/overflow is >> rare, then implementations that have strong branch prediction will >> only slowdown by a few cycles. If underflow/overflow is common, then >> the branch predictors will be less successful and the cost will be >> higher. >> >> Results from two machines are presented as examples of the overhead >> for the new method. The one labeled x86 is a 5 year old Intel x86 >> processor and the one labeled aarch64 is a 3 year old arm64 processor. >> >> In the following chart, the times are averaged over a one million >> value data set. All values are scaled to set the time of current >> method to be 1.0. Lower values are better. A value of less than 1.0 >> would >> be faster than the current method and a value greater than 1.0 would >> be slower than the current method. >> >> ================================================ >> Moderate set full set >> x86 aarch64 x86 aarch64 >> ======== =============== =============== >> float 1.15 1.15 1.96 1.96 >> double 1.07 1.40 1.31 1.78 >> long double 1.08 1.23 1.08 1.24 >> ================================================ >> Table 5: Performance Comparisons (ratio new/current) >> >> The above tables omit the timing for the 1 divide and 2 multiply >> comparison with the 2 divide approach. I consider that optimization >> vs accuracy issue separate from the rest of this proposal. >> Roughly speaking the 1 divide and 2 multiply approach provides >> a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5% >> of the time for double precison. The loss of accurancy is somewhat >> greater for float than for double precision. >> >> For the proposed change, the moderate dataset shows less overhead for >> the newer methods than the full dataset. That's because the moderate >> dataset does not ever take the new branches which protect from >> under/overflow. The better the branch predictor, the lower the cost >> for these untaken branches. Both platforms are somewhat dated, with >> the x86 having a better branch predictor which reduces the cost of the >> additional branches in the new code. Of course, the relative slowdown >> may be greater for some architectures, especially those with limited >> branch prediction combined with a high cost of misprediction. >> >> This cost is claimed to be tolerable on the grounds that: >> (a) most applications will only spend a small fraction of their time >> calculating complex divide. >> (b) it is much less than the cost of extended precision >> (c) users are not forced to use it (as described below) >> (d) the cost is worthwhile considering the accuracy improvement shown >> in Table 2 and 4. >> >> Those users who find this degree of slowdown unsatisfactory may use >> the switch fcxfortranrules which does not use the library routine, >> instead inlining Smith's method without the C99 requirement for >> dealing with NaN results. The proposed patch for libgcc complex divide >> does not affect the code generated by fcxfortranrules. >> >> As a side note, double precision is not slower than float for the >> aarch64 and x64 platforms tested. A future change might investigate >> the performance of promoting the float values of complex divide to >> double precision and using the current Smith's method. Using double >> precision for float values will avoid the round off effects and using >> the larger mantissa will give better answers in the case of >> catastrophic subtraction. Better accuracy with no loss of performance >> seems a win all around, but would require performance testing on a >> wide range of platforms to confirm the "no loss of performance" >> observed on the two implementions tested so far. >> >> SUMMARY >> >> When input data to complex divide has exponents whose absolute value >> is half than half of *_MAX_EXP, this patch makes no changes in >> accuracy and has only a modest effect on performance. When input data >> contains values outside those ranges, the patch eliminates more than >> 99.5% of major errors with a tolerable cost in performance. >> >> In comparison to Elen Kalda's method, this patch introduces more >> performance overhead but reduces overflow/underflow more than 100 >> times as often. >> >> There is an intermediate design choice which would catch 13x fewer >> errors while introducing roughly 20% lower overhead. That could be >> analyzed in depth if the overhead of the current proposal is >> unacceptable. >> >> REFERENCES >> >> [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, >> 5(8):435, 1962. >> >> [2] Michael Baudin and Robert L. Smith. "A robust complex division in >> Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. >> >> [3] Elen Kalda: Complex division improvements in libgcc >> https://gcc.gnu.org/legacyml/gccpatches/201908/msg01629.html >> >> [4] Nelson H.F. Beebe, "The MathematicalFunction Computation Handbook. >> Springer International Publishing AG, 2017. >>  >> libgcc/ChangeLog  6 ++++ >> libgcc/libgcc2.c  105 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 107 insertions(+), 4 deletions() >> >> diff git a/libgcc/ChangeLog b/libgcc/ChangeLog >> index d6367e2..d1db004 100644 >>  a/libgcc/ChangeLog >> +++ b/libgcc/ChangeLog >> @@ 1,3 +1,9 @@ >> +20200624 Patrick McGehearty <patrick.mcgehearty@oracle.com> >> + >> + * libgcc2.c (__divdc3): Enhance accuracy of complex divide by >> + avoiding underflow/overflow when input values have very large >> + or very small exponents. >> + >> 20200625 Martin Liska <mliska@suse.cz> >> * libgcovdriver.c (merge_summary): Remove function as its name >> diff git a/libgcc/libgcc2.c b/libgcc/libgcc2.c >> index e0a9fd7..feffb13 100644 >>  a/libgcc/libgcc2.c >> +++ b/libgcc/libgcc2.c >> @@ 2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, >> MTYPE c, MTYPE d) >> CTYPE >> CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) >> { >> +#if defined(L_divsc3) >> +#define RBIG ((FLT_MAX)/2.0) >> +#define RMIN (FLT_MIN) >> +#define RMIN2 (0x1.0p21) >> +#define RMINSCAL (0x1.0p+19) >> +#endif >> + >> +#if defined(L_divdc3) >> +#define RBIG ((DBL_MAX)/2.0) >> +#define RMIN (DBL_MIN) >> +#define RMIN2 (0x1.0p53) >> +#define RMINSCAL (0x1.0p+51) >> +#endif >> + >> +#if (defined(L_divxc3)  defined(L_divtc3)) >> +#define RBIG ((LDBL_MAX)/2.0) >> +#define RMIN (LDBL_MIN) >> +#define RMIN2 (0x1.0p53) >> +#define RMINSCAL (0x1.0p+51) >> +#endif >> + >> MTYPE denom, ratio, x, y; >> CTYPE res; >>  /* ??? We can get better behavior from logarithmic scaling >> instead of >>  the division. But that would mean starting to link libgcc against >>  libm. We could implement something akin to ldexp/frexp as gcc >> builtins >>  fairly easily... */ >> +#if defined(L_divhc3) >> + /* half precision is handled with float precision. >> + no extra measures are needed to avoid overflow/underflow */ >> + >> + /* Scale by max(c,d) to reduce chances of denominator overflowing */ >> if (FABS (c) < FABS (d)) >> { >> ratio = c / d; >> @@ 2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE >> c, MTYPE d) >> x = ((b * ratio) + a) / denom; >> y = (b  (a * ratio)) / denom; >> } >> +#else >> + /* float, double, extended, long double have significant potential >> + underflow/overflow errors than can be greatly reduced with >> + a limited number of adjustments */ >> + >> + /* Scale by max(c,d) to reduce chances of denominator overflowing */ >> + if (FABS (c) < FABS (d)) >> + { >> + /* prevent underflow when denominator is near max >> representable */ >> + if (FABS (d) >= RBIG) >> + { >> + a = a * 0.5; >> + b = b * 0.5; >> + c = c * 0.5; >> + d = d * 0.5; >> + } >> + /* avoid overflow/underflow issues when c and d are small */ >> + /* scaling up helps avoid some underflows */ >> + /* No new overflow possible since c&d < RMIN2 */ >> + if (FABS (d) < RMIN2) >> + { >> + a = a * RMINSCAL; >> + b = b * RMINSCAL; >> + c = c * RMINSCAL; >> + d = d * RMINSCAL; >> + } >> + ratio = c / d; >> + denom = (c * ratio) + d; >> + /* choose alternate order of computation if ratio is subnormal */ >> + if (FABS (ratio) > RMIN) >> + { >> + x = ((a * ratio) + b) / denom; >> + y = ((b * ratio)  a) / denom; >> + } >> + else >> + { >> + x = ((c * (a / d)) + b) / denom; >> + y = ((c * (b / d))  a) / denom; >> + } >> + } >> + else >> + { >> + /* prevent underflow when denominator is near max >> representable */ >> + if (FABS(c) >= RBIG) >> + { >> + a = a * 0.5; >> + b = b * 0.5; >> + c = c * 0.5; >> + d = d * 0.5; >> + } >> + /* avoid overflow/underflow issues when both c and d are small */ >> + /* scaling up helps avoid some underflows */ >> + /* No new overflow possible since both c&d are less than RMIN2 */ >> + if (FABS(c) < RMIN2) >> + { >> + a = a * RMINSCAL; >> + b = b * RMINSCAL; >> + c = c * RMINSCAL; >> + d = d * RMINSCAL; >> + } >> + ratio = d / c; >> + denom = (d * ratio) + c; >> + /* choose alternate order of computation if ratio is subnormal */ >> + if (FABS(ratio) > RMIN) >> + { >> + x = ((b * ratio) + a) / denom; >> + y = (b  (a * ratio)) / denom; >> + } >> + else >> + { >> + x = (a + (d * (b / c))) / denom; >> + y = (b  (d * (a / c))) / denom; >> + } >> + } >> +#endif >> /* Recover infinities and zeros that computed as NaN+iNaN; the >> only cases >> are nonzero/zero, infinite/finite, and finite/infinite. */ >
=============================================================== Errors Moderate Dataset gtr eq current b1div b2div new1 new ====== ======== ======== ======== ======== ======== 1 ulp 0.24707% 0.92986% 0.24707% 0.92986% 0.24707% 2 ulp 0.01762% 0.01770% 0.01762% 0.01770% 0.01762% 8 ulp 0.00026% 0.00026% 0.00026% 0.00026% 0.00026% 16 ulp 0.00000% 0.00000% 0.00000% 0.00000% 0.00000% 24 ulp 0% 0% 0% 0% 0% 52 ulp 0% 0% 0% 0% 0% =============================================================== Table 1: Errors with Moderate Dataset Note in Table 1 that both the old and new methods give identical error rates for data with moderate exponents. Errors exceeding 16 ulp are exceedingly rare. There are substantial increases in the 1 ulp error rates for the 1 divide/2 multiply methods as compared to the 2 divides methods. These differences are minimal for 2 ulp and larger error measurements. I chose to use the more accurate method of two divides to avoid loss of accuracy relative to current behavior where current behavior is not at risk of under/overflow. =============================================================== Errors Full Dataset gtr eq current b1div b2div new1 new ====== ======== ======== ======== ======== ======== 1 ulp 2.05% 1.23842% 0.67130% 0.71774% 0.17100% 2 ulp 1.88% 0.51615% 0.50354% 0.02052% 0.01319% 8 ulp 1.79% 0.43227% 0.42531% 0.00358% 0.00350% 16 ulp 1.69% 0.34503% 0.33542% 0.00212% 0.00212% 24 ulp 1.59% 0.26367% 0.25189% 0.00138% 0.00138% 52 ulp 1.28% 0.01914% 0.00378% 0.00001% 0.00001% =============================================================== Table 2: Errors with Full Dataset (double) Table 2 shows significant differences in error rates, especially between the current method and the new method. With the current code, 1.13% of test values have results where no bits of the mantissa match the correct answer. That level of error is extremely rare with the new code. Consider the results for errors greater than or equal to 24 ulp. The current code sees those errors at a rate of 1.6%. Most would agree that such answers are "substantially different" from the answers calculated using extended precision. The b1div and b2div still shows errors about about onesixth of that rate. The new code reduces errors at that level by more than a factor of 1000. These improvements are important for serious computation of complex numbers. Correctness for float ============================= Errors Moderate Dataset gtr eq current new ====== ======== ======== 1 ulp 28.68070% 28.68070% 2 ulp 0.64386% 0.64386% 8 ulp 0.00403% 0.00403% 16 ulp 0.00001% 0.00001% 24 ulp 0% 0% ============================= Table 3: Errors with Moderate Dataset (float) We see again in Table 3 as in Table 1, there is no change in accuracy between the current and new methods when the input data does not contain very large or very small exponents with potential for under/overflow. ============================= Errors Full Dataset gtr eq current new ====== ======== ======== 1 ulp 19.97% 17.85193% 2 ulp 3.20% 0.43153% 8 ulp 2.16% 0.04139% 16 ulp 1.40% 0.00818% (99.4% better) 24 ulp 0.98% 0.00016% ============================= Table 4: Errors with Full Dataset (float) In Table 4, we see as in Table 2, when extreme exponents are encountered, the new method provides a substantial reduction in unacceptable answers. The improvements are not quite as much as for double precision due to the inherent limitations of 32bit floats. There is no change in accuracy for halfprecision since the code is unchanged. libgcc computes halfprecision functions in float precision allowing the existing methods to avoid overflow/underflow issues for the allowed range of exponents for halfprecision. Extended precision (using x87 80bit format on x86) and Long double (using IEEE754 128bit on x86 and aarch64) both have 15bit exponents as compared to 11bit exponents in double precision. We note that the C standard also allows Long Double to be implemented in the equivalent range of Double. The RMIN2 and RMINSCAL constants are selected to work within the Double range as well as with extended and 128bit ranges. We will limit our performance and accurancy discussions to the 80bit and 128bit formats as seen on x86 here. The extended and long double precision investigations were more limited. Aarch64 does not supported extended precision but does support the software implementation of 128bit long double precision. For x86, long double defaults to the 80bit precision but using the mlongdouble128 flag switches to using the software implementation of 128bit precision. Both 80bit and 128bit precisions have the same exponent range, with the 128bit precision has extended mantissas. Since this change is only aimed at avoiding underflow/overflow for extreme exponents, I studied the extended precision results on x86 for 100,000 values. The limited exponent dataset showed no differences. For the dataset with full exponent range, the current and new values showed major differences (greater than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of these cases, the ratio of c/d or d/c (as appropriate) was zero or subnormal, indicating the advantage of the new method and its continued correctness where needed. PERFORMANCE Test results In order for a library change to be practical, it is necessary to show the slowdown is tolerable. The slowdowns observed are much less than would be seen by (for example) switching from hardware double precison to a software quad precision, which on the tested machines causes a slowdown of around 100x). The actual slowdown depends on the machine architecture. It also depends on the nature of the input data. If underflow/overflow is rare, then implementations that have strong branch prediction will only slowdown by a few cycles. If underflow/overflow is common, then the branch predictors will be less successful and the cost will be higher. Results from two machines are presented as examples of the overhead for the new method. The one labeled x86 is a 5 year old Intel x86 processor and the one labeled aarch64 is a 3 year old arm64 processor. In the following chart, the times are averaged over a one million value data set. All values are scaled to set the time of current method to be 1.0. Lower values are better. A value of less than 1.0 would be faster than the current method and a value greater than 1.0 would be slower than the current method. ================================================ Moderate set full set x86 aarch64 x86 aarch64 ======== =============== =============== float 1.15 1.15 1.96 1.96 double 1.07 1.40 1.31 1.78 long double 1.08 1.23 1.08 1.24 ================================================ Table 5: Performance Comparisons (ratio new/current) The above tables omit the timing for the 1 divide and 2 multiply comparison with the 2 divide approach. I consider that optimization vs accuracy issue separate from the rest of this proposal. Roughly speaking the 1 divide and 2 multiply approach provides a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5% of the time for double precison. The loss of accurancy is somewhat greater for float than for double precision. For the proposed change, the moderate dataset shows less overhead for the newer methods than the full dataset. That's because the moderate dataset does not ever take the new branches which protect from under/overflow. The better the branch predictor, the lower the cost for these untaken branches. Both platforms are somewhat dated, with the x86 having a better branch predictor which reduces the cost of the additional branches in the new code. Of course, the relative slowdown may be greater for some architectures, especially those with limited branch prediction combined with a high cost of misprediction. This cost is claimed to be tolerable on the grounds that: (a) most applications will only spend a small fraction of their time calculating complex divide. (b) it is much less than the cost of extended precision (c) users are not forced to use it (as described below) (d) the cost is worthwhile considering the accuracy improvement shown in Table 2 and 4. Those users who find this degree of slowdown unsatisfactory may use the switch fcxfortranrules which does not use the library routine, instead inlining Smith's method without the C99 requirement for dealing with NaN results. The proposed patch for libgcc complex divide does not affect the code generated by fcxfortranrules. As a side note, double precision is not slower than float for the aarch64 and x64 platforms tested. A future change might investigate the performance of promoting the float values of complex divide to double precision and using the current Smith's method. Using double precision for float values will avoid the round off effects and using the larger mantissa will give better answers in the case of catastrophic subtraction. Better accuracy with no loss of performance seems a win all around, but would require performance testing on a wide range of platforms to confirm the "no loss of performance" observed on the two implementions tested so far. SUMMARY When input data to complex divide has exponents whose absolute value is half than half of *_MAX_EXP, this patch makes no changes in accuracy and has only a modest effect on performance. When input data contains values outside those ranges, the patch eliminates more than 99.5% of major errors with a tolerable cost in performance. In comparison to Elen Kalda's method, this patch introduces more performance overhead but reduces overflow/underflow more than 100 times as often. There is an intermediate design choice which would catch 13x fewer errors while introducing roughly 20% lower overhead. That could be analyzed in depth if the overhead of the current proposal is unacceptable. REFERENCES [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962. [2] Michael Baudin and Robert L. Smith. "A robust complex division in Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. [3] Elen Kalda: Complex division improvements in libgcc https://gcc.gnu.org/legacyml/gccpatches/201908/msg01629.html [4] Nelson H.F. Beebe, "The MathematicalFunction Computation Handbook. Springer International Publishing AG, 2017.  libgcc/ChangeLog  6 ++++ libgcc/libgcc2.c  105 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 4 deletions() diff git a/libgcc/ChangeLog b/libgcc/ChangeLog index d6367e2..d1db004 100644  a/libgcc/ChangeLog +++ b/libgcc/ChangeLog @@ 1,3 +1,9 @@ +20200624 Patrick McGehearty <patrick.mcgehearty@oracle.com> + + * libgcc2.c (__divdc3): Enhance accuracy of complex divide by + avoiding underflow/overflow when input values have very large + or very small exponents. + 20200625 Martin Liska <mliska@suse.cz> * libgcovdriver.c (merge_summary): Remove function as its name diff git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index e0a9fd7..feffb13 100644  a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ 2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { +#if defined(L_divsc3) +#define RBIG ((FLT_MAX)/2.0) +#define RMIN (FLT_MIN) +#define RMIN2 (0x1.0p21) +#define RMINSCAL (0x1.0p+19) +#endif + +#if defined(L_divdc3) +#define RBIG ((DBL_MAX)/2.0) +#define RMIN (DBL_MIN) +#define RMIN2 (0x1.0p53) +#define RMINSCAL (0x1.0p+51) +#endif + +#if (defined(L_divxc3)  defined(L_divtc3)) +#define RBIG ((LDBL_MAX)/2.0) +#define RMIN (LDBL_MIN) +#define RMIN2 (0x1.0p53) +#define RMINSCAL (0x1.0p+51) +#endif + MTYPE denom, ratio, x, y; CTYPE res;  /* ??? We can get better behavior from logarithmic scaling instead of  the division. But that would mean starting to link libgcc against  libm. We could implement something akin to ldexp/frexp as gcc builtins  fairly easily... */ +#if defined(L_divhc3) + /* half precision is handled with float precision. + no extra measures are needed to avoid overflow/underflow */ + + /* Scale by max(c,d) to reduce chances of denominator overflowing */ if (FABS (c) < FABS (d)) { ratio = c / d; @@ 2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) x = ((b * ratio) + a) / denom; y = (b  (a * ratio)) / denom; } +#else + /* float, double, extended, long double have significant potential + underflow/overflow errors than can be greatly reduced with + a limited number of adjustments */ + + /* Scale by max(c,d) to reduce chances of denominator overflowing */ + if (FABS (c) < FABS (d)) + { + /* prevent underflow when denominator is near max representable */ + if (FABS (d) >= RBIG) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when c and d are small */ + /* scaling up helps avoid some underflows */ + /* No new overflow possible since c&d < RMIN2 */ + if (FABS (d) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } + ratio = c / d; + denom = (c * ratio) + d; + /* choose alternate order of computation if ratio is subnormal */ + if (FABS (ratio) > RMIN) + { + x = ((a * ratio) + b) / denom; + y = ((b * ratio)  a) / denom; + } + else + { + x = ((c * (a / d)) + b) / denom; + y = ((c * (b / d))  a) / denom; + } + } + else + { + /* prevent underflow when denominator is near max representable */ + if (FABS(c) >= RBIG) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when both c and d are small */ + /* scaling up helps avoid some underflows */ + /* No new overflow possible since both c&d are less than RMIN2 */ + if (FABS(c) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } + ratio = d / c; + denom = (d * ratio) + c; + /* choose alternate order of computation if ratio is subnormal */ + if (FABS(ratio) > RMIN) + { + x = ((b * ratio) + a) / denom; + y = (b  (a * ratio)) / denom; + } + else + { + x = (a + (d * (b / c))) / denom; + y = (b  (d * (a / c))) / denom; + } + } +#endif /* Recover infinities and zeros that computed as NaN+iNaN; the only cases are nonzero/zero, infinite/finite, and finite/infinite. */