[V3] Practical Improvement to libgcc Complex Divide

Message ID 1593621012-15058-1-git-send-email-patrick.mcgehearty@oracle.com
State New
Headers show
Series
  • [V3] Practical Improvement to libgcc Complex Divide
Related show

Commit Message

Jakub Jelinek via Gcc-patches July 1, 2020, 4:30 p.m.
(Version 3)

(Added in version 3)
Support for half, float, extended, and long double precision has
been added to the prior work for double precision. Since half precision
is computed with float precision as per current libgcc practice,
the enhanced underflow/overflow tests provide no benefit for half
precision and would cost performance. Therefore half precision is
left unchanged.

The existing constants for each precision:
float: FLT_MAX, FLT_MIN;
double: DBL_MAX, DBL_MIN;
extended and/or long double: LDBL_MAX, LDBL_MIN
are used for avoiding the more common overflow/underflow cases.

Additional investigation showed that testing for when both parts of
the denominator had exponents roughly small enough to allow shifting
any subnormal values to normal values, all input values could be
scaled up without risking unnecessary overflow and gaining a clear
improvement in accuracy. The test and scaling values used all fit
within the allowed exponent range for each precision required by the C
standard. The remaining number of troubling results in version 3 is
measurably smaller than in versions 1 and 2.

The timing and precision tables below have been revised appropriately
to match the algorithms used in this version for double precision
and additional tables added to include results for other precisions.

In prior versions, I omitted mention of the bug report that started me
on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714
complex division is surprising on targets with FMA (was: on aarch64)
With the proposed method, whether using FMA or not, dividing
1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i.

I also have added a reference to Beebe's "The Mathematical Function
Computation Handbook" [4] which was my starting point for research
into better complex divide methods.

(Added for Version 2)
In my initial research, I missed Elen Kalda's proposed patch
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]
Thanks to Joseph Myers for providing me with the pointer.
This version includes performance and accuracy comparisions
between Elen's proposed patch and my latest patch version for
double precision.

(from earlier Versions)

The following patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default complex divide routine (half, float, double, extended,
long double precisions) when dealing with very large or very small exponents.

The current code correctly implements Smith's method (1962) [1]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
substantially different from the answers provided by quad precision
more than 1% of the time. This error rate may be unacceptable for many
applications that cannot a priori restrict their computations to the
safe range. The proposed method reduces the frequency of
"substantially different" answers by more than 99% for double
precision at a modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

NOTATION

For all of the following, the notation is:
Input complex values:
  a+bi  (a= real part, b= imaginary part)
  c+di
Output complex value:
  e+fi = (a+bi)/(c+di)

For the result tables:
current = current method (SMITH)
b1div = method proposed by Elen Kalda
b2div = alternate method considered by Elen Kalda
new1 = new method using 1 divide and 2 multiplies
new = new method proposed by this patch

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
  e = (a*c + b*d)/(c*c + d*d)
  f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of intermediate values
generated by a computation should consider using this switch.

SMITH's METHOD (current libgcc):
  if(fabs(c)<fabs(d) {
    r = c/d;
    denom = (c*r) + d;
    e = (a*r + b) / denom;
    f = (b*r - a) / denom;
  } else {
    r = d/c;
    denom = c + (d*r);
    e = (a + b*r) / denom;
    f = (b - a*r) / denom;
  }

Smith's method is the current default method available with __divdc3.

Elen Kalda's METHOD

This method applies the most significant part of the algorithm
proposed by Baudin&Smith (2012) in the paper "A Robust Complex
Division in Scilab" [2]. Elen's method also replaces two divides
by one divide and two multiplies due to the high cost of divide
on aarch64. In the comparison sections, this method will be
labeled b1div. A variation discussed in that patch which
does not replace the two divides will be labeled b2div.

  inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
  {
    r = d/c;
    t = 1.0 / (c + (d * r));
    if (r != 0) {
        x = (a + (b * r)) * t;
        y = (b - (a * r)) * t;
    }  else {
    /* Changing the order of operations avoids the underflow of r impacting
     the result. */
        x = (a + (d * (b / c))) * t;
        y = (b - (d * (a / c))) * t;
    }
  }

  if (FABS (d) < FABS (c)) {
      improved_internal (a, b, c, d);
  } else {
      improved_internal (b, a, d, c);
      y = -y;
  }

NEW METHOD (proposed by patch) to replace the current default method:

The proposed method starts with an algorithm proposed by Baudin&Smith
(2012) in the paper "A Robust Complex Division in Scilab" [2]. The
patch makes additional modifications to that method for further
reductions in the error rate. The following code shows the #define
values for double precision. See the patch for #define values used
for other precisions.

  #define RBIG ((DBL_MAX)/2.0)
  #define RMIN (DBL_MIN)
  #define RMIN2 (0x1.0p-53)
  #define RMINSCAL (0x1.0p+51)

  /* prevent overflow when arguments are near max representable */
  if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
      a = a * 0.5;
      b = b * 0.5;
      c = c * 0.5;
      d = d * 0.5;
  }
  /* minimize overflow/underflow issues when c and d are small */
  else if (FABS (d) < RMIN2) {
      a = a * RMINSCAL;
      b = b * RMINSCAL;
      c = c * RMINSCAL;
      d = d * RMINSCAL;
  }
  r = c/d; denom = (c*r) + d;
  if( r > RMIN ) {
      e = (a*r + b) / denom   ;
      f = (b*r - a) / denom
  } else {
      e = (c * (a/d) + b) / denom;
      f = (c * (b/d) - a) / denom;
  }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

Before any computation of the answer, the code checks for any input
values near maximum to allow down scaling to avoid overflow.  These
scalings almost never harm the accuracy since they are by 2. Values that
are over RBIG are relatively rare but it is easy to test for them and
allow aviodance of overflows.

Testing for RMIN2 reveals when both c and d are less than 2^-53 (for
double precision, see patch for other values).  By scaling all values
by 2^51, the code avoids many underflows in intermediate computations
that otherwise might occur. If scaling a and b by 2^51 causes either
to overflow, then the computation will overflow whatever method is
used.

Next, r (the ratio of c to d) is checked for being near zero. Baudin
and Smith checked r for zero. Checking for values less than DBL_MIN
(subnormal) covers roughly 10 times as many cases and improves overall
accuracy. If r is too small, then when it is used in a multiplication,
there is a high chance that the result will underflow to zero, losing
significant accuracy. That underflow is avoided by reordering the
computation.  When r is subnormal, the code replaces a*r (= a*(c/d))
with ((a/d)*c) which is mathematically the same but avoids the
unnecessary underflow.

TEST Data

Two sets of data are presented to test these methods. Both sets
contain 10 million pairs of complex values.  The exponents and
mantissas are generated using multiple calls to random() and then
combining the results. Only values which give results to complex
divide that are representable in the appropriate precision after
being computed in quad precision are used.

The first data set is labeled "moderate exponents".
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2
or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions.
The second data set is labeled "full exponents".
The exponent range for these cases is the full exponent range
for the precision.

ACCURACY Test results:

Note: The following accuracy tests are based on IEEE-754 arithmetic.

Note: All results are based on use of fused multiply-add. If
fused multiply-add is not used, the error rate increases slightly
for the 1 and 2 ulp cases for both current and new complex divide.
Differences at 8 ulp are barely measurable.

The complex divide methods are evaluated by determining what
percentage of values exceed different ulp (units in last place)
levels.  If a "2 ulp" test results show 1%, that would mean that 1% of
10,000,000 values (100,000) have either a real or imaginary part that
had a greater than or equal to a 2 bit difference from the quad
precision result.

Results are reported for differences greater than or equal to 1 ulp, 2
ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision.  Even
when the patch avoids overflows and underflows, some input values are
expected to have errors due to the potential for catastrophic roundoff
from floating point subtraction. For example, when b*c and a*d are
nearly equal, the result of subtraction may lose several places of
accuracy. This patch does not attempt to detect or minimize this type
of error, but neither does it increase them.

I only show the results for Elen Kalda's method (with both 1 and
2 divides) and the new method for only 1 divide in the double
precision table.

In the following charts, lower values are better.

current - current complex divide in libgcc
b1div - Elen Kalda's method from Baudin & Smith with one divide
b2div - Elen Kalda's method from Baudin & Smith with two divides
new1  - This patch except with 1 divide and 2 multiplies
new   - This patch which uses 2 divides

-- 
1.8.3.1

Comments

Jakub Jelinek via Gcc-patches July 21, 2020, 5:19 p.m. | #1
Ping



On 7/1/2020 11:30 AM, Patrick McGehearty via Gcc-patches wrote:
> (Version 3)

>

> (Added in version 3)

> Support for half, float, extended, and long double precision has

> been added to the prior work for double precision. Since half precision

> is computed with float precision as per current libgcc practice,

> the enhanced underflow/overflow tests provide no benefit for half

> precision and would cost performance. Therefore half precision is

> left unchanged.

>

> The existing constants for each precision:

> float: FLT_MAX, FLT_MIN;

> double: DBL_MAX, DBL_MIN;

> extended and/or long double: LDBL_MAX, LDBL_MIN

> are used for avoiding the more common overflow/underflow cases.

>

> Additional investigation showed that testing for when both parts of

> the denominator had exponents roughly small enough to allow shifting

> any subnormal values to normal values, all input values could be

> scaled up without risking unnecessary overflow and gaining a clear

> improvement in accuracy. The test and scaling values used all fit

> within the allowed exponent range for each precision required by the C

> standard. The remaining number of troubling results in version 3 is

> measurably smaller than in versions 1 and 2.

>

> The timing and precision tables below have been revised appropriately

> to match the algorithms used in this version for double precision

> and additional tables added to include results for other precisions.

>

> In prior versions, I omitted mention of the bug report that started me

> on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714

> complex division is surprising on targets with FMA (was: on aarch64)

> With the proposed method, whether using FMA or not, dividing

> 1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i.

>

> I also have added a reference to Beebe's "The Mathematical Function

> Computation Handbook" [4] which was my starting point for research

> into better complex divide methods.

>

> (Added for Version 2)

> In my initial research, I missed Elen Kalda's proposed patch

> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]

> Thanks to Joseph Myers for providing me with the pointer.

> This version includes performance and accuracy comparisions

> between Elen's proposed patch and my latest patch version for

> double precision.

>

> (from earlier Versions)

>

> The following patch to libgcc/libgcc2.c __divdc3 provides an

> opportunity to gain important improvements to the quality of answers

> for the default complex divide routine (half, float, double, extended,

> long double precisions) when dealing with very large or very small exponents.

>

> The current code correctly implements Smith's method (1962) [1]

> further modified by c99's requirements for dealing with NaN (not a

> number) results. When working with input values where the exponents

> are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are

> substantially different from the answers provided by quad precision

> more than 1% of the time. This error rate may be unacceptable for many

> applications that cannot a priori restrict their computations to the

> safe range. The proposed method reduces the frequency of

> "substantially different" answers by more than 99% for double

> precision at a modest cost of performance.

>

> Differences between current gcc methods and the new method will be

> described. Then accuracy and performance differences will be discussed.

>

> NOTATION

>

> For all of the following, the notation is:

> Input complex values:

>    a+bi  (a= real part, b= imaginary part)

>    c+di

> Output complex value:

>    e+fi = (a+bi)/(c+di)

>

> For the result tables:

> current = current method (SMITH)

> b1div = method proposed by Elen Kalda

> b2div = alternate method considered by Elen Kalda

> new1 = new method using 1 divide and 2 multiplies

> new = new method proposed by this patch

>

> DESCRIPTIONS of different complex divide methods:

>

> NAIVE COMPUTATION (-fcx-limited-range):

>    e = (a*c + b*d)/(c*c + d*d)

>    f = (b*c - a*d)/(c*c + d*d)

>

> Note that c*c and d*d will overflow or underflow if either

> c or d is outside the range 2^-538 to 2^512.

>

> This method is available in gcc when the switch -fcx-limited-range is

> used. That switch is also enabled by -ffast-math. Only one who has a

> clear understanding of the maximum range of intermediate values

> generated by a computation should consider using this switch.

>

> SMITH's METHOD (current libgcc):

>    if(fabs(c)<fabs(d) {

>      r = c/d;

>      denom = (c*r) + d;

>      e = (a*r + b) / denom;

>      f = (b*r - a) / denom;

>    } else {

>      r = d/c;

>      denom = c + (d*r);

>      e = (a + b*r) / denom;

>      f = (b - a*r) / denom;

>    }

>

> Smith's method is the current default method available with __divdc3.

>

> Elen Kalda's METHOD

>

> This method applies the most significant part of the algorithm

> proposed by Baudin&Smith (2012) in the paper "A Robust Complex

> Division in Scilab" [2]. Elen's method also replaces two divides

> by one divide and two multiplies due to the high cost of divide

> on aarch64. In the comparison sections, this method will be

> labeled b1div. A variation discussed in that patch which

> does not replace the two divides will be labeled b2div.

>

>    inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>    {

>      r = d/c;

>      t = 1.0 / (c + (d * r));

>      if (r != 0) {

>          x = (a + (b * r)) * t;

>          y = (b - (a * r)) * t;

>      }  else {

>      /* Changing the order of operations avoids the underflow of r impacting

>       the result. */

>          x = (a + (d * (b / c))) * t;

>          y = (b - (d * (a / c))) * t;

>      }

>    }

>

>    if (FABS (d) < FABS (c)) {

>        improved_internal (a, b, c, d);

>    } else {

>        improved_internal (b, a, d, c);

>        y = -y;

>    }

>

> NEW METHOD (proposed by patch) to replace the current default method:

>

> The proposed method starts with an algorithm proposed by Baudin&Smith

> (2012) in the paper "A Robust Complex Division in Scilab" [2]. The

> patch makes additional modifications to that method for further

> reductions in the error rate. The following code shows the #define

> values for double precision. See the patch for #define values used

> for other precisions.

>

>    #define RBIG ((DBL_MAX)/2.0)

>    #define RMIN (DBL_MIN)

>    #define RMIN2 (0x1.0p-53)

>    #define RMINSCAL (0x1.0p+51)

>

>    /* prevent overflow when arguments are near max representable */

>    if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {

>        a = a * 0.5;

>        b = b * 0.5;

>        c = c * 0.5;

>        d = d * 0.5;

>    }

>    /* minimize overflow/underflow issues when c and d are small */

>    else if (FABS (d) < RMIN2) {

>        a = a * RMINSCAL;

>        b = b * RMINSCAL;

>        c = c * RMINSCAL;

>        d = d * RMINSCAL;

>    }

>    r = c/d; denom = (c*r) + d;

>    if( r > RMIN ) {

>        e = (a*r + b) / denom   ;

>        f = (b*r - a) / denom

>    } else {

>        e = (c * (a/d) + b) / denom;

>        f = (c * (b/d) - a) / denom;

>    }

> [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

>

> Before any computation of the answer, the code checks for any input

> values near maximum to allow down scaling to avoid overflow.  These

> scalings almost never harm the accuracy since they are by 2. Values that

> are over RBIG are relatively rare but it is easy to test for them and

> allow aviodance of overflows.

>

> Testing for RMIN2 reveals when both c and d are less than 2^-53 (for

> double precision, see patch for other values).  By scaling all values

> by 2^51, the code avoids many underflows in intermediate computations

> that otherwise might occur. If scaling a and b by 2^51 causes either

> to overflow, then the computation will overflow whatever method is

> used.

>

> Next, r (the ratio of c to d) is checked for being near zero. Baudin

> and Smith checked r for zero. Checking for values less than DBL_MIN

> (subnormal) covers roughly 10 times as many cases and improves overall

> accuracy. If r is too small, then when it is used in a multiplication,

> there is a high chance that the result will underflow to zero, losing

> significant accuracy. That underflow is avoided by reordering the

> computation.  When r is subnormal, the code replaces a*r (= a*(c/d))

> with ((a/d)*c) which is mathematically the same but avoids the

> unnecessary underflow.

>

> TEST Data

>

> Two sets of data are presented to test these methods. Both sets

> contain 10 million pairs of complex values.  The exponents and

> mantissas are generated using multiple calls to random() and then

> combining the results. Only values which give results to complex

> divide that are representable in the appropriate precision after

> being computed in quad precision are used.

>

> The first data set is labeled "moderate exponents".

> are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are

> The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2

> or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions.

> The second data set is labeled "full exponents".

> The exponent range for these cases is the full exponent range

> for the precision.

>

> ACCURACY Test results:

>

> Note: The following accuracy tests are based on IEEE-754 arithmetic.

>

> Note: All results are based on use of fused multiply-add. If

> fused multiply-add is not used, the error rate increases slightly

> for the 1 and 2 ulp cases for both current and new complex divide.

> Differences at 8 ulp are barely measurable.

>

> The complex divide methods are evaluated by determining what

> percentage of values exceed different ulp (units in last place)

> levels.  If a "2 ulp" test results show 1%, that would mean that 1% of

> 10,000,000 values (100,000) have either a real or imaginary part that

> had a greater than or equal to a 2 bit difference from the quad

> precision result.

>

> Results are reported for differences greater than or equal to 1 ulp, 2

> ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision.  Even

> when the patch avoids overflows and underflows, some input values are

> expected to have errors due to the potential for catastrophic roundoff

> from floating point subtraction. For example, when b*c and a*d are

> nearly equal, the result of subtraction may lose several places of

> accuracy. This patch does not attempt to detect or minimize this type

> of error, but neither does it increase them.

>

> I only show the results for Elen Kalda's method (with both 1 and

> 2 divides) and the new method for only 1 divide in the double

> precision table.

>

> In the following charts, lower values are better.

>

> current - current complex divide in libgcc

> b1div - Elen Kalda's method from Baudin & Smith with one divide

> b2div - Elen Kalda's method from Baudin & Smith with two divides

> new1  - This patch except with 1 divide and 2 multiplies

> new   - This patch which uses 2 divides

>

> ===============================================================

> Errors   Moderate Dataset

> gtr eq     current    b1div      b2div        new1       new

> ======    ========   ========   ========   ========   ========

>   1 ulp    0.24707%   0.92986%   0.24707%   0.92986%   0.24707%

>   2 ulp    0.01762%   0.01770%   0.01762%   0.01770%   0.01762%

>   8 ulp    0.00026%   0.00026%   0.00026%   0.00026%   0.00026%

> 16 ulp    0.00000%   0.00000%   0.00000%   0.00000%   0.00000%

> 24 ulp          0%         0%         0%         0%         0%

> 52 ulp          0%         0%         0%         0%         0%

> ===============================================================

> Table 1: Errors with Moderate Dataset

>

> Note in Table 1 that both the old and new methods give identical error

> rates for data with moderate exponents. Errors exceeding 16 ulp are

> exceedingly rare. There are substantial increases in the 1 ulp error

> rates for the 1 divide/2 multiply methods as compared to the 2 divides

> methods. These differences are minimal for 2 ulp and larger error

> measurements. I chose to use the more accurate method of two divides

> to avoid loss of accuracy relative to current behavior where current

> behavior is not at risk of under/overflow.

>

> ===============================================================

> Errors   Full Dataset

> gtr eq     current    b1div      b2div        new1       new

> ======    ========   ========   ========   ========   ========

>   1 ulp      2.05%   1.23842%    0.67130%   0.71774%   0.17100%

>   2 ulp      1.88%   0.51615%    0.50354%   0.02052%   0.01319%

>   8 ulp      1.79%   0.43227%    0.42531%   0.00358%   0.00350%

> 16 ulp      1.69%   0.34503%    0.33542%   0.00212%   0.00212%

> 24 ulp      1.59%   0.26367%    0.25189%   0.00138%   0.00138%

> 52 ulp      1.28%   0.01914%    0.00378%   0.00001%   0.00001%

> ===============================================================

> Table 2: Errors with Full Dataset (double)

>

> Table 2 shows significant differences in error rates, especially

> between the current method and the new method. With the current code,

> 1.13% of test values have results where no bits of the mantissa match

> the correct answer. That level of error is extremely rare with the new

> code. Consider the results for errors greater than or equal to 24 ulp.

> The current code sees those errors at a rate of 1.6%. Most would agree

> that such answers are "substantially different" from the answers

> calculated using extended precision. The b1div and b2div still shows

> errors about about one-sixth of that rate. The new code reduces errors

> at that level by more than a factor of 1000. These improvements are

> important for serious computation of complex numbers.

>

> Correctness for float

> =============================

> Errors   Moderate Dataset

> gtr eq     current     new

> ======    ========   ========

>   1 ulp   28.68070%  28.68070%

>   2 ulp    0.64386%   0.64386%

>   8 ulp    0.00403%   0.00403%

> 16 ulp    0.00001%   0.00001%

> 24 ulp          0%         0%

> =============================

> Table 3: Errors with Moderate Dataset (float)

>

> We see again in Table 3 as in Table 1, there is no change in accuracy

> between the current and new methods when the input data does not

> contain very large or very small exponents with potential for

> under/overflow.

>

> =============================

> Errors   Full Dataset

> gtr eq     current     new

> ======    ========   ========

>   1 ulp     19.97%  17.85193%

>   2 ulp      3.20%   0.43153%

>   8 ulp      2.16%   0.04139%

> 16 ulp      1.40%   0.00818%  (99.4% better)

> 24 ulp      0.98%   0.00016%

> =============================

> Table 4: Errors with Full Dataset (float)

>

> In Table 4, we see as in Table 2, when extreme exponents are encountered,

> the new method provides a substantial reduction in unacceptable answers.

> The improvements are not quite as much as for double precision due

> to the inherent limitations of 32-bit floats.

>

> There is no change in accuracy for half-precision since the code is

> unchanged. libgcc computes half-precision functions in float precision

> allowing the existing methods to avoid overflow/underflow issues

> for the allowed range of exponents for half-precision.

>

> Extended precision (using x87 80-bit format on x86) and Long double

> (using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents

> as compared to 11-bit exponents in double precision. We note that the

> C standard also allows Long Double to be implemented in the equivalent

> range of Double. The RMIN2 and RMINSCAL constants are selected to work

> within the Double range as well as with extended and 128-bit ranges.

> We will limit our performance and accurancy discussions to the 80-bit

> and 128-bit formats as seen on x86 here.

>

> The extended and long double precision investigations were more

> limited.  Aarch64 does not supported extended precision but does

> support the software implementation of 128-bit long double

> precision. For x86, long double defaults to the 80-bit precision but

> using the -mlong-double-128 flag switches to using the software

> implementation of 128-bit precision. Both 80-bit and 128-bit

> precisions have the same exponent range, with the 128-bit precision

> has extended mantissas. Since this change is only aimed at avoiding

> underflow/overflow for extreme exponents, I studied the extended

> precision results on x86 for 100,000 values. The limited exponent

> dataset showed no differences.  For the dataset with full exponent

> range, the current and new values showed major differences (greater

> than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of

> these cases, the ratio of c/d or d/c (as appropriate) was zero or

> subnormal, indicating the advantage of the new method and its

> continued correctness where needed.

>

> PERFORMANCE Test results

>

> In order for a library change to be practical, it is necessary to show

> the slowdown is tolerable. The slowdowns observed are much less than

> would be seen by (for example) switching from hardware double precison

> to a software quad precision, which on the tested machines causes a

> slowdown of around 100x).

>

> The actual slowdown depends on the machine architecture. It also

> depends on the nature of the input data. If underflow/overflow is

> rare, then implementations that have strong branch prediction will

> only slowdown by a few cycles. If underflow/overflow is common, then

> the branch predictors will be less successful and the cost will be

> higher.

>

> Results from two machines are presented as examples of the overhead

> for the new method. The one labeled x86 is a 5 year old Intel x86

> processor and the one labeled aarch64 is a 3 year old arm64 processor.

>

> In the following chart, the times are averaged over a one million

> value data set. All values are scaled to set the time of current

> method to be 1.0. Lower values are better. A value of less than 1.0 would

> be faster than the current method and a value greater than 1.0 would

> be slower than the current method.

>

> ================================================

>                 Moderate set          full set

>                 x86  aarch64        x86  aarch64

> ========     ===============     ===============

> float         1.15    1.15        1.96    1.96

> double        1.07    1.40        1.31    1.78

> long double   1.08    1.23        1.08    1.24

> ================================================

> Table 5: Performance Comparisons (ratio new/current)

>

> The above tables omit the timing for the 1 divide and 2 multiply

> comparison with the 2 divide approach. I consider that optimization

> vs accuracy issue separate from the rest of this proposal.

> Roughly speaking the 1 divide and 2 multiply approach provides

> a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5%

> of the time for double precison. The loss of accurancy is somewhat

> greater for float than for double precision.

>

> For the proposed change, the moderate dataset shows less overhead for

> the newer methods than the full dataset. That's because the moderate

> dataset does not ever take the new branches which protect from

> under/overflow. The better the branch predictor, the lower the cost

> for these untaken branches. Both platforms are somewhat dated, with

> the x86 having a better branch predictor which reduces the cost of the

> additional branches in the new code. Of course, the relative slowdown

> may be greater for some architectures, especially those with limited

> branch prediction combined with a high cost of misprediction.

>

> This cost is claimed to be tolerable on the grounds that:

> (a) most applications will only spend a small fraction of their time

>      calculating complex divide.

> (b) it is much less than the cost of extended precision

> (c) users are not forced to use it (as described below)

> (d) the cost is worthwhile considering the accuracy improvement shown

>      in Table 2 and 4.

>

> Those users who find this degree of slowdown unsatisfactory may use

> the switch -fcx-fortran-rules which does not use the library routine,

> instead inlining Smith's method without the C99 requirement for

> dealing with NaN results. The proposed patch for libgcc complex divide

> does not affect the code generated by -fcx-fortran-rules.

>

> As a side note, double precision is not slower than float for the

> aarch64 and x64 platforms tested. A future change might investigate

> the performance of promoting the float values of complex divide to

> double precision and using the current Smith's method. Using double

> precision for float values will avoid the round off effects and using

> the larger mantissa will give better answers in the case of

> catastrophic subtraction. Better accuracy with no loss of performance

> seems a win all around, but would require performance testing on a

> wide range of platforms to confirm the "no loss of performance"

> observed on the two implementions tested so far.

>

> SUMMARY

>

> When input data to complex divide has exponents whose absolute value

> is half than half of *_MAX_EXP, this patch makes no changes in

> accuracy and has only a modest effect on performance.  When input data

> contains values outside those ranges, the patch eliminates more than

> 99.5% of major errors with a tolerable cost in performance.

>

> In comparison to Elen Kalda's method, this patch introduces more

> performance overhead but reduces overflow/underflow more than 100

> times as often.

>

> There is an intermediate design choice which would catch 13x fewer

> errors while introducing roughly 20% lower overhead. That could be

> analyzed in depth if the overhead of the current proposal is

> unacceptable.

>

> REFERENCES

>

> [1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,

>   5(8):435, 1962.

>

> [2] Michael Baudin and Robert L. Smith. "A robust complex division in

> Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

>

> [3] Elen Kalda: Complex division improvements in libgcc

> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html

>

> [4] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.

> Springer International Publishing AG, 2017.

> ---

>   libgcc/ChangeLog |   6 ++++

>   libgcc/libgcc2.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++---

>   2 files changed, 107 insertions(+), 4 deletions(-)

>

> diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog

> index d6367e2..d1db004 100644

> --- a/libgcc/ChangeLog

> +++ b/libgcc/ChangeLog

> @@ -1,3 +1,9 @@

> +2020-06-24  Patrick McGehearty <patrick.mcgehearty@oracle.com>

> +

> +	* libgcc2.c (__divdc3): Enhance accuracy of complex divide by

> +	avoiding underflow/overflow when input values have very large

> +	or very small exponents.

> +

>   2020-06-25  Martin Liska  <mliska@suse.cz>

>   

>   	* libgcov-driver.c (merge_summary): Remove function as its name

> diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c

> index e0a9fd7..feffb13 100644

> --- a/libgcc/libgcc2.c

> +++ b/libgcc/libgcc2.c

> @@ -2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>   CTYPE

>   CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>   {

> +#if defined(L_divsc3)

> +#define RBIG	((FLT_MAX)/2.0)

> +#define RMIN	(FLT_MIN)

> +#define RMIN2	(0x1.0p-21)

> +#define RMINSCAL (0x1.0p+19)

> +#endif

> +

> +#if defined(L_divdc3)

> +#define RBIG	((DBL_MAX)/2.0)

> +#define RMIN	(DBL_MIN)

> +#define RMIN2	(0x1.0p-53)

> +#define RMINSCAL (0x1.0p+51)

> +#endif

> +

> +#if (defined(L_divxc3) || defined(L_divtc3))

> +#define RBIG	((LDBL_MAX)/2.0)

> +#define RMIN	(LDBL_MIN)

> +#define RMIN2	(0x1.0p-53)

> +#define RMINSCAL (0x1.0p+51)

> +#endif

> +

>     MTYPE denom, ratio, x, y;

>     CTYPE res;

>   

> -  /* ??? We can get better behavior from logarithmic scaling instead of

> -     the division.  But that would mean starting to link libgcc against

> -     libm.  We could implement something akin to ldexp/frexp as gcc builtins

> -     fairly easily...  */

> +#if defined(L_divhc3)

> +  /* half precision is handled with float precision.

> +     no extra measures are needed to avoid overflow/underflow */

> +

> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */

>     if (FABS (c) < FABS (d))

>       {

>         ratio = c / d;

> @@ -2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>         x = ((b * ratio) + a) / denom;

>         y = (b - (a * ratio)) / denom;

>       }

> +#else

> +  /* float, double, extended, long double have significant potential

> +    underflow/overflow errors than can be greatly reduced with

> +    a limited number of adjustments  */

> +

> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */

> +  if (FABS (c) < FABS (d))

> +    {

> +      /* prevent underflow when denominator is near max representable */

> +      if (FABS (d) >= RBIG)

> +	{

> +	  a = a * 0.5;

> +	  b = b * 0.5;

> +	  c = c * 0.5;

> +	  d = d * 0.5;

> +	}

> +      /* avoid overflow/underflow issues when c and d are small */

> +      /* scaling up helps avoid some underflows */

> +      /* No new overflow possible since c&d < RMIN2 */

> +      if (FABS (d) < RMIN2)

> +	{

> +	  a = a * RMINSCAL;

> +	  b = b * RMINSCAL;

> +	  c = c * RMINSCAL;

> +	  d = d * RMINSCAL;

> +	}

> +      ratio = c / d;

> +      denom = (c * ratio) + d;

> +      /* choose alternate order of computation if ratio is subnormal */

> +      if (FABS (ratio) > RMIN)

> +	{

> +	  x = ((a * ratio) + b) / denom;

> +	  y = ((b * ratio) - a) / denom;

> +	}

> +      else

> +	{

> +	  x = ((c * (a / d)) + b) / denom;

> +	  y = ((c * (b / d)) - a) / denom;

> +	}

> +    }

> +  else

> +    {

> +      /* prevent underflow when denominator is near max representable */

> +      if (FABS(c) >= RBIG)

> +	{

> +	  a = a * 0.5;

> +	  b = b * 0.5;

> +	  c = c * 0.5;

> +	  d = d * 0.5;

> +	}

> +      /* avoid overflow/underflow issues when both c and d are small */

> +      /* scaling up helps avoid some underflows */

> +      /* No new overflow possible since both c&d are less than RMIN2 */

> +      if (FABS(c) < RMIN2)

> +	{

> +	  a = a * RMINSCAL;

> +	  b = b * RMINSCAL;

> +	  c = c * RMINSCAL;

> +	  d = d * RMINSCAL;

> +	}

> +      ratio = d / c;

> +      denom = (d * ratio) + c;

> +      /* choose alternate order of computation if ratio is subnormal */

> +      if (FABS(ratio) > RMIN)

> +	{

> +	  x = ((b * ratio) + a) / denom;

> +	  y = (b - (a * ratio)) / denom;

> +	}

> +      else

> +	{

> +	  x = (a + (d * (b / c))) / denom;

> +	  y = (b - (d * (a / c))) / denom;

> +	}

> +    }

> +#endif

>   

>     /* Recover infinities and zeros that computed as NaN+iNaN; the only cases

>        are nonzero/zero, infinite/finite, and finite/infinite.  */
Jakub Jelinek via Gcc-patches Aug. 11, 2020, 9:06 p.m. | #2
2nd ping.

Any estimate on when a reviewer might get to this improvement
in the accuracy of Complex Divide?

I'm happy to supply more info on what testing I've done
and details about design decisions. I'd prefer to do that
sooner than later as who knows when corporate priority decisions
might prevent me from having time for rapid response.

- Patrick McGehearty (patrick.mcgehearty@oracle.com)


On 7/21/2020 12:19 PM, Patrick McGehearty via Gcc-patches wrote:
> Ping

>

>

>

> On 7/1/2020 11:30 AM, Patrick McGehearty via Gcc-patches wrote:

>> (Version 3)

>>

>> (Added in version 3)

>> Support for half, float, extended, and long double precision has

>> been added to the prior work for double precision. Since half precision

>> is computed with float precision as per current libgcc practice,

>> the enhanced underflow/overflow tests provide no benefit for half

>> precision and would cost performance. Therefore half precision is

>> left unchanged.

>>

>> The existing constants for each precision:

>> float: FLT_MAX, FLT_MIN;

>> double: DBL_MAX, DBL_MIN;

>> extended and/or long double: LDBL_MAX, LDBL_MIN

>> are used for avoiding the more common overflow/underflow cases.

>>

>> Additional investigation showed that testing for when both parts of

>> the denominator had exponents roughly small enough to allow shifting

>> any subnormal values to normal values, all input values could be

>> scaled up without risking unnecessary overflow and gaining a clear

>> improvement in accuracy. The test and scaling values used all fit

>> within the allowed exponent range for each precision required by the C

>> standard. The remaining number of troubling results in version 3 is

>> measurably smaller than in versions 1 and 2.

>>

>> The timing and precision tables below have been revised appropriately

>> to match the algorithms used in this version for double precision

>> and additional tables added to include results for other precisions.

>>

>> In prior versions, I omitted mention of the bug report that started me

>> on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714

>> complex division is surprising on targets with FMA (was: on aarch64)

>> With the proposed method, whether using FMA or not, dividing

>> 1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i.

>>

>> I also have added a reference to Beebe's "The Mathematical Function

>> Computation Handbook" [4] which was my starting point for research

>> into better complex divide methods.

>>

>> (Added for Version 2)

>> In my initial research, I missed Elen Kalda's proposed patch

>> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]

>> Thanks to Joseph Myers for providing me with the pointer.

>> This version includes performance and accuracy comparisions

>> between Elen's proposed patch and my latest patch version for

>> double precision.

>>

>> (from earlier Versions)

>>

>> The following patch to libgcc/libgcc2.c __divdc3 provides an

>> opportunity to gain important improvements to the quality of answers

>> for the default complex divide routine (half, float, double, extended,

>> long double precisions) when dealing with very large or very small 

>> exponents.

>>

>> The current code correctly implements Smith's method (1962) [1]

>> further modified by c99's requirements for dealing with NaN (not a

>> number) results. When working with input values where the exponents

>> are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are

>> substantially different from the answers provided by quad precision

>> more than 1% of the time. This error rate may be unacceptable for many

>> applications that cannot a priori restrict their computations to the

>> safe range. The proposed method reduces the frequency of

>> "substantially different" answers by more than 99% for double

>> precision at a modest cost of performance.

>>

>> Differences between current gcc methods and the new method will be

>> described. Then accuracy and performance differences will be discussed.

>>

>> NOTATION

>>

>> For all of the following, the notation is:

>> Input complex values:

>>    a+bi  (a= real part, b= imaginary part)

>>    c+di

>> Output complex value:

>>    e+fi = (a+bi)/(c+di)

>>

>> For the result tables:

>> current = current method (SMITH)

>> b1div = method proposed by Elen Kalda

>> b2div = alternate method considered by Elen Kalda

>> new1 = new method using 1 divide and 2 multiplies

>> new = new method proposed by this patch

>>

>> DESCRIPTIONS of different complex divide methods:

>>

>> NAIVE COMPUTATION (-fcx-limited-range):

>>    e = (a*c + b*d)/(c*c + d*d)

>>    f = (b*c - a*d)/(c*c + d*d)

>>

>> Note that c*c and d*d will overflow or underflow if either

>> c or d is outside the range 2^-538 to 2^512.

>>

>> This method is available in gcc when the switch -fcx-limited-range is

>> used. That switch is also enabled by -ffast-math. Only one who has a

>> clear understanding of the maximum range of intermediate values

>> generated by a computation should consider using this switch.

>>

>> SMITH's METHOD (current libgcc):

>>    if(fabs(c)<fabs(d) {

>>      r = c/d;

>>      denom = (c*r) + d;

>>      e = (a*r + b) / denom;

>>      f = (b*r - a) / denom;

>>    } else {

>>      r = d/c;

>>      denom = c + (d*r);

>>      e = (a + b*r) / denom;

>>      f = (b - a*r) / denom;

>>    }

>>

>> Smith's method is the current default method available with __divdc3.

>>

>> Elen Kalda's METHOD

>>

>> This method applies the most significant part of the algorithm

>> proposed by Baudin&Smith (2012) in the paper "A Robust Complex

>> Division in Scilab" [2]. Elen's method also replaces two divides

>> by one divide and two multiplies due to the high cost of divide

>> on aarch64. In the comparison sections, this method will be

>> labeled b1div. A variation discussed in that patch which

>> does not replace the two divides will be labeled b2div.

>>

>>    inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>>    {

>>      r = d/c;

>>      t = 1.0 / (c + (d * r));

>>      if (r != 0) {

>>          x = (a + (b * r)) * t;

>>          y = (b - (a * r)) * t;

>>      }  else {

>>      /* Changing the order of operations avoids the underflow of r 

>> impacting

>>       the result. */

>>          x = (a + (d * (b / c))) * t;

>>          y = (b - (d * (a / c))) * t;

>>      }

>>    }

>>

>>    if (FABS (d) < FABS (c)) {

>>        improved_internal (a, b, c, d);

>>    } else {

>>        improved_internal (b, a, d, c);

>>        y = -y;

>>    }

>>

>> NEW METHOD (proposed by patch) to replace the current default method:

>>

>> The proposed method starts with an algorithm proposed by Baudin&Smith

>> (2012) in the paper "A Robust Complex Division in Scilab" [2]. The

>> patch makes additional modifications to that method for further

>> reductions in the error rate. The following code shows the #define

>> values for double precision. See the patch for #define values used

>> for other precisions.

>>

>>    #define RBIG ((DBL_MAX)/2.0)

>>    #define RMIN (DBL_MIN)

>>    #define RMIN2 (0x1.0p-53)

>>    #define RMINSCAL (0x1.0p+51)

>>

>>    /* prevent overflow when arguments are near max representable */

>>    if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {

>>        a = a * 0.5;

>>        b = b * 0.5;

>>        c = c * 0.5;

>>        d = d * 0.5;

>>    }

>>    /* minimize overflow/underflow issues when c and d are small */

>>    else if (FABS (d) < RMIN2) {

>>        a = a * RMINSCAL;

>>        b = b * RMINSCAL;

>>        c = c * RMINSCAL;

>>        d = d * RMINSCAL;

>>    }

>>    r = c/d; denom = (c*r) + d;

>>    if( r > RMIN ) {

>>        e = (a*r + b) / denom   ;

>>        f = (b*r - a) / denom

>>    } else {

>>        e = (c * (a/d) + b) / denom;

>>        f = (c * (b/d) - a) / denom;

>>    }

>> [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

>>

>> Before any computation of the answer, the code checks for any input

>> values near maximum to allow down scaling to avoid overflow. These

>> scalings almost never harm the accuracy since they are by 2. Values that

>> are over RBIG are relatively rare but it is easy to test for them and

>> allow aviodance of overflows.

>>

>> Testing for RMIN2 reveals when both c and d are less than 2^-53 (for

>> double precision, see patch for other values).  By scaling all values

>> by 2^51, the code avoids many underflows in intermediate computations

>> that otherwise might occur. If scaling a and b by 2^51 causes either

>> to overflow, then the computation will overflow whatever method is

>> used.

>>

>> Next, r (the ratio of c to d) is checked for being near zero. Baudin

>> and Smith checked r for zero. Checking for values less than DBL_MIN

>> (subnormal) covers roughly 10 times as many cases and improves overall

>> accuracy. If r is too small, then when it is used in a multiplication,

>> there is a high chance that the result will underflow to zero, losing

>> significant accuracy. That underflow is avoided by reordering the

>> computation.  When r is subnormal, the code replaces a*r (= a*(c/d))

>> with ((a/d)*c) which is mathematically the same but avoids the

>> unnecessary underflow.

>>

>> TEST Data

>>

>> Two sets of data are presented to test these methods. Both sets

>> contain 10 million pairs of complex values.  The exponents and

>> mantissas are generated using multiple calls to random() and then

>> combining the results. Only values which give results to complex

>> divide that are representable in the appropriate precision after

>> being computed in quad precision are used.

>>

>> The first data set is labeled "moderate exponents".

>> are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are

>> The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2

>> or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions.

>> The second data set is labeled "full exponents".

>> The exponent range for these cases is the full exponent range

>> for the precision.

>>

>> ACCURACY Test results:

>>

>> Note: The following accuracy tests are based on IEEE-754 arithmetic.

>>

>> Note: All results are based on use of fused multiply-add. If

>> fused multiply-add is not used, the error rate increases slightly

>> for the 1 and 2 ulp cases for both current and new complex divide.

>> Differences at 8 ulp are barely measurable.

>>

>> The complex divide methods are evaluated by determining what

>> percentage of values exceed different ulp (units in last place)

>> levels.  If a "2 ulp" test results show 1%, that would mean that 1% of

>> 10,000,000 values (100,000) have either a real or imaginary part that

>> had a greater than or equal to a 2 bit difference from the quad

>> precision result.

>>

>> Results are reported for differences greater than or equal to 1 ulp, 2

>> ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision. Even

>> when the patch avoids overflows and underflows, some input values are

>> expected to have errors due to the potential for catastrophic roundoff

>> from floating point subtraction. For example, when b*c and a*d are

>> nearly equal, the result of subtraction may lose several places of

>> accuracy. This patch does not attempt to detect or minimize this type

>> of error, but neither does it increase them.

>>

>> I only show the results for Elen Kalda's method (with both 1 and

>> 2 divides) and the new method for only 1 divide in the double

>> precision table.

>>

>> In the following charts, lower values are better.

>>

>> current - current complex divide in libgcc

>> b1div - Elen Kalda's method from Baudin & Smith with one divide

>> b2div - Elen Kalda's method from Baudin & Smith with two divides

>> new1  - This patch except with 1 divide and 2 multiplies

>> new   - This patch which uses 2 divides

>>

>> ===============================================================

>> Errors   Moderate Dataset

>> gtr eq     current    b1div      b2div        new1       new

>> ======    ========   ========   ========   ========   ========

>>   1 ulp    0.24707%   0.92986%   0.24707%   0.92986%   0.24707%

>>   2 ulp    0.01762%   0.01770%   0.01762%   0.01770%   0.01762%

>>   8 ulp    0.00026%   0.00026%   0.00026%   0.00026%   0.00026%

>> 16 ulp    0.00000%   0.00000%   0.00000%   0.00000%   0.00000%

>> 24 ulp          0%         0%         0%         0%         0%

>> 52 ulp          0%         0%         0%         0%         0%

>> ===============================================================

>> Table 1: Errors with Moderate Dataset

>>

>> Note in Table 1 that both the old and new methods give identical error

>> rates for data with moderate exponents. Errors exceeding 16 ulp are

>> exceedingly rare. There are substantial increases in the 1 ulp error

>> rates for the 1 divide/2 multiply methods as compared to the 2 divides

>> methods. These differences are minimal for 2 ulp and larger error

>> measurements. I chose to use the more accurate method of two divides

>> to avoid loss of accuracy relative to current behavior where current

>> behavior is not at risk of under/overflow.

>>

>> ===============================================================

>> Errors   Full Dataset

>> gtr eq     current    b1div      b2div        new1       new

>> ======    ========   ========   ========   ========   ========

>>   1 ulp      2.05%   1.23842%    0.67130%   0.71774%   0.17100%

>>   2 ulp      1.88%   0.51615%    0.50354%   0.02052%   0.01319%

>>   8 ulp      1.79%   0.43227%    0.42531%   0.00358%   0.00350%

>> 16 ulp      1.69%   0.34503%    0.33542%   0.00212%   0.00212%

>> 24 ulp      1.59%   0.26367%    0.25189%   0.00138%   0.00138%

>> 52 ulp      1.28%   0.01914%    0.00378%   0.00001%   0.00001%

>> ===============================================================

>> Table 2: Errors with Full Dataset (double)

>>

>> Table 2 shows significant differences in error rates, especially

>> between the current method and the new method. With the current code,

>> 1.13% of test values have results where no bits of the mantissa match

>> the correct answer. That level of error is extremely rare with the new

>> code. Consider the results for errors greater than or equal to 24 ulp.

>> The current code sees those errors at a rate of 1.6%. Most would agree

>> that such answers are "substantially different" from the answers

>> calculated using extended precision. The b1div and b2div still shows

>> errors about about one-sixth of that rate. The new code reduces errors

>> at that level by more than a factor of 1000. These improvements are

>> important for serious computation of complex numbers.

>>

>> Correctness for float

>> =============================

>> Errors   Moderate Dataset

>> gtr eq     current     new

>> ======    ========   ========

>>   1 ulp   28.68070%  28.68070%

>>   2 ulp    0.64386%   0.64386%

>>   8 ulp    0.00403%   0.00403%

>> 16 ulp    0.00001%   0.00001%

>> 24 ulp          0%         0%

>> =============================

>> Table 3: Errors with Moderate Dataset (float)

>>

>> We see again in Table 3 as in Table 1, there is no change in accuracy

>> between the current and new methods when the input data does not

>> contain very large or very small exponents with potential for

>> under/overflow.

>>

>> =============================

>> Errors   Full Dataset

>> gtr eq     current     new

>> ======    ========   ========

>>   1 ulp     19.97%  17.85193%

>>   2 ulp      3.20%   0.43153%

>>   8 ulp      2.16%   0.04139%

>> 16 ulp      1.40%   0.00818%  (99.4% better)

>> 24 ulp      0.98%   0.00016%

>> =============================

>> Table 4: Errors with Full Dataset (float)

>>

>> In Table 4, we see as in Table 2, when extreme exponents are 

>> encountered,

>> the new method provides a substantial reduction in unacceptable answers.

>> The improvements are not quite as much as for double precision due

>> to the inherent limitations of 32-bit floats.

>>

>> There is no change in accuracy for half-precision since the code is

>> unchanged. libgcc computes half-precision functions in float precision

>> allowing the existing methods to avoid overflow/underflow issues

>> for the allowed range of exponents for half-precision.

>>

>> Extended precision (using x87 80-bit format on x86) and Long double

>> (using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents

>> as compared to 11-bit exponents in double precision. We note that the

>> C standard also allows Long Double to be implemented in the equivalent

>> range of Double. The RMIN2 and RMINSCAL constants are selected to work

>> within the Double range as well as with extended and 128-bit ranges.

>> We will limit our performance and accurancy discussions to the 80-bit

>> and 128-bit formats as seen on x86 here.

>>

>> The extended and long double precision investigations were more

>> limited.  Aarch64 does not supported extended precision but does

>> support the software implementation of 128-bit long double

>> precision. For x86, long double defaults to the 80-bit precision but

>> using the -mlong-double-128 flag switches to using the software

>> implementation of 128-bit precision. Both 80-bit and 128-bit

>> precisions have the same exponent range, with the 128-bit precision

>> has extended mantissas. Since this change is only aimed at avoiding

>> underflow/overflow for extreme exponents, I studied the extended

>> precision results on x86 for 100,000 values. The limited exponent

>> dataset showed no differences.  For the dataset with full exponent

>> range, the current and new values showed major differences (greater

>> than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of

>> these cases, the ratio of c/d or d/c (as appropriate) was zero or

>> subnormal, indicating the advantage of the new method and its

>> continued correctness where needed.

>>

>> PERFORMANCE Test results

>>

>> In order for a library change to be practical, it is necessary to show

>> the slowdown is tolerable. The slowdowns observed are much less than

>> would be seen by (for example) switching from hardware double precison

>> to a software quad precision, which on the tested machines causes a

>> slowdown of around 100x).

>>

>> The actual slowdown depends on the machine architecture. It also

>> depends on the nature of the input data. If underflow/overflow is

>> rare, then implementations that have strong branch prediction will

>> only slowdown by a few cycles. If underflow/overflow is common, then

>> the branch predictors will be less successful and the cost will be

>> higher.

>>

>> Results from two machines are presented as examples of the overhead

>> for the new method. The one labeled x86 is a 5 year old Intel x86

>> processor and the one labeled aarch64 is a 3 year old arm64 processor.

>>

>> In the following chart, the times are averaged over a one million

>> value data set. All values are scaled to set the time of current

>> method to be 1.0. Lower values are better. A value of less than 1.0 

>> would

>> be faster than the current method and a value greater than 1.0 would

>> be slower than the current method.

>>

>> ================================================

>>                 Moderate set          full set

>>                 x86  aarch64        x86  aarch64

>> ========     ===============     ===============

>> float         1.15    1.15        1.96    1.96

>> double        1.07    1.40        1.31    1.78

>> long double   1.08    1.23        1.08    1.24

>> ================================================

>> Table 5: Performance Comparisons (ratio new/current)

>>

>> The above tables omit the timing for the 1 divide and 2 multiply

>> comparison with the 2 divide approach. I consider that optimization

>> vs accuracy issue separate from the rest of this proposal.

>> Roughly speaking the 1 divide and 2 multiply approach provides

>> a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5%

>> of the time for double precison. The loss of accurancy is somewhat

>> greater for float than for double precision.

>>

>> For the proposed change, the moderate dataset shows less overhead for

>> the newer methods than the full dataset. That's because the moderate

>> dataset does not ever take the new branches which protect from

>> under/overflow. The better the branch predictor, the lower the cost

>> for these untaken branches. Both platforms are somewhat dated, with

>> the x86 having a better branch predictor which reduces the cost of the

>> additional branches in the new code. Of course, the relative slowdown

>> may be greater for some architectures, especially those with limited

>> branch prediction combined with a high cost of misprediction.

>>

>> This cost is claimed to be tolerable on the grounds that:

>> (a) most applications will only spend a small fraction of their time

>>      calculating complex divide.

>> (b) it is much less than the cost of extended precision

>> (c) users are not forced to use it (as described below)

>> (d) the cost is worthwhile considering the accuracy improvement shown

>>      in Table 2 and 4.

>>

>> Those users who find this degree of slowdown unsatisfactory may use

>> the switch -fcx-fortran-rules which does not use the library routine,

>> instead inlining Smith's method without the C99 requirement for

>> dealing with NaN results. The proposed patch for libgcc complex divide

>> does not affect the code generated by -fcx-fortran-rules.

>>

>> As a side note, double precision is not slower than float for the

>> aarch64 and x64 platforms tested. A future change might investigate

>> the performance of promoting the float values of complex divide to

>> double precision and using the current Smith's method. Using double

>> precision for float values will avoid the round off effects and using

>> the larger mantissa will give better answers in the case of

>> catastrophic subtraction. Better accuracy with no loss of performance

>> seems a win all around, but would require performance testing on a

>> wide range of platforms to confirm the "no loss of performance"

>> observed on the two implementions tested so far.

>>

>> SUMMARY

>>

>> When input data to complex divide has exponents whose absolute value

>> is half than half of *_MAX_EXP, this patch makes no changes in

>> accuracy and has only a modest effect on performance.  When input data

>> contains values outside those ranges, the patch eliminates more than

>> 99.5% of major errors with a tolerable cost in performance.

>>

>> In comparison to Elen Kalda's method, this patch introduces more

>> performance overhead but reduces overflow/underflow more than 100

>> times as often.

>>

>> There is an intermediate design choice which would catch 13x fewer

>> errors while introducing roughly 20% lower overhead. That could be

>> analyzed in depth if the overhead of the current proposal is

>> unacceptable.

>>

>> REFERENCES

>>

>> [1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,

>>   5(8):435, 1962.

>>

>> [2] Michael Baudin and Robert L. Smith. "A robust complex division in

>> Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

>>

>> [3] Elen Kalda: Complex division improvements in libgcc

>> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html

>>

>> [4] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.

>> Springer International Publishing AG, 2017.

>> ---

>>   libgcc/ChangeLog |   6 ++++

>>   libgcc/libgcc2.c | 105 

>> ++++++++++++++++++++++++++++++++++++++++++++++++++++---

>>   2 files changed, 107 insertions(+), 4 deletions(-)

>>

>> diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog

>> index d6367e2..d1db004 100644

>> --- a/libgcc/ChangeLog

>> +++ b/libgcc/ChangeLog

>> @@ -1,3 +1,9 @@

>> +2020-06-24  Patrick McGehearty <patrick.mcgehearty@oracle.com>

>> +

>> +    * libgcc2.c (__divdc3): Enhance accuracy of complex divide by

>> +    avoiding underflow/overflow when input values have very large

>> +    or very small exponents.

>> +

>>   2020-06-25  Martin Liska  <mliska@suse.cz>

>>         * libgcov-driver.c (merge_summary): Remove function as its name

>> diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c

>> index e0a9fd7..feffb13 100644

>> --- a/libgcc/libgcc2.c

>> +++ b/libgcc/libgcc2.c

>> @@ -2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, 

>> MTYPE c, MTYPE d)

>>   CTYPE

>>   CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)

>>   {

>> +#if defined(L_divsc3)

>> +#define RBIG    ((FLT_MAX)/2.0)

>> +#define RMIN    (FLT_MIN)

>> +#define RMIN2    (0x1.0p-21)

>> +#define RMINSCAL (0x1.0p+19)

>> +#endif

>> +

>> +#if defined(L_divdc3)

>> +#define RBIG    ((DBL_MAX)/2.0)

>> +#define RMIN    (DBL_MIN)

>> +#define RMIN2    (0x1.0p-53)

>> +#define RMINSCAL (0x1.0p+51)

>> +#endif

>> +

>> +#if (defined(L_divxc3) || defined(L_divtc3))

>> +#define RBIG    ((LDBL_MAX)/2.0)

>> +#define RMIN    (LDBL_MIN)

>> +#define RMIN2    (0x1.0p-53)

>> +#define RMINSCAL (0x1.0p+51)

>> +#endif

>> +

>>     MTYPE denom, ratio, x, y;

>>     CTYPE res;

>>   -  /* ??? We can get better behavior from logarithmic scaling 

>> instead of

>> -     the division.  But that would mean starting to link libgcc against

>> -     libm.  We could implement something akin to ldexp/frexp as gcc 

>> builtins

>> -     fairly easily...  */

>> +#if defined(L_divhc3)

>> +  /* half precision is handled with float precision.

>> +     no extra measures are needed to avoid overflow/underflow */

>> +

>> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */

>>     if (FABS (c) < FABS (d))

>>       {

>>         ratio = c / d;

>> @@ -2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE 

>> c, MTYPE d)

>>         x = ((b * ratio) + a) / denom;

>>         y = (b - (a * ratio)) / denom;

>>       }

>> +#else

>> +  /* float, double, extended, long double have significant potential

>> +    underflow/overflow errors than can be greatly reduced with

>> +    a limited number of adjustments  */

>> +

>> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */

>> +  if (FABS (c) < FABS (d))

>> +    {

>> +      /* prevent underflow when denominator is near max 

>> representable */

>> +      if (FABS (d) >= RBIG)

>> +    {

>> +      a = a * 0.5;

>> +      b = b * 0.5;

>> +      c = c * 0.5;

>> +      d = d * 0.5;

>> +    }

>> +      /* avoid overflow/underflow issues when c and d are small */

>> +      /* scaling up helps avoid some underflows */

>> +      /* No new overflow possible since c&d < RMIN2 */

>> +      if (FABS (d) < RMIN2)

>> +    {

>> +      a = a * RMINSCAL;

>> +      b = b * RMINSCAL;

>> +      c = c * RMINSCAL;

>> +      d = d * RMINSCAL;

>> +    }

>> +      ratio = c / d;

>> +      denom = (c * ratio) + d;

>> +      /* choose alternate order of computation if ratio is subnormal */

>> +      if (FABS (ratio) > RMIN)

>> +    {

>> +      x = ((a * ratio) + b) / denom;

>> +      y = ((b * ratio) - a) / denom;

>> +    }

>> +      else

>> +    {

>> +      x = ((c * (a / d)) + b) / denom;

>> +      y = ((c * (b / d)) - a) / denom;

>> +    }

>> +    }

>> +  else

>> +    {

>> +      /* prevent underflow when denominator is near max 

>> representable */

>> +      if (FABS(c) >= RBIG)

>> +    {

>> +      a = a * 0.5;

>> +      b = b * 0.5;

>> +      c = c * 0.5;

>> +      d = d * 0.5;

>> +    }

>> +      /* avoid overflow/underflow issues when both c and d are small */

>> +      /* scaling up helps avoid some underflows */

>> +      /* No new overflow possible since both c&d are less than RMIN2 */

>> +      if (FABS(c) < RMIN2)

>> +    {

>> +      a = a * RMINSCAL;

>> +      b = b * RMINSCAL;

>> +      c = c * RMINSCAL;

>> +      d = d * RMINSCAL;

>> +    }

>> +      ratio = d / c;

>> +      denom = (d * ratio) + c;

>> +      /* choose alternate order of computation if ratio is subnormal */

>> +      if (FABS(ratio) > RMIN)

>> +    {

>> +      x = ((b * ratio) + a) / denom;

>> +      y = (b - (a * ratio)) / denom;

>> +    }

>> +      else

>> +    {

>> +      x = (a + (d * (b / c))) / denom;

>> +      y = (b - (d * (a / c))) / denom;

>> +    }

>> +    }

>> +#endif

>>       /* Recover infinities and zeros that computed as NaN+iNaN; the 

>> only cases

>>        are nonzero/zero, infinite/finite, and finite/infinite. */

>

Patch

===============================================================
Errors   Moderate Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
 1 ulp    0.24707%   0.92986%   0.24707%   0.92986%   0.24707%
 2 ulp    0.01762%   0.01770%   0.01762%   0.01770%   0.01762%
 8 ulp    0.00026%   0.00026%   0.00026%   0.00026%   0.00026%
16 ulp    0.00000%   0.00000%   0.00000%   0.00000%   0.00000%
24 ulp          0%         0%         0%         0%         0%
52 ulp          0%         0%         0%         0%         0%
===============================================================
Table 1: Errors with Moderate Dataset

Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 ulp are
exceedingly rare. There are substantial increases in the 1 ulp error
rates for the 1 divide/2 multiply methods as compared to the 2 divides
methods. These differences are minimal for 2 ulp and larger error
measurements. I chose to use the more accurate method of two divides
to avoid loss of accuracy relative to current behavior where current
behavior is not at risk of under/overflow.

===============================================================
Errors   Full Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
 1 ulp      2.05%   1.23842%    0.67130%   0.71774%   0.17100%
 2 ulp      1.88%   0.51615%    0.50354%   0.02052%   0.01319%
 8 ulp      1.79%   0.43227%    0.42531%   0.00358%   0.00350%
16 ulp      1.69%   0.34503%    0.33542%   0.00212%   0.00212%
24 ulp      1.59%   0.26367%    0.25189%   0.00138%   0.00138%
52 ulp      1.28%   0.01914%    0.00378%   0.00001%   0.00001%
===============================================================
Table 2: Errors with Full Dataset (double)

Table 2 shows significant differences in error rates, especially
between the current method and the new method. With the current code,
1.13% of test values have results where no bits of the mantissa match
the correct answer. That level of error is extremely rare with the new
code. Consider the results for errors greater than or equal to 24 ulp.
The current code sees those errors at a rate of 1.6%. Most would agree
that such answers are "substantially different" from the answers
calculated using extended precision. The b1div and b2div still shows
errors about about one-sixth of that rate. The new code reduces errors
at that level by more than a factor of 1000. These improvements are
important for serious computation of complex numbers.

Correctness for float
=============================
Errors   Moderate Dataset
gtr eq     current     new
======    ========   ========
 1 ulp   28.68070%  28.68070%
 2 ulp    0.64386%   0.64386%
 8 ulp    0.00403%   0.00403%
16 ulp    0.00001%   0.00001%
24 ulp          0%         0%
=============================
Table 3: Errors with Moderate Dataset (float)

We see again in Table 3 as in Table 1, there is no change in accuracy
between the current and new methods when the input data does not
contain very large or very small exponents with potential for
under/overflow.

=============================
Errors   Full Dataset
gtr eq     current     new
======    ========   ========
 1 ulp     19.97%  17.85193%
 2 ulp      3.20%   0.43153%
 8 ulp      2.16%   0.04139%
16 ulp      1.40%   0.00818%  (99.4% better)
24 ulp      0.98%   0.00016%
=============================
Table 4: Errors with Full Dataset (float)

In Table 4, we see as in Table 2, when extreme exponents are encountered,
the new method provides a substantial reduction in unacceptable answers.
The improvements are not quite as much as for double precision due
to the inherent limitations of 32-bit floats.

There is no change in accuracy for half-precision since the code is
unchanged. libgcc computes half-precision functions in float precision
allowing the existing methods to avoid overflow/underflow issues
for the allowed range of exponents for half-precision.

Extended precision (using x87 80-bit format on x86) and Long double
(using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents
as compared to 11-bit exponents in double precision. We note that the
C standard also allows Long Double to be implemented in the equivalent
range of Double. The RMIN2 and RMINSCAL constants are selected to work
within the Double range as well as with extended and 128-bit ranges.
We will limit our performance and accurancy discussions to the 80-bit
and 128-bit formats as seen on x86 here.

The extended and long double precision investigations were more
limited.  Aarch64 does not supported extended precision but does
support the software implementation of 128-bit long double
precision. For x86, long double defaults to the 80-bit precision but
using the -mlong-double-128 flag switches to using the software
implementation of 128-bit precision. Both 80-bit and 128-bit
precisions have the same exponent range, with the 128-bit precision
has extended mantissas. Since this change is only aimed at avoiding
underflow/overflow for extreme exponents, I studied the extended
precision results on x86 for 100,000 values. The limited exponent
dataset showed no differences.  For the dataset with full exponent
range, the current and new values showed major differences (greater
than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of
these cases, the ratio of c/d or d/c (as appropriate) was zero or
subnormal, indicating the advantage of the new method and its
continued correctness where needed.

PERFORMANCE Test results

In order for a library change to be practical, it is necessary to show
the slowdown is tolerable. The slowdowns observed are much less than
would be seen by (for example) switching from hardware double precison
to a software quad precision, which on the tested machines causes a
slowdown of around 100x).

The actual slowdown depends on the machine architecture. It also
depends on the nature of the input data. If underflow/overflow is
rare, then implementations that have strong branch prediction will
only slowdown by a few cycles. If underflow/overflow is common, then
the branch predictors will be less successful and the cost will be
higher.

Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.

In the following chart, the times are averaged over a one million
value data set. All values are scaled to set the time of current
method to be 1.0. Lower values are better. A value of less than 1.0 would
be faster than the current method and a value greater than 1.0 would
be slower than the current method.

================================================
               Moderate set          full set
               x86  aarch64        x86  aarch64
========     ===============     ===============
float         1.15    1.15        1.96    1.96
double        1.07    1.40        1.31    1.78
long double   1.08    1.23        1.08    1.24
================================================
Table 5: Performance Comparisons (ratio new/current)

The above tables omit the timing for the 1 divide and 2 multiply
comparison with the 2 divide approach. I consider that optimization
vs accuracy issue separate from the rest of this proposal.
Roughly speaking the 1 divide and 2 multiply approach provides
a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5%
of the time for double precison. The loss of accurancy is somewhat
greater for float than for double precision.

For the proposed change, the moderate dataset shows less overhead for
the newer methods than the full dataset. That's because the moderate
dataset does not ever take the new branches which protect from
under/overflow. The better the branch predictor, the lower the cost
for these untaken branches. Both platforms are somewhat dated, with
the x86 having a better branch predictor which reduces the cost of the
additional branches in the new code. Of course, the relative slowdown
may be greater for some architectures, especially those with limited
branch prediction combined with a high cost of misprediction.

This cost is claimed to be tolerable on the grounds that:
(a) most applications will only spend a small fraction of their time
    calculating complex divide.
(b) it is much less than the cost of extended precision
(c) users are not forced to use it (as described below)
(d) the cost is worthwhile considering the accuracy improvement shown
    in Table 2 and 4.

Those users who find this degree of slowdown unsatisfactory may use
the switch -fcx-fortran-rules which does not use the library routine,
instead inlining Smith's method without the C99 requirement for
dealing with NaN results. The proposed patch for libgcc complex divide
does not affect the code generated by -fcx-fortran-rules.

As a side note, double precision is not slower than float for the
aarch64 and x64 platforms tested. A future change might investigate
the performance of promoting the float values of complex divide to
double precision and using the current Smith's method. Using double
precision for float values will avoid the round off effects and using
the larger mantissa will give better answers in the case of
catastrophic subtraction. Better accuracy with no loss of performance
seems a win all around, but would require performance testing on a
wide range of platforms to confirm the "no loss of performance"
observed on the two implementions tested so far.

SUMMARY

When input data to complex divide has exponents whose absolute value
is half than half of *_MAX_EXP, this patch makes no changes in
accuracy and has only a modest effect on performance.  When input data
contains values outside those ranges, the patch eliminates more than
99.5% of major errors with a tolerable cost in performance.

In comparison to Elen Kalda's method, this patch introduces more
performance overhead but reduces overflow/underflow more than 100
times as often.

There is an intermediate design choice which would catch 13x fewer
errors while introducing roughly 20% lower overhead. That could be
analyzed in depth if the overhead of the current proposal is
unacceptable.

REFERENCES

[1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
 5(8):435, 1962.

[2] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

[3] Elen Kalda: Complex division improvements in libgcc
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html

[4] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.
Springer International Publishing AG, 2017.
---
 libgcc/ChangeLog |   6 ++++
 libgcc/libgcc2.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
index d6367e2..d1db004 100644
--- a/libgcc/ChangeLog
+++ b/libgcc/ChangeLog
@@ -1,3 +1,9 @@ 
+2020-06-24  Patrick McGehearty <patrick.mcgehearty@oracle.com>
+
+	* libgcc2.c (__divdc3): Enhance accuracy of complex divide by
+	avoiding underflow/overflow when input values have very large
+	or very small exponents.
+
 2020-06-25  Martin Liska  <mliska@suse.cz>
 
 	* libgcov-driver.c (merge_summary): Remove function as its name
diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index e0a9fd7..feffb13 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2036,13 +2036,35 @@  CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 CTYPE
 CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 {
+#if defined(L_divsc3)
+#define RBIG	((FLT_MAX)/2.0)
+#define RMIN	(FLT_MIN)
+#define RMIN2	(0x1.0p-21)
+#define RMINSCAL (0x1.0p+19)
+#endif
+
+#if defined(L_divdc3)
+#define RBIG	((DBL_MAX)/2.0)
+#define RMIN	(DBL_MIN)
+#define RMIN2	(0x1.0p-53)
+#define RMINSCAL (0x1.0p+51)
+#endif
+
+#if (defined(L_divxc3) || defined(L_divtc3))
+#define RBIG	((LDBL_MAX)/2.0)
+#define RMIN	(LDBL_MIN)
+#define RMIN2	(0x1.0p-53)
+#define RMINSCAL (0x1.0p+51)
+#endif
+
   MTYPE denom, ratio, x, y;
   CTYPE res;
 
-  /* ??? We can get better behavior from logarithmic scaling instead of
-     the division.  But that would mean starting to link libgcc against
-     libm.  We could implement something akin to ldexp/frexp as gcc builtins
-     fairly easily...  */
+#if defined(L_divhc3)
+  /* half precision is handled with float precision.
+     no extra measures are needed to avoid overflow/underflow */
+
+  /* Scale by max(c,d) to reduce chances of denominator overflowing */
   if (FABS (c) < FABS (d))
     {
       ratio = c / d;
@@ -2057,6 +2079,81 @@  CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
       x = ((b * ratio) + a) / denom;
       y = (b - (a * ratio)) / denom;
     }
+#else
+  /* float, double, extended, long double have significant potential
+    underflow/overflow errors than can be greatly reduced with
+    a limited number of adjustments  */
+
+  /* Scale by max(c,d) to reduce chances of denominator overflowing */
+  if (FABS (c) < FABS (d))
+    {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS (d) >= RBIG)
+	{
+	  a = a * 0.5;
+	  b = b * 0.5;
+	  c = c * 0.5;
+	  d = d * 0.5;
+	}
+      /* avoid overflow/underflow issues when c and d are small */
+      /* scaling up helps avoid some underflows */
+      /* No new overflow possible since c&d < RMIN2 */
+      if (FABS (d) < RMIN2)
+	{
+	  a = a * RMINSCAL;
+	  b = b * RMINSCAL;
+	  c = c * RMINSCAL;
+	  d = d * RMINSCAL;
+	}
+      ratio = c / d;
+      denom = (c * ratio) + d;
+      /* choose alternate order of computation if ratio is subnormal */
+      if (FABS (ratio) > RMIN)
+	{
+	  x = ((a * ratio) + b) / denom;
+	  y = ((b * ratio) - a) / denom;
+	}
+      else
+	{
+	  x = ((c * (a / d)) + b) / denom;
+	  y = ((c * (b / d)) - a) / denom;
+	}
+    }
+  else
+    {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS(c) >= RBIG)
+	{
+	  a = a * 0.5;
+	  b = b * 0.5;
+	  c = c * 0.5;
+	  d = d * 0.5;
+	}
+      /* avoid overflow/underflow issues when both c and d are small */
+      /* scaling up helps avoid some underflows */
+      /* No new overflow possible since both c&d are less than RMIN2 */
+      if (FABS(c) < RMIN2)
+	{
+	  a = a * RMINSCAL;
+	  b = b * RMINSCAL;
+	  c = c * RMINSCAL;
+	  d = d * RMINSCAL;
+	}
+      ratio = d / c;
+      denom = (d * ratio) + c;
+      /* choose alternate order of computation if ratio is subnormal */
+      if (FABS(ratio) > RMIN)
+	{
+	  x = ((b * ratio) + a) / denom;
+	  y = (b - (a * ratio)) / denom;
+	}
+      else
+	{
+	  x = (a + (d * (b / c))) / denom;
+	  y = (b - (d * (a / c))) / denom;
+	}
+    }
+#endif
 
   /* Recover infinities and zeros that computed as NaN+iNaN; the only cases
      are nonzero/zero, infinite/finite, and finite/infinite.  */