# About Floating-Point Arithmetic

*Go Up to Using Floating-Point Routines*

Working with floating-point numbers requires understanding of the internal representation of data. Programmers must be aware of the finite precision issues:

- For integral values (integer types), you must consider the possibility of overflow.
- For floating-point values (single, double or extended precision), you must consider the possibility of accuracy loss.

## Contents

## Finite Precision Implications

Floating-point accuracy loss can be the result of multiple causes:

- If you assign a floating-point literal (for example: 0.1) to a floating-point variable, the floating-point variable might not have sufficient bits to hold the desired value without introducing some error. The floating-point literal might require a very large number or even an infinite number of bits for infinite precision representation.

uses System.SysUtils; var X: Single; Y: Double; begin X := 0.1; Y := 0.1; Writeln('X =', X); Writeln('Y =', Y); Writeln(Format('Exponent X: %x', [X.Exp])); Writeln(Format('Mantissa Y: %x', [Y.Mantissa])); ReadLn; end.

- Console output:
`X = 1.00000001490116E-0001`

`Y = 1.00000000000000E-0001`

`Exponent Y: 3FB`

`Mantissa Y: 1999999999999A`

- From the code above, you can see the error in the ninth digit of the Single precision representation. The Double precision representation has the error too. To demonstrate this error let us use the raw Exponent and Mantissa of the Double precision number. The hexadecimal
`$3FB`

**Exponent**has the`1019`

decimal representation. Since the internal representation of Double numbers has the bias equal to 1023, then`Exponent = 1019 - 1023 = -4`

. The hexadecimal`$1999999999999A`

**Mantissa**has the`11001100110011001100110011001100110011001100110011010`

binary representation. Therefore, the binary representation of Y is`1.1001100110011001100110011001100110011001100110011010*2`

or^{-4}`0.00011001100110011001100110011001100110011001100110011010*2`

. Notice that this number is the Double precision approximation. The exact^{-4}`0.1`

number is represented as the infinite recurrent fraction`0.0(0011)`

.

- It is worthwhile to note, however, that the maximum error that can be produced in this way is 0.5 ulps.

- If you perform floating-point operations, then each step (operation) can introduce its specific error. This happens because, in the case of some operations, the computed result cannot be stored with complete precision. For example, if you multiply two numbers, S1 bits with S2 bits (this is true for integral types and for floating-point types), then the result requires S1 + S2 bits for complete precision.

- The "amount" of error introduced by an operation depends on the processor model and operation type. Additive operations introduce a relatively low error. Multiplication introduces a relatively high error.

It is important to understand that the floating-point accuracy loss (error) is propagated through calculations and it is the role of the programmer to design an algorithm that is, however, correct.

A floating-point variable can be regarded as an integer variable with a power of two scale. If you "force" the floating-point variable to an extreme value, the scale will automatically be adjusted. That is why you might have the impression that a floating-point variable cannot overflow. And it is indeed true, but on the other hand, there are other threats: a floating-point variable can accumulate a significant error and/or become **denormalized**.

## Using Larger Data Types

The easiest way to resolve the problem of integer overflow or floating-point accuracy drop (finite precision effects, in general) is to use data types from the same class (integral or floating-point), but with increased capacity. That is, if a ShortInt overflows, then you can easily switch to a LongInt, FixedInt or Int64. Similarly, if a single precision float does not provide enough accuracy, then you can switch to a double precision float. But there are two things to consider:

- The data type with more storage capacity can still be insufficient.
- The data type with more storage capacity requires more memory, and possibly more CPU cycles in operations.

## Control Settings

On the 32-bit platform, the x87 FPU control word (CW) has two bits allocated for specifying the rounding mode. See Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture > 8.1.5.3 Rounding Control Field. For 64-bit programs, the SSE control register, MSXCSR, specifies the rounding mode. You can change them with the help of **System.Math.SetRoundMode**.

Some RTL functions that operate with floating-point variables might be affected by the FPU rounding mode. The exact nature of changes in the results of RTL routines based on the FPU control word depends on the algorithms being implemented. Rounding will have effects on every operation that needs rounding to fit the result into the target type, e.g. floating-point multiplication will almost always involve rounding. If a function consists of lots of floating-point multiplication, it will be strongly affected by the rounding mode.

The rounding mode is sometimes used to implement interval arithmetic: roughly speaking, doing the same algorithm with a round-mode of up, then repeating it with a round mode of down, and then seeing the difference between the two results. This gives an idea of the potential error introduced by rounding and imprecision.

## Use Cases

### Financial Calculations

IEEE floating-point might be inappropriate for financial calculations. This is because the precision requirements are usually very strict. You should consider using integral types (primitive integers or Currency) or BCD types.

The Data.FmtBcd unit provides support for BCD operations. The BCD format has the following important feature: each decimal digit (radix 10 digit) is coded with 4 bits of memory (a nibble).

The following code shows how to use a TBcd value as a variant, for convenience:

**Delphi:**

```
var
X: Variant;
begin
X := VarFMTBcdCreate('0.1', 16 { Precision }, 2 { Scale });
Writeln(VarToStr(X));
// ...
```

**C++:**

```
#include <Variants.hpp>
#include <FMTBcd.hpp>
int _tmain(int argc, _TCHAR* argv[]) {
Variant x = VarFMTBcdCreate("0.1", 16 /* Precision */, 2 /* Scale */);
printf("%ls", VarToStr(x).c_str());
// ...
```

Console output:

`0.1`

You can see in the code above that with the help of a BCD variable the conversion from text to numeric format is perfect.

You can use for financial calculations the Currency type. The Currency type is in essence an integer scaled by 10000 (this value allows exact division by 10). You can store four decimal digits in a Currency variable, anything that goes beyond this limit is rounded.

**Delphi:**

```
var
X, Y: Currency;
begin
X := 0.1;
Y := 0.00001;
Writeln(X);
Writeln(Y);
// ...
```

**C++:**

```
#include <System.hpp>
int _tmain(int argc, _TCHAR* argv[]) {
Currency x, y;
x = 0.1;
y = 0.00001;
printf("%2.14E\n", (double)x);
printf("%2.14E\n", (double)y);
// ...
```

Console output:

`1.00000000000000E-0001`

`0.00000000000000E+0000`

You can see the C++ implementation of **Currency** in `$BDS\include\rtl\syscurr.h`

.

The **Currency** interval is [-922337203685477.5807; 922337203685477.5807].

### Physical (Scientific) Calculations/Simulations

Scientific calculations require, in general, considerable computations and increasing the floating-point precision might be not desirable. Extended precision operations are less supported (see Delphi for x64).

If you use SSE, then you must keep in mind that a SSE register can hold two double precision variables or four single precision variables. Thus you can do in parallel more single precision than double precision operations.

A very interesting and useful approach is the following: use single precision floating point, but periodically reduce the accumulated error. Many applications can tolerate a small accuracy loss; it is just important to somehow cancel the deviations. An example of such implementation is the 3D spatial rotation using quaternions. See Physically Based Modeling > Rigid Body Dynamics (ONLINE SIGGRAPH 2001 COURSE NOTES).

### Digital Signal Processing

All DSP variables are, in general, "contaminated" with error. You should consider the optimal trade-off between data precision and computational effort.

## General Conclusions

### "What you see is not what you get"

Floating-point numbers written in the source code with decimal digits and floating-point numbers displayed on screen **probably** differ from what resides in memory. Do not assume that what you see on the console represents exactly what is in memory. Decimal to binary conversion (and back) cannot be done perfectly in every case.

Use integral, BCD, or Currency variables to avoid the IEEE floating-point representation error.

### Understand the Data Flow

In Delphi, intermediate results of Single precision floating-point expressions are always implicitly stored as Extended on x86.

By default, all x64 arithmetic operations and expressions involving only Single precision floating-point values retain high precision by storing intermediate results as Double precision values. As the result, these operations are slower than with explicit double precision operands (the compiled code converts Single values to Double on each operation). If the speed of execution is the primary concern, you can mark the code in question with the {$EXCESSPRECISION OFF} directive to disable the use of intermediate double-precision values. Otherwise the default directive ({$EXCESSPRECISION ON}) is recommended to improve the precision of the resulting value. The {$EXCESSPRECISION OFF} directive has effect only for x64 target CPU.

In C++, a floating-point literal can represent either a single precision or a double precision float: it depends on the presence of the `f`

suffix. If `f`

is appended to a floating-point literal in C++, then the compiler chooses single precision. To understand the precision of intermediary values, consult the ISO published standards.

### Floating-Point Operations Might Not Be Associative

Because of the error produced by every operator, the order of executing the calculations is significant.

See CERT C Secure Coding Standards, Recommendation FLP01-C.

## Floating-Point Exceptions

Floating-point operations can lead to several incorrect situations like floating-point overflow, division by zero, denormalized value, generating NaNs, and executing other invalid floating-point operations. Typically, such situations lead to raising floating-point exceptions. The System.Math unit provides:

- The set of possible floating-point exceptions.
- The methods to retrieve the currently active floating-point exceptions on the current platform.
- The methods to retrieve the currently masked exceptions from the current floating-point hardware and to set which exceptions to be masked by the current floating-point hardware.

Different floating-point hardware provide different means to control floating-point exceptions:

- On Intel 32-bit processors this is the FPU control word.
- On Intel 64-bit processors this is the SSE control word.
- We do not support floating-point exceptions on ARM architecture. Therefore, we always mask all floating-point exceptions on ARM architecture.

## See Also

- Floating-Point Number Control Routines
- Floating-Point Rounding Routines
- Floating-Point Comparison Routines
- Internal Representation of Real Types
- Internal Representation Of Numerical Types
- Floating-Point Exceptions
- System.Round
- System.Math.SetRoundMode
- Floating point precision control ({$EXCESSPRECISION} Delphi compiler directive for x64)
- Extended type compatibility ({$EXTENDEDCOMPATIBILITY} Delphi)
- SetRoundMode (Delphi) code example
- How Futile are Mindless Assessments of Roundoff in Floating-Point Computation? (by William Kahan)
- What Every Computer Scientist Should Know About Floating-Point Arithmetic (by David Goldberg)
- Numerical stability (on Wikipedia.org)
- CERT C Coding Standards, Floating Point (FLP)
- Floating-point to fixed-point conversion for DSP (Google Project Hosting)
- Floating-Point Rounding (Google Project Hosting)