About FloatingPoint Arithmetic
Go Up to Using FloatingPoint Routines
Working with floatingpoint numbers requires understanding of the internal representation of data. Programmers must be aware of the finite precision issues:
 For integral values (integer types), you must consider the possibility of overflow.
 For floatingpoint values (single, double or extended precision), you must consider the possibility of accuracy loss.
Contents
Finite Precision Implications
Floatingpoint accuracy loss can be the result of multiple causes:
 If you assign a floatingpoint literal (for example: 0.1) to a floatingpoint variable, the floatingpoint variable might not have sufficient bits to hold the desired value without introducing some error. The floatingpoint literal might require a very large number or even an infinite number of bits for infinite precision representation.
uses System.SysUtils; var X: Single; Y: Double; begin X := 0.1; Y := 0.1; Writeln('X =', X); Writeln('Y =', Y); Writeln(Format('Exponent X: %x', [X.Exp])); Writeln(Format('Mantissa Y: %x', [Y.Mantissa])); ReadLn; end.
 Console output:

X = 1.00000001490116E0001

Y = 1.00000000000000E0001

Exponent Y: 3FB

Mantissa Y: 1999999999999A

 From the code above, you can see the error in the ninth digit of the Single precision representation. The Double precision representation has the error too. To demonstrate this error let us use the raw Exponent and Mantissa of the Double precision number. The hexadecimal
$3FB
Exponent has the1019
decimal representation. Since the internal representation of Double numbers has the bias equal to 1023, thenExponent = 1019  1023 = 4
. The hexadecimal$1999999999999A
Mantissa has the11001100110011001100110011001100110011001100110011010
binary representation. Therefore, the binary representation of Y is1.1001100110011001100110011001100110011001100110011010*2^{4}
or0.00011001100110011001100110011001100110011001100110011010*2^{4}
. Notice that this number is the Double precision approximation. The exact0.1
number is represented as the infinite recurrent fraction0.0(0011)
.
 It is worthwhile to note, however, that the maximum error that can be produced in this way is 0.5 ulps.
 If you perform floatingpoint operations, then each step (operation) can introduce its specific error. This happens because, in the case of some operations, the computed result cannot be stored with complete precision. For example, if you multiply two numbers, S1 bits with S2 bits (this is true for integral types and for floatingpoint types), then the result requires S1 + S2 bits for complete precision.
 The "amount" of error introduced by an operation depends on the processor model and operation type. Additive operations introduce a relatively low error. Multiplication introduces a relatively high error.
It is important to understand that the floatingpoint accuracy loss (error) is propagated through calculations and it is the role of the programmer to design an algorithm that is, however, correct.
A floatingpoint variable can be regarded as an integer variable with a power of two scale. If you "force" the floatingpoint variable to an extreme value, the scale will automatically be adjusted. That is why you might have the impression that a floatingpoint variable cannot overflow. And it is indeed true, but on the other hand, there are other threats: a floatingpoint variable can accumulate a significant error and/or become denormalized.
Using Larger Data Types
The easiest way to resolve the problem of integer overflow or floatingpoint accuracy drop (finite precision effects, in general) is to use data types from the same class (integral or floatingpoint), but with increased capacity. That is, if a ShortInt overflows, then you can easily switch to a LongInt, FixedInt or Int64. Similarly, if a single precision float does not provide enough accuracy, then you can switch to a double precision float. But there are two things to consider:
 The data type with more storage capacity can still be insufficient.
 The data type with more storage capacity requires more memory, and possibly more CPU cycles in operations.
Control Settings
On the 32bit platform, the x87 FPU control word (CW) has two bits allocated for specifying the rounding mode. See Intel® 64 and IA32 Architectures Software Developer's Manual Volume 1: Basic Architecture > 8.1.5.3 Rounding Control Field. For 64bit programs, the SSE control register, MSXCSR, specifies the rounding mode. You can change them with the help of System.Math.SetRoundMode.
Some RTL functions that operate with floatingpoint variables might be affected by the FPU rounding mode. The exact nature of changes in the results of RTL routines based on the FPU control word depends on the algorithms being implemented. Rounding will have effects on every operation that needs rounding to fit the result into the target type, e.g. floatingpoint multiplication will almost always involve rounding. If a function consists of lots of floatingpoint multiplication, it will be strongly affected by the rounding mode.
The rounding mode is sometimes used to implement interval arithmetic: roughly speaking, doing the same algorithm with a roundmode of up, then repeating it with a round mode of down, and then seeing the difference between the two results. This gives an idea of the potential error introduced by rounding and imprecision.
Use Cases
Financial Calculations
IEEE floatingpoint might be inappropriate for financial calculations. This is because the precision requirements are usually very strict. You should consider using integral types (primitive integers or Currency) or BCD types.
The Data.FmtBcd unit provides support for BCD operations. The BCD format has the following important feature: each decimal digit (radix 10 digit) is coded with 4 bits of memory (a nibble).
The following code shows how to use a TBcd value as a variant, for convenience:
Delphi:
var
X: Variant;
begin
X := VarFMTBcdCreate('0.1', 16 { Precision }, 2 { Scale });
Writeln(VarToStr(X));
// ...
C++:
#include <Variants.hpp>
#include <FMTBcd.hpp>
int _tmain(int argc, _TCHAR* argv[]) {
Variant x = VarFMTBcdCreate("0.1", 16 /* Precision */, 2 /* Scale */);
printf("%ls", VarToStr(x).c_str());
// ...
Console output:

0.1
You can see in the code above that with the help of a BCD variable the conversion from text to numeric format is perfect.
You can use for financial calculations the Currency type. The Currency type is in essence an integer scaled by 10000 (this value allows exact division by 10). You can store four decimal digits in a Currency variable, anything that goes beyond this limit is rounded.
Delphi:
var
X, Y: Currency;
begin
X := 0.1;
Y := 0.00001;
Writeln(X);
Writeln(Y);
// ...
C++:
#include <System.hpp>
int _tmain(int argc, _TCHAR* argv[]) {
Currency x, y;
x = 0.1;
y = 0.00001;
printf("%2.14E\n", (double)x);
printf("%2.14E\n", (double)y);
// ...
Console output:

1.00000000000000E0001

0.00000000000000E+0000
You can see the C++ implementation of Currency in $BDS\include\rtl\syscurr.h
.
The Currency interval is [922337203685477.5807; 922337203685477.5807].
Physical (Scientific) Calculations/Simulations
Scientific calculations require, in general, considerable computations and increasing the floatingpoint precision might be not desirable. Extended precision operations are less supported (see Delphi for x64).
If you use SSE, then you must keep in mind that a SSE register can hold two double precision variables or four single precision variables. Thus you can do in parallel more single precision than double precision operations.
A very interesting and useful approach is the following: use single precision floating point, but periodically reduce the accumulated error. Many applications can tolerate a small accuracy loss; it is just important to somehow cancel the deviations. An example of such implementation is the 3D spatial rotation using quaternions. See Physically Based Modeling > Rigid Body Dynamics (ONLINE SIGGRAPH 2001 COURSE NOTES).
Digital Signal Processing
All DSP variables are, in general, "contaminated" with error. You should consider the optimal tradeoff between data precision and computational effort.
General Conclusions
"What you see is not what you get"
Floatingpoint numbers written in the source code with decimal digits and floatingpoint numbers displayed on screen probably differ from what resides in memory. Do not assume that what you see on the console represents exactly what is in memory. Decimal to binary conversion (and back) cannot be done perfectly in every case.
Use integral, BCD, or Currency variables to avoid the IEEE floatingpoint representation error.
Understand the Data Flow
In Delphi, intermediate results of Single precision floatingpoint expressions are always implicitly stored as Extended on x86.
By default, all x64 arithmetic operations and expressions involving only Single precision floatingpoint values retain high precision by storing intermediate results as Double precision values. As the result, these operations are slower than with explicit double precision operands (the compiled code converts Single values to Double on each operation). If the speed of execution is the primary concern, you can mark the code in question with the {$EXCESSPRECISION OFF} directive to disable the use of intermediate doubleprecision values. Otherwise the default directive ({$EXCESSPRECISION ON}) is recommended to improve the precision of the resulting value. The {$EXCESSPRECISION OFF} directive has effect only for x64 target CPU.
In C++, a floatingpoint literal can represent either a single precision or a double precision float: it depends on the presence of the f
suffix. If f
is appended to a floatingpoint literal in C++, then the compiler chooses single precision. To understand the precision of intermediary values, consult the ISO published standards.
FloatingPoint Operations Might Not Be Associative
Because of the error produced by every operator, the order of executing the calculations is significant.
See CERT C Secure Coding Standards, Recommendation FLP01C.
FloatingPoint Exceptions
Floatingpoint operations can lead to several incorrect situations like floatingpoint overflow, division by zero, denormalized value, generating NaNs, and executing other invalid floatingpoint operations. Typically, such situations lead to raising floatingpoint exceptions. The System.Math unit provides:
 The set of possible floatingpoint exceptions.
 The methods to retrieve the currently active floatingpoint exceptions on the current platform.
 The methods to retrieve the currently masked exceptions from the current floatingpoint hardware and to set which exceptions to be masked by the current floatingpoint hardware.
Different floatingpoint hardware provide different means to control floatingpoint exceptions:
 On Intel 32bit processors this is the FPU control word.
 On Intel 64bit processors this is the SSE control word.
 We do not support floatingpoint exceptions on ARM architecture. Therefore, we always mask all floatingpoint exceptions on ARM architecture.
See Also
 FloatingPoint Number Control Routines
 FloatingPoint Rounding Routines
 FloatingPoint Comparison Routines
 Internal Representation of Real Types
 Internal Representation Of Numerical Types
 FloatingPoint Exceptions
 System.Round
 System.Math.SetRoundMode
 Floating point precision control ({$EXCESSPRECISION} Delphi compiler directive for x64)
 Extended type compatibility ({$EXTENDEDCOMPATIBILITY} Delphi)
 SetRoundMode (Delphi) code example
 How Futile are Mindless Assessments of Roundoff in FloatingPoint Computation? (by William Kahan)
 What Every Computer Scientist Should Know About FloatingPoint Arithmetic (by David Goldberg)
 Numerical stability (on Wikipedia.org)
 CERT C Coding Standards, Floating Point (FLP)
 Floatingpoint to fixedpoint conversion for DSP (Google Project Hosting)
 FloatingPoint Rounding (Google Project Hosting)