Round-off error

In computing, a roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error. When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers (which in theory have infinitely many digits), one of the goals of numerical analysis is to estimate computation errors. Computation errors, also called numerical errors, include both truncation errors and roundoff errors.

When a sequence of calculations with an input involving any roundoff error are made, errors may accumulate, sometimes dominating the calculation. In ill-conditioned problems, significant error may accumulate.

In short, there are two major facets of roundoff errors involved in numerical calculations:

The ability of computers to represent both magnitude and precision of numbers is inherently limited.
Certain numerical manipulations are highly sensitive to roundoff errors. This can result from both mathematical considerations as well as from the way in which computers perform arithmetic operations.

Representation error

The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error called representation error. Here are some examples of representation error in decimal representations:

{| class="wikitable" style="margin:1em auto"

! Notation

! Representation

! Approximation

! Error

||| 0. || 0.142 857 || 0.000 000

|ln 2 || 0.693 147 180 559 945 309 41... || 0.693 147 || 0.000 000 180 559 945 309 41...

|log<sub>10</sub> 2 || 0.301 029 995 663 981 195 21... || 0.3010 || 0.000 029 995 663 981 195 21...

|cube root| || 1.259 921 049 894 873 164 76... || 1.25992 || 0.000 001 049 894 873 164 76...

|square root| || 1.414 213 562 373 095 048 80... || 1.41421 || 0.000 003 562 373 095 048 80...

|e || 2.718 281 828 459 045 235 36... || 2.718 281 828 459 045 || 0.000 000 000 000 000 235 36...

|π || 3.141 592 653 589 793 238 46... || 3.141 592 653 589 793 || 0.000 000 000 000 000 238 46...

Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error for uncountably many real numbers. Additional digits used for intermediary steps of a calculation are known as guard digits.

Rounding multiple times can cause error to accumulate. For example, if 9.945309 is rounded to two decimal places (9.95), then rounded again to one decimal place (10.0), the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single step introduces less error (0.045309). This can occur, for example, when software performs arithmetic in x86 80-bit floating-point and then rounds the result to IEEE 754 binary64 floating-point.

Floating-point number system

Compared with the fixed-point number system, the floating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers <math>\mathbb{R}</math> are infinite and continuous, a floating-point number system <math>F</math> is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

Notation of floating-point number system

A floating-point number system <math>F</math> is characterized by <math>4</math> integers:

<math> \beta </math>: base or radix
<math>p</math>: precision
<math> [L, U] </math>: exponent range, where <math>L</math> is the lower bound and <math>U</math> is the upper bound

Any <math>x \in F</math> has the following form:

<math display="block"> x = \pm (\underbrace{d_{0}.d_{1}d_{2}\ldots d_{p-1_\text{significand})_{\beta} \times \beta ^{\overbrace{E}^\text{exponent = \pm d_{0}\times \beta ^{E}+d_{1}\times \beta ^{E-1}+\ldots+ d_{p-1}\times \beta ^{E-(p-1)}</math>

where <math>d_{i}</math> is an integer such that <math>0 \leq d_{i} \leq \beta-1</math> for <math>i = 0, 1, \ldots, p-1</math>, and <math>E</math> is an integer such that <math>L \leq E \leq U</math>.

Normalized floating-number system

A floating-point number system is normalized if the leading digit <math>d_{0}</math> is always nonzero unless the number is zero.

Note that the addition of two floating-point numbers can produce roundoff error when their sum is an order of magnitude greater than that of the larger of the two.

For example, consider a normalized floating-point number system with base <math>10</math> and precision <math>2</math>. Then <math>fl(62)=6.2 \times 10^{1}</math> and <math>fl(41) = 4.1 \times 10^{1}</math>. Note that <math>62+41=103</math> but <math>fl(103)=1.0 \times 10^{2}</math>. There is a roundoff error of <math>103-fl(103)=3</math>.

This kind of error can occur alongside an absorption error in a single operation.

Multiplication

In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand. For example, the computation of <math>f(x) = \sqrt{1 + x} - 1</math> using the "obvious" method is unstable near <math>x = 0</math> due to the large error introduced in subtracting two similar quantities, whereas the equivalent expression <math>\textstyle{f(x) = \frac{x}{\sqrt{1+x} + 1</math> is stable.