FACTOID # 15: A mere 0.8% of West Virginians were born in a foreign country.

 Home Encyclopedia Statistics States A-Z Flags Maps FAQ About

 WHAT'S NEW RELATED ARTICLES People who viewed "Floating point" also viewed:

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

(* = Graphable)

Encyclopedia > Floating point

Floating-point is a numeral-interpretation system in which a string of digits (or bits) represents a real number. A system of arithmetic is defined that allows these representations to be manipulated with results that are similar to the arithmetic operations over real numbers. The representation uses an explicit designation of where the radix point (decimal point, or, more commonly in computers, binary point) is to be placed relative to that string. The designated location of the radix point is permitted to be far to the left or right of the digit string, allowing for the representation of very small or very large numbers. Floating-point could be thought of as a computer realization of scientific notation. Image File history File links Broom_icon. ... In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2. ... In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2. ... In mathematics, radix point refers to the symbol used in numerical representations to separate the integral part of the number (to the left of the radix) from its fractional part (to the right of the radix). ... In mathematics, radix point refers to the symbol used in numerical representations to separate the integral part of the number (to the left of the radix) from its fractional part (to the right of the radix). ... Scientific notation is a (very helpfull) notation for writing numbers that is often used by scientists and mathematicians to make it easier to write large and small numbers. ...

The large dynamic range of floating-point numbers frees the programmer from explicitly encoding number representations for their particular application. For this reason computation with floating-point numbers plays a very important role in an enormous variety of applications in science, engineering, and industry, particularly in meteorology, simulation, and mechanical design. The ability to perform floating point operations is an important measure of performance for computers intended for such applications. It is measured in "MegaFLOPS" (million FLoating-point Operations Per Second), or Gigaflops, etc. World-class supercomputer installations are generally rated in Teraflops. In computing, FLOPS (or flops) is an acronym meaning FLoating point Operations Per Second. ... The TOP500 project ranks and details the 500 most powerful publicly-known computer systems in the world. ...

There are several mechanisms by which strings of digits can represent numbers:

• The most common way of interpreting the value of a string of digits, so trivial that people rarely think about it, is as an integer—the radix point is implicitly at the right end of the string.
• In common mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there.
• In "Fixed-point" systems, some specific convention is made about where the radix point is located in the string. For example, the convention could be made that the string consists of 8 digits, with the point in the middle, so that "00012345" has a value of 1.2345.
• In scientific notation, a radix character is used, but since most scientific calculations involve very large or small numbers, standard form is used. The radix comes immediately after the first digit, and further mathematical information is supplied to describe the actual scale of the number. In other words, numbers are multiplied by powers of 10 until they are between 1 and 10. For example, the revolution period of Jupiter's moon Io is 152853.5047s. This could be represented in standard form as 1.528535047 × 105
• Floating-point notation generally refers to a system similar to scientific notation, but without use of a radix character. The location of the radix point is specified solely by separate "exponent" information. It can be thought of as being equivalent to scientific notation with the requirement that the radix point be effectively in a "standard" place. That place is often chosen as just after the leftmost digit. This article will follow that convention. Under that convention, the orbital period of Io is 1528535047 with an exponent of 5. That is, the "standard" place for the radix point is just after the first digit: 1.528535047, and the exponent designation indicates the radix point is actually 5 digits to the right of that, that is, the number is 105 times bigger than that.

.12 × .12 = .0144

would be expressed as

(1.20 × 10-1) × (1.20 × 10-1) = (1.44 × 10-2)

In a fixed-point system with the decimal point at the left, it would be

.120 × .120 = .014

A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string.

### Range of Floating Point Numbers

The range of floating point numbers depends on the number of bits used for representation of the mantissa and exponent. On a typical 32 bit computer system, using double precision (64 bit) floating point — a mantissa of 52 bits, exponent of 11 bits and 1 sign bit — floating point numbers have an approximate range of 10-308 to 10308.

### Nomenclature

In floating-point representation, the string of digits is called the significand, or sometimes the mantissa. The representation of the significand is defined by a choice of base or radix, and the number of digits stored in that base. Throughout this article the base will be denoted by b, and the number of digits (or the precision) by p. Historically, different bases have been used for floating-point, but almost all modern computer hardware uses base 2, or binary. Some examples in this article will be in base 10, the familiar decimal notation. The significand (also coefficient or mantissa) is the part of a floating-point number that contains its significant digits. ... For the traditional use of the word mantissa in mathematics, see common logarithm. ... The radix (Latin for root), also called base, is the number of various unique symbols (or digits or numerals) a positional numeral system uses to represent numbers. ... The binary numeral system, or base-2 number system, is a numeral system that represents numeric values using two symbols, usually 0 and 1. ... It has been suggested that this article or section be merged with decimal. ...

The representation also includes a number called the exponent. This records the position, or offset, of the window of digits into the number. This can also be referred to as the characteristic, or scale. The window always stores the most significant digits in the number, the first non-zero digits in decimal or bits in binary. The exponent is the power of the base by which the significand is multiplied.

Floating-point notation per se is generally used only in computers, because it holds no advantage over scientific notation for human reading. There are many ways to represent floating-point numbers in computers—floating-point is a generic term to describe number representations in computing that are used to implement the above system of arithmetic. A number representation (called a numeral system in mathematics) specifies some way of storing a number as a string of bits. The arithmetic is defined as a set of actions on bit-strings that simulate normal arithmetic operations. When the numbers being represented are the rationals, one immediate issue is that there are an infinite number of rational numbers, and only a finite number of bits inside a real computer. The numbers that we represent must be an approximation of the entire set. When we restrict ourselves to binary expansions of numbers (because these are easiest to operate upon in a digital computer) the subset of the Rationals that we operate on is restricted to denominators that are powers of 2. Now any rational with a denominator that has a factor other than 2 will have an infinite binary expansion. If we consider this expansion as an infinite string of bits then there are several methods for approximating this string in memory: A numeral is a symbol or group of symbols, or a word in a natural language that represents a number. ... In mathematics, a rational number (or informally fraction) is a ratio of two integers, usually written as the vulgar fraction a/b, where b is not zero. ...

• When we store a fixed size window at a constant position in the bit-string the representation is called Fixed Point. The hardware to manipulate these representations is less costly than Floating-Point and is commonly used to perform integer operations. In this case the radix point is always beneath the window.
• When we store a fixed size window that is allowed to slide up and down the bit-string, with the radix point location not necessarily under the window, the representation is called Floating-point.

However, the most common ways of representing floating-point numbers in computers are the formats standardized as the IEEE 754 standard, commonly called "IEEE floating-point". These formats can be manipulated efficiently by nearly all modern floating-point computer hardware, and this article will focus on them. The standard actually provides for many closely-related formats, differing in only a few details. Two of these formats are ubiquitous in computer hardware and languages: In mathematics, a fixed point of a function f is an argument x such that f(x) = x; see fixed point (mathematics). ... In mathematics, radix point refers to the symbol used in numerical representations to separate the integral part of the number (to the left of the radix) from its fractional part (to the right of the radix). ... In mathematics, radix point refers to the symbol used in numerical representations to separate the integral part of the number (to the left of the radix) from its fractional part (to the right of the radix). ... The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. ...

• Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. It occupies 32 bits (4 bytes) and has a significand precision of 24 bits. This gives it an accuracy of about 7 decimal digits.
• Double precision, called "double" in the C language family, and "doubleprecision" or "real*8" in Fortran. It occupies 64 bits (8 bytes) and has a significand precision of 53 bits. This gives it an accuracy of about 16 decimal digits.

In computing, single precision is a computer numbering format that occupies one storage locations in computer memory at address. ... In computing, double precision is a computer numbering format that occupies two storage locations in computer memory at address and address+1. ...

### Alternative computer representations for non-integral numbers

While the standard IEEE formats are by far the most common because they are efficiently handled in most computer processors, there are a few alternate representations that are sometimes used:

• Fixed-point representation uses integer hardware operations with a specific convention about the location of the binary point, for example, 6 bits from the left. This has to be done in the context of a program that makes whatever convention is required. It is usually used in special-purpose applications on embedded processors that can only do integer arithmetic.
• Where extreme precision is desired, floating-point arithmetic can be emulated in software with extremely large significand fields. The significands might grow and shrink as the program runs. This is called Arbitrary-precision arithmetic or "bignum" arithmetic.
• Some numbers (e.g. 1/3) can't be represented exactly in binary floating-point no matter what the precision. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use bignum arithmetic for the individual integers.
• Some software packages (e.g. Maxima and Maple) can perform symbolic arithmetic, handling irrational numbers like π or 3 in a completely "formal" way, without dealing with the bits of the significand. Such programs can evaluate expressions like "sin3π" exactly, because they "know" the underlying mathematics.
• A floating point number can also be represented by storing its natural logarithm in fixed-point. Like IEEE floating point, this solution has precision for smaller numbers, as well as a wide range. It is rarely used due to the high cost of addition and subtraction.

It has been suggested that Binary scaling be merged into this article or section. ... This article is about emulation in computer science. ... On a computer, arbitrary-precision arithmetic, also called bignum arithmetic, is a technique that allows computer programs to perform calculations on integers or rational numbers (including floating-point numbers) with an arbitrary number of digits of precision, typically limited only by the available memory of the host system. ... A cake divided into four equal quarters. ... For other uses of Maxima, see Maxima (disambiguation). ... Maple 9. ... A computer algebra system (CAS) is a software program that facilitates symbolic mathematics. ... The natural logarithm, formerly known as the hyperbolic logarithm, is the logarithm to the base e, where e is equal to 2. ...

### Normalization

The requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to express the point explicitly; the exponent provides that information. In decimal floating-point notation with precision of 10, the revolution period of Io is simply an exponent e=5 and a significand s=1528535047. The implied decimal point is after the first digit of s (after the '1' and before the first '5'). In computing, a normal number is a non-zero number in a floating-point representation which is within the balanced range supported by a given floating-point format. ...

When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero needs special treatment; this will be described below.)

### Value

The mathematical value of a floating-point number, using the convention given above, is s.ssssssss...sss × be.

Equivalently, it is:

$frac{s}{b^{p-1}} times b^e$ (where s means the integer value of the entire significand)

In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The real number π, represented in binary as an infinite series of bits is The radix (Latin for root), also called base, is the number of various unique symbols (or digits or numerals) a positional numeral system uses to represent numbers. ... This article is about the unit of information. ... When a circles diameter is 1, its circumference is Ï€. The mathematical constant Ï€ is an irrational real number, approximately equal to 3. ...

11.0010010000111111011010101000100010000101101000110000100011010011... but is
11.0010010000111111011011 when approximated by rounding to a precision of 24 bits.

In binary floating-point, this is e=1 ; s=110010010000111111011011. It has a decimal value of Rounding is the process of reducing the number of significant digits in a number. ...

3.1415927410125732421875, whereas the true value of π is
3.1415926535897932384626433832795...

The problem is that floating-point numbers with a limited number of digits can represent only a subset of the real numbers, so any real number outside of that subset (e.g. 1/3, or an irrational number such as π), cannot be represented exactly. Even numbers with extremely short decimal representations can suffer from this problem. The decimal number 0.1 is not representable in binary floating-point of any finite precision. The exact binary representation would have a "1100" sequence continuing endlessly: In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2. ...

e=-4; s=1100110011001100110011001100110011..., but when rounded to 24 bits it becomes
e=-4; s=110011001100110011001101 which is actually 0.100000001490116119384765625 in decimal.

### Conversion and rounding

When a number is represented in some other format (such as a string of digits), then it will require a conversion to be used in floating-point format. If the number can be represented in the floating point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which float-point number is appropriate. There are several different rounding schemes for this decision that have been used. Originally truncation was the typical approach. Since the introduction of IEEE 754, the default method rounds to even, sometimes called Bankers Rounding. This method choses the nearest value, or in the case of a tie, so as to make the significand even. The result of rounding π to 24-bit binary floating-point differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon. Rounding is the process of reducing the number of significant digits in a number. ... In mathematics, truncation is the term used for reducing the number of digits right of the decimal point, by discarding the least significant ones. ... Rounding is the process of reducing the number of significant digits in a number. ... In numerical analysis, computational physics, and simulation, discretization error is error resulting from the fact that a function of a continuous variable is represented in the computer by a finite number of evaluations, for example, on a lattice. ... The machine epsilon (also called macheps, machine precision or unit roundoff) is a term used in computer science. ...

The other time that a rounding mode is used, is when the result of an operation on floating-point numbers has more significant digits than there are places in the significand. In this case the rounding mode is applied to the intermediate value as if it were being converted into a floating-point number. Other common rounding modes always round the number in a certain direction (e.g. towards zero). These alternative modes are useful when the amount of error being introduced must be known. Applications that require a known error are multi-precision floating-point, and interval arithmetic. In mathematics, interval is a concept relating to the sequence and set-membership of one or more numbers. ...

### Mantissa

The word mantissa is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. For the traditional use of the word mantissa in mathematics, see common logarithm. ... In mathematics, the common logarithm is the logarithm with base 10. ...

## History

The floating point system of numbers was used by the Kerala School of mathematics in 14th century India to investigate and rationalise about the convergence of series. The Kerala School was a school of mathematics and astronomy founded by Madhava of Sangamagrama in Kerala, South India which included as its prominent members Parameshvara, Nilakantha Somayaji, Jyeshtadeva, Achyuta Pisharati, Melpathur Narayana Bhattathiri and Achyuta Panikkar. ... This 14th-century statue from south India depicts the gods Shiva (on the left) and Uma (on the right). ... In the absence of a more specific context, convergence denotes the approach toward a definite value, as time goes on; or to a definite point, a common view or opinion, or toward a fixed or equilibrium state. ... In mathematics, a series is often represented as the sum of a sequence of terms. ...

Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the Z4 in 1950, followed by the IBM 704 in 1954. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The PDP-11/44 was an extension of the 11/34 that included the cache memory and floating point units as a standard feature. The Z4 computer was the worlds first commercial digital computer, designed by German engineer Konrad Zuse and built by his company Zuse KG. It was delivered to ETH ZÃ¼rich, Switzerland, in September 1950. ... 1950 (MCML) was a common year starting on Sunday (link will take you to calendar). ... The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in April, 1956. ... 1954 (MCMLIV) was a common year starting on Friday of the Gregorian calendar. ... The PDP-11/44 ist the last PDP implemented in discrete logic, the cpu consists of 5 boards, options were floating point (FP-11, one board) and commercial instruction set (CIS, two boards). ... This article is about the computer term. ...

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating point formats. Single precision used 36 bits, organised into a 1-bit sign, 8-bit exponent, and a 27-bit mantissa. Double precision used 72 bits organised as a 1-bit sign, 11-bit exponent, and a 60-bit mantissa. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats. The UNIVAC 1100/2200 series is a series of compatible 36-bit computer systems, beginning with the UNIVAC 1107 in 1962, initially made by Sperry Rand. ... 1962 (MCMLXII) was a common year starting on Monday (the link is to a full 1962 calendar). ... The IBM 7094 the fourth member of the most popular family of IBMs large second-generation transistorized mainframe computers and was designed for large-scale scientific and technological applications. The first 7094 installation was in September 1962. ...

Prior to the IEEE-754 standard, computers used many different forms of floating point. These differed in the word-sizes, the format of the representations, and the rounding behaviour of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy. The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. ...

The IEEE-754 standard was created in the early 1980s, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among its innovations are these:

• A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
• A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
• The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.

## Floating point arithmetic operations

The usual rule for performing floating point arithmetic is that the exact mathematical value is calculated,[1] and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, under normal rounding behavior and in the absence of exceptional conditions.

For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples. The fundamental principles are the same in any radix or precision. The radix (Latin for root), also called base, is the number of various unique symbols (or digits or numerals) a positional numeral system uses to represent numbers. ... The radix (Latin for root), also called base, is the number of various unique symbols (or digits or numerals) a positional numeral system uses to represent numbers. ...

A simple method to add floating point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits. We proceed with the usual addition method:

` e=5; s=1.234567 (123456.7) + e=2; s=1.017654 (101.7654) e=5; s=1.234567 + e=5; s=0.001017654 (after shifting) -------------------- e=5; s=1.235584654 (true sum: 123558.4654) `

This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is

` e=5; s=1.235585 (final sum: 123558.5) `

Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them: A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. ...

` e=5; s=1.234567 + e=-3; s=9.876543 e=5; s=1.234567 + e=5; s=0.00000009876543 (after shifting) ---------------------- e=5; s=1.23456709876543 (true sum) e=5; s=1.234567 (after rounding/normalization) `

Another problem of loss of significance occurs when two close numbers are subtracted. e=5; s=1.234571 and e=5; s=1.234567 are representations of the rationals 123457.1467 and 123456.659.

` e=5; s=1.234571 - e=5; s=1.234567 ---------------- e=5; s=0.000004 e=-1; s=4.000000 (after rounding/normalization) `

The best representation of this difference is e=-1; s=4.877000, which differs more than 20% from e=-1; s=4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Loss of significance is an undesirable effect in calculations using floating-point arithmetic. ...

Dealing with the consequences of these errors are topics in numerical analysis. Numerical analysis is the study of approximate methods for the problems of continuous mathematics (as distinguished from discrete mathematics). ...

### Multiplication

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

` e=3; s=4.734612 × e=5; s=5.417242 ----------------------- e=8; s=25.648538980104 (true product) e=8; s=25.64854 (after rounding) e=9; s=2.564854 (after normalization) `

Division is done similarly, but that is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex. (see Booth's multiplication algorithm and digital division)[2] Booths multiplication algorithm is a multiplication algorithm that multiplies two signed binary numbers in twos complement notation. ... Several algorithms exist to perform division in digital designs. ...

## Computer representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the common (IEEE standard) formats they are apportioned as follows:

` sign exponent (exponent bias) significand total single 1 8 (127) 23 32 double 1 11 (1023) 52 64 `

While the exponent can be positive or negative, it is stored as an unsigned number that has a fixed "bias" added to it. A value of zero, or all 1's, in this field is reserved for special treatment. Therefore the legal exponent range for normalized numbers is [-126, 127] for single precision or [-1022, 1023] for double.

When a number is normalized, its leftmost significand bit is known to be 1. In the IEEE single and double precision formats that bit is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has 24 bits of significand precision, while double precision format has 53.

For example, it was shown above that π, rounded to 24 bits of precision, has:

• sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)

The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as

• 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB in hexadecimal

In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0â€“9 and Aâ€“F, or aâ€“f. ...

## Dealing with exceptional cases

Floating-point computation in a computer can run into two kinds of problems:

• An operation can be mathematically illegal, such as division by zero, or calculating the square root of -1 or the inverse sine of 2.
• An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large) or underflow (exponent too small.)

Prior to the IEEE standard, such things usually caused the program to terminate, or caused some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. Modern IEEE-compliant systems have a uniform way of handling these situations. An important part of the mechanism involves error values that result from a failing computation, and that can propagate silently through subsequent computation until they are detected at a point of the programmer's choosing. The term arithmetic overflow or simply overflow has the following meanings. ... The term arithmetic underflow or simply underflow has the following meanings. ... It has been suggested that this article or section be merged into Exception handling. ... In computer science, porting is the process of adapting software so that an executable program can be created for a computing environment that is different from the one for which it was originally designed (e. ...

The two error values are "infinity" (often denoted "INF"), and "NaN" ("not a number"), which covers all other errors.

"Infinity" does not necessarily mean that the result is actually infinite. It simply means "too large to represent".

Both of these are encoded with the exponent field set to all 1's. (Recall that exponent fields of all 0's or all 1's are reserved for special meanings.) The significand field is set to something that can distinguish them—typically zero for INF and nonzero for NaN. The sign bit is meaningful for INF, that is, floating-point hardware distinguishes between +∞ and −∞.

When a nonzero number is divided by zero (the divisor must be exactly zero), a "zerodivide" event occurs, and the result is set to infinity of the appropriate sign. In other cases in which the result's exponent is too large to represent, an "overflow" event occurs, also producing infinity of the appropriate sign.

Division of an extremely large number by an extremely small number can overflow and produce infinity. This is different from a zerodivide, though both produce a result of infinity, and the distinction is usually unimportant in practice.

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

• (+INF) + (+7) = (+INF)
• (+INF) × (-2) = (-INF)
• But: (+INF) × 0 = NaN—there is no meaningful thing to do

When the result of an operation has an exponent too small to represent properly, an "underflow" event occurs. The hardware responds to this by changing to a format in which the significand is not normalized, and there is no "hidden" bit—that is, all significand bits are represented. The exponent field is set to the reserved value of zero. The significand is set to whatever it has to be in order to be consistent with the exponent. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal". Denorms are perfectly legal operands to arithmetic operations. In computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is sub-normal. For example, if the smallest positive normal number is 1Ã—Î²-n (where Î² is...

If no significant bits are able to appear in the significand field, the number is zero. Note that, in this case, the exponent field and significand field are all zeros—floating-point zero is represented by all zeros.

Other errors, such as division of zero by zero, or taking the square root of -1, cause an "operand error" event, and produce a NaN result. NaNs propagate aggressively through arithmetic operations—any NaN operand to any operation causes an operand error and produces a NaN result.

There are five special "events" that may occur, though some of them are quite benign:

• An overflow occurs as described previously, producing an infinity.
• An underflow occurs as described previously, producing a denorm or zero.
• A zerodivide occurs as described previously, producing an infinity of the appropriate sign.
• An "operand error" occurs as described previously, producing a NaN.
• An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.

Computer hardware is typically able to raise exceptions when these events occur. How this is done is system-dependent. Usually these exceptions are all masked (disabled), relying only on the propagation of error values. Sometimes overflow, zerodivide, and operand error are enabled. Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of some condition that changes the normal flow of execution. ...

## Implementation in actual computers

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format. The Institute of Electrical and Electronics Engineers or IEEE (pronounced as eye-triple-ee) is an international non-profit, professional organization incorporated in the State of New York, United States. ... The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. ... IBM System/360 computers, and subsequent machines based on that architecture (mainframes), support a hexadecimal floating-point format. ...

The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision format, with 15 exponent bits and 64 significand bits, with no hidden bit. There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C and related programming languages usually provide these formats via the long double type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software. Extended precision refers to storage formats for floating point numbers which are larger and therefore more precise than the next best precision (usually double precision). ... C is a general-purpose, procedural, imperative computer programming language developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system. ... In C and related programming languages, long double refers to a floating point data type that may, and usually does, have greater than double precision. ...

A project for revising the IEEE 754 standard has been under way since 2000. See IEEE 754r. A late phase of the review was completed on 10 March, 2007 but final ratification of the new standard is still awaiting a decision later in 2007. IEEE 754r is an ongoing revision to the IEEE 754 floating point standard. ...

## Behavior of computer arithmetic

The standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of the following:

• round to nearest (the default; by far the most common mode)
• round up (toward +∞; negative results round toward zero)
• round down (toward −∞; negative results round away from zero)
• round toward zero (sometimes called "chop" mode; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3)

In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

The arithmetical difference between two consecutive representable floating point numbers which have the same exponent is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. For numbers with an exponent of 0, an ULP is exactly 2-23 or about 10−7 in single precision, and about 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.

## Accuracy problems

The facts that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, lead to many surprising situations.

For example, the non-representability of 0.1 and 0.01 means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e=-4; s=110011001100110011001101, which is

.100000001490116119384765625 exactly.

Squaring this number gives

.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

` // Enough digits to be sure we get the correct approximation. double pi = 3.1415926535897932384626433832795; double z = tan(pi/2.0); `

Will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be -22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) .1225 × 10-15 in double precision, or -.8742 × 10-7 in single precision.[3]

In fact, while addition and multiplication are both commutative (a+b = b+a and a×b = b×a), they are not associative (a + b) + c = a + (b + c). Using 7-digit decimal arithmetic: In mathematics, especially abstract algebra, a binary operation * on a set S is commutative if x * y = y * x for all x and y in S. Otherwise * is noncommutative. ... In mathematics, associativity is a property that a binary operation can have. ...

` 1234.567 + 45.67844 = 1280.245 1280.245 + 0.0004 = 1280.245 but 45.67844 + 0.0004 = 45.67884 45.67884 + 1234.567 = 1280.246 `

They are also not distributive (a + b)×c = a×c + b×c : In mathematics, and in particular in abstract algebra, distributivity is a property of binary operations that generalises the distributive law from elementary algebra. ...

` 1234.567 × 3.333333 = 4115.223 1.234567 × 3.333333 = 4.115223 4115.223 + 4.115223 = 4119.338 but 1234.567 + 1.234567 = 1235.802 1235.802 × 3.333333 = 4119.340 `

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

• Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
• Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round.
• Limited exponent range: results might overflow yielding infinity, or underflow yielding a denormal value or zero. If a denormal number results, precision will be lost.
• Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
• Equality is problematical! Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that doesn't necessarily make the problem go away.

In computer science, denormal numbers (also called subnormal numbers) fill the gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is sub-normal. Producing a denormal is sometimes called gradual underflow because it allows the calculation to lose precision slowly... In computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is sub-normal. For example, if the smallest positive normal number is 1Ã—Î²-n (where Î² is...

## Minimizing the effect of accuracy problems

Because of the problems noted above, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article. Numerical analysis is the study of approximate methods for the problems of continuous mathematics (as distinguished from discrete mathematics). ...

In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article. This article is about the computing term. ...

Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# programming language, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.

Double precision floating point arithmetic is more accurate than just about any physical measurement one could make. For example, it could indicate the distance from the Earth to the Moon with an accuracy of about 50 nanometers. So, if one were designing an integrated circuit chip with 100 nanometer features, that stretched from the Earth to the Moon, double precision arithmetic would be fairly close to being good enough.

What makes floating point arithmetic troublesome is that people write mathematical algorithms that perform operations an enormous number of times, and so small errors grow. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.

People often carry expectations from their mathematics training into the field of floating point computation. For example, it is known that $(x+y)(x-y) = x^2-y^2,$, and that $sin^2{theta}+cos^2{theta} = 1,$, and that eigenvectors are degenerate if the eigenvalues are equal. These facts can't be counted on when the quantities involved are the result of floating point computation.

While a treatment of the techniques for writing high-quality floating-point software is far beyond the scope of this article, here are a few simple tricks:

The use of the equality test (if (x==y) ...) is usually not a good idea when it is based on expectations from pure mathematics. Such things are sometimes replaced with "fuzzy" tests (if (abs(x-y) < epsilon) ...), where epsilon is sufficiently small and tailored to the application, such as 1.0E-13). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.

An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like

` 3253.671 + 3.141276 -------- 3256.812 `

The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors. In numerical analysis, the Kahan summation algorithm minimizes the error when adding a sequence of finite precision floating point numbers. ...

Another thing that can be done is to rearrange the computation in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is: Archimedes (Greek: c. ...

$t_0 = frac{1}{sqrt{3}}$
$t_{i+1} = frac{sqrt{t_i^2+1}-1}{t_i}qquadmathrm{second form:}qquad t_{i+1} = frac{t_i}{sqrt{t_i^2+1}+1}$
$pi sim 6 times 2^i times t_i,qquadmathrm{converging as i rightarrow infty},$

Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:

` i 6 × 2i × ti, first form 6 × 2i × ti, second form 0 3.4641016151377543863 3.4641016151377543863 1 3.2153903091734710173 3.2153903091734723496 2 3.1596599420974940120 3.1596599420975006733 3 3.1460862151314012979 3.1460862151314352708 4 3.1427145996453136334 3.1427145996453689225 5 3.1418730499801259536 3.1418730499798241950 6 3.1416627470548084133 3.1416627470568494473 7 3.1416101765997805905 3.1416101766046906629 8 3.1415970343230776862 3.1415970343215275928 9 3.1415937488171150615 3.1415937487713536668 10 3.1415929278733740748 3.1415929273850979885 11 3.1415927256228504127 3.1415927220386148377 12 3.1415926717412858693 3.1415926707019992125 13 3.1415926189011456060 3.1415926578678454728 14 3.1415926717412858693 3.1415926546593073709 15 3.1415919358822321783 3.1415926538571730119 16 3.1415926717412858693 3.1415926536566394222 17 3.1415810075796233302 3.1415926536065061913 18 3.1415926717412858693 3.1415926535939728836 19 3.1414061547378810956 3.1415926535908393901 20 3.1405434924008406305 3.1415926535900560168 21 3.1400068646912273617 3.1415926535898608396 22 3.1349453756585929919 3.1415926535898122118 23 3.1400068646912273617 3.1415926535897995552 24 3.2245152435345525443 3.1415926535897968907 25 3.1415926535897962246 26 3.1415926535897962246 27 3.1415926535897962246 28 3.1415926535897962246 The true value is 3.1415926535897932385... `

While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

## A few nice properties

One can sometimes take advantage of a few nice properties:

• Any integer less than or equal to 224 can be exactly represented in the single precision format, and any integer less than or equal to 253 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.
• The bit representations of IEEE floating point numbers are monotonic, as long as exceptional values are avoided and the signs are handled properly. IEEE floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
• To a rough approximation, the bit representation of an IEEE floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.

## Notes and references

1. ^ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
2. ^ The enormous complexity of modern division algorithms once led to a famous error. An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See Pentium FDIV bug.
3. ^ But an attempted computation of cos(π) yields -1 exactly. Since the derivative is nearly zero near π, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating point numbers around -1, and the rounded result is exact.

On October 30, 1994, Professor Thomas Nicely who was then at Lynchburg College reported a bug in the Pentium floating point unit. ...

Results from FactBites:

 Floating point - Wikipedia, the free encyclopedia (2889 words) For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. Unlike the fixed-point counterpart, the application of dither in a floating point environment is nearly impossible. To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%.
 Floating Point and Integer Numbers (332 words) floating point numbers: these are numbers with a decimal point (which "floats" - called a floating point number) like 2.3, -14.5 (see pp. Floating point and integer numbers have different rules for thier arithmetic. In such cases, all numbers are converted to floating point number and the result is a floating point number.
More results at FactBites »

Share your thoughts, questions and commentary here