Sunday, January 4, 2009

Floating Point Reference

"Fractions are your friends." This was my high school algebra teacher's standard response when students complained about the sometimes tedious math that they could require. Many years later, I'm finding that while floating point math, while not tedious, is still rather more involved than it appears at first; here's my attempt to summarize issues that may come up.

Floating point data types

In C on an x86 system:
  • A float is 32 bits.
  • A double is 64 bits.
  • A long double is compiler-dependent. A long double in gcc contains 80 bits of data, although it's stored as 128 bits (16 bytes) to maintain word alignment. Visual C++ treats a long double as 64 bits, just like a double (and has been criticized for doing so; see here and here). I had been under the impression that long doubles were nonstandard or were a recent addition to the standard, but as far as I can tell, they've been around since before C99.

Selecting a floating point data type

See this portion of another post.

Floating point storage

This is covered on many sites; the most concise and thorough description I've found is in chapter 2 of Sun's Numerical Computation Guide.

NaNs and Infinity

See my previous post.

Comparing floating point numbers

A simple equality comparison (such as a == b) will often fail, even for values which you would expect to be equal, due to rounding errors (especially rounding introduced by converting between binary and decimal). (Do any compilers or static code analysis tools emit warnings if you try to do a naive equality comparison?)
  • The simplest solution is to check if two floating point numbers are within a small developer-chosen epsilon value of each other; this is the approach used by the CUnit and CppUnit testing frameworks, but it has the disadvantage of requiring you to choose an epsilon value yourself and make sure that it's appropriate for the data being compared.
  • "Comparing Floating Point Numbers," by Bruce Dawson, discusses this absolute epsilon approach as well as more sophisticated approaches such as comparing using a relative epsilon and comparing based on the number of possible floating point values that could occur between the two operands.
  • The simplest good approach, provided by CERT, is to compare using compiler-provided absolute epsilons. (As I understand it, CERT's approach should be equivalent to Dawson's approach with a max ulps of 1.)
Greater than / less than comparisons generally require no special handling, although at the assembly level, a comparison may return a result of "unordered" if NaNs are involved, and one of the compilers I tested (CodeGear C++Builder) fails to account for this and so may return incorrect results when comparing NaNs.

Handling floating point exceptions

See this post.

Low-level floating point calculations

If you need to do floating point work at the assembly level, use Intel Software Developer's Manuals as a reference. Volume 1 has some background on the FPU; volume 2A contains most floating point instructions (since they start with F).

Example code

Rather than simple math, the following code covers manipulating floating point numbers' bit representations, handling NaNs and infinities, and so on.

For further reading

In no particular order...

EDIT: (3/15/2009) Added sections on "Selecting a floating point data type" and "Handling floating point exceptions."

No comments: