Set-theoretic explanation of IEEE 754

The IEEE Standard for Floating-Point Arithmetic is confusing for beginners. Here I try to give an alternative explanation. It’s not my goal to make it easy. It simply isn’t easy. But this might help understand some aspects of floating point arithmetic.

Why sets?

A set is a collection of elements. A range of numbers is a kind of set, as it is a subset of continuous elements of some other ordered set. A range can be infinite in size. If it contains rational (or even real numbers) there are always more elements between any two different elements. But it’s still a set. The range from 0 to hundred contains 5.3 and 6. Between them you have numbers, such as 5.67 and 60/11.

In computing values often get rounded to a nearest number because there’s nothing to represent that value precisely. So all values of some range get represented by the one value that actually exists in the number format.

In this article I write about these ranges. And I use this simple notation:

[ beginning (inclusive), end (inclusive) ]
] beginning (exclusive), end (inclusive) ]
[ beginning (inclusive), end (exclusive) [
] beginning (exclusive), end (exclusive) [

] 0, 4 ] : { x : x > 0, x ≤ 4 }
[ 0, 0 ] : { 0 }

So the brackets just show if the given value is still in the range, or not. Everything between is in it.

Example with Integers

Integers have no fractional component. They are whole. So sometimes we use them but when the result of a calculation has a fractional component, we need to round.
Usually all numbers that are in the range [x – 0.5,x+0.5[ get rounded to x.
So 1.1 becomes 1 and 4 is just 4. 11.5 gets rounded to 12.

When you know that there was some rounding, it means that 12 isn’t just 12. It could be 11.502 or 12.497832. There are infinite numbers that get represented by 12, if we round the values like this. 12 represents the range [11.5,12.5[.

This is easy, right? We do it all the time. Measurements are usually rounded because we know that at some point a tiny fraction doesn’t matter. When we add up such rounded values we need to deal with the possibility that they were all rounded up or down and the added imprecision is starting to be problematic. Then we write 12 ±4%, or something like that. But we don’t actually do that when programming with floating point numbers.

What actually makes it confusing

Imprecision isn’t that hard to grasp. But there are some things about IEEE 754 that are very confusing. I will explain some of them here.
You can say that the following are special cases, but that’s why they are so confusing when you have to deal with them.

Imprecision depends on the value

Numbers around zero are rather precise. But when you get further away from zero (positive and negative) you get less precision. That’s not such a strange concept. You should already be familiar with this:

Distance Earth to Sun: 149.6E9 metres
Radius of Cobalt atom: 1.52E−10 metres

Take each of those values and add “1” before the “E”. These additions would be rather different when compared. It depends on the order of magnitude.

This means each value represents a larger range when it’s farther away from 0. There’s no absolute zero. Even 0 represents a range. The closest you can get with double precision is around ±2−1022. So 0 represents a rather small range. But it’s not just the range between those two. See next item.

At some point (2^53, to be precise) the imprecision is larger than 1. So incrementing such a value by 1 will do nothing.

See Math.ulp for the unit in the last place. It lets you see how imprecise a value might be. The ulp for one million is 0.0625. There are other interesting methods, such as nextAfter.

There are two zeroes

It’s true. Simply because a single bit tells you if a value is positive or negative. So you can get -0. This is confusing, because -0 and +0 are equal! So they are the same. This article is about rounding as a pragmatic solution. But mathematically speaking +0 and -0 are actually the same. Just like 0.999… and 1 are the same.

Let’s say x is the largest value getting rounded to -2−1022. So if a value is of the range ]x,0] you get -0 (as a String this is actually “-0.0”). And for the others near zero you get +0 (as a String it’s it’s just “0.0”). However, those will all be treated as zero.

So when you understand why there are two zeroes you understand that the range of values that get rounded to -0.0 is not the same range of values that are equal to -0.0.

Not a Number

When you think of these numbers as sets it’s very easy to explain Not a Number (NaN):
This is the empty set.
Or this range: ]x,x[ where x can be anything, as it isn’t included.

Whenever some calculation has no number then NaN is used. 0.0/0.0 can’t be rounded to anything. So you get the empty set instead.
But there’s one thing you must keep in mind: NaN is not equal to NaN!

Double.NaN == Double.NaN; // = false

There’s always a catch when you try to understand floating point arithmetic.
And there’s actually another thing that is a bit confusing. NaN is a number. At least the type of it is still double or float. In Java all boxed numbers implement java.lang.Number. So it’s a value of type “number”, which represents anything that is not a number.

Infinity

IEEE 754 has two versions of infinity. One is positive and one is negative.
This does not mean that those are values too large for your data type. The largest positive number for double precision is a bit under 1.8E308. Any larger number is simply rounded down. It represents ](2-2-52)·21023,∞[, which excludes infinity.

If you ask yourself how many numbers are represented by 12, the answer is: infinity
And you only get this result for questions that are somewhat abstract. Like this one: How many natural numbers are there?

Examples for (positive) infinity are 1.0 / 0.0 and Math.pow(0, -1).

Infinity does equal itself. What’s special, is that many operations will just return the same infinity (positive or negative). Divided by itself you get NaN (see above). Some of these results might be unexpected. That’s why we have a standardisation by IEEE. If you want ∞/∞ you need to do the programming yourself. If NaN is fine for you, you don’t need to do anything.

Representation

Binary

There are only so many possible values when you use 4 or even 8 bytes. Some of them aren’t even valid (not normalized). However, you can say that each normalized stands for a certain exact value. The rules for rounding define which values get rounded to such a value. Use any online floating point tool to see how any configuration represents some exact value. This converter is great because it shows the string representation and the exact value:
IEEE-754 Floating Point Converter

String

So every “double” is actually a range. But which one represents this set of values? We are used to using decimal system, so we use that and not binary. So if you would use a theoretical toString implementation that can  process even real numbers (such as π), you’d get a lot of infinitely long strings. We can’t use those. Instead we use the shortest one. However, “0”, “1”, “2”, etc are already used by the integers. So those get “.0” after the integer representation. The signum is rendered as “-” for negative numbers (including “-0.0” and “-Infinity”). So you get 4 or more characters for negative integers. If there are too many decimal places, the string is trimmed. For very large numbers E is used: “1.0E123”

Everyday Example

This is a surprise to many:

0.1+0.2 == 0.3; // = false

How could that be?

You have the addition of two values. But those each represent some range. Imagine you cut pieces of wood. One is 0.1 meters and one is 0.2 meters. You glue them together. Then you cut a piece that is 0.3 meters. Will they be the exact same length? Probably not. (For the stochastics geeks out there: the probability is 0. There’s no way they are of equal length.)

Let’s look at it as ranges. 0.1 represents a range. But “0.1” is only the string used to represent that range. It is not the actual value.

1/2 is just 0.5, plain and simple. But 0.1 isn’t actually that. It’s represented by “0.1” as a String. But the binary value is a bit more.  You can’t add a set to another (there’s union but that’s not the same). Addition is one value plus another value. So of each ranges one has to be picked. But for computers it’s just using the binary representation as it is.

You’d have to add the smallest possible value of the first range to the smallest of the other range. And then do the same with the two largest values. Then you get a new range. You could do that. Then you could check if 0.3 is part of that range. That’s not at all what 0.1+0.2 == 0.3 does and you get false.

Both 0.1 and 0.2 are a bit larger in binary and this imprecision gets added to the result. So you get something larger than 0.3.
What you really ask is this:
Is some value near X plus some value near Y exactly the same as some value near Z?
Of course not! There’s a chance that it is (after rounding), but that’s not even likely. We are still rather close to zero and we already have this problem with imprecision.
It’s also because 0.1 is really 1/(2·5) (using only prime numbers). So we have 5, which is not based on 2. But your computer uses a binary representation of all numbers. It starts at 1 and divides the values by two until it’s close. Then it adds such fractions until it’s as close as possible. So 0.125 is too much for “0.1”. But 0.0625 is too little. Add 0.03125 and you get closer. However, you never actually get “0.1”. Even if you were using 128 bit instead of 64 bit you would not get there.

So what now?

Those imprecisions are often not a problem. You can use floats in a game where the exact values do not really matter. Your physics won’t ever be 100% exact, and performance is more important. Moving an object at 0.1 by 0.2 doesn’t have to place it to 0.3. It’s good enough if it’s close to 0.3.
However, to compare double or float numbers you have to use something like this:

Math.abs(a-b) < delta 

With a small, but not too small, delta you can check if a and b are close together. This is good enough in many cases.

If you really need precision you can use arbitrary-precision arithmetic. In Java you have BigDecimal for that.
And you can use some maths library that keeps track of your discrepancies. Then you know the lower and upper bounds of your results.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s