Friends, in this year of the Ghostbusters revival, I would like to talk about some paranormal phenomena: floating point value migration. The problem statement here is simple:

- We have a large collection of numbers in one system
- We copy those numbers to another system
- Wanting to be sure the copy was successful, we compare the two sets of numbers using a test
- Naively, we would assume the numbers in the two systems should be “exactly” the same
- Huge numbers of the values are different and fail the test

What’s going on? There’s a 95.6999999999% chance that floating point numbers are what is going on, and if you aren’t afraid of floating point numbers, well, a little fear might be good.

## Review: Floating Point Drift

All numbers stored digitally are, by definition, rounded to the nearest single value that the digital representation can store. Sometimes that rounded value is precisely the original value, and sometimes it’s just a tiny fraction different than the original. The number 8 can be represented exactly and conveniently in a digital form, while the number 1/3 might have to be stored as 0.333333, which is almost 1/3, but not precisely 1/3. We do this as a matter of course all day long, and most of the time it works without too much effort, because the rounding increment is small relative to the number we are trying to store.

Consider this illustration. The red arrow represents exactly a real number we intend to store, and the black number lines represent the resolution or coarseness of two possible kinds of digital representations, A and B, of that real number:

When the number is stored, it will have to be adjusted to one or the other black tick on the number lines, because it’s not possible to store the real, exact value. The density of those tick marks depends on the digital representation (data type) of the number.

Importantly, if that number’s representation is changed to another data type, say from A to B, then it may shift slightly because it has to be a value at the resolution of B. If we move it to a third representation, it might shift again. How this value relates to the original is not something you can test any longer using equality. This might all seem academic unless you have really had to process data in this way, where it changes in flight, and explain that baffling change to users or managers, etc. This has real consequences if, for example, you are moving data from a database to a file and back, or representing it in an application but storing it in a file, or comparing floating point numbers in a hadoop system to those in a relational database.

Informally, let’s call this phenomenon “drift.” Drift happens both in computations with digital values, in storing them, and potentially in transporting them from system to system.

### Decimal or Float

For context, let’s compare three representations: Decimal, Floating Point and String. Imagine these as variations of those two number lines A and B.

A **decimal** is typically a data type modeled in the computer in such a way that a fixed precision of decimal numbers can be stored and retrieved with exact precision. That is, if I declare a decimal number to store 5 digits, with three decimal places, I can store 12.345 and when I retrieve it I will get precisely 12.345 back again. This seems like it would be the obvious best thing to do for most cases, but decimal has tradeoffs:

- It’s expensive to do this, because the natural representation in computers for numbers is base 2 and not base 10. This means it takes a lot more memory and storage to devise a way to precisely store decimal numbers having a lot of digits.
- Computation of this representation is dramatically slower than computation of binary numbers.
- The range (minimum to maximum value) of a decimal is very limited compared to other representations, as we’ll see in a minute.

A **float** is typically a base 2 / binary representation of a number consisting of a fixed-width series of binary values (mantissa) together with an exponent and a sign. There is a standard IEEE 754 that details how to store a floating point number as bits, which makes working with them more consistent across languages and operating systems.

The thing that probably trips most people up when using floats is that as humans we look at numbers and reason about numbers all in base 10, and it’s easy to forget that the computer is actually, constantly storing and processing them in base 2. It’s *never* working in base 10 on floats. It just shows us a base 10 approximation on our displays. It is as if the computer is working with one of those number lines in our diagram *but only shows the values to us using the other one, *because we humans can read 1.234 but not 00111111100111011111001110110110. (For the record, that is almost, but not exactly, 1.234.) We have a “feel” for numbers like 0.3333333 and understand that the decimal representation there is imprecise, but we have no similar frame of reference for base 2.

Microsoft has some good advice about choosing between these types for SQL Server, which in my experience not very many people seem to heed:

“Approximate numeric data types **do not store the exact values specified for many numbers**; they store an extremely close approximation of the value. For many applications, the tiny difference between the specified value and the stored approximation is not noticeable. At times, though, the difference becomes noticeable. Because of the approximate nature of the float and real data types, do not use these data types when exact numeric behavior is required, such as in financial applications, in operations involving rounding, or in equality checks. Instead, use the integer, decimal, money, or smallmoney data types.”

https://technet.microsoft.com/en-us/library/ms187912(v=sql.105).aspx

Having a value portion and an exponent portion, floating point numbers look a lot like scientific notation; they have this behavior where they can store a huge range of values, but the precision of those values goes down as the magnitude goes up. This happens because for all values there is a fixed number of significant digits. For a lot of applications, that’s a good thing, and it’s what enables the large range of possible values.

So why use a float?

- They are compact
- They are fast for computation
- They can store a much larger range of values than implementations of decimal

Floating point numbers are great for the right applications. They work well in computer graphics for games and applications like CAD and 3D modeling, and for sensor data. They work poorly for financial calculations.

A **string*** *representation of a number is much less well defined, because there are so many formats. In general it’s a text/alphabetic series of characters (so unicode or ascii, etc.) limited to the text characters 1234567890,.E+- and probably a few more. Strings are very much in play in a migration if the data originates from, or passes through, or ends in, files. The relationship between a string representation of a number in a file and a float is basically undefined.

## I Don’t Believe You

Here are real examples that might blow your mind a little. These use PowerShell, but trust me, this happens everywhere.

PS C:\> [float]$mynum = 0.1
$mynum
0.1
PS C:\> # OK but what is it really?
$mynum.ToString()
0.1
PS C:\> # No, I mean REALLY?
$mynum.ToString( "E18" )
**1.000000010000000000E-001**
PS C:\> # Wha? OK, let's look at the real, real representation
$mybytes = [BitConverter]::GetBytes( $mynum )
[array]::reverse( $mybytes )
$mybits = $mybytes | % { [Convert]::ToString( $_, 2 ).PadLeft( 8,'0') }
$mybits -join " "
**00111101 11001100 11001100 11001101**
PS C:\> # Here's how round trips to other types can get sideways
$mynum.ToString( "E18" )
[double]$myothernum = [Double]::Parse($mynum.ToString( "E18" ))
$myothernum
**1.000000010000000000E-001
0.100000001**
PS C:\> # But this is all 0.1, right?
$mynum -eq $myothernum
**False**
PS C:\> # Here's a different example
[float]$example = 1.55555555
$example
1.555556
PS C:\> $example.ToString()
1.555556
PS C:\> # What's that *really* though?
$example.ToString( "E20" )
1.55555558000000000000E+000
PS C:\> # 8?! Can rounding help me?
[Math]::Round( $example, 10 )
1.555555582
PS C:\> [Math]::Round( $example, 9 )
1.555555582
PS C:\> [Math]::Round( $example, 8 )
1.55555558
PS C:\> [Math]::Round( $example, 7 )
1.5555556
PS C:\> # What just happened
[Math]::Round( $example, 6 )
1.555556
PS C:\> # There is a "round trip" string format in .net
$example.ToString()
$example.ToString( "R" )
1.555556
1.55555558
PS C:\> # But it's not the default
[float]$example2 = $example.ToString()
$example2
1.555556
PS C:\> # So
$example -eq $example2
**False**
PS C:\> $example.ToString( "R" )
$example2.ToString( "R" )
1.55555558
1.555556
PS C:\>

If you want to play with other examples there’s this fantastic page where you can input decimal numbers and see what floating point does with them.

## OK, OK. So What Can We Do About It?

If you face this problem with a data migration, I suggest these precautions:

### Testing for equality is folly

The first step is to admit there’s a problem. If you are moving floating point data between systems, understand that they probably will not be the same afterward unless the process uses fierce rigor in preserving the floating point representation through serialization, conversion and so on. Plan for them to be different, and plan time to grapple with that.

### Examine the path for a value from one system to the other

Part of understanding where the variations may be could be helped by seeing the exact path, and where the transformations from one numeric representation to another happen. For example, if you are going from one SQL Server to another through SSIS, and you have the ability to be totally rigorous with keeping floating point values in the same representation, things may turn out closer. If you are serializing values to a file and moving to Hadoop, then investigate how much the numbers change.

### Rounding probably won’t work

Two things are tempting here: checking that the difference in values is small enough ( abs( a – b ) < something ) and trying to round the values and compare after rounding. It’s worth trying those things, but know that there will still be cases where that doesn’t work.

The first usually fails because the “drift” gets progressively larger as numbers increase in size. If you know the numbers are expected to be in a certain range, then perhaps you can settle on some value for the limit of the difference, but that cannot be generalized for all values that can be stored in a float or double.

The second often fails because the result of rounding a float is still quite volatile for many real cases, and doesn’t end with equality. Consider using rounding on values like 0.045555 and 0.04444 – because 5 is a boundary for rounding up or down, the difference between rounded versions of those numbers, 0.05 and 0.04, might be much *greater* than the difference between the unrounded values, because the larger one increases in value and the smaller one decreases.

### Look at the canonical test to compare floating point values

There is a standard methodology used in applications to compare floating point values with a variable relative difference “epsilon.” Although it has variations, this basic technique is in use for graphics, and so on, and it works.

https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

Except then understand that might not work either, because it is engineered to compare two values on the same system that have the same representation, not two values where one value has been run through a migration process where the “drift” to which is was subjected is undetermined. Was it converted to a string and back? How? It’s very likely that the migrated values have drifted past what a test like this considers equal, but are still within range of what your application would accept as the same (e.g. 1.5555556 vs 1.5555555)

### Examine the edge cases in your data

Many of the problems with the comparison are around specific, sensitive points in the numbers, which you can locate and test around. Some of these are zero, very small numbers near zero, repeating decimals, and numbers right on significant boundaries like 1.299999999 to 1.300000000. Plan on an iterative process where you test and examine a subset of the data with different comparison methods and look at what values are failing. Try to find patterns in those and establish what difference is acceptable or not.

### Develop a two or three stage test for IsCloseEnough()

Ultimately comparing values across the systems may take a function that does a series of different comparisons to test whether the values are close enough to some business-determined limit. These could include:

- Compare only a subset of significant digits and the exponent to validate that the numbers are close enough. For example, the numbers 1.2345612 E 005 and 1.2345634 E 005 could be compared by checking that the first 6 digits of the value match, ignoring the last two, and compare the exponents. This could fail for some cases like 1.9999999 E 005 and 2.0000000 E 005, which are close without matching significant digits
- If the first test fails, fall into a second test that evaluates the difference between the numbers using a the sliding “epsilon” technique, to see if they are close enough even though the first test failed.
- etc.

If you have a significant quantity of data to check, ensure there is some method to validate subsets of the data, and that it reports the values and a reason why they failed, so that the test method can be revised and hone in on the best fitting algorithm.

Further and much more rigorous reading:

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

## Comments

## Mike Decuir said:

Instead of comparing the absolute difference between a and b, you could instead compare the fractional difference, represented as abs((a-b)/a).

You could then represent 6 significant figures of accuracy as abs((a-b)/a) <= 0.000001.

Going a bit further, you could wrap everything in log base 10 and ensure that the order of magnitude of error is lower than -6.