Wednesday, January 30, 2019

Floating point madness

In the course of squaring away sub-microsecond precision in aniso8601, the discussion came up that fractional components in general are rounded instead of truncated at maximum representable precision which can lead to the same kinds of problems pointed out in issue 10. At the time I pushed back because I figured it would lead to 'floating point madness' trying to get the correct behavior in all cases. Recently I was convinced to capitulate as I ran into a series of issues in my own use. Issue 21 covers this in some detail, but for those that want a complete breakdown, read on.




First, let's look at the problem in more detail. Basically, a user would never expect a value to be rounded up in the following case:

>>> aniso8601.parse_duration('P1999.99999999999997Y')
datetime.timedelta(days=730000)

What's wrong with that? Remember, by default, aniso8601 assumes a year is 365 days, 730000 / 365 gives 2000 years. No matter how you look at it, 1999.99999999999997 != 2000. These issues persist all the way up and down the range:

>>> aniso8601.parse_time('14:43.999999997')
datetime.time(14, 44)
>>> aniso8601.parse_time('14.99999999997')
datetime.time(15, 0)
>>> aniso8601.parse_duration('P1.999999999997D')
datetime.timedelta(days=2)
>>> aniso8601.parse_duration('P1.9999999999997W')
datetime.timedelta(days=14)

Okay, those are all incorrect as well, but at least it parses, is it really that big of a deal?

>>> aniso8601.parse_time('23.99999999997')
datetime.time(0, 0)

That one is pretty unforgivable. So what do we do? Well, if you need correct behavior now, the arbitrary precision attotimebuilder does not suffer from these issues, and the 64 bit types produced by numpytimebuilder require far smaller fractions to display the bad behavior.

But what about using the default Python types? We can't do like we did with microseconds and simply truncate to microsecond precision; microsecond precision of things like years, months, weeks, days, hours, and minutes does not come out to an even fraction. For example, a microsecond is 0.000000016667 of a minute.

Another option is to convert everything to seconds then truncate the seconds to microseconds. The remaining seconds would have to be decomposed to integer days, seconds, and microseconds. Definitely an option, and one I may revisit again in the future. Using this method runs to risk of increased rounding due to successive floating point operations which could be avoided by only converting the fractional component to seconds, instead of all components.

The solution that worked in the widest range of discovered 'problem cases' was successive decomposition. We start with the largest component, years, and decompose it into an integer year component, and convert any remaining fractional component into a floating point days component. Repeat for months and weeks. The fractional hours component is converted to minutes, and fractional minutes components to seconds. Finally, we truncate the seconds to microsecond precision.

One wrinkle in all of this is truncation of floats in Python. This solution from Stack Overflow is close, but misses a key detail: the format call in Python may round. That means the given code will round before any truncation can happen. This can be avoided by using the Python repr function which exists to give us a string representation that can be used to reconstruct the object using eval. For example:

>>> '{}'.format(float(.9999999999999999))
'1.0'
>>> repr(float(.9999999999999999))
'0.9999999999999999'

The modified truncate function can be seen here.

An additional tangentially related behavior was changed as well. Previously, when calculating an interval from a date and duration, the result would be promoted to a datetime if the duration had an hour, minute, or second component. This would lead to some inconsistencies with fractional duration components throwing away resolution, for example:

>>> aniso8601.parse_interval('2007-03-01/P1.5D')
(datetime.date(2007, 3, 1), datetime.date(2007, 3, 2))

Now, date components of intervals will be promoted to datetimes whenever the calculated timedelta component has second or microsecond resolution, as well as whenever the duration portion of the interval has a time component. The above example with the new behavior:

>>> aniso8601.parse_interval('2007-03-01/P1.5D')
(datetime.date(2007, 3, 1), datetime.datetime(2007, 3, 2, 12, 0))

All of these changes will be rolling out as part of aniso8601 5.0.0 in the next few months. The relativetimebuilder has been updated with these new behaviors as well, those changes will roll out with 0.2.0. Updated datetime promotion when building intervals is included with attotimebuilder 0.2.0 and numpytimebuilder 0.2.0 as well. All of the updated builders will launch at the same time as aniso8601 5.0.0.

No comments:

Post a Comment