Wednesday, June 12, 2019

It's all about the integers



As previously discussed in Floating point madness, the handling of fractional year, month, day, week, hour, and minute components of ISO 8601 dates and times is fraught with all kinds of peril. At the time, I chose a solution that was pretty immediately obvious, could be implemented quickly, solved a real problem with degenerate cases, and seemed intuitive. Even at the time, I mentioned it would only work for most cases, not all. Unfortunately, issue 24 highlights that working for most cases isn't good enough. Is there a solution that can work for all cases?




Before looking for a fix, let's explore the cause of issue 24. The problem comes from the fact floating point representations are not exact, and thus neither are the result of operations on floating point numbers. As noted in the bug report, the issue comes from splitting a string representing a floating point number ("11.858714") into integer and floating point parts ("11", ".858714"), and later readding them. We can demonstrate by simply plugging the numbers into the interpreter:

>>> float('11') + float('.858714')
11.858713999999999

So in the end, the microseconds would get truncated to 858713, instead of the 858714 we'd expect. A quick and dirty fix would be to simply not split and cast fractional seconds, as the fractional component of seconds can be easily converted directly to integer microseconds with no pesky floating point numbers required. In fact, I implemented exactly that fix. However, that leaves all kinds of floating point dragons lurking in the handling of fractional components larger than seconds. Surely we can slay them once and for all.

What if we instead handled the fractional component using only integers as suggested in the bug report? Can we even do that? The answer is yes we can! And it turns out you don't have to be nearly as clever as I thought.

Remember for a moment that decimal numbers can be represented as an exponent and a coefficient. For example, 12.34 can instead be written in standard form as 1.234 * 10^1 with 1.234 being the coefficient, and 1 being the exponent, or 1234 * 10^-2 which may be more helpful to us as as it can be represented entirely as integers. Also remember that we cannot handle sub-microsecond precision, so our end goal is for our fractional components to become microseconds.

So, let's say you have 12.34 years, and you want to convert the .34 years to microseconds, using only integers. .34 can be represented as 34 * 10^-2, but 10^-2 is a decimal, so instead you could do 34 / 100. From there, you would simply multiply by the number of microseconds in a year. Simple right? (34 / 100) * MICROSECONDS_PER_YEAR:

>>> MICROSECONDS_PER_YEAR = int(1e6) * 60 * 60 * 24 * 365
>>> (34 / 100) * MICROSECONDS_PER_YEAR
0

Oh no! .34 years is surely not 0 microseconds! Maybe not, but 34 / 100 is less than 1, so it's 0 in the land of integers. Luckily multiplication is commutative! So instead, we can do 34 * MICROSECONDS_PER_YEAR * 10^-2, but remember, we want all integer math. 10^-2 again becomes 1 / 100, so instead we do (34 * MICROSECONDS_PER_YEAR) * (1 / 100), which can be written as (34 * MICROSECONDS_PER_YEAR) / 100:


>>> (34 * MICROSECONDS_PER_YEAR) / 100
10722240000000

There we go! Let's generalize it some, we need the value we want to convert which comes to us from the ISO 8601 string, call it 'X_str' (which we also need cast to an integer, call that 'X_int'). We need a conversion factor to convert the value to microseconds, let's just call that 'conversion'. Remember that we only have year, month, week, day, hour, minute, and second, date and time components, so the value of conversion will always be greater than 0 and will always be an integer (you can verify this by calculating the 'MICROSECONDS_PER_YEAR', 'MICROSECONDS_PER_WEEK', 'MICROSECONDS_PER_DAY', 'MICROSECONDS_PER_HOUR', 'MICROSECONDS_PER_MINUTE', and 'MICROSECONDS_PER_SECOND' constants yourself). The result we're looking for is 'X_as_microseconds', so the generalization can be written:

X_as_microseconds = (X_int * conversion) / (10 ^ len(X_str))

The only thing new here is the 10 ^ len(X_str), which we use to derive our exponent. From the example above, our X_str would be "34", which has a length 2, 10 ^ 2 = 100, 34 / 100 = .34, same as our worked example.

An additional benefit of this method is since we converted directly to microseconds there is no sub-microsecond precision we have to worry about truncating, all truncation was handled for us simply by virtue of working with integers!

One last little wrinkle is that the Python datetime and timedelta constructors only take values in certain ranges, so we need to redistribute our microseconds to years, months, weeks, days, hours, minutes, seconds, and remaining microseconds. This is done by repeated application of modulo (or in practice, divmod) starting at the largest component (say, years), and ending at the smallest (seconds), and taking the final remainder as the number of microseconds.

All in all, a fun application of high school algebra. Thanks so much to Gary Donovan for the bug report, and the idea! It really wasn't as hard as I made it out to be... These changes were released as aniso8601 7.0.0 (and also RelativeTimeBuilder 2.0.0).

No comments:

Post a Comment