Thursday, October 25, 2018

aniso8601 and sub-microsecond precision

After more false starts than I can count, aniso8601 is finally ready to handle sub-microsecond precision. Anyone following the (now approaching 3 year old) issue #10 will know that a solution wasn't immediately obvious. However, after fairly constant requests for "I want to parse an ISO 8601 timestamp and receive an X" and "why do you only support parsing to X and Y, I need to the parse result to be a Z" I came to the realization what people really wanted was a parser that just parses and to be able to build those parse results into whatever fits their use case best. Enter the 'builder' kwarg.

TLDR: New optional 'builder' keyword argument for all parse methods, 'relative' keyword argument going away. Default behavior is the same as always.



All of the top level parsing methods ('parse_datetime', 'parse_time', 'parse_duration', 'parse_interval', and 'parse_repeating_interval') now take an optional 'builder' keyword argument. These builders are classes that are responsible for taking parse results and building them into whatever datetime representation the user wants. By default, a 'PythonTimeBuilder' is used which returns the native Python types you would expect ('datetime', 'time', 'timedelta'), but a 'RelativeTimeBuilder' and 'TupleBuilder' are provided as well. The 'RelativeTimeBuilder' enables calendar level accuracy the same way the now deprecated 'relative' keyword argument did in the past. The 'TupleBuilder' simply returns the parse result as a tuple of strings representing each component of the parse. While it exists because these tuples are used as the internal representation of ISO 8601 parse results, it may be of some use to other users.

Additionally, I have wrote two builders that solve the sub-microsecond precision issue using different datetime implementations. The first, 'NumPyTimeBuilder', builds NumPy 'datetime64' and 'timedelta64' objects from ISO 8601 parse results. The second, 'AttoTimeBuilder', builds 'attotime', 'attodatetime', and 'attotimedelta objects. These are parts of my (new, unfinished) attotime library, which stores the nanosecond components of times as Python 'Decimal' objects, allowing for arbitrary, end user configurable precision. While not perfect, I think these projects show the utility of the builder system.

The final justification for user specified builders over built in parse options is how unfeasible directly supporting all desired output formats becomes. The 'relative' option for supporting parsing to dateutil 'relativedelta' objects was fine, but did grow both the main codebase as well as the test surface. The project would have only grown more if a 'numpy' and 'atto' keyword were added as well. I have in the past maintained various "bespoke" versions of aniso8601 for parsing to various one-off formats, maintaining these was a pain as they all started off different versions of aniso8601, and backporting parser fixes was a nightmare, as was keeping a myriad of private forks in sync with the main project. By splitting support for alternative formats off to separate projects, aniso8601 is free to remain a relatively simple, complete, tested ISO 8601 parser.

This design does carry with it some baggage. Unfortunately, code for range tests tends to be duplicated across builders. Additionally, builders will have to be mindful of tracking aniso8601 versions. At this point in time, there really isn't a particularly good API reference for implementing builders other than the 'builders.py' file itself. Better documentation should come with time, but for now this code really needs to get out the door so I don't need to port any new parsing fixes across significantly dissimilar codebases.

These changes are rolling out as aniso8601 4.0.0. Changes to the builder API (or any other API) will only happen in MAJOR (aniso8601 uses semver) versions. The relative keyword is deprecated, and will go away in 5.0.0, with the 'RelativeTimeBuilder' becoming a separate project. aniso8601 will remove 'RelativeTimeBuilder' as a dependency in 6.0.0. I look forward to fixing any inevitable breakage you find, until then, happy parsing!

Extra credit: Yeah, but is it still fast?

So, speed hasn't ever really been the goal of aniso8601, but I get fairly regular emails discussing it. While complete coverage of ISO 8601 remains the focus, I do care enough about performance to test it again, and things are even better now:

Results, aniso8601 4.0.0, Fedora 28, Python 2.7.15, i7 8550u @ 4.00 GHz:

strptime: 9.818 us.
aniso8601: 12.74 us.
pyiso8601: 18.72 us.

Results, aniso8601 3.0.2, Fedora 28, Python 2.7.15, i7 8550u @ 4.00 GHz:

strptime: 8.065 us.
aniso8601: 18.02 us.
pyiso8601: 18.04 us.
 
I haven't profiled it in any detail, I don't really know why it's faster, but I'll take it!

For posterity, the test code:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#! /usr/bin/env python2

strptimesetup = """\
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

anisosetup = """\
import aniso8601
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

pyisosetup = """\
import iso8601
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

if __name__ == '__main__':
    import timeit

    iterations = 100000

    strptimeresult = timeit.timeit('datetime.datetime.strptime(toparse, \'%Y-%m-%dT%H:%M:%S\')', setup=strptimesetup, number=iterations)
    anisoresult = timeit.timeit('aniso8601.parse_datetime(toparse)', setup=anisosetup, number=iterations)
    pyisoresult = timeit.timeit('iso8601.parse_date(toparse)', setup=pyisosetup, number=iterations)

    print 'strptime: {:.4} us.'.format(strptimeresult / iterations * 1e6)
    print 'aniso8601: {:.4} us.'.format(anisoresult / iterations * 1e6)
    print 'pyiso8601: {:.4} us.'.format(pyisoresult / iterations * 1e6)

No comments:

Post a Comment