Monday, August 18, 2014

The great date parser race

Recently I've had a couple of conversations that made me wonder about Python parsers for ISO 8601. A topic I have a personal interest in.

The first conversation was about how a library providing some fundamental functionality turned out to be a bottleneck in a much larger program.

The second was about how slow regular expressions seem to be on ARM systems.

Parsing dates is a pretty fundamental function, and my date parser runs on ARM (and some of its competition (pyiso8601) uses regular expressions), so let's race!




The test is simple, parse 10,000 random datetimes in the ISO 8601 format <date>T<time> (eg. 2007-04-05T14:30:18). The competitors are aniso8601, pyiso8601, and Python's own strptime with the format string "%Y-%m-%dT%H:%M:%S".

Tests were done with the latest versions available on PyPI (0.83 for aniso8601, 1.10 for pyiso8601), and whatever version of Python 2 I had installed.

The code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#! /usr/bin/env python

strptimesetup = """\
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

anisosetup = """\
import aniso8601
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

pyisosetup = """\
import iso8601
import datetime
import random

toparse = datetime.datetime.fromtimestamp(random.randrange(2**16)).isoformat()
"""

if __name__ == '__main__':
    import timeit

    strptimeresult = timeit.repeat('datetime.datetime.strptime(toparse, \'%Y-%m-%dT%H:%M:%S\')', setup=strptimesetup, repeat=100000, number=1)
    anisoresult = timeit.repeat('aniso8601.parse_datetime(toparse)', setup=anisosetup, repeat=100000, number=1)
    pyisoresult = timeit.repeat('iso8601.parse_date(toparse)', setup=pyisosetup, repeat=100000, number=1)

    print 'strptime: {:.4} ms.'.format(sum(strptimeresult) / len(strptimeresult) * 1000)
    print 'aniso8601: {:.4} ms.'.format(sum(anisoresult) / len(anisoresult) * 1000)
    print 'pyiso8601: {:.4} ms.'.format(sum(pyisoresult) / len(pyisoresult) * 1000)

Nothing too fancy here. For each parser, we generate a random date and parse it. We do this 10,000 times each. Then we print out the average result (converted to milliseconds). Note that the timestamps we generate are in the range of 16 bit integers to avoid issues on ARM.

Results, Fedora 20, Python 2.7.5, Xeon E3-1230 v3 @ 3.30 GHz:

strptime: 0.0120228981972 ms.
aniso8601: 0.0233640527725 ms.
pyiso8601: 0.0345859408379 ms.


Other than all competitors being slower than I figured, the results are pretty unspectacular. None of these options should be slowing down anyone's program. Most importantly, my baby isn't slowest!

Results, Guruplug Server Plus, Debian 7.4 (Wheezy), Python 2.7.3, Feroceon 88FR131@ 1.2 GHz (ARM9E):

strptime: 0.363 ms.
aniso8601: 0.6164 ms.
pyiso8601: 1.002 ms.


Similar, except everything takes 30 times as long.

So, what did we learn? All three options are plenty fast. My horse is the race, aniso8601, is faster than pyiso8601 (and has significantly more complete coverage of the standard, which is why I wrote it), but slower than just parsing with strptime in Python (which is only an option for some ISO 8601 strings). Also, this use of regular expressions shows no weakness on ARM.

No comments:

Post a Comment