Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importing dateparser is too slow #1051

Open
anarcat opened this issue Mar 24, 2022 · 2 comments
Open

importing dateparser is too slow #1051

anarcat opened this issue Mar 24, 2022 · 2 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Mar 24, 2022

hi!

first, thanks for this awesome project, it's really useful and powerful and i am grateful to not have to write this stuff myself. :)

i open this issue because I feel there's some inherent performance issue to be paid whenever we even load the dateparser library:

anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import dateparser"
===> multitime results
1: -q python3 -c "import dateparser"
            Mean        Std.Dev.    Min         Median      Max
real        0.328       0.008       0.319       0.326       0.350       
user        0.313       0.009       0.299       0.315       0.331       
sys         0.013       0.008       0.000       0.014       0.028       

compare with similar libraries:

anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import parsedatetime"
===> multitime results
1: -q python3 -c "import parsedatetime"
            Mean        Std.Dev.    Min         Median      Max
real        0.072       0.008       0.069       0.070       0.096       
user        0.065       0.011       0.050       0.062       0.095       
sys         0.008       0.005       0.000       0.008       0.019       
anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import arrow"===> multitime results
1: -q python3 -c "import arrow"
            Mean        Std.Dev.    Min         Median      Max
real        0.064       0.006       0.061       0.062       0.081       
user        0.055       0.006       0.042       0.054       0.064       
sys         0.009       0.006       0.000       0.010       0.019       

a quick profiling seems to show it spends an inordinate amount of time compiling regular expressiongs:

anarcat@curie:undertime(main)$ python3 -m cProfile -s cumulative <(echo "import dateparser") | head -50
         627961 function calls (598845 primitive calls) in 0.558 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.000    0.000    0.593    0.066 __init__.py:1(<module>)
     60/1    0.000    0.000    0.558    0.558 {built-in method builtins.exec}
        1    0.000    0.000    0.558    0.558 63:1(<module>)
     73/1    0.000    0.000    0.558    0.558 <frozen importlib._bootstrap>:1002(_find_and_load)
     73/1    0.000    0.000    0.558    0.558 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
     70/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap>:659(_load_unlocked)
     58/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap_external>:784(exec_module)
     93/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
        1    0.000    0.000    0.556    0.556 date.py:1(<module>)
        1    0.000    0.000    0.519    0.519 date_parser.py:1(<module>)
        1    0.000    0.000    0.500    0.500 timezone_parser.py:1(<module>)
     1901    0.034    0.000    0.485    0.000 regex.py:451(_compile)
      795    0.004    0.000    0.477    0.001 regex.py:349(compile)
      770    0.002    0.000    0.357    0.000 timezone_parser.py:56(build_tz_offsets)
      769    0.006    0.000    0.345    0.000 timezone_parser.py:58(get_offset)
 2255/755    0.008    0.000    0.159    0.000 _regex_core.py:382(_parse_pattern)

basically, it seems we're spending a lot of time compiling regular expressions. individually, those don't matter so much (percall=1ms) but we seem to be doing hundreds of those. I think it might be related to the timezone_parser.py file (build_tz_offsets?) but i stopped digging there.

the exact source is a little besides the point: shouldn't just importing the module be safe enough, performance wise? i know we load a default parser, but that's not what's eating us here, but rather a bunch of globals in timezone_parser.py... it seems to me those could be lazily loaded, at least?

@anarcat
Copy link
Contributor Author

anarcat commented Mar 24, 2022

oh and in case you're wondering why this matters to me, it's because i wrote this tool called undertime who gives you different times in different zones, as a one-shot commandline tool. most of its time is spent building those regexes it doesn't use. :)

i'm now lazily loading dateparser itself, but the user can definitely "feel" when it hits that corner case.

@mlissner
Copy link

We use a lot of data objects in our libraries that usually load from json and moving them to lazy-load instead of load-on-import has been helpful. It's not too hard and it's been reliable for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants