PERF: Very slow printing of Series with DatetimeIndex #19764

jorisvandenbossche · 2018-02-19T10:39:12Z

Consider this series with a datetime index:

s = pd.Series(np.random.randn(int(1e6)), 
              index=pd.date_range("2012-01-01", freq='s', periods=int(1e6)))

Showing this in the console or notebook (doing just s) is very slow (noticable delay of 1 to 2 seconds in console, around 8 seconds in the notebook or JupyterLab ). But, on the other hand, in a plain Python console the display is instantly.

Further, on the other hand, if the series has no DatetimeIndex, the display appears instantly. Also, if it is a frame (doing s.to_frame()) the display is instantly. And when doing print(s) explicitly, the display is actually identical and also instantly. When it is another datetime-like index, eg s.to_period(), the display is instant.

So it seems we are doing some work under the hood that is not needed. And it is somehow triggered by doing this in an IPython context, and specifically for DatetimeIndex.

This already seems to be present on some older versions of pandas as well.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-02-19T10:43:57Z

cc @takluyver what is the mechanism how in general an object is displayed? As the __repr__ or __str__ itself is fast, but displaying in an IPython console is not.

jorisvandenbossche · 2018-02-19T12:47:01Z

Since seems to be present in IPython >= 6.1 (I don't see it in an environment with IPython 6.0). And in an ipython console, the overhead comes from MimeBundleFormatter.__call__ calling get_real_method, which in the end does a getattr on the series to check if a _repr_mimebundle_ exists.

So in the end it is due to getattr taking a huge amount of time on a Series with a DatetimeIndex. With the example from above:

In [29]: %timeit getattr(s, '_bla', None)
1.39 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: s2 = s.reset_index(drop=True)

In [32]: %timeit getattr(s2, '_bla', None)
3.54 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [33]: df = s.to_frame()

In [34]: %timeit getattr(df, '_bla', None)
3.64 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And from the same profiling, it seems it is doing expensive get_loc calls on the index a few times, which stem from a 'contains' call in generic.py __getattr__ (name in s.index), and timings for this:

In [41]: %timeit '_bla' in s.index
517 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [42]: %timeit '_bla' in s2.index
1.4 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [43]: %timeit '_bla' in s3.index
2.91 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

which in turn steps from a very slow get_loc of the DatetimeIndex engine. I suppose it is doing (for this case) way too much work because it tries to interpret the string as a datetime string.

chris-b1 · 2018-02-19T17:25:04Z

I didn't know we did dot-attribute indexing for a Series - seems odd - something we want to support / actually gets used?

In [10]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [11]: s.c
Out[11]: 3

chris-b1 · 2018-02-19T17:26:37Z

I guess it is fairly prominent in the indexing docs, just missed it.

https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

jorisvandenbossche · 2018-02-19T17:37:31Z

Yeah, the dataframe equivalent is more known and used, I personally don't think many people do that for series as well.

For the original issues, I think we should use in __getattr__ somehow a 'smarter' contains check, because I don't think we want to support any string-like to be used as attribute for DatetimeIndex (I am thinking if there is actually one that would be a valid attribute name, as it cannot start with a number).

chris-b1 · 2018-02-19T17:41:58Z

There also seems to be fundamental performance problem with DatetimeIndex.get_loc - maybe same as #17754

In [26]: dti = s.index

In [27]: dti.get_loc('2012-01-01')
Out[27]: slice(0, 86400, None)

In [28]: %timeit dti.get_loc('2012-01-01')
1.24 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [29]: %timeit dti.get_loc(pd.Timestamp('2012-01-01'))
11 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

jorisvandenbossche · 2018-02-20T10:44:43Z

Explored a bit what DatetimeIndex.get_loc and DatetimeEngine.get_loc are doing, and a lot is going on:

Something like dti.get_loc('blabla') ultimately calls DatetimeEngine.get_loc 2 times, and dti.get_loc('2012-01-01 00:00:00') as well. While we can know in advance that DatetimeEngine.get_loc will fail as this method does not work for strings (it needs a datetime-like)
A single call to DatetimeEngine.get_loc('string'), although it will never work, takes a lot of time. The ultimate reason for this long time is that it is trying a values.searchsorted(key) where values are the int64 data and key is the string. The numpy searchsorted will then cast under the hood the int values to string or object (whather the common dtype between values and key), which is slow. The value that then comes out of it, in case of a string, is outside the range (just added to the right of the values) and we check this and raise a KeyError. While in case of a string we just never should have tried.
When I would have used int(1e6) - 1 instead of int(1e6) values, we are below the size threshold, and it takes a different code path. In that case the situation is better (we still do unneeded tries of get_loc, but they are not that slow in that case)

TomAugspurger · 2018-04-26T19:05:12Z

Correctness question, is this the right output?

In [5]: pd.date_range('2017', periods=12)
Out[5]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12'],
              dtype='datetime64[ns]', freq='D')

In [6]: pd.date_range('2017', periods=12).get_loc("2017")
Out[6]: slice(0, 12, None)

The docstring says in int is returned. The closes analog I can think of is IntervalIndex w/ overlapping intervals.

In [12]: pd.IntervalIndex.from_tuples([(0, 1), (0, 2), (0, 3)]).get_loc(0.5)
Out[12]: array([0, 1, 2])

TomAugspurger · 2018-04-26T19:15:46Z

Second question, should DatetimeIndex.get_loc(int) ever work? We test that it does.

In [11]: s.index.get_loc(s.index[0].value)
Out[11]: 0

That's taking the underlying integer representation. cc @jreback

jorisvandenbossche · 2018-04-26T19:17:42Z

Not really sure, should dive into the code again.

But, to fix the actual regression, I think the easier path will be to avoid any get_loc call that is coming from __getattr__ altogether, as I mentioned in #19764 (comment) (not that get_loc shouldn't be fixed)

TomAugspurger · 2018-04-26T19:19:41Z

Indeed, I got curious though :) Perhaps I should keep performance questions in #17754.

I'll investigate the getattr stuff now (for 0.23).

jorisvandenbossche · 2018-04-26T19:21:59Z

So currently, Series/DataFrame.__getattr__ does a if name in self._info_axis, which then calls Index.__contains__, which does the engine lookup.

So a quick dirty "hack" would be to add a _contains_getattr method to the base Index class which is equal to __contains__, and then DatetimeLikeIndex can override this to just return False (as a valid attribute name can never be contained in the index).

Is that too dirty of a hack?

Closes pandas-dev#19764

TomAugspurger · 2018-04-26T19:51:43Z

See #20834.

jorisvandenbossche added Performance Memory or execution speed performance Output-Formatting __repr__ of pandas objects, to_string labels Feb 19, 2018

jorisvandenbossche added this to the 0.23.0 milestone Feb 19, 2018

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Mar 29, 2018

jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 26, 2018

PERF: Faster Series.__getattribute__

ad6a202

Closes pandas-dev#19764

TomAugspurger mentioned this issue Apr 26, 2018

PERF: Faster Series.__getattribute__ #20834

Merged

jorisvandenbossche closed this as completed in #20834 May 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Very slow printing of Series with DatetimeIndex #19764

PERF: Very slow printing of Series with DatetimeIndex #19764

jorisvandenbossche commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

jorisvandenbossche commented Feb 20, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

jorisvandenbossche commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018 •

edited

Loading

jorisvandenbossche commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

PERF: Very slow printing of Series with DatetimeIndex #19764

PERF: Very slow printing of Series with DatetimeIndex #19764

Comments

jorisvandenbossche commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

jorisvandenbossche commented Feb 19, 2018

chris-b1 commented Feb 19, 2018

jorisvandenbossche commented Feb 20, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

jorisvandenbossche commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018 • edited Loading

jorisvandenbossche commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018 •

edited

Loading