-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Very slow printing of Series with DatetimeIndex #19764
Comments
cc @takluyver what is the mechanism how in general an object is displayed? As the |
Since seems to be present in IPython >= 6.1 (I don't see it in an environment with IPython 6.0). And in an ipython console, the overhead comes from So in the end it is due to
And from the same profiling, it seems it is doing expensive
which in turn steps from a very slow |
I didn't know we did dot-attribute indexing for a Series - seems odd - something we want to support / actually gets used? In [10]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [11]: s.c
Out[11]: 3 |
I guess it is fairly prominent in the indexing docs, just missed it. https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access |
Yeah, the dataframe equivalent is more known and used, I personally don't think many people do that for series as well. For the original issues, I think we should use in |
There also seems to be fundamental performance problem with In [26]: dti = s.index
In [27]: dti.get_loc('2012-01-01')
Out[27]: slice(0, 86400, None)
In [28]: %timeit dti.get_loc('2012-01-01')
1.24 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit dti.get_loc(pd.Timestamp('2012-01-01'))
11 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
Explored a bit what
|
Correctness question, is this the right output? In [5]: pd.date_range('2017', periods=12)
Out[5]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12'],
dtype='datetime64[ns]', freq='D')
In [6]: pd.date_range('2017', periods=12).get_loc("2017")
Out[6]: slice(0, 12, None) The docstring says in int is returned. The closes analog I can think of is In [12]: pd.IntervalIndex.from_tuples([(0, 1), (0, 2), (0, 3)]).get_loc(0.5)
Out[12]: array([0, 1, 2]) |
Second question, should In [11]: s.index.get_loc(s.index[0].value)
Out[11]: 0 That's taking the underlying integer representation. cc @jreback |
Not really sure, should dive into the code again. But, to fix the actual regression, I think the easier path will be to avoid any |
Indeed, I got curious though :) Perhaps I should keep performance questions in #17754. I'll investigate the getattr stuff now (for 0.23). |
So currently, So a quick dirty "hack" would be to add a Is that too dirty of a hack? |
See #20834. |
Consider this series with a datetime index:
Showing this in the console or notebook (doing just
s
) is very slow (noticable delay of 1 to 2 seconds in console, around 8 seconds in the notebook or JupyterLab ). But, on the other hand, in a plain Python console the display is instantly.Further, on the other hand, if the series has no DatetimeIndex, the display appears instantly. Also, if it is a frame (doing
s.to_frame()
) the display is instantly. And when doingprint(s)
explicitly, the display is actually identical and also instantly. When it is another datetime-like index, egs.to_period()
, the display is instant.So it seems we are doing some work under the hood that is not needed. And it is somehow triggered by doing this in an IPython context, and specifically for DatetimeIndex.
This already seems to be present on some older versions of pandas as well.
The text was updated successfully, but these errors were encountered: