Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Fixed regression in Series(index=idx) constructor #20865

Merged
merged 4 commits into from
Apr 30, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Apr 29, 2018

From #18496

Special cases empty series construction, since the reindex is not necessary.

xref #18532 (comment)

Setup

data = None
idx = date_range(start=datetime(2015, 10, 26),
                              end=datetime(2016, 1, 1),
                              freq='50s')
dict_data = dict(zip(idx, range(len(idx))))
data = None if data is None else dict_data

benchmark

%timeit out = Series(data, index=idx)

HEAD:

113 µs ± 6.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

0.21.0

96.5 µs ± 6.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Master

320 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

From pandas-dev#18496

Special cases empty series construction, since the reindex is not necessary.
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 29, 2018
@TomAugspurger
Copy link
Contributor Author

cc @toobaz

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Apr 29, 2018
@codecov
Copy link

codecov bot commented Apr 29, 2018

Codecov Report

Merging #20865 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20865      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49313    49327      +14     
==========================================
+ Hits        45259    45275      +16     
+ Misses       4054     4052       -2
Flag Coverage Δ
#multiple 90.17% <100%> (ø) ⬆️
#single 41.92% <100%> (+0.03%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.03% <100%> (+0.03%) ⬆️
pandas/io/formats/latex.py 100% <0%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.61% <0%> (+0.04%) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 563a6ad...d8b1312. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv & a whatsnew (needed?)

@@ -207,10 +211,20 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
else:
data = data.reindex(index, copy=copy)
data = data._data
elif isinstance(data, dict):
elif isinstance(data, dict) and (len(data) or index is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing this here, why not inside _init_dict itself? (you can simply return early if these conditions ar met). you can also return the type, copy parameter as well. ideally like to locate all of a single 'type' code in the same place (dict here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - in particular, _init_dict already checks whether data is empty

@TomAugspurger
Copy link
Contributor Author

Good call on the init_dict. For some reason I thought it had to be in the main Series.init, since _init_dict returns a BlockManager & index, but this seems to work.

In [4]: %timeit out = Series(data, index=idx)
140 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Shouldn't need a whatsnew since #18496 was in 0.23. I think that was the original slowdown.

And this was caught by the series_methods.SeriesConstructor.time_constructor(None) asv, so we're good there (that's the benchmark I've been running).

@TomAugspurger TomAugspurger mentioned this pull request Apr 30, 2018
71 tasks
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 30, 2018

Does d8b1312 change your opinion on putting this stuff in _init_dict? It's getting borderline too messy now.

Series._init_dict sorts the index (keys) (well on non-PY36). This would break

Series(index=['b', 'a'])

Since the user input isn't actually dict-like, it's just going down _init_dict.

@toobaz
Copy link
Member

toobaz commented Apr 30, 2018

I might be wrong (can't test now, sorry), but a simpler solution could maybe be to replace

if data is None:

with

if data in [None, {}]:
    data = []

?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 30, 2018 via email

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 30, 2018 via email

@toobaz
Copy link
Member

toobaz commented Apr 30, 2018

Ah, no, doesn't quite work since the list isn't broadcast to the length of index.

Right!

I thought at least replacing data = {} with data = na_value_for_dtype(dtype) would alleviate the problem for the data=None case, but I'm getting some infinite recursion when collecting tests which I'm unable to explain.

@TomAugspurger
Copy link
Contributor Author

Merging in 2 hours if there are no further comments. I think we're just choosing between the two least-bad outcomes here.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

let me look again

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

lgtm. thanks (I believe you said this was after 0.22 so no whatsnew needed)

@TomAugspurger
Copy link
Contributor Author

Right, slowdown was only on master.

@TomAugspurger TomAugspurger merged commit c8fcfcb into pandas-dev:master Apr 30, 2018
@TomAugspurger TomAugspurger deleted the empty-series-ctor-perf branch April 30, 2018 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants