Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: copy cached attributes on index shallow_copy #32568

Merged
merged 2 commits into from
Mar 11, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Mar 9, 2020

The performance of the example in #28584 is:

>>> idx = pd.Index(np.arange(100_000))
>>> %timeit idx.get_loc(99_999)
1.17 µs ± 80.6 ns per loop  # master and this PR
>>> %timeit idx._shallow_copy().get_loc(99_999)
3.57 ms ± 286 ns per loop  # master
3.67 µs ± 286 ns per loop  # this PR

The issue is still on the extension indexes, e.g. CategoricalIndex._shallow_copy. I'd like to take them afterwards.

@@ -499,10 +501,12 @@ def _shallow_copy(self, values=None, name: Label = no_default):
"""
name = self.name if name is no_default else name

if values is None:
values = self.values
cache = self._cache if values is None else {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep the if values is None: check; i like it for inspection of coverage results

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 9, 2020

LGTM pending green

@jbrockmendel
Copy link
Member

Should engine be shared? If so, maybe access it to ensure it is in _cache before coping _cache?

@topper-123
Copy link
Contributor Author

topper-123 commented Mar 10, 2020

If the _cache has been populated before the call to _shallow_copy, we always copy. It it hasn't, it's a question of evaluating trade-offs:

  1. If we access _engine before copying, we risk calculating unneccesarily
  2. If we do not access it before copying, we risk having to do the same calculation several times.

My hunch is that it's mostly not that important, because either the cache has already been populated, or the original index will not be used again (e.g. in a pipe, often).

But in some given cases we'd prefer either 1 or 2, depending on context, but we'd have to choose. I prefer 2., because in that case _shallo_copy will never halt to create new _engine attributes, i.e. performance will be more predictable.

@jbrockmendel
Copy link
Member

Makes sense, thanks for explaining your thought process.

@topper-123 topper-123 added Performance Memory or execution speed performance Index Related to the Index class or subclasses labels Mar 10, 2020
@topper-123 topper-123 added this to the 1.1 milestone Mar 10, 2020
@jreback jreback merged commit e8e02c0 into pandas-dev:master Mar 11, 2020
@jreback
Copy link
Contributor

jreback commented Mar 11, 2020

thanks @topper-123

if you think we need additional asvs's happy to have them (e.g. is_monotonic, is_unique), MI indexing, and .get_loc would all benefit; we might have enough, just checking in about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Index._shallow_copy doesn't copy ._engine
3 participants