-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up the repr for big MultiIndex objects #4846
Conversation
xarray/core/formatting.py
Outdated
if col_width < len(coord): | ||
n_values = col_width // 4 | ||
indices = list(range(0, n_values)) + list(range(-n_values, 0)) | ||
subset = coord[indices] | ||
else: | ||
subset = coord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could probably use some optimization: how big does the MultiIndex
have to be so indexing+get_level_variable
is faster than just get_level_variable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though it's so fast already relative to how often repr is called...
Could also defer to pandas, which seem to do this (though a different orientation)
More than fine to leave as a TODO imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that for everything below 100 elements get_level_variable
is faster than indexing first, which also means that I don't have to worry about the case where the index does not have enough elements to be truncated.
Could also defer to pandas
how would I do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would I do that?
I had meant — they have a repr for multiindex which is fast — so could we use theirs somehow, despite the different orientation. On reflection — our code is simple, I agree with your impulse.
This is great, thanks! We could add an ASV if you like, but also fine to leave for another day / PR |
done, I think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super!
Co-authored-by: Maximilian Roos <[email protected]>
Thanks @keewis and @max-sixty |
* upstream/master: speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
* master: (458 commits) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809) fix decode for scale/ offset list (pydata#4802) Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795) add a version info step to the upstream-dev CI (pydata#4815) fix the ci trigger action (pydata#4805) scatter plot by order of the first appearance of hue (pydata#4723) ...
…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...
* upstream/master: (24 commits) Compatibility with dask 2021.02.0 (pydata#4884) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) ...
I'm not able to check if this actually works for bigger arrays, but with
xr.DataArray(pd.Series(range(25_000_000), index=idx))
I get a significant speed-up of about 180x forrepr
.pre-commit run --all-files
whats-new.rst
api.rst