Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible index variables #8124

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Aug 30, 2023

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

The goal of this PR is to provide a more general solution to indexed coordinate variables, i.e., support arbitrary dimensions and/or duck arrays for those variables while at the same time prevent them from being updated in a way that would invalidate their index.

This would solve problems like the one mentioned here: #1650 (comment)

@shoyer I've tried to implement what you have suggested in #4979 (comment). It would be nice indeed if eventually we could get rid of IndexVariable. It won't be easy to deprecate it until we finish the index refactor (i.e., all methods listed in #6293), though. Also, I didn't find an easy way to refactor that class as it has been designed too closely around a 1-d variable backed by a pandas.Index.

So the approach implemented in this PR is to keep using IndexVariable for PandasIndex until we can deprecate / remove it later, and for the other cases use Variable with data wrapped in a custom IndexedCoordinateArray object.

The latter solution (wrapper) doesn't always work nicely, though. For example, several methods of Variable expect that self._data directly returns a duck array (e.g., a dask array or a chunked duck array). A wrapped duck array will result in unexpected behavior there. We could probably add some checks / indirection or extend the wrapper API... But I wonder if there wouldn't be a more elegant approach?

More generally, which operations should we allow / forbid / skip for an indexed coordinate variable?

  • Set array items in-place? Do not allow.
  • Replace data? Do not allow.
  • (Re)Chunk?
  • Load lazy data?
  • ... ?

(Note: we could add Index.chunk() and Index.load() methods in order to allow an Xarray index implement custom logic for the two latter cases like, e.g., convert a DaskIndex to a PandasIndex during load, see #8128).

cc @andersy005 (some changes made here may conflict with what you are refactoring in #8075).

Used to wrap indexed coordinate data. It raises an explicit error
message when trying to modify the array values in-place.
Used to create the variables from the index object and then wrap
coordinate data to prevent updating values in-place.
Except for PandasIndex objects: not needed for now since they create
IndexVariable objects.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

1 participant