Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate Zarr array read #8472

Merged
merged 5 commits into from
Dec 1, 2023
Merged

Conversation

dcherian
Copy link
Contributor

We already get the underlying Zarr array in

return FrozenDict(
(k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
)

and then pass it to open_store_variable. Just pass that array down to ZarrArrayWrapper instead of reading it from the datastore again.

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Nov 21, 2023
xarray/backends/zarr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logical next step after #8297.

@@ -63,13 +63,13 @@ def encode_zarr_attr_value(value):
class ZarrArrayWrapper(BackendArray):
__slots__ = ("datastore", "dtype", "shape", "variable_name", "_array")

def __init__(self, variable_name, datastore):
def __init__(self, variable_name, zarr_array, datastore):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, variable_name, zarr_array, datastore):
def __init__(self, zarr_array):

Fairly sure this should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this properly. I've not had good results doing this trick on a different datastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the heads up! what issues did you see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure how the zarr backend is implemented, but in my case after xr.open_dataset() the datastore is closed and not needed anymore( until .compute()).

But in this case you're sending a reference from that store to the backendarray which will be lost once the store is closed.

I've gone the opposite way, in get_variables only the required keys are sent. Then only once the backendarray __getitem__ is triggered the arrays are loaded.

I noticed this quite quickly once I played around with the dataset and loaded some variables and if you haven't noticed it yet maybe the zarr backend does some interesting tricks to avoid the closed references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yerah not sure. We do have distributed tests for reading zarr so presumably it's all OK

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoop whoop!

@dcherian dcherian added the plan to merge Final call for comments label Nov 29, 2023
@andersy005 andersy005 enabled auto-merge (squash) December 1, 2023 02:24
@andersy005 andersy005 merged commit 1715ed3 into pydata:main Dec 1, 2023
26 checks passed
@dcherian dcherian deleted the optimize-zarr-read branch December 1, 2023 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants