Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr: Optimize region="auto" detection #8997

Merged
merged 4 commits into from
May 7, 2024

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented May 3, 2024

  1. This moves the region detection code into ZarrStore so we only open the store once.
  2. Instead of opening the store as a dataset, construct a pd.Index directly to "auto"-infer the region.

The diff is large mostly because a bunch of code moved from backends/api.py to backends/zarr.py

for k, v in self.get_variables().items()
if k in existing_variable_names
},
{k: self.open_store_variable(name=k) for k in existing_variable_names},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just open the needed variables instead of opening all of them.

Comment on lines +807 to +812
variable = conventions.decode_cf_variable(
dim, self.open_store_variable(dim).compute()
)
assert variable.dims == (dim,)
index = pd.Index(variable.data)
idxs = index.get_indexer(ds[dim].data)
Copy link
Contributor Author

@dcherian dcherian May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 812-817: This is the main logic change.

region[dim] = slice(idxs[0], idxs[-1] + 1)
return region

def _validate_and_autodetect_region(self, ds, region) -> dict[str, slice]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly copied over

@@ -1815,6 +1710,16 @@ def to_zarr(
write_empty=write_empty_chunks,
)

if region is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down so we only open the Zarr store once.

@@ -890,7 +890,7 @@ where the data should be written (in index space, not label space), e.g.,

# For convenience, we'll slice a single dataset, but in the real use-case
# we would create them separately possibly even from separate processes.
ds = xr.Dataset({"foo": ("x", np.arange(30))})
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
# Any of the following region specifications are valid
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
Copy link
Contributor Author

@dcherian dcherian May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last line does not do what it looks like it's doing if there are no indexes!

@dcherian dcherian changed the title Zarr: Optimize region detection Zarr: Optimize region="auto" detection May 3, 2024
@max-sixty
Copy link
Collaborator

Looks great!

@dcherian dcherian added the plan to merge Final call for comments label May 7, 2024
@dcherian
Copy link
Contributor Author

dcherian commented May 7, 2024

Thanks for taking a look, Max!

@dcherian dcherian merged commit dcf2ac4 into pydata:main May 7, 2024
32 of 34 checks passed
andersy005 pushed a commit that referenced this pull request May 10, 2024
* Zarr: Optimize region detection

* Fix for unindexed dimensions.

* Better example

* small cleanup
andersy005 added a commit that referenced this pull request May 10, 2024
* main:
  Avoid auto creation of indexes in concat (#8872)
  Fix benchmark CI (#9013)
  Avoid extra read from disk when creating Pandas Index. (#8893)
  Add a benchmark to monitor performance for large dataset indexing (#9012)
  Zarr: Optimize `region="auto"` detection (#8997)
  Trigger CI only if code files are modified. (#9006)
  Fix for ruff 0.4.3 (#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004)
  Speed up localize (#8536)
  Simplify fast path (#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991)
  Fix syntax error in test related to cupy (#9000)
andersy005 added a commit to hmaarrfk/xarray that referenced this pull request May 10, 2024
* backend-indexing:
  Trigger CI only if code files are modified. (pydata#9006)
  Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870)
  add `.oindex` and `.vindex` to `BackendArray` (pydata#8885)
  temporary enable CI triggers on feature branch
  Avoid auto creation of indexes in concat (pydata#8872)
  Fix benchmark CI (pydata#9013)
  Avoid extra read from disk when creating Pandas Index. (pydata#8893)
  Add a benchmark to monitor performance for large dataset indexing (pydata#9012)
  Zarr: Optimize `region="auto"` detection (pydata#8997)
  Trigger CI only if code files are modified. (pydata#9006)
  Fix for ruff 0.4.3 (pydata#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004)
  Speed up localize (pydata#8536)
  Simplify fast path (pydata#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991)
  Fix syntax error in test related to cupy (pydata#9000)
@dcherian dcherian deleted the cleanup-zarr-region branch July 11, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants