Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How easy can it be to create a zarr array #2083

Open
d-v-b opened this issue Aug 13, 2024 · 1 comment
Open

How easy can it be to create a zarr array #2083

d-v-b opened this issue Aug 13, 2024 · 1 comment
Labels
enhancement New features or improvements V3 Affects the v3 branch
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Aug 13, 2024

Users come to Zarr with a variety of array-like objects -- numpy arrays, or dask arrays, or xarray DataArrays, zarr v2 arrays, zarr v3 arrays, etc. Imagine a venn diagram of attributes / methods for these objects: shape, __getitem__, dtype would be in the shared middle, and chunks, chunksize, attrs, dims, codecs, filters in the disjoint periphery. How can we conveniently model an arbitrary array-like object as a Zarr array? In particular, how can we ensure that you can create a complete Zarr array from an existing array-like object (which might be already a zarr array) with a single function call?

If we agree on that objective, then here is a rough outline of what that function could look like:

  • we should have a top-level from_array method that creates a Zarr array from an existing array-like object.
# numpy
np_arr = np.zeros(10)
zarr.from_array(np_arr) # memorystore-backed zarr v3 array with shape 10 and dtype float64, and default parameters for everything else
zarr.from_array(np_arr, zarr_format=2, compressor=Gzip(), attributes={'foo': 10}) # same as above, but v2, with gzip, and attributes

# dask
da_arr = da.zeros((10,), chunks=(1,))
zarr.from_array(da_arr) # inherits the `chunks` attribute from the array
zarr.from_array(da_arr, chunking_bikeshed=(2,)) # overrides the chunks attribute, kwarg name tbd 🙃 

# xarray
xr_arr = xarray.DataArray(np.zeros(10), attrs={'foo': 10}, dims=('dim_0',))
zarr.from_array(xr_arr) # zarr v3 array with dimension names inherited xr_arr.dims, attrs from xr_arr.attrs)

# zarr
zarr.from_array(zarr.zeros(10)) # makes a copy of the array

some open questions:

  • should we copy data? over in pydantic-zarr I implemented a from_array function that only creates array metadata, because users might not want to eagerly move 10 TB of data at array definition time. Perhaps this could be controlled with a keyword argument.
  • should we support creating v2 arrays through this API, or use a v2.from_array function for that? I'm fine either way.
  • How much work is required to implicitly model the different array-like libraries enough for the above functionality to be useful?
  • There is a similar question about zarr groups, but the set of "zarr-group-like objects" is a bit narrower than array-likes.
    Thoughts?
@jni
Copy link
Contributor

jni commented Oct 17, 2024

Big supporter here @d-v-b. Actually the main reason I use zarr-python more heavily than tensorstore is having to specify a spec in tensorstore 😂 🦥. Convenience functions with sane defaults are very important!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements V3 Affects the v3 branch
Projects
Status: Todo
Development

No branches or pull requests

3 participants