Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Inlining" data when writing references to disk #62

Open
TomNicholas opened this issue Mar 28, 2024 · 7 comments
Open

"Inlining" data when writing references to disk #62

TomNicholas opened this issue Mar 28, 2024 · 7 comments
Labels
Kerchunk Relating to the kerchunk library / specification itself zarr-specs Requires adoption of a new ZEP

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Mar 28, 2024

Sometimes we might prefer to write actual data values out into the on-disk kerchunk references file (/ zarr store), because it's more efficient than storing byte ranges to point to very small amounts of data. e.g.

you wouldn't want to read ~90k time steps from 90k files to construct a 90K long time coordinate

(Originally posted by @dcherian in #18 (comment))

Kerchunk calls this "inlining".

To implement this we need to actually read those data values into memory in the first place. Once #18 is solved we would already be doing that for 1D coordinate indexes (which are the data we most likely want to inline anyway), so the choice of whether or not to "inline" those values could be deferred until the .to_kerchunk write step. (But you wouldn't be able to inline if you never created the indexes.)

@TomNicholas
Copy link
Member Author

To support this in zarr would it be enough to write a mixed store, i.e. one where some arrays are backed by manifest.json files and some by normal zarr chunks? Where the latter is effectively the "inlined" data? cc @jhamman

@TomNicholas TomNicholas added zarr-specs Requires adoption of a new ZEP Kerchunk Relating to the kerchunk library / specification itself labels Mar 28, 2024
@jhamman
Copy link
Member

jhamman commented Mar 28, 2024

I imagine there are actually two sources of inlined chunks.

  1. the case where the entire array is comprised one (or possibly more) small chunk.
  2. some chunks of an array are very small. These could be chunks along the edge of a chunkgrid or chunks that were very effectively compressed (e.g. constant value).

For case (1), duplicating the entire array seems like a decent idea -- perhaps with a new set of encoding options. You can imagine a case where you want to merge your 90k tiny chunks into a single chunk.
For case (2), you may want to do something else. Kerchunk is happy to inline individual chunks based on a length threshold. These could be easily stored in the same manifest along with traditional references.

@TomNicholas
Copy link
Member Author

Case (2) would require adding inlining into the manifest spec right? (xref https://github.com/zarr-developers/zarr-specs) Whereas case (1) is just a mixture of "normal" zarr arrays and arrays where every chunk is represented in the manifest but nothing is inlined into the manifest itself.

@TomNicholas
Copy link
Member Author

TomNicholas commented Apr 1, 2024

What's the kerchunk equivalent of case (1)? Is it written into the kerchunk references the same as case (2) is?

@jhamman
Copy link
Member

jhamman commented Apr 2, 2024

In the kerchunk story, (1) and (2) are equivalent.

The special thing about (1) is that you don't have to inline the data in the case of a zarr manifest. You could, for specific variables, choose to rewrite the array, perhaps with new chunking/encoding/etc. Why would you want to do this?

Imagine a situation where you concatenate a dataset with a 10k of time variables of size (1, ). This is very easy to inline with an effective chunksize of (1, ) but you may want to just rewrite the concatenated time variable as a real zarr array with shape (10000, ) and a chunksize of (1000, ).

@TomNicholas
Copy link
Member Author

So in the kerchunk story, (1) is kind of a poor mans Zarr array? Because you've written in actual data to the .json file for every chunk in an array. So in a more zarr-native way of doing things, we might as well just write a real zarr array for that variable instead.

FYI #69 implements the opening of such a mixed dataset, but I don't have a way of saving it to disk yet (either through kerchunk or zarr with manifest.jsons).

@TomNicholas
Copy link
Member Author

As #45 has been merged the Zarr version of (1) can now be implemented. It would be analogous to #73 but instead use the zarr-python v3 library (/ ideally just some part of xarray's normal to_zarr interface) to write only the "loadable variables" into the store on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kerchunk Relating to the kerchunk library / specification itself zarr-specs Requires adoption of a new ZEP
Projects
None yet
Development

No branches or pull requests

2 participants