-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Inlining" data when writing references to disk #62
Comments
To support this in zarr would it be enough to write a mixed store, i.e. one where some arrays are backed by |
I imagine there are actually two sources of inlined chunks.
For case (1), duplicating the entire array seems like a decent idea -- perhaps with a new set of encoding options. You can imagine a case where you want to merge your 90k tiny chunks into a single chunk. |
Case (2) would require adding inlining into the manifest spec right? (xref https://github.com/zarr-developers/zarr-specs) Whereas case (1) is just a mixture of "normal" zarr arrays and arrays where every chunk is represented in the manifest but nothing is inlined into the manifest itself. |
What's the kerchunk equivalent of case (1)? Is it written into the kerchunk references the same as case (2) is? |
In the kerchunk story, (1) and (2) are equivalent. The special thing about (1) is that you don't have to inline the data in the case of a zarr manifest. You could, for specific variables, choose to rewrite the array, perhaps with new chunking/encoding/etc. Why would you want to do this? Imagine a situation where you concatenate a dataset with a 10k of time variables of size |
So in the kerchunk story, (1) is kind of a poor mans Zarr array? Because you've written in actual data to the FYI #69 implements the opening of such a mixed dataset, but I don't have a way of saving it to disk yet (either through kerchunk or zarr with manifest.jsons). |
This has been implemented, for both kerchunk and icechunk formats. |
Sometimes we might prefer to write actual data values out into the on-disk kerchunk references file (/ zarr store), because it's more efficient than storing byte ranges to point to very small amounts of data. e.g.
(Originally posted by @dcherian in #18 (comment))
Kerchunk calls this "inlining".
To implement this we need to actually read those data values into memory in the first place. Once #18 is solved we would already be doing that for 1D coordinate indexes (which are the data we most likely want to inline anyway), so the choice of whether or not to "inline" those values could be deferred until the
.to_kerchunk
write step. (But you wouldn't be able to inline if you never created the indexes.)The text was updated successfully, but these errors were encountered: