"Inlining" data when writing references to disk #62

TomNicholas · 2024-03-28T15:47:01Z

Sometimes we might prefer to write actual data values out into the on-disk kerchunk references file (/ zarr store), because it's more efficient than storing byte ranges to point to very small amounts of data. e.g.

you wouldn't want to read ~90k time steps from 90k files to construct a 90K long time coordinate

(Originally posted by @dcherian in #18 (comment))

Kerchunk calls this "inlining".

To implement this we need to actually read those data values into memory in the first place. Once #18 is solved we would already be doing that for 1D coordinate indexes (which are the data we most likely want to inline anyway), so the choice of whether or not to "inline" those values could be deferred until the .to_kerchunk write step. (But you wouldn't be able to inline if you never created the indexes.)

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-28T15:48:25Z

To support this in zarr would it be enough to write a mixed store, i.e. one where some arrays are backed by manifest.json files and some by normal zarr chunks? Where the latter is effectively the "inlined" data? cc @jhamman

jhamman · 2024-03-28T20:54:17Z

I imagine there are actually two sources of inlined chunks.

the case where the entire array is comprised one (or possibly more) small chunk.
some chunks of an array are very small. These could be chunks along the edge of a chunkgrid or chunks that were very effectively compressed (e.g. constant value).

For case (1), duplicating the entire array seems like a decent idea -- perhaps with a new set of encoding options. You can imagine a case where you want to merge your 90k tiny chunks into a single chunk.
For case (2), you may want to do something else. Kerchunk is happy to inline individual chunks based on a length threshold. These could be easily stored in the same manifest along with traditional references.

TomNicholas · 2024-03-28T21:01:26Z

Case (2) would require adding inlining into the manifest spec right? (xref https://github.com/zarr-developers/zarr-specs) Whereas case (1) is just a mixture of "normal" zarr arrays and arrays where every chunk is represented in the manifest but nothing is inlined into the manifest itself.

TomNicholas · 2024-04-01T16:34:26Z

What's the kerchunk equivalent of case (1)? Is it written into the kerchunk references the same as case (2) is?

jhamman · 2024-04-02T15:42:58Z

In the kerchunk story, (1) and (2) are equivalent.

The special thing about (1) is that you don't have to inline the data in the case of a zarr manifest. You could, for specific variables, choose to rewrite the array, perhaps with new chunking/encoding/etc. Why would you want to do this?

Imagine a situation where you concatenate a dataset with a 10k of time variables of size (1, ). This is very easy to inline with an effective chunksize of (1, ) but you may want to just rewrite the concatenated time variable as a real zarr array with shape (10000, ) and a chunksize of (1000, ).

TomNicholas · 2024-04-02T16:07:48Z

So in the kerchunk story, (1) is kind of a poor mans Zarr array? Because you've written in actual data to the .json file for every chunk in an array. So in a more zarr-native way of doing things, we might as well just write a real zarr array for that variable instead.

FYI #69 implements the opening of such a mixed dataset, but I don't have a way of saving it to disk yet (either through kerchunk or zarr with manifest.jsons).

TomNicholas · 2024-05-01T21:07:26Z

As #45 has been merged the Zarr version of (1) can now be implemented. It would be analogous to #73 but instead use the zarr-python v3 library (/ ideally just some part of xarray's normal to_zarr interface) to write only the "loadable variables" into the store on disk.

TomNicholas · 2024-12-18T20:28:03Z

This has been implemented, for both kerchunk and icechunk formats.

TomNicholas added zarr-specs Requires adoption of a new ZEP Kerchunk Relating to the kerchunk library / specification itself labels Mar 28, 2024

TomNicholas mentioned this issue Mar 28, 2024

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Closed

2 tasks

This was referenced Apr 1, 2024

How to handle encoding #68

Open

Load selected variables instead of making them virtual #69

Merged

This was referenced Apr 5, 2024

Test fsspec roundtrip #42

Merged

Inline loaded variables into kerchunk references #73

Merged

TomNicholas mentioned this issue Jun 8, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

21 tasks

ghidalgo3 mentioned this issue Aug 2, 2024

Handle scalar dataset variables #205

Merged

5 tasks

ayushnag mentioned this issue Aug 19, 2024

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Open

TomNicholas mentioned this issue Aug 20, 2024

Xarray backend which loads data by default #221

Open

maxrjones mentioned this issue Dec 4, 2024

Virtual chunks design document earth-mover/icechunk#436

Merged

TomNicholas closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Inlining" data when writing references to disk #62

"Inlining" data when writing references to disk #62

TomNicholas commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Mar 28, 2024

jhamman commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

TomNicholas commented Apr 1, 2024 •

edited

Loading

jhamman commented Apr 2, 2024

TomNicholas commented Apr 2, 2024

TomNicholas commented May 1, 2024

TomNicholas commented Dec 18, 2024

"Inlining" data when writing references to disk #62

"Inlining" data when writing references to disk #62

Comments

TomNicholas commented Mar 28, 2024 • edited Loading

TomNicholas commented Mar 28, 2024

jhamman commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

TomNicholas commented Apr 1, 2024 • edited Loading

jhamman commented Apr 2, 2024

TomNicholas commented Apr 2, 2024

TomNicholas commented May 1, 2024

TomNicholas commented Dec 18, 2024

TomNicholas commented Mar 28, 2024 •

edited

Loading

TomNicholas commented Apr 1, 2024 •

edited

Loading