Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray Cannot Read ConsolidateMetadata Reference Output #675

Open
ranchodeluxe opened this issue Jan 25, 2024 · 7 comments
Open

Xarray Cannot Read ConsolidateMetadata Reference Output #675

ranchodeluxe opened this issue Jan 25, 2024 · 7 comments

Comments

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jan 25, 2024

Versions:

zarr: '2.16.1'
xarray.: '2023.12.0'
pangeo-forge-recipes: @main

Problem:

Given a WriteCombinedReference beam pipeline like this:

    with pipeline as p:
        (
            p
            | beam.Create(pattern.items())
            | OpenWithKerchunk(file_type=pattern.file_type)
            | WriteCombinedReference()
            | ConsolidateMetadata()
        )

we can only read the reference output with zarr.open_consolidated but not xr.open_dataset(..., consolidated=True)

(Pdb) xr.open_dataset(store, engine='zarr', consolidated=True)
*** KeyError: '.zmetadata'

Offending line on deserialization: https://github.com/zarr-developers/zarr-python/blob/main/zarr/storage.py#L2944

indeed looking at the raw reference.json it doesn't contain a .zmetadata key even though zarr.consolidate_metadata should be writing one. Totaly guess, but might have to do with the version branching here: https://github.com/zarr-developers/zarr-python/blob/main/zarr/convenience.py#L1250-L1263

@abarciauskas-bgse
Copy link
Contributor

I also don't know if this is helpful but I'm a bit confused as well how zarr.consolidate_metadata is intended to or ever was working. I was also looking at convenience.py and noticed that consolidate_metadata returns open_consolidated and open_consolidated uses, by default an r+ mode which has a docstring:

'r+' means
read/write (must exist) although only writes to data are allowed,
changes to metadata including creation of new arrays or group
are not allowed.

Which stuck out to me because this method is intended to add a metadata group, I think.

Overall, I'm not seeing where the .zmetadata key would be written to disk, either as its own file or into the reference json file. But I'm probably missing something?

@maxrjones
Copy link
Contributor

@martindurant suggested in fsspec/kerchunk#240 (comment) adding consolidation in MultiZarrtoZarr, if I understand correctly. In general, I haven't seen examples of Zarr Python being used to modify kerchunk reference files, so I'm wondering if that functionality would need to be provided by kerchunk?

@rabernat
Copy link
Contributor

rabernat commented Jan 26, 2024

I also don't know if this is helpful but I'm a bit confused as well how zarr.consolidate_metadata is intended to or ever was working.

You can have a look at how Xarray does it.
https://github.com/pydata/xarray/blob/ca4f12133e9643c197facd17b54d5040a1bda002/xarray/backends/zarr.py#L660-L661

Calling zarr.consolidate_metadata(store) mutates the store. It lists the metadata documents, builds the consolidated metadata dictionary, and writes it as json to the .zmetadata in the store.

For some thoughts on what belongs in Zarr vs. Kerchunk, I shared my perspective here: fsspec/kerchunk#377 (comment)

@maxrjones
Copy link
Contributor

For some thoughts on what belongs in Zarr vs. Kerchunk, I shared my perspective here: fsspec/kerchunk#377 (comment)

It was helpful to read your perspective on where features belong, thanks for sharing. I interpreted "So when it comes time to refactor those prototypes into more stable interfaces" as when Zarr V3 is finalized and the base implementation is available in Zarr Python. Does the concept of a chunk manifest have the potential to apply to the V2 implementation in Zarr Python separately from kerchunk, to guide where the modifying references and metadata would happen if this feature were needed in the very near future?

@rabernat
Copy link
Contributor

V2 is not extensible. The lack of extension points was one of the main reasons we needed to create V3.

The lack of extensibility of V2 has driven people to implement all kinds of creative hacks at the Store level. Kerchunk reference filesystem is a perfect example of that. But the limitation of the Store API is that it's just a key / value interface. Stores don't "know" anything about Arrays, Groups, etc. This fact has led to Kerchunk essentially having to re-implement a bunch of logic that exists elsewhere in the stack, e.g. for concatenating arrays.

It should certainly be possible to generate V3 metadata for existing V2 datasets without rewriting any chunks. That's the best path for migrating existing data. I don't think V2 will every be able to retroactively gain something like a chunk manifest.

@martindurant
Copy link
Contributor

Before going too far, why would you want to consolidate metadata? In the case of kerchunk, all of the metadata is already separated, and "consolidate" would just duplicate information to no benefit.

Having said that, I suppose the only thing missing from the pipeline is to write the references back. As @rabernat said, they have been changed in memory but not saved. Here is an example of amending kerchunk references - it works fine for chunks or metadata!

when Zarr V3 is finalized

A lot of development has been waiting on this.

Kerchunk essentially having to re-implement a bunch of logic that exists elsewhere in the stack

Well, persisting xarray's internal representation would have been better, but not tractable (by me!). That's assuming xarray-compatible datasets, which does not cover all zarrable datasets.

@rabernat
Copy link
Contributor

A lot of development has been waiting on this.

Agreed. The slow and uneven rollout of V3 has really been a drag on our whole community. 😞 Lots of lessons to be learned from this.

Fortunately there is a light at the end of the tunnel, thanks to the work @jhamman has been doing to finally get V3 implemented in Zarr Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants