Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidated metadata storage for Zarr v3 #136

Open
grlee77 opened this issue Mar 11, 2022 · 12 comments · May be fixed by #309
Open

consolidated metadata storage for Zarr v3 #136

grlee77 opened this issue Mar 11, 2022 · 12 comments · May be fixed by #309
Labels
protocol-extension Protocol extension related issue

Comments

@grlee77
Copy link
Contributor

grlee77 commented Mar 11, 2022

In the current draft of the v3 spec in zarr-developers/zarr-python#898, consolidated metadata has been implemented similarly to v2. Currently this metadata gets stored in the group meta/root/consolidated, but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

@ianliu
Copy link

ianliu commented Sep 6, 2022

Is the V3 considering metadata compression? I'm working with a dataset where the consolidated metadata (v2) has 1.2GB in size, and opening it from S3 takes a very long time to open.
After gzipping the metadata, it reduces to 20MB, so it would be nice to have compressed metadata.

@alimanfoo
Copy link
Member

Hi @ianliu, thanks this is very useful to know. Adding a capability for compression into a v3 extension for consolidated metadata sounds like a good idea to me.

@rabernat
Copy link
Contributor

rabernat commented Sep 7, 2022

@ianliu I'm curious what makes the metadata so big in your case? Do you have lots of arrays? Or very large custom attrs?

V3 allows for defining of custom metadata encoding, so a compressed format could be used.

Also, a major question for V3 is whether user metadata is a separate document or not (see #149 (comment)). If you have 1GB of custom attrs in a single array, that would be a strong motivation not to put user metadata together with the core array metadata.

@ianliu
Copy link

ianliu commented Sep 7, 2022

@rabernat in my particular case, I have many small arrays. It could be argued that there is a better data representation that would consolidate multiple arrays into a single array, thus reducing metadata, but the way our program is setup makes this a little bit trickier (mainly because our program produces results in parallel).

@joshmoore
Copy link
Member

💯 for potential consolidated metadata features but

but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

would this suggest stripping the current consolidated metadata support in V3 until there's an extension?

@TomAugspurger
Copy link

I've drafted a version of this at https://github.com/TomAugspurger/zeps/blob/consolidated-metadata/draft/ZEP0006.md. A few questions before turning that into a PR:

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?


One major design decision is whether this consolidated metadata should be embedded in the zarr.json, or whether it should go in a file next to the root zarr.json (e.g. consolidated_metadata.json). Doing it in the zarr.json would make collecting all the metadata possible in a single HTTP request. With separate zarr.json and consolidated_metadata.json you would need two. The downside of putting it in zarr.json is potentially making any read of that file slow, even if you don't need the consolidated metadata.

I'm also a bit curious about how much async zarr-python reduces the pain of hierarchies with a large number of nodes. With dozens of HTTP requests in flight at the same time, maybe it's not so bad? If #284 (listing the child nodes in the root zarr.json) were implemented, then you'd be able to make all but the first HTTP request concurrently.

@jhamman
Copy link
Member

jhamman commented Aug 21, 2024

I like the draft ZEP @TomAugspurger. Thanks for putting it together.

As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though). Provided we can address this challenge, we likely will want to support both an inline and external consolidated metadata object.

// inline
"consolidated_metadata": {
    "kind": "inline":
    "metadata": {
        "air": {
          "shape": [2920, 25, 53],
          "fill_value": 0,
          ...
          "node_type": "array",
      },
      "lat": {
          "shape": [25],
          "fill_value": NaN,
          ...
          "node_type": "array",
      }
  }
}
// external
"consolidated_metadata": {
    "kind": "external":
    "metadata": "consolidated_metadata.json"
}

An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

@TomAugspurger
Copy link

the current spec does not provide a way to add arbitrary extensions to a metadata document

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

we likely will want to support both an inline and external consolidated metadata object

I like the idea of future-proofing things by including a "kind" enum to indicate where to find the consolidated metadata, maybe we pick only one for the spec? This is minor, but I worry about putting too much complexity on the implementors.

@jhamman
Copy link
Member

jhamman commented Aug 22, 2024

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

I don't think we have defined the expected behavior for what will happen if an implementation encounters and unexpected key. So yes, I think thinks could break.

edit: there is this section in the spec:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

Now, I don't know how/where you would put "must_understand" in this case. See also #270

maybe we pick only one for the spec?

+1 one, pick one for now. We can always expand this later.

@TomAugspurger
Copy link

I'm not sure I see the motivation for disallowing extra keys in the metadata objects. That seems to make it hard to evolve the spec in forwards-compatible ways.

Anyway, maybe in this case it's fine to use because we're making a "future version of the spec"? And I guess we'll set must_understand: false to indicate that older versions shouldn't error, they should just ignore the extra key and fall back to the non-consolidated metadata.

@joshmoore
Copy link
Member

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?

In my mind, this is a great ZEP.

from PR (TODO: is there a better way to indicate that a dataset uses an extension?)

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

jhamman commented 2 days ago
As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though)

I think it's implicitly possible but not (well-)documented. So 👍 for doing what we need to do to make it possible.

TomAugspurger commented 20 hours ago
An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

I'd urge caution here. This led to many issues in v2. Minimally it should be a "wrapping store" so that they can be composed, because otherwise they lead to an explosion of classes that are needed. (e.g. "NestedRemoteDirectoryStore") Additionally, I'd suggestthe metadata MUST be in the zarr.json regardless of the implementation.

Now, I don't know how/where you would put "must_understand" in this case. See also #270

👍 for having must_understand set. FWIW, I've also been considering writing a ZEP to say that "all metadata will be in a RO-Crate file" for OME-Zarr. This might could go hand in hand. My thinking was there should be a flag in zarr.json to say "all metadata loading should follow this other protocol", i.e. a loader or something that could be pip installed.

@TomAugspurger TomAugspurger linked a pull request Aug 23, 2024 that will close this issue
@TomAugspurger
Copy link

Whoops, I just pushed this as a PR at #309, rather than a ZEP, sorry.

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

I'll open a separate issue to discuss this. In my experience, STAC's method of each object containing a stac_extensions list with a list of extensions has worked well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants