-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCZarr - Netcdf Support for Zarr #41
Comments
Naming issue:
So, we might have "__NCZ_dimensions" instead of .zdimensions. |
Thanks for opening this @DennisHeimbigner. Encountered issue ( zarr-developers/zarr-python#280 ) again recently. So figured that might interest you given some of this discussion about how to manage specs. Though each issue has its own place I think. If we do go down the attribute road, agree that having some non-conflicting name convention is important. The other option might be to widen the spec of things like |
Thanks @DennisHeimbigner. Just to add that, on the question of whether to pack everything into attributes (.zattrs) or whether to store metadata separately under other store-level keys (.zdims, .ztypdefs, etc.), I think both are reasonable and personally I have no objection to either. I lean slightly towards using attributes (.zattrs) because it plays nicely with some existing API features. E.g., the NCZ metadata can be accessed directly via the attributes API. And, e.g., the NCZ metadata would get included if using consolidated metadata, which is an experimental approach to optimising cloud access, available in the next release of Zarr Python. But neither of these are blockers to the alternative approach, because it is straightforward to read and decode JSON objects directly from a store, and it would also be straightforward to modify the consolidated metadata code to include other objects. |
We have learned from the existing netcdf-4 that datasets exist with |
here is a proposed consolicated metadata structure for NCZarr. |
We have learned from the existing netcdf-4 that datasets exist with
very large (~14gb) metadata.
Wow, that's big. I think anything near that size will be very sub-optimal
in zarr, because of metadata being stored as uncompressed JSON documents. I
wonder if in cases like that, it might be necessary to examine what is
being stored as metadata, and if any largish arrays are included then
consider storing them as arrays rather than as attributes.
I was looking at the Amazon S3 query capabilities and they are extremely
limited.
So the idea of consolidated metadata seems like a very good idea.
This reference:
https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata
does not provide any details of the form of the proposed consolidated
metadata.
Apologies the format is not documented as yet. There's an example here:
zarr-developers/zarr-python#268 (comment)
|
That was a typo. The correct size is 14 mb. |
Ah, OK! Although 14MB is still pretty big, it's probably not unmanageable. |
Depends on what manageable means, I suppose. We have situations where |
Depends on what manageable means, I suppose. We have situations where
projects are trying to load a small part of the metadata from thousands of
files
each of which has the amount of metadata. Needless to say, this is
currently
very slow. We are trying various kinds of optimizations around lazy loading
of metadata but the limiting factor will be HDF5. A similar situation
is eventually going to occur here, so thinking about various optimizations
is important.
That's helpful to know.
FWIW the consolidated metadata feature currently in zarr python was
developed for the xarray use case, where the need (as I understand it) is
to load *all* metadata up front. So that feature combines the content from
all .zarray, .zgroup and .zattrs objects from the entire group and dataset
hierarchy into a single object, which can then be read from object storage
in a single HTTP request.
If you have use cases where you have a large amount of metadata but only
need to read parts of it at a time, that obviously might not be optimal.
However, 14MB is not an unreasonable amount to load from object storage,
would probably be fine to do interactively (IIRC bandwidth to object
storage from compute nodes within the same cloud is usually ~100MB/s).
I'm sure there would be other approaches that could be taken too that could
support partial/lazy loading of metadata. Happy to discuss at any point.
|
Are you able to provide data on where most of the time is being spent, @DennisHeimbigner? |
Issue: Attribute Typing This is most important with respect to the _FillValue attribute for a variable. Additionally, if the variable is of a structured type, there is currently no standardized Sadly, this means that NCZarr must add yet another attribute that specifies |
Regarding the fill value specifically, the standard metadata for a zarr array includes a Regarding attributes in general, we haven't tried to standardise any method to encode values that do not have a natural JSON representation. Currently it is left to the application developer to decide their own method for encoding and decoding values as JSON, e.g., I believe xarray has some logic for encoding values in zarr attributes. There has also been some discussion of this at #354 and #156. Ultimately it would be good to standardise some conventions (or at least define some best practices) for representing various common value types in JSON, such as typed arrays. I'm more than happy for the community to lead on that. |
This reference -- https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding -- |
I.e., use base 64 encoding. |
So it would be nice if we had a defined language-independent algorithm |
That would be good. I believe numpy mimics C structs, further info here. Looking again at the numpy docs, there is support for an
That's a nice idea, would fit with the design principle that metadata are human-readable/editable.
The zarr spec does currently defer to numpy as much as possible, assuming that much of the hard thinking around things like types has been done there already. If there are clarifications that we could make to the v2 spec that would help people develop compatible implementations in other languages then I'd welcome suggestions. Thinking further ahead to the next iteration on the spec, it would obviously be good to be as platform-agnostic as possible, however it would also be good to build on existing work rather than do any reinvention. The work on ndtypes may be relevant/helpful there. |
Surfacing here notes on the NetCDF NCZarr implementation, thanks @DennisHeimbigner for sharing. |
Also relevant here, documentation of xarray zarr encoding conventions, thanks @rabernat. |
@DennisHeimbigner: It looks like Unidata's Netcdf C library can now read data with the xarray zarr encoding conventions, right? @rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions? |
The ability to read xarray is in the main branch, and will be in the upcoming 4.8.1 release. I am shaving the yak to get our automated regression and integration test infrastructure back up and running but we hope to have 4.8.1 out shortly. |
I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf. p.s. but yes, please open an xarray issue to keep track of it. |
One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see pydata/xarray#5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray. Consolidated metadata would probably be a nice feature for NcZarr, too, because it reduces the number of files that need to be queried for metadata down to only one. I think there was a similar intent behind the |
NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate. |
In Xarray, we have to read nearly all the metadata eagerly to instantiate
This is correct, you don't need to write consolidated metadata. But if you do, Xarray will be able to read the data much faster. As for whether netCDF users would notice a difference with consolidated metadata, I guess it would depend on their use-cases. Lazy metadata reads are great, but for sure it is faster to download a single small file than to download multiple files in a way that cannot be fully parallelized, even if they add up to the same total size. |
true, but we have use cases where the client code is walking a large set of netcdf files and reading a few pieces of information out of each of them and where the total metadata is large (14 megabytes). This can occur when one has a large collection of netcdf files covering some time period and each netcdf file is a time slice (or slices). |
For what it's worth, I could see making some movement (June-ish?) on #112 (comment) to permit the additional files. But either way, certainly ome/ngff#46 (review) (related issue) would suggest hammering out a plan for this difference before another package introduces a convention.
Having gone through pydata/xarray#5251 I'm slightly less worried about this then when I first read it (assuming it mean it would only support consolidated metadata), but having just spent close to 2 months trying to get |
@DennisHeimbigner, just a quick comment that I too always use consolidated metadata when writing Zarr. Here's a recent example with coastal ocean model output we are publishing, where consolidated metadata is an order of magnitude faster to open: |
Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed. In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case. On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata. |
I agree, I think any reasonable implementation should *ignore* unrecognized
keys or files. Hopefully this will be codified in zarr v3.
…On Sun, Aug 1, 2021 at 11:30 AM Dennis Heimbigner ***@***.***> wrote:
There was a long conversation about this on the Zarr spec github.
I pointed out that the new "dimension_separator" key violated
this constraint. The consensus seemed to be that extra keys would be
allowed,
but must be ignored if they are not recognized by the implementation.
I have not checked to see if that change has made it into the spec yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVXS6D3GDCMQPJULL4DT2WHDLANCNFSM4H5L7P7A>
.
|
ok, thanks for the clarification |
for reference: zarr-developers/zarr-python#715 (comment) That didn't make it into a zarr-specs issue (neither v2 nor v3) as far as I can tell. Anyone up for shepherding that? |
See the related conversation in pydata/xarray#6374 ("Should the [xarray-]zarr backend support NCZarr conventions?") |
@DennisHeimbigner does NCZarr support any filter now? |
Yes although there are some complications because the code uses HDF5 filters to perform |
@DennisHeimbigner Do you have documentations about how to enable and use the filter through NCZarr, and we have a new codec which is not binding to any codec yet. Do you have suggestion on how to enable it in your NCZarr? |
@DennisHeimbigner @halehawk Maybe I should jump in now ;) I have a lossy compressor product (SPERR: https://github.com/shaomeng/SPERR) that I'm looking at paths to integrate into the Zarr format. I haven't spent too much time on it, but my understanding is that I need to make it a Zarr filter. Our immediate application of it, an ASD run of MURam, has decided to use NCZarr to output Zarr files, so the question arises that if Zarr filters are supported by NCZarr. I guess the most direct question to @DennisHeimbigner as the NCZarr developer is, what approach do you recommend to integrate a lossy compressor to an NCZarr output? |
If the compressor is (or easily could be) written in python, then see the NumCodecs web page. |
That's super helpful, thanks for the pointer! One more question: can the
compression-enabled NCZarr output read by zarr tools in the python
ecosystem?
…On Fri, Apr 8, 2022, 3:56 PM Dennis Heimbigner ***@***.***> wrote:
If the compressor is (or easily could be) written in python, then see the
NumCodecs web page.
If the compressor is in C or C++, and you decide to use netcdf-c NCZarr,
then you need
to build an HDF5 compressor wrapper plus the corresponding codecs API.
I have attached the relevant documentation. If this compressor is similar
to some existing
compressor such as bzip2 or zstandard, then you can copy and modify the
corresponding
wrapper in netcdf-c/plugins directory -- H5Zzstd.c, for example.
filters.md
<https://github.com/zarr-developers/zarr-specs/files/8455495/filters.md>
—
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGG6JNSZGN6IUUZAPCW6ODVECTQTANCNFSM4H5L7P7A>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yes, IF the filters are available in NumCodecs. |
Does it mean the compressor would be better to be integrated in numcodecs if it wants to be used by nczarr and zarr/xarray?
Haiying
…Sent from my iPhone
On Apr 8, 2022, at 5:03 PM, Dennis Heimbigner ***@***.***> wrote:
Yes, IF the filters are available in NumCodecs.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Sorry, I wasn't clear. |
Hi @DennisHeimbigner , there is some confusion our team has and we would love to know if you can comment on it. The confusion is that do we even need to make a HDF5 filter for SPERR compressor? Because NCZarr supports NumCodecs filters, isn't it the case that once we make a NumCodecs filter for SPERR, both NCZarr and Python-Zarr can read and write SPERR compressed zarr files? More generally, are there any advantages/disadvantages to produce a HDF5 filter for SPERR, if all we want is SPERR-compressed Zarr files? |
There are two pieces here and I am sorry I was unclear.
So for zstd, we might have this: The second part is the actual code that implements the compressort. NCZarr supports the first part so that it can read/write legal Zarr metadata. |
@dennis Heimbigner ***@***.***> I still need you to clarify something
here. So I looked at your H5Zzstd.c which is a HDF5 plugin for zstd and
supports numcodec zstd read/write. Then I got this idea, if Samuel's new
compressor need not get a formal HDF5 filter ID, but should add a similar
H5Zsperr.c to your netcdf-c if he need not write to HDF5/netcdf file now.
But he needs to be in numcodecs to get Zarr/numcodecs support.
…On Tue, Apr 12, 2022 at 12:20 PM Dennis Heimbigner ***@***.***> wrote:
There are two pieces here and I am sorry I was unclear.
The first piece is the declaration of the compressor for a variable
in the Zarr metadata, This is specified in the "compressor" key for
the .zarray metadata object for the variable. The format for this
is defined by NumCodecs and generally has the form
{"id": "<compressor name>", "parameter1": <value>, ... "parametern": <value>}
So for zstd, we might have this: {"id": "zstd", "level:": 5}
The second part is the actual code that implements the compressort.
NCZarr supports the first part so that it can read/write legal Zarr
metadata.
BUT, NCZarr requires its filter code to be written in C (or C++).
More specifically, it does not support the Python compressor code
implementations.
Sorry for the confusion
—
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAPEFEAYJPY2GZOFJWM6ODVEW5F5ANCNFSM4H5L7P7A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Just to clarify, did you mean that NCZarr requires its filter code to be in C AND also exposed to NCZarr as HDF5 filters? E.g., NumCodecs filters won't work. Sorry for the back and forth in this github thread. I think this is my last try and if there's still confusion, I'll try to set up a meeting and resolve it more directly :) |
I don't know the details of how codecs are defined for NCZarr, but in general you will need to provide a separate implementation of a codec for each zarr implementation that you want to support it. Zarr-python provides a mechanism by which codecs can be registered --- numcodecs defines many codecs, and zarr-python pulls in numcodecs as a dependency, but it is actually possible to define a codec for zarr-python outside of the numcodecs package --- see for example the imagecodecs Python package. |
That is correct. THe HDF group reserves ids 32768 to 65535 for unregistered use. |
First a big 💯 for the discussion, since this is exactly what we want to see happening for cross-implementation support of codecs. @shaomeng & @halehawk, don't hesitate to keep asking. I do wonder, @DennisHeimbigner, if we don't want to establish the channel you'd like for more nczarr questions. If so, I'd say we update the start and end of this thread with that URL and close this issue. Others may want to express an opinion, but if it's useful, we can have a no-code location like github.com/zarr-developers/nczarr for people to find a README pointing to the netcdf-c implementation's resources. cc: @WardF |
Sorry for the late comment on this; I would agree that maybe a 'Github Discussions' post would be a better place for this, instead of the issue we are working within. We can create that over at the |
21-050r1_Zarr_Community_Standard.pdf |
Pertinent text from @briannapagan's link above...
|
That information is out-of-date in a couple of ways.
|
Thanks for calling that out, @DennisHeimbigner. This came out of a conversation over here. zarr-developers/geozarr-spec#22 There are very few people who have a deep enough understanding of the moving parts here to answer all the questions. It's good to hear that we basically have interoperability. Two questions:
|
I am moving the conversation about NCZarr to its own issue. See Issue https://github.com/zarr-developers/zarr/issues/317 for initial part of this discussion.
The text was updated successfully, but these errors were encountered: