Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCZarr - Netcdf Support for Zarr #41

Open
DennisHeimbigner opened this issue Jan 12, 2019 · 82 comments
Open

NCZarr - Netcdf Support for Zarr #41

DennisHeimbigner opened this issue Jan 12, 2019 · 82 comments

Comments

@DennisHeimbigner
Copy link

I am moving the conversation about NCZarr to its own issue. See Issue https://github.com/zarr-developers/zarr/issues/317 for initial part of this discussion.

@DennisHeimbigner
Copy link
Author

Naming issue:
I have about convinced myself that rather than creating KVP level objects
like .zdimensions, I should just use the existing Zarr attribute mechanism.
In order to do this, it is necessary to setup some naming conventions for such
attributes. Basically, we need to identify that an attribute is special (and probably
hidden) and for which extension(s) it applies.
For NCZarr, let me propose this:

  1. All such attributes start with two underscores
  2. Next is a 2-4 character tag specific to the extension: "NCZ" for NCZarr.
  3. another underscore
  4. the rest of the attribute name.

So, we might have "__NCZ_dimensions" instead of .zdimensions.

@jakirkham
Copy link
Member

Thanks for opening this @DennisHeimbigner.

Encountered issue ( zarr-developers/zarr-python#280 ) again recently. So figured that might interest you given some of this discussion about how to manage specs. Though each issue has its own place I think.

If we do go down the attribute road, agree that having some non-conflicting name convention is important. The other option might be to widen the spec of things like .zarray to allow specs subclassing Zarr's spec to add additional relevant content here as others have mentioned. A third option similar to what you have done would be to add something like .zsubspec, which users can fill as needed. We might need certain keys in there like subspec name, subspec version, etc., but otherwise leave it to users to fill these out as needed.

@alimanfoo
Copy link
Member

Thanks @DennisHeimbigner.

Just to add that, on the question of whether to pack everything into attributes (.zattrs) or whether to store metadata separately under other store-level keys (.zdims, .ztypdefs, etc.), I think both are reasonable and personally I have no objection to either.

I lean slightly towards using attributes (.zattrs) because it plays nicely with some existing API features. E.g., the NCZ metadata can be accessed directly via the attributes API. And, e.g., the NCZ metadata would get included if using consolidated metadata, which is an experimental approach to optimising cloud access, available in the next release of Zarr Python. But neither of these are blockers to the alternative approach, because it is straightforward to read and decode JSON objects directly from a store, and it would also be straightforward to modify the consolidated metadata code to include other objects.

@DennisHeimbigner
Copy link
Author

DennisHeimbigner commented Jan 14, 2019

We have learned from the existing netcdf-4 that datasets exist with
very large (~14mb) metadata.
I was looking at the Amazon S3 query capabilities and they are extremely limited.
So the idea of consolidated metadata seems like a very good idea.
This reference:
https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata
does not provide any details of the form of the proposed consolidated metadata.
Note that there may not be any point in storing all of the metadata, especially
if lazy reading of metadata is being used (as it is in the netcdf-4 over hdf5
implementation).
Rather I think that what is needed is just a skeleton so that query is never needed:
we would consolidate the names and kinds (group, variable, dimension, etc)
and leave out e.g. attributes and variable types and shapes.

@DennisHeimbigner
Copy link
Author

here is a proposed consolicated metadata structure for NCZarr.
It would be overkill for standard Zarr, which is simpler.
Sorry if it is a bit opaque since it is a partial Antlr grammar.
nczmetadata.txt

@alimanfoo
Copy link
Member

alimanfoo commented Jan 14, 2019 via email

@DennisHeimbigner
Copy link
Author

That was a typo. The correct size is 14 mb.

@alimanfoo
Copy link
Member

That was a typo. The correct size is 14 mb.

Ah, OK! Although 14MB is still pretty big, it's probably not unmanageable.

@DennisHeimbigner
Copy link
Author

Depends on what manageable means, I suppose. We have situations where
projects are trying to load a small part of the metadata from thousands of files
each of which has the amount of metadata. Needless to say, this is currently
very slow. We are trying various kinds of optimizations around lazy loading
of metadata but the limiting factor will be HDF5. A similar situation
is eventually going to occur here, so thinking about various optimizations
is important.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 15, 2019 via email

@jakirkham
Copy link
Member

Are you able to provide data on where most of the time is being spent, @DennisHeimbigner?

@DennisHeimbigner
Copy link
Author

Issue: Attribute Typing
I forgot to address one important difference between the netcdf-4 model and Zarr:
attribute typing. In netcdf-4, attributes have a defined type. In Zarr, attributes are
technically untyped, although in some case it is possible to infer a type from the value
of the attribute.

This is most important with respect to the _FillValue attribute for a variable.
There is an implied constraint (in netcdf-4 anyway) that the type of the attribute
must be the same as the type of the corresponding variable. There is no way to
guarantee this for Zarr except by doing inferencing.,

Additionally, if the variable is of a structured type, there is currently no standardized
way to define the fill value for such a type nor is there a way to use structured types
with other, non-fillvalue, attributes.

Sadly, this means that NCZarr must add yet another attribute that specifies
the types of other attributes associated with a group or variable.

@alimanfoo
Copy link
Member

Hi @DennisHeimbigner,

Regarding the fill value specifically, the standard metadata for a zarr array includes a fill_value key. There are also rules about how to encode fill values to deal with values that do not have a natural representation in JSON. This includes fill values for arrays with a structured dtype. If possible, I would suggest to use this feature of standard array metadata, rather than adding a separate _FillValue attribute. If not, please do let us know what's missing, that would be an important piece of information to carry forward when considering spec changes.

Regarding attributes in general, we haven't tried to standardise any method to encode values that do not have a natural JSON representation. Currently it is left to the application developer to decide their own method for encoding and decoding values as JSON, e.g., I believe xarray has some logic for encoding values in zarr attributes. There has also been some discussion of this at #354 and #156.

Ultimately it would be good to standardise some conventions (or at least define some best practices) for representing various common value types in JSON, such as typed arrays. I'm more than happy for the community to lead on that.

@DennisHeimbigner
Copy link
Author

This reference -- https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding --
does not appear to address fill values for structured types. Did you get the reference wrong?

@alimanfoo
Copy link
Member

If an array has a fixed length byte string data type (e.g., "|S12"), or a structured data type, and if the fill value is not null, then the fill value MUST be encoded as an ASCII string using the standard Base64 alphabet.

I.e., use base 64 encoding.

@DennisHeimbigner
Copy link
Author

So it would be nice if we had a defined language-independent algorithm
that defines how to construct the fill value for all possible struct types
(including recursion for nested structs). This should be pretty straightforward/
Also, why force a string (base64) encoding? Why not make the fill value
be just another Json structure?
It worries me how python-specific is much of the spec around types.

@alimanfoo
Copy link
Member

So it would be nice if we had a defined language-independent algorithm
that defines how to construct the fill value for all possible struct types
(including recursion for nested structs). This should be pretty straightforward

That would be good. I believe numpy mimics C structs, further info here.

Looking again at the numpy docs, there is support for an align keyword when constructing a structured dtype, which changes the itemsize and memory layout. This hasn't been accounted for in the zarr spec, I suspect that things are currently broken if someone specifies align=True (default is False).

Also, why force a string (base64) encoding? Why not make the fill value
be just another Json structure?

That's a nice idea, would fit with the design principle that metadata are human-readable/editable.

It worries me how python-specific is much of the spec around types.

The zarr spec does currently defer to numpy as much as possible, assuming that much of the hard thinking around things like types has been done there already.

If there are clarifications that we could make to the v2 spec that would help people develop compatible implementations in other languages then I'd welcome suggestions.

Thinking further ahead to the next iteration on the spec, it would obviously be good to be as platform-agnostic as possible, however it would also be good to build on existing work rather than do any reinvention. The work on ndtypes may be relevant/helpful there.

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
@alimanfoo
Copy link
Member

Surfacing here notes on the NetCDF NCZarr implementation, thanks @DennisHeimbigner for sharing.

@alimanfoo
Copy link
Member

Also relevant here, documentation of xarray zarr encoding conventions, thanks @rabernat.

@rsignell-usgs
Copy link

rsignell-usgs commented May 5, 2021

@DennisHeimbigner: It looks like Unidata's Netcdf C library can now read data with the xarray zarr encoding conventions, right?

@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?

@WardF
Copy link

WardF commented May 5, 2021

The ability to read xarray is in the main branch, and will be in the upcoming 4.8.1 release. I am shaving the yak to get our automated regression and integration test infrastructure back up and running but we hope to have 4.8.1 out shortly.

@rabernat
Copy link
Contributor

rabernat commented May 5, 2021

@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?

I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.

p.s. but yes, please open an xarray issue to keep track of it.

@shoyer
Copy link

shoyer commented May 5, 2021

One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see pydata/xarray#5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.

Consolidated metadata would probably be a nice feature for NcZarr, too, because it reduces the number of files that need to be queried for metadata down to only one. I think there was a similar intent behind the .nczgroup JSON field. Consolidated metadata is sort of a super-charged version of that.

@DennisHeimbigner
Copy link
Author

NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate.
As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.

@shoyer
Copy link

shoyer commented May 6, 2021

NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate.

In Xarray, we have to read nearly all the metadata eagerly to instantiate xarray.Dataset objects.

As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.

This is correct, you don't need to write consolidated metadata. But if you do, Xarray will be able to read the data much faster.

As for whether netCDF users would notice a difference with consolidated metadata, I guess it would depend on their use-cases. Lazy metadata reads are great, but for sure it is faster to download a single small file than to download multiple files in a way that cannot be fully parallelized, even if they add up to the same total size.

@DennisHeimbigner
Copy link
Author

faster to download a single small file than to download multiple files

true, but we have use cases where the client code is walking a large set of netcdf files and reading a few pieces of information out of each of them and where the total metadata is large (14 megabytes). This can occur when one has a large collection of netcdf files covering some time period and each netcdf file is a time slice (or slices).
Perhaps Rich Signell would like to comment with his experience.

@joshmoore
Copy link
Member

joshmoore commented May 6, 2021

#41 (comment) I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.

For what it's worth, I could see making some movement (June-ish?) on #112 (comment) to permit the additional files. But either way, certainly ome/ngff#46 (review) (related issue) would suggest hammering out a plan for this difference before another package introduces a convention.

#41 (comment) One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see pydata/xarray#5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.

Having gone through pydata/xarray#5251 I'm slightly less worried about this then when I first read it (assuming it mean it would only support consolidated metadata), but having just spent close to 2 months trying to get dimension_separator "standardized", I'd like to raise a flag that consolidated metadata is a similar gray area. It'd be nice to get it nailed down.

@rsignell-usgs
Copy link

rsignell-usgs commented May 6, 2021

@DennisHeimbigner, just a quick comment that I too always use consolidated metadata when writing Zarr. Here's a recent example with coastal ocean model output we are publishing, where consolidated metadata is an order of magnitude faster to open:

@DennisHeimbigner
Copy link
Author

Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed.

In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case.

On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata.

@shoyer
Copy link

shoyer commented Aug 1, 2021 via email

@rouault
Copy link
Contributor

rouault commented Aug 1, 2021

The consensus seemed to be that extra keys would be allowed, but must be ignored if they are not recognized by the implementation.

ok, thanks for the clarification

@joshmoore
Copy link
Member

for reference: zarr-developers/zarr-python#715 (comment)

That didn't make it into a zarr-specs issue (neither v2 nor v3) as far as I can tell. Anyone up for shepherding that?

@joshmoore
Copy link
Member

joshmoore commented Mar 24, 2022

See the related conversation in pydata/xarray#6374 ("Should the [xarray-]zarr backend support NCZarr conventions?")

@halehawk
Copy link

halehawk commented Apr 8, 2022

@DennisHeimbigner does NCZarr support any filter now?

@DennisHeimbigner
Copy link
Author

Yes although there are some complications because the code uses HDF5 filters to perform
the actual filter code, but it needs extra code to convert a Zarr codec JSON format
to the HDF5 unsigned integer parameters.
What specific filter(s) do you need?

@halehawk
Copy link

halehawk commented Apr 8, 2022

@DennisHeimbigner Do you have documentations about how to enable and use the filter through NCZarr, and we have a new codec which is not binding to any codec yet. Do you have suggestion on how to enable it in your NCZarr?

@shaomeng
Copy link

shaomeng commented Apr 8, 2022

@DennisHeimbigner @halehawk Maybe I should jump in now ;)

I have a lossy compressor product (SPERR: https://github.com/shaomeng/SPERR) that I'm looking at paths to integrate into the Zarr format. I haven't spent too much time on it, but my understanding is that I need to make it a Zarr filter. Our immediate application of it, an ASD run of MURam, has decided to use NCZarr to output Zarr files, so the question arises that if Zarr filters are supported by NCZarr.

I guess the most direct question to @DennisHeimbigner as the NCZarr developer is, what approach do you recommend to integrate a lossy compressor to an NCZarr output?

@DennisHeimbigner
Copy link
Author

If the compressor is (or easily could be) written in python, then see the NumCodecs web page.
If the compressor is in C or C++, and you decide to use netcdf-c NCZarr, then you need
to build an HDF5 compressor wrapper plus the corresponding codecs API.
I have attached the relevant documentation. If this compressor is similar to some existing
compressor such as bzip2 or zstandard, then you can copy and modify the corresponding
wrapper in netcdf-c/plugins directory -- H5Zzstd.c, for example.
filters.md

@shaomeng
Copy link

shaomeng commented Apr 8, 2022 via email

@DennisHeimbigner
Copy link
Author

Yes, IF the filters are available in NumCodecs.

@halehawk
Copy link

halehawk commented Apr 9, 2022 via email

@DennisHeimbigner
Copy link
Author

Sorry, I wasn't clear.
Suppose you use nczarr to write a Zarr file where some of its arrays apply a filter.
Then you can obviously read that file with nczarr.
However, suppose you write the array with nczarr and then want
others to read it using python-zarr. In that case, you will need to create
a NumCodecs compatible version of your filter written in python so that
the python-zarr users can read the data written by nczarr.

@shaomeng
Copy link

Hi @DennisHeimbigner , there is some confusion our team has and we would love to know if you can comment on it.

The confusion is that do we even need to make a HDF5 filter for SPERR compressor? Because NCZarr supports NumCodecs filters, isn't it the case that once we make a NumCodecs filter for SPERR, both NCZarr and Python-Zarr can read and write SPERR compressed zarr files? More generally, are there any advantages/disadvantages to produce a HDF5 filter for SPERR, if all we want is SPERR-compressed Zarr files?

@DennisHeimbigner
Copy link
Author

There are two pieces here and I am sorry I was unclear.
The first piece is the declaration of the compressor for a variable
in the Zarr metadata, This is specified in the "compressor" key for
the .zarray metadata object for the variable. The format for this
is defined by NumCodecs and generally has the form

{"id": "<compressor name>", "parameter1": <value>, ... "parametern": <value>}

So for zstd, we might have this: {"id": "zstd", "level:": 5}

The second part is the actual code that implements the compressort.

NCZarr supports the first part so that it can read/write legal Zarr metadata.
BUT, NCZarr requires its filter code to be written in C (or C++).
More specifically, it does not support the Python compressor code implementations.
Sorry for the confusion

@halehawk
Copy link

halehawk commented Apr 12, 2022 via email

@shaomeng
Copy link

NCZarr requires its filter code to be written in C (or C++).

Just to clarify, did you mean that NCZarr requires its filter code to be in C AND also exposed to NCZarr as HDF5 filters? E.g., NumCodecs filters won't work.

Sorry for the back and forth in this github thread. I think this is my last try and if there's still confusion, I'll try to set up a meeting and resolve it more directly :)

@jbms
Copy link
Contributor

jbms commented Apr 12, 2022

I don't know the details of how codecs are defined for NCZarr, but in general you will need to provide a separate implementation of a codec for each zarr implementation that you want to support it.

Zarr-python provides a mechanism by which codecs can be registered --- numcodecs defines many codecs, and zarr-python pulls in numcodecs as a dependency, but it is actually possible to define a codec for zarr-python outside of the numcodecs package --- see for example the imagecodecs Python package.

@DennisHeimbigner
Copy link
Author

@dennis Heimbigner @.***> I still need you to clarify something here. So I looked at your H5Zzstd.c which is a HDF5 plugin for zstd and supports numcodec zstd read/write. Then I got this idea, if Samuel's new compressor need not get a formal HDF5 filter ID, but should add a similar H5Zsperr.c

That is correct. THe HDF group reserves ids 32768 to 65535 for unregistered use.
So Samuel can pick a number in that range for his filter; later, if desired, a forml
HDF5 filter id can be assigned.

@joshmoore
Copy link
Member

First a big 💯 for the discussion, since this is exactly what we want to see happening for cross-implementation support of codecs. @shaomeng & @halehawk, don't hesitate to keep asking.

I do wonder, @DennisHeimbigner, if we don't want to establish the channel you'd like for more nczarr questions. If so, I'd say we update the start and end of this thread with that URL and close this issue.

Others may want to express an opinion, but if it's useful, we can have a no-code location like github.com/zarr-developers/nczarr for people to find a README pointing to the netcdf-c implementation's resources.

cc: @WardF

@WardF
Copy link

WardF commented Jun 1, 2022

Sorry for the late comment on this; I would agree that maybe a 'Github Discussions' post would be a better place for this, instead of the issue we are working within. We can create that over at the netcdf-c repository, or we could create one here in the appropriate zarr-* repositories. There are arguments to be made for either, so I am happy to go with what makes the most sense for the broader group :).

@briannapagan
Copy link

21-050r1_Zarr_Community_Standard.pdf
Adding this here for reference in the convo.

@dblodgett-usgs
Copy link

Pertinent text from @briannapagan's link above...

Beginning with NetCDF-C version 4.8.0, Unidata introduced experimental Zarr support
into the NetCDF-C library. This was accomplished via creating a new specification -
NCZarr - which is “similar to, but not identical with the Zarr Version 2 Specification.”
Specifically, NCZarr adds two additional metadata files (“.nczarray" and ".nczattr”),
which are not part of the Zarr V2 Spec. Since NCZarr stores are not fully compatible and
interoperable with Zarr V2, this community standard excludes NCZarr. Work is ongoing
to reconcile NCZarr and the architectural reasons that motivated its development with the
forthcoming Zarr V3 Specification.
Fortunately, the NetCDF-C library also supports reading / writing of data using the
simpler Named Dimension convention described in 4.1.

@DennisHeimbigner
Copy link
Author

That information is out-of-date in a couple of ways.

  1. the metadata files (“.nczarray" and ".nczattr”) are no longer used; they were replaced with special dictionary entries.
  2. I believe the spec was changed to specify that unrecognized elements (objects and dictionary entries) should be ignored by any implementation that does not recognize them.
  3. With License? #2 in effect, nczarr created files can be read by pure zarr implementations and nczarr can read pure zarr files.

@dblodgett-usgs
Copy link

Thanks for calling that out, @DennisHeimbigner. This came out of a conversation over here. zarr-developers/geozarr-spec#22

There are very few people who have a deep enough understanding of the moving parts here to answer all the questions. It's good to hear that we basically have interoperability.

Two questions:

  1. Do you feel like we even need to worry about the distinction right now?
  2. Is there a current document we should be using to learn about the nuances between "pure zarr" and "nczarr"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests