-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle encoding #68
Comments
See also #73 (comment), which implies that kerchunk / fsspec doesn't actually do any decoding of times at all. |
I think a lot of encoding challenges can be solved by pushing encoding down the stack, in to Zarr and out of Xarray. There are two obvious advantages to doing this at the Zarr level:
|
Concatenating two zarr arrays with different codecs currently is not yet supported - it will currently raise an error, see #5 (comment). It could be supported in future, but would require the virtual concatenation ZEP in zarr in order to serialize the resultant references (see #5). It would be nice to factor out this encoding issue to be entirely handled at the level of virtual (/ zarr) array concatenation. |
@TomNicholas Thanks for these links, this really helps with understanding the encoding handling and propagation landscape in xarray. I like @rabernat's rationales for pushing encoding considerations further down the stack into Zarr. I'm very new to understanding what the path forward for xarray will be but it seems like given the discussions around reducing the surface area of encoding that we might want to focus on pushing down encoding for maximum flexibility in the longer term. Though most of this is totally reliant on upstream Zarr work, I do think we might have some practical considerations we can make now that might simplify things in the future. Thinking forward to #5 should we consider including encoding information at the This might be premature, but was something I was considering as we think about how to handle encoding in the prototype hdf reader. |
If we put encoding in the The suggestions above to move encoding into the Zarr layer would have xarray encoding => zarr codecs, and therefore one set of encoding per
The proposal in zarr-developers/zarr-specs#288 is to handle mixed codecs by still having one codec per zarr array, but introduce a higher-level array type (
So in this roadmap the future-proof thing to do is avoid hacking mixed encodings into one Apologies if I'm explaining what you already know here @sharkinsspatial |
Got it. I was totally misunderstanding the level of the |
To go back to the original issue, let me summarize where we are at with encoding. Opening encoded variables as The solution to this is meant to be to load these encoded time variables and serialize them as bytes (see #42 (comment)). Whilst #69 implemented the ability to load selected variables as numpy arrays, and #73 implemented the ability to serialize them as bytes into the kerchunk references on-disk, this inlining does not yet work on variables that have any encoding (in the sense that it doesn't reproduce the same inlined bytes that calling kerchunk's That means we currently: cc @dcherian |
@TomNicholas I'm at a bit of decision point on decoding via Do we have an idea which direction we'd like to go here? I also noticed @dcherian 's timely comment #122 (comment) around treating I'll defer to you all here given that I'm really new to the wonderfully painful world of encoding :], but I'll probably press on with the pure |
Thanks for the update @sharkinsspatial ! I don't know a huge amount about the best way to do this either.
@rabernat is there actually an xarray issue tracking this suggestion? If there is I can't find it. And what would this mean in practice? Adding a bunch of codecs to
I think it depends whether your priority is to just get something working as an alternative to kerchunk's hdf5 reader, or to work towards a longer-term but more separated and modular solution. Any changes to xarray tend to happen slower but allow bigger impact. |
Part of this story is to delegate handling of CF conventions to xarray - see #157. |
@sharkinsspatial suggested an interesting idea to aim for today: can we open data from a netCDF file using a zarr reader and roundtrip it? This would require the chunk manifest on-disk to point to the netCDF file, but also the necessary codecs to live somewhere that they can be utilized by the zarr reader. Perhaps that implies lifting them out of xarray and into a separate package? |
I just raised zarr-developers/zarr-specs#303 to suggest this. |
Encoding is the part of the virtualization problem that I have the least clear idea of how to handle using this library.
The discussion in #42 shows that currently to roundtrip an xarray dataset to disk as kerchunk then open it with fsspec requires ignoring encoding of the time variable. (In general time variables are going to be the most common offenders for encoding mistakes, as different files in the same dataset can easily end up with different encoding options for times.)
Part of the problem is that encoding is kind of a second-class citizen in xarray's data model. It happens magically behind the scenes in the backend file opener system (see pydata/xarray#8548 for a great overview), and then is attached at the
Variable
level asvar.encoding
, but there aren't general rules for propagating it through operations (see pydata/xarray#1614 and pydata/xarray#6323).xr.concat
will just keep the encoding of the first object passed, which for us could lead to an incorrect result.There is also some ambiguity around whether certain types of encoding should be handled by xarray or zarr. For example
scale
andoffset
exist in the CF conventions and so are normally handled by xarray, but could also be handled by a Zarr codec. If all encoding were represented at the zarr level then virtual array concatenation (#5) could potentially solve this.One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them) - see #62. Then effectively xarray has decoded them, you do the concatenation in-memory, and re-encode them when you save back out to the new zarr store. For dimension coordinates you often might want to do this anyway.
The text was updated successfully, but these errors were encountered: