-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetCDF files with many empty variables require large drop_variables
declaration
#194
Comments
By "non-array" you mean scalar right? This is definitely an edge case I haven't thought through in great detail. I think the main question is whether one can have valid Zarr arrays with a chunk shape of (). If not then we might want to automatically load these variables instead, i.e. add them to |
I think so? I'm not sure how to translate between the NetCDF concept and a Zarr array, I think ultimately Zarr only deals in Arrays or Groups. Scalar is a good word for it though.
You can in zarrv2. This is allowed: import zarr
store = zarr.open("example.zarr", mode="w", shape=(), chunks=(), path="a")
store.attrs["important attribute"] = 42 Which results in this store:
BUT This is not allowed in ZarrV3:
Maybe it will be fixed? zarr-developers/zarr-python#1977 Setting these as Should VirtualiZarr automatically load scalar variables? I think it's a nice feature for users and it would make it unnecessary to add a long list of variables to the |
Oh interesting, thanks for finding that @ghidalgo3 .
Are these scalar though? By scalar I meant something with a single element, but these have zero elements!
I think ideally we would not automatically choose to do that and instead 0-size zarr arrays would be supported upstream, but I can see why it would be nice for this edge case to just automatically work.
This suggestion seems okay though? It's in line with what the behaviour of v2 is, and what v3 is supposed to do. |
Hi @TomNicholas! Just adding onto @ghidalgo3's GOES example where an empty "scalar" contains critical attributes for projections defined here. The variable of interest is import fsspec
import xarray as xr
fs = fsspec.filesystem("https")
ds = xr.open_dataset(
fs.open('https://goeseuwest.blob.core.windows.net/noaa-goes16/ABI-L2-MCMIPF/2019/244/12/OR_ABI-L2-MCMIPF-M6_G16_s20192441200194_e20192441209513_c20192441210005.nc')
)
ds['goes_imager_projection'] The projection information is stored in the attributes of this dummy variable. However, opening without including this in As above, setting it as a loadable variable works fine: import virtualizarr as vz
vds_loadable = vz.open_virtual_dataset(
'https://goeseuwest.blob.core.windows.net/noaa-goes16/ABI-L2-MCMIPF/2019/244/12/OR_ABI-L2-MCMIPF-M6_G16_s20192441200194_e20192441209513_c20192441210005.nc',
reader_options={},
indexes={},
loadable_variables=['goes_imager_projection'],
)
vds_loadable['goes_imager_projection'] If these aren't loaded, then downstream projection operations have to go back to the original NetCDF files to get projection information which was dropped. Was a befuddling issue I ran into when trying to work with the kerchunk stores written from the virtual dataset, so I wanted to add it as a concrete example where automatically loading such scalars (or at least warning that they're being ignored) might be a preferable behavior. |
Thanks for your input @t-mcneely ? So this is another vote in favour of automatically loading "scalar" variables? |
That'd be my personal preference, but in lieu of that I think printing a warning or similar when variables are automatically dropped would be a helpful behavior. |
Me too, and I'd be happy to implement in a PR unless there is a good reason not to load scalar values automatically. |
I guess the only reason not to is that once zarr-developers/zarr-python#1977 is fixed (as it should be, else its a breaking change between v2 and v3) we will no longer need to load scalar values automatically, so we would be adding something to virtualizarr that we know we should remove later. I think that's okay though - if you're happy to add a PR @ghidalgo3 we can think about what scenarios should cause a warning (or deprecation warning). |
Also unrelated but FYI: #202 |
Wished I could have attended but I had a conflict :(, I'll try to join the next one. I will tackle loading these scalars this week, because I think many of the NetCDF datasets on https://planetarycomputer.microsoft.com/ are going to have variables like this. |
I'm not sure if this is a real issue or working as intended, but I wanted to get clarification on it. The GOES dataset's NetCDF files contain many non-array variables such that the call to
virtualizarr.open_virtual_dataset
needs a large number ofdrop_variables
to work. All of the dropped variables have in common achunk_shape
of()
, because they're not arrays.If
validate_chunk_shape
skipped validating empty dicts, it would be possible to open this dataset without having to pass such adrop_variables
list, but more importantly we could preserve variable metadata when we write a Zarr store for this file and others. Is there a reason VirtualiZarr shouldn't read empty variables like this?Repro
The text was updated successfully, but these errors were encountered: