-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document special attributes and mapping #13
Comments
In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4? |
This is something we'll need to consider. I'm hesitant to having _ARRAY_DIMENSIONS with phony_dim_x on every dataset... the _SCALAR=True method seems more straightforward. But I do understand that there are benefits of interoperability. Would be helpful to think of a scenario where we'd need that in order to use some tool. I'm hesitant to go down the path where we end up with many attributes supporting all the various projects (kerchunk, lindi, hdmf-zarr)... instead of having logic in the various tools to be able to handle various cases. |
Because information about this is scattered throughout issues and docs, I wanted to summarize: Xarray is a popular Python library for working with labelled multi-dimensional arrays. It reads and writes netCDF4 files by default (these are specially organized HDF5 files). Xarray requires dimension names to work, and it can read/write them from/to netCDF4 and HDF5 files (it uses HDF5 dimension scales). Scalar datasets in Xarray and netCDF4 are indicated by the lack of dimension names. All netCDF4 datasets have dimension names for non-scalar data and lack dimension names for scalar data, so Xarray and netCDF4 are compatible. But not all HDF5 datasets have dimension names. When Xarray loads an HDF5 dataset without dimension names, it generates phony dimension names for them in memory and on write. Xarray can also read and write Zarr files, but Zarr does not support storing dimension names, so to write Xarray-compatible Zarr files, Xarray defined a special Zarr array attribute: Kerchunk, in order to generate Xarray-compatible Zarr files, uses the same convention - it creates the attribute So adding the |
Thanks for the summary, @rly! I was not aware of a lot of that. I do think that Xarray support would be quite valuable, but this may not be the best way to do it. Many of these dataset dimensions really should have names as indicated by the NWB schema. |
Just to add: The NetCDF group has their own Zarr implementation called NCZarr. It has its own conventions and I think Xarray supports reading both their IMO, this demonstrates the complexity of having too many different conventions and the danger of adding another. https://xkcd.com/927/ For simplicity, I'm still inclined to follow neither convention until Xarray (or netCDF) is within scope, but perhaps that is naive.
Just to add: Technically, Xarray doesn't does this, but both the default I/O engine, netcdf4, and the alternate I/O engine for HDF5 files, h5netcdf, do it. I don't know why Xarray doesn't generate phony dimension names when reading Zarr arrays without dimension names. That would make things easier... |
Just to add onto this... custom zarr stores are easy to make... and so one can create adapters that attach the various needed attributes for different contexts. For example, you could have simple adapter that adds the _ARRAY_DIMENSIONS on everything where needed. So you'd have
with no loss of efficiency. |
This has been documented to an extent here |
It looks like the Allen Institute for Neural Dynamics would like to use xarray with NWB Zarr files: hdmf-dev/hdmf-zarr#176 |
Good to know. As I suggested above, I would propose an adapter that adds the phony_dim _ARRAY_DIMENSIONS attributes, rather than having them in the .zarr.json |
For the case where array dimensions are unknown, I agree that having a way to emulate them rather than storing invalid information is probably preferable. However, in the case of NWB, we can often know the dimensions from the schema so it would be nice to have those reflected. |
and mappings during translation
The text was updated successfully, but these errors were encountered: