-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose resolution _groups_ for xarray support #114
base: main
Are you sure you want to change the base?
Conversation
In discussing with the xarray community, the one change to the NGFF specification that needs to occur to prevent errors being raised when opening a multiscale is for each resolution _array_ to live in a separate _group_. This has already been tested by thewtex in https://github.com/spatial-image/spatial-image-multiscale and the current spec is permissive enough to allow it. The proposal here would enforce the subdirectories moving forward. The conflict in xarray stems from the fact that each of our subresolutions have the same dimension names ("x", "y,", etc.) but different sizes. This is not allowed in the xarray (nor NetCDF) model. An added benefit of this change is that other arrays with the same resolution levels and the same dimensions (e.g. labels!) could be stored together: ``` ├── resolution-N/.zgroup │ ├── image/.zarray │ └── labe/.zarray ```
What is the advantage of this? A downside is that it couples the downsampling process for raw data to the downsampling process for labels (or any other image in the collection). Imagine if I want raw data downsampled by 2x2x2, but labels downsampled by 4x4x4, then the proposed layout becomes tricky to parse. I think it's conceptually cleaner to group by data type (raw, labels, etc) than grouping by resolution. |
Intensity, labels, masks sampled are often sampled on the same voxel grid. It is common to use them together, and this is helpful to identify and use this association. This pattern led to the development of xarray Dataset, which this enables. There is not the constraint that every intensity image has to have a label image or every label image has to have an intensity image at the downsampled resolutions. |
├── 0 # Each multiscale level is stored as a separate Zarr array, | ||
│ ... # which is a folder containing chunk files which compose the array. | ||
├── n # The name of the array is arbitrary with the ordering defined by | ||
├── 0 # Each multiscale level is stored as a separate Zarr group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> separate Zarr group, possibly nested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to add.
I'm all for using |
If you prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application, then you can do that. I hear you. But, that should not block people who want to store a label image together with its intensity image. |
Ah, apologies. Late to the conversation. Thanks, both. I'm interpreting @d-v-b's last 👍 to mean that he is not proposing a |
│ └─ y # provide the "chunk coordinate" (t, c, z, y, x), where the maximum coordinate | ||
│ └─ x # will be `dimension_size / chunk_size`. | ||
│ └── image # Within the group, there will typically be a single array named "image". | ||
│ │ # Other arrays may be added in future versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, Dataset path
would be 0/image
, 1/image
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence Each dictionary in "datasets" MUST contain the field "path", whose value contains the path to the array for this resolution relative to the current zarr group.
should mention that the path MUST be to an array within a group, typically named "image".
And the sample with "path": "0",
etc should be updated to "path": "0/image"
etc.
I'm still not sure what to think here... the 👍 was to signify that in some situations someone might want to pack scale levels together. But for a specific format, like OME-NGFF, I don't see the appeal of a) packing scale levels together, and b) supporting multiple ways of representing the same thing. On the contrary, I think the format should specify just one way to organize images, unless there's are really powerful (e.g., "representation X is impossible on storage backend Y") argument for polymorphism here. And if we are specifying just one way to organize images, I would strongly advocate an organization scheme that keeps separate multiscale images in separate folders / prefixes. This facilitates an access pattern where multiscale images are read / written independently, which I think is pretty common. Grouping by scale levels, on the other hand, facilitates an access pattern where the same scale for all images is read / written at once, which I think is pretty uncommon (and, not very scalable). |
Guess in my mind this isn't really a new way of storing things, since it's already possible at the moment. The metadata of the
(At the risk of contradicting myself) this I can see even from the metadata level. I would like to make the necessary changes to the spec so that there would only be one image (i.e. multiscale) in a zgroup. I see @d-v-b's dilemma if I try to combine those two thoughts, since the only way would be to reference some common space outside of the current group by a "../" style reference. |
IIRC, the purpose of including the path metadata in |
In practice, storing related images, different features, different modalities, label images, masks, that are sampled on the same sampling grid, is extremely common. This is what motivated xarray Datasets and it does not sense to bother with Datasets without this organization. The Dataset organization enables simply and direct identification of volumes where pixels correspond. It has proven to be very scalable in the geospatial community. |
Yes, this storage layout is surely convenient where sampling grids match, but I can only see this working if you constrain sampling grids to match for all images. Here's a realistic example of images I work with:
None of them are on the same sampling grid. How would you store them in the scheme you are proposing? |
For these volumes that are not on the same sampling grid, they would not provide the useful indication that they are on sampled on the sampling grid -- they would be store in different groups, just like they are now. There is not an additional constraint that prevents them from being stored like they are now. |
I think this statement is too strong, @thewtex. I fall very much on @d-v-b's side of things here, I think it absolutely makes sense to group together datasets with different pixel spacings or even grid orientations. Now, whether the engineering constraints on the xarray side are steadfast, or can be remedied upstream, I don't know, but my personal instinct would be to push back on that constraint a bit, rather than harden the spec one the ome-ngff side. The only place where I find this layout compelling is for multiple channels — which is probably where the geosciences applications come from? |
We are in agreement! It absolutely does make sense to group together datasets with different spacing or grid orientations. However, that is not a reason to push back on this documentation update in this PR. Datasets with different spacings and grid orientations can and should be able to be grouped together. That can be done independently of this documentation update, and this change does not put a constraint on the creation on this type of dataset associations. The proposal actually prevents a potential over-constraint that the pixel arrays live in the same group. This is unlikely to require any changes to existing code because accessing a group or a nested group is done the same way. Currently, the NGFF standard does not explicitly say that the pixel arrays are in the same group or in separate groups. This would explicitly say that they could be in different groups. In practice, this means that NGFF can be compatible with Xarray and NetCDF, and I think we can all agree that it is in the interest of both standards to make them compatible, if possible. The nested group in Xarray/NetCDF is motivated by a reasonable approach to a need that makes sense (store the image pixel coordinates alongside the pixel data). And the use case is similar in geosciences as in bioimaging, medical imaging, microscopy: work with multi-dimensional images as numpy arrays with the same shape: multiple frequencies, multiple sensors, derived feature images, label images. Even if someone does not use this functionality, I do not think we should be unnecessarily over-constraining NGFF in a way that makes it incompatible with Xarray/NetCDF. |
Ah, thanks @thewtex, I should have looked at the spec rather than the short summary and the ensuing discussion. As I see it, the essence of this PR is to put different resolution levels in different groups rather than different arrays within one group. (?) If one wants the groups to be singletons, that's entirely fine. (?) Also, groups are hierarchical, (?) meaning it's totally fine to have groups of groups, ie groups of multiscale data. (?) Given all this, I'm ok with this PR. 😅 |
@jni yes, that's it, sorry if the explanation was not clear. There can be additional associations of data through grouping. As we continue to make progress on the spec, we can add associations to meet needs. |
See discussion in Unidata/netcdf-c#2474 which suggests as part of this effort (and ASAP e.g. v0.5 if not retroactively for the previous versions) _ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups or stripped entirely. |
How about making a patch release (0.4.1) for this? |
So far, I see this proposal includes breaking changes in terms of the data layout so I don't think a patch release is an amenable option in its current form. Semi-related, is the proposal to exclusively support the new layout i.e. have OME-NGFF 0.x fully compatible with the netcdf/xarray model. Or would we have a period of transition where both layouts would be supported? One way or another, this decision will have implications on implementations, both readers and writers. |
Thanks for the clarification @sbesson; I went through the whole discussion in more detail now and here are my thoughts:
This could be done as a patch release (and is what I was refering to), but it does not help w.r.t. compatibility with xarray.
Indeed, this is a breaking change and should not be done as a patch release (and for sure not retroactively, this would invalidate the v0.4 data that is out there!). For the changes here: I guess we have two options:
I am in favor of option 1 since I do believe xarray support is important and this is the only feasible way to get there. (Although it will need a bit of refactoring in readers and writers...)
I think having a transition period would make things complicated. If we decide to go with this change, we should stick to the versioning and require 0.5 to have the new format. |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/ome-ngff-community-call-transforms-and-tables/71792/5 |
Hey all, thought I should leave some comments here after the spec call this week. tldr
This PR is more about
|
Hey @ivirshup, Thanks for sharing your thoughts and code.
This PR is about compatibility with netCDF. xarray and xr.open_zarr compatibilty come for free. xarray Datasets are based on netCDF groups. And the OME multiscale images can mapped to the proposed higher order xarray.DataTree in a natural way. An OME Yes, OME-NGFF and netCDF are different standards that do not overlap 100%. However, we should strive for compatibility when possible. We will not have 100% NGFF functionality. That does not mean that the functionality that results is not valuable. Few, if any, single piece of software implements 100% of the functionality of even the current relatively minimal OME-NGFF standard: high-content screening, axes, bioformat2raw.layout, coordindateTransformations, multiscales, omero, labels, image-label, plate well. That does not mean the current ecosystem of software striving for OME-NGFF support does not have value. The value of standards means that we do not need to control all the related software. Indeed, this is an extremely important quality because it allows the ecosystem to flourish. And everyone benefits as a result. Beyond xarray, a sampling of other software tools supporting the NetCDF standard:
Wouldn't it be cool if we could open OME-NGFF images, even in a basic way, without having to fork and hack and control each one? And wouldn't it be cool if the community of researchers using these tools could use software tools from the OME-NGFF community, even in just a basic way? I strongly think we should unnecessarily avoid resisting compatibility with other standards, software tools and research communities. |
I disagree with this. The title, branch name, and description are pretty specific to xarray – as is most of the discussion. The referenced issue is titled: "Compatibility with xarray".
I agree compatibility with existing tools is useful, however:
My understanding is that netcdf uses arrays being stored in the same groups to indicate that they should be used together (e.g. in an xarray.Dataset). I think this limits how useful having a netcdf compatible tool read directly from an ome-zarr store can be.
I agree, but I think a lot of the potential here is actually realized from building on a standard like zarr, rather than going to a standard on top of zarr. Alternative vision for netcdf compatI think it would be quite easy to create a view of an ome-zarr store that was compatible with netcdf usage. This could be done with references (e.g. symlinks) and metadata transformations. Not so different from: > [I] prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application But with the added benefit that the format itself can keep the "one way to store an image", while having broader compatibility with netcdf. Alternative alternative vision for netcdf compatAnother vision would be to go full on netCDF, and layer all of OME-NGFF on top of it. I would assume this conversation has happened before here. |
The changes proposed in this PR (storing scale levels in separate groups) opens up two possibilities:
As I noted earlier in this issue, I don't love this idea. I stand by the principle that we should have just 1 way of organizing multiscale images, and it should be a way that isolates different multiscale images from each other.
I actually quite like this idea, but it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata). It seems premature to change the spec to support coordinate arrays before there's concrete proposal to actually use coordinate arrays for OME-NGFF. I should note that we only get xarray compatibility "for free" if we use exactly their zarr encoding, and the real blocker for that is the absence of coordinate arrays in OME-NGFF. I would love free compatibility with xarray but this PR doesn't actually bring us closer to it unless we have xarray-compatible coordinates. @thewtex is there anything I'm missing here? Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF? |
I would strongly agree with this point
I believe the |
There are many mentions of xarray. It is worth inspecting why,
Regarding 1), by placing the image pixel array and metadata in a common group, we gain compatibility with the netCDF groups. And xarray is based on the netCDF data model:
and extensions to Xarray's data model, labeled arrays without coordinates, and hierarchical data, the xarray DataTree, intentionally are compatible with the netCDF data model,
Note that nodes of the tree are xarray.Dataset's and not xarray.DataArray's. Deviation from the netCDF data model means deviation the standard model used by the geospatial research community. This model has been around for decades, is used by other software, and does its job well. It is a standard and a community of existing data and software that support it. There are other Python libraries supporting netCDF, and other software built on the netCDF C library, netCDF Java library. Regarding 2), maybe some folks are only interested in possible use of xarray as another Python library. Speaking for myself at least, I am interested in broader compatibility between open data and open source software, xarray and beyond, between the geospatial research community and the scientific software research communities. I would like to the ability for the same software developed for climate research in cancer research and vice versa. Many of the algorithms developed do not care whether the pixels come from clouds or cells. This means compatibility between the OME-NGFF and netCDF data models. We get a lot of value by being able to load pixel data and dimension names. This is why the labelled array without coordinates is mentioned and is considered. Many times, just a pixel data array goes a long way. By placing the image in a group, we gain compatibility of pixel data array between OME-NGFF and netCDF.
Yes, this alternative is worth considering, and it adds unnecessary complexity for just accessing pixel data. And implementation across all the software that supports the netCDF data model is not scalable or sustainable.
We do not want to shoehorn all of OME-NGFF into netCDF, but that is not what is proposed.
Many tools are based on the Unidata netCDF C library, which is getting zarr support, as previously mentioned, and that will trickle down.
This approach was taken many times in the TIFF ecosystem -- parties came along, built on the TIFF format, and created their own data models that did not share common tags. Sure, transformations could be implemented. But this causes unnecessary pain when common information is desired. Wikipedia's characterization of TIFF:
We should seek compatibility when possible and appropriate. Loading pixel arrays is important.
@d-v-b I agree with you. We should not bring in xarray's support for netCDF data coordinates into OME-NGFF unless it is appropriate. And this proposal does not add coordinates to OME-NGFF.
Note that xarray can save and load Dataset's with and without coordinate arrays. They are not required.
Yes!
|
Coordinate arrays are required if you want xarray to know about the coordinates of your data. Loading OME-NGFF data into xarray without the coordinates (specified implicitly via
I believe that only option 2 is needed, because the hypothetical xarray OME backend would be free to create In fact, I think we should aim for a situation where either OME-NGFF images are read correctly by xarray (e.g., with coordinates), or not at all, with no middle ground. Without including either a) xarray-compatible coordinate arrays, or b) the assurance of an OME-NGFF backend for xarray, this PR enables lossy deserialization of OME-NGFF images by xarray which could lead to massive confusion -- for example, after this PR, someone might load an OME-NGFF collection into xarray, generate some coordinates (because the OME-NGFF transformations were ignored by xarray), and then use |
This proposal supports use of Xarray, including corrordinates, correctly. It is not necessary to only support loading Xarray through overly complex transformations that only works in specific implementations and intentionally diverges from conventions of the geospatial community. Also, coordinate arrays are not required by all use cases in Xarray. The fact is, you can write and read xarray Datasets without coordinates. This is what motivates a labeled array without coordinates in a simplified version of the Xarray package proposed in the medical imaging community (nibabel). We should look to support this use case.
This is not correct. There is not going to be massive confusion if
Dataset.to_zarr is not going to automatically generate a valid OME-NGFF with or without this proposal. |
As a frequent xarray user, I'm a little skeptical of this claim. Yes, technically you can have dimensions without coordinates, but In my experience, coordinates are the key feature of xarray, and I simply wouldn't use the library if I didn't want coordinates. And I find it very hard to imagine doing anything useful with multiscale images in xarray without coordinates, because there will be no way to relate different scale levels to one another. So, speaking for myself, if xarray could natively load OME-NGFF multiscale images but not generate coordinates, this would not be terribly useful unless coordinates were handled correctly.
Can you explain how this proposal handles coordinates correctly? As I understand it, there are only two ways to get xarray-compatible coordinates for OME-NGFF (explicit coordinate arrays or an OME-NGFF backend for xarray) |
the spirit of this discussion continues in #174 |
@d-v-b + others on this thread. Would you be able to join one of the GeoZarr Steering Working Group meetings to discuss this a bit further. Info on scheduling/how to join is here: https://hackmd.io/@briannapagan/geozarr-spec-swg |
@briannapagan I'd love to attend but the time zone lines up very poorly for me (I'm in Melbourne Australia). Having said that for the one in mid-April maybe I'll stay up. 😃 (We lose DST this weekend so it becomes 1am-2am, slightly closer to a civilised time. 😅) |
Wait is it EST or EDT? 🤔 #TimeZonesAreHard |
@jni it's EDT now (thanks for the catch on the dodc, and yes #TimeZonesAreHard haha!). Trying to keep it as async as possible - but also happy to chat one on one in the future at a time better for you. You can also track convos: https://github.com/zarr-developers/geozarr-spec |
In discussing with the xarray community, the one change to the NGFF
specification that needs to occur to prevent errors being raised
when opening a multiscale is for each resolution array to live
in a separate group. This has already been tested by thewtex
in https://github.com/spatial-image/spatial-image-multiscale and
the current spec is permissive enough to allow it. The proposal
here would enforce the subdirectories moving forward.
The conflict in xarray stems from the fact that each of our
subresolutions have the same dimension names ("x", "y,", etc.)
but different sizes. This is not allowed in the xarray (nor NetCDF)
model. An added benefit of this change is that other arrays with
the same resolution levels and the same dimensions (e.g. labels!)
could be stored together:
cc: @thewtex @aurghs @malmans2
see: #48