-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collections Specification #31
Comments
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-feb-23/48386/9 |
@DragaDoncila Thank you very much for the detailed post! It makes a lot of sense and I am looking forward to whatever you come up with! Incidentally we (cc @constantinpape ) were also working on this topic during the past few days. I also ping @d-v-b I would like to add a notion and would be curious to hear opinions: Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list. The main reason is simplicity for the reader and writer libraries (the current HCS specifications does not follow this). Anything that imposes a hierarchy would be handled by the collections specification, which I think could be seen as metadata that specifies how to display and layout several images together. |
This means you would like to store images as 2d arrays and volumes as 3d arrays, correct? |
I was not gonna enter the 3D vs 5D discussion here, but just wanted to say that I feel that structuring the zarr like this: https://ngff.openmicroscopy.org/latest/#hcs-layout feels overly complex to me. |
Ok, so your point is to rather have a flat hierarchy of images in the zarr container:
and then define the potential hierarchies in the collections metadata (just a mock-up): {
"well1": ["image1", "image2"],
"well2": ["image3", "image4"]
} |
Yes, exactly. The way I see it conceptually is that a multi-well plate is a specific layout of a bag of images and, as such, should be covered by our collections specification, which I would currently see as metadata that exists independent of the way we store the raw image data. What do you think? |
One feature that the current HCS layout gives us is a URL to a specific Well. So I can open a specific Well like: https://hms-dbmi.github.io/vizarr/v0.1?source=https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/A/1 I guess you could try to use a URL |
I see that this is cool, but I am afraid that (i) these hierarchies make it harder parse an ome.zarr and (ii) it is not flexible; for example, I guess I cannot produce an single URL to show me all the images that were subjected to the same biological treatment (which may be several wells). |
I completely agree that the metadata and storage should be independent, because I think this also provides the opportunity to support a wider range of custom metadata. For example this:
could easily be this (for some geographical feature learning model):
I guess that's what I was thinking of when I said
I like the idea of a flat set of images with the hierarchy determined entirely by the metadata. That certainly seems the easiest way to support an arbitrary level of hierarchy without ending up with a very complex storage structure. |
Is this a MAY or a MUST? And what happens when/if someone does make use of the folder structure available in Zarr/N5/HDF5? |
Re @tischi "flexibility and biological treatment" - I'm wondering if there must be a single 'hierarchy' in the container, e.g. If we can have multiple. E.g.
And:
or
Those are all different ways to grouping the images.
|
My current idea would be to have no hierarchy on the data storage level, but provide the possibility to specify different "views" on the data on the metadata level. Something along the lines:
Does that make sense to you? |
Personally, I'd be for a MUST, i.e. not support hierarchies and then ignore anything stored at deeper levels. |
I think I am not such a big fan of the MUST here. There are some use cases where hierarchies make a lot of sense to keep the data ordered. As an simple example:
is a more natural (and easier to navigate) way of storing this then
|
OK, fair enough :) I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure. If one changes ones mind at some point about the folder structure, this could be quite expensive in terms of reordering all the data (at least that's how I understood how the object stores work), while it would be very cheap to just replace the views, isn't it? |
Sure, reordering the folder structure is not such a good idea but also not necessary because we can have multiple views for the same data. But having a hierarchical folder structure does not change anything about the views except that there will be some |
Yes, that is true. I guess it'd be fine with a MAY, but should we then maybe "strongly encourage" that there is a default_view specified that one could go to in order to efficiently find out what's in the dataset, without having to go through the whole "folder structure"? (I am also thinking about our experience that things like |
I'm also a little hazy on object stores, but my impression is that all the 'paths' within a bucket are really just 'keys'. So I imagine they could be changed without moving the data on disk. However, since you won't always be working with object stores, allowing E.g. this could be valid:
Any reason not to allow this? |
@will-moore I think that's fine and a very good point. On a file system there are some benefits to this and on an object store there are no disadvantages. |
OK, so it looks like there's enough consensus here to start on something a bit more concrete. Option 1A
Each Option 2An alternative is to use the "path" as the ID/key of each image. Any reason not to do this? (for labels-metadata we decided not to use an ID as key because the ID was a number which is not a valid key in JSON.). This protects from having 2 identical 'path' values which could be possible above.
Other optional metadataWithin the
and groupings. I guess we could use the path as the identifier of each image.
Which could mean that the "images" list/dict above is not needed (if we don't have any other metadata, and every image is in a group)? BUT it simplifies the spec to say that So, is everyone happy with Option 1 or Option 2? Or would like to suggest improvements to whichever is their favourite? |
No. To move objects around in object storage is always a copy/delete operation.
It sounds like we're struggling with the semantics of having one "hard-coded" hierarchy beside the additional collections. In the ome-zarr-py implementation (and we could work to formalize this), there's a generator pattern. You start at the group you're given and then ask for what it "points to" and then process that. You will always start from a single group, so perhaps we're saying that you will only use the metadata of the given group for the objects that are generated. |
@joshmoore I agree. Maybe, for simplicity, we could restrict this issue to discussion of the additional collections? and make an extra issue but "hard-coded" hierarchy?
@will-moore Do I get it right that currently one such .zattrs file would contain only one collection? Meaning that to specify multiple additional collection we would need several
Option 2 looks more concise, so maybe slight preference for that one. In terms of the layout, instead of specifying row and column, I think specifying a translation in physical coordinates may also be an option.
|
I also prefer option2. And as @tischi brought up I think it's important to think about how to map different collections for the same data (or subsets of it), either in the same .zattrs or distributed into different ones in some defined pattern. |
I was thinking the multiple So the simplest way to support multiple
|
I think if we want to easily support images being opened both as part of their collection and on their own then it would make sense to have each image as its own well-formed ome-zarr, including a .zattrs file? It would mean either duplication of some metadata, or a top level .zattrs which only contains the necessary information for traversing the collection i.e. the snippet @will-moore posted just above |
Yes, in the examples I've posted, there would be a full OME-Zarr in each of the |
Just to clarify one issue that will become more important with Zarr V3, each of those paths contains an |
Taking the current HCS spec as an example, it is true that this doesn't allow an image to be in more than one collection (plate). The collection's metadata is one (or more) levels up from the Image or Well.
I'm not sure I understand this. The E.g. viewing https://hms-dbmi.github.io/vizarr/v0.1?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr that path to plate contains the plate metadata at https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/.zattrs Also in the case of HCS, the hierarchies are quite well defined, so we can even start at a Well (and we know it's a well because the .zattrs has In the case of a more generic Collection, as discussed above, it's true that for a given path/to/image/ we wouldn't know which parent directory defines the collection container and holds the collection metadata. |
I can't find any info on JSON-schema validating references within JSON instances. The only use of references or ids is within a schema, to refer to other schemas. e.g. https://json-schema.org/understanding-json-schema/structuring.html#ref |
I was referring to the proposals above. A path to a Zarr subgroup with images of a single collection does not have the NGFF Collection metadata, so when reading it, it looks like a plain (non-NGFF) Zarr group containing NGFF images. One could check whether the parent directory is a Zarr group and has NGFF Collections metadata, then iterate over all collections to find the one containing these images. This works if I introduce restrictions on my Zarr hierarchy. But for the general case you would have to walk up to the file system root and check every parent because image paths can have several nesting levels. In short, "a Zarr collection of images" cannot be expressed as a path/URL (but as path to collections + collection name). |
At ome/omero-cli-zarr#88 there is code for exporting a Dataset of Images, according to one of the Collection specs discussed above. And there is an example Dataset hosted at https://minio-dev.openmicroscopy.org/idr/v0.3/datasets/idr0043/13901.zarr/.zattrs That URL defines a collection of Images, listed in the .zattrs where we could also include the name and other metadata (this example doesn't). The .zattrs is simple, and we have since decided to use a different structure, but it doesn't look like a plain (non-NGFF) Zarr group because it contains the "collection" dictionary:
That collection can be viewed in a vizarr PR deployed at Apologies if I'm getting my Zarr terminology wrong, but is that not an example of a path to a Zarr group (or subgroup?) with (limited) Collection metadata and no searching of parent directories needed to find the images? |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11 |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: |
Copying from image.sc:
With regards to this proposal, I was wondering if there should be different types of collections (those with shared coordinate spaces and ones with separate spaces)? |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/intermission-ome-ngff-0-4-1-bioformats2raw-0-5-0-et-al/72214/1 |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/cli-programmatic-browsing-of-ome-zarr-hierarchies-on-idr/75907/7 |
@normanrz @joshmoore @will-moore
|
Yes!
I think it fits nicely in the OME-Zarr world. For our purposes, there is no requirement to include other image formats. |
For me, having a collection spec for a bunch of TIFF files also would be very handy. |
I'd also be very interested to see where the collection work is going! We're just interested in OME-Zarr images and use the HCS spec heavily. But would be interesting to hear how this can generalize to other collections and how metadata about the contents of the collection is represented :) |
I think we got bogged down previously because we tried to do too much, so let's see if we can list the requirements that we all agree on before we try to find solutions. Here's a possible list, and I'm only selecting 2 items to start. Maybe others could copy this list and select items or add their own items?
|
What it could look like:
|
How do I copy the list? :-) If someone tells me I am happy to reformat what I write here: For my current work I (and I think @jluethi may be interested) would need one more thing: Some way to specify the shape and pixel type metadata for a collection of images, such that I don't need to open all of them during initialisation (use case is similar to HCS where we have hundreds of images with identical metadata that we want to open in a "grid view"); essentially I want to be able to express that "here is a collection of images and they all have the same shape (dimensions) and pixel type, namely ...
|
In the use case where we assume all items of the collection (e.g. all wells) have the same metadata, I don't see a huge difference about whether I load the metadata from the collection level or from a single (randomly chosen? alphabetically first?) image. A simple flag may be sufficient then. |
We had a need for collections and did a most minimal implementation of this proposal. We were using it for grouping images associated with each well (collection A1 → {modality1, modality2…}). Now with the plate spec not being marked transitional, we are contemplating moving to that (storing each modality as a separate plate), because it is more standards-compliant than our own extension, we practically don't encounter plate layouts that are not row/column based, and we don't need assumptions that each collection actually contains all expected images. Things to consider:
@jluethi, couldn't you define (sub)collections based on same metadata? Unless the partitioning into subcollections is different for every metadata property. |
@tischi here's the markdown for the list to copy/paste
|
@normanrz Do you need to specify rendering settings for each image (rather than a rendering for the whole collection)? Is this just to save time, to avoid loading the settings from each I like @jluethi's idea of flags to say e.g.
This is very minimal and once we know that, we can load shape, dtype, rendering settings from the first image. But it might be too restrictive if e.g. you want different settings for the plate than for individual images. Making these "image properties follow the NGFF image spec" as @aeisenbarth suggested makes sense when possible (e.g. name and rendering settings that are in the NGFF multiscales metadata), but might be tricky for others that are in the Zarr .zarray data (dtype, shape). |
Use-case wise I realised that for me it would also be interesting to know whether
I wonder now, if this collection.json is mainly meant to be read and written by a computer, whether it would be easier and more generic to simply have the option to add for to each image all sorts of metadata from the .zarray files, such as dtype, shape, multiscales. Since this is in a single file, for a computer it would still be very fast to figure out whether "all are 2D" or "all have same shape" a.s.o.. I think the same argument can be made for rendering settings. What about the following: We first figure out all Then, as second step, we discuss whether it is worth to have a |
This came up in the community call: Think of a collection as the definition of the view state in a visualization tool. Now, the render settings feel better placed in the view than with the individual images because you may want to have the same image in different views with different settings. For example, you may want to choose different colors for an image in different views.
I want to mention that idea of consolidated metadata has come up on the Zarr-level quite a few times. This would mean that the Storing information like
or
seems like redundant aggregations that could easily be derived by the visualization or processing software (given consolidated metadata). |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/faim-hcs-functions-to-work-with-hcs-data/78868/20 |
This issue has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/save-a-single-labels-dataset-into-an-ome-zarr/93505/39 |
What is an image collection?
A collection of images is a semantic grouping of two or more associated ome-ngff images and/or image-labels.
This definition could include
What workflows should it support?
The specification should support implementations being able to traverse the image collection and, where relevant, map the associated metadata to the physical coordinate space for loading these images.
Ideally, the specification should provide sufficient information at each level of a hierarchical grouping to allow for the loading of both the entire collection, and the loading of an arbitrary level of the hierarchy. This can be important when wanting to share/view partial datasets or update only small parts of the entire collection.
Where labels or other related data is provided (e.g. meshes, points…), the specification should support being able to associate any member of the image collection with its associated labels, regardless of the level in the hierarchy.
The OME-NGFF spec is close to supporting this functionality with the HCS specification which allows the positioning of wells into rows and plates. The main drawbacks of this specification are
What should it be called?
Ideally, the names used in the base specification would be general enough to support a broad variety of use cases and tailored use cases could be demonstrated using examples in the documentation.
Reference specifications
BDV XML Files
SVG
TrakEM2
Napari Plugin for image-label collections
mobie grid view of many sources
Related
Image.SC discussion on collections
Live notes from latest community call
HCS Specification
What next?
I think we should first decide on whether we want to support arbitrary levels in the hierarchy and whether we want a general spec which we can “inherit” from for more detailed specs, or whether we want one spec to rule them all.
My vote is that we define the most generic collection (a “bag” of images) which works with arbitrary levels of grouping (it’s collections all the way down), and then work to add to it for more complex collections. I will be working on this over the coming week and will post here once I have something working, but of course would love to hear what everyone’s thoughts are on the best way forward.
The text was updated successfully, but these errors were encountered: