Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add group path to series index mapping in root attributes #157

Merged
merged 3 commits into from
Sep 23, 2022

Conversation

melissalinkert
Copy link
Member

Fixes #126.

The root level attributes (if they are written) will now contain a groups dictionary that maps each group with multiscales metadata to the corresponding Bio-Formats series index (== Image index in METADATA.ome.xml).

As noted in the comments, the paths are dictionary keys and the integer indexes are the values. If the indexes were keys, they would be stored as strings, so this seems easier for a reader. Example tiny plate:

$ bin/bioformats2raw "test&plateRows=3&plateCols=2.fake" test-plate
$ head -n 10 test-plate/.zattrs
{
  "groups" : {
    "C/1/0" : 4,
    "C/2/0" : 5,
    "B/1/0" : 2,
    "A/1/0" : 0,
    "B/2/0" : 3,
    "A/2/0" : 1
  },
  "bioformats2raw.layout" : 3,

Happy to hear better ideas overall (especially for the attribute name), this is just a place to start.

/cc @kkoz @erindiel @DavidStirling @joshmoore @dgault @sbesson @chris-allan (and feel free to add anyone I missed)

Fixes glencoesoftware#126.

The root level attributes (if they are written) will now contain
a "groups" dictionary that maps each group with multiscales metadata
to the corresponding Bio-Formats series index (== Image index in METADATA.ome.xml).
@DavidStirling
Copy link
Member

This is absolutely amazing!

@sbesson
Copy link
Member

sbesson commented Jun 6, 2022

Thanks @melissalinkert for starting this discussion and coming up with a few proposal. Capturing a few immediate thoughts:

  • thinking from a client perspective, I can definitely see the use case for mapping the path to a Zarr multiscales group to an index within a list of images without requiring to parse large XML files. Generally, the approach proposed here feels quite similar to the rowIndex/columnIndex attribute introduced in the plate.wells specification and providing and explicit mapping via the Zarr attrbutes.
  • from my side, the naming question largely depends on whether this is expected to be specified upstream or whether it should be still considered as internal for now. At least, in the context of the OME-NGFF specification, I can certainly see groups being too generic a name especially current effort might lead towards the specification of extra Zarr groups which would not be captured by this new proposed key. images would be a more specific name and would directly maps into the OME vocabulary. It is also the name used in the well specification and this could likely raise ambiguity on why the same key is defined at different levels
  • from a specification and reader perspective, what should be the expectation and/or the precedence order in case the XML and the groups key do not match?
  • do you foresee adding more metadata in the future and needing a list of JSON objects rather than a simple dictionary e.g.
 "groups" : [
   {
      "index" 4,
      "path": "C/1/0"
   },
   {
      "index": 5,
      "path": "C/2/0"
    },
    ...
  ],
  • finally in terms of scalability, the biggest unknown from my side is the impact of registering every field of view in the top-level metadata in the case of real-world HCS datasets with typically 1-10K images. The size of the JSON had already been a concern during the original work on HCS data (HCS Well metadata ome/ngff#10) and effectively led to the introduction of the well specification to distribute the metadata at different levels of the hierarchy

Based on the above, an alternative approach would be to extend the existing well specification and store the image index as an extra index attribute for each object of the images list. While this approach would bring maximal compatibility with the existing specification, it comes with 2 primary caveats: 1- this would only apply to HCS datasets at the moment, 2- finding the path to a Zarr group given an image index would require to parse plate.wells then iterate over the wells until the relevant image index found.

@melissalinkert
Copy link
Member Author

Our use cases aren't specific to HCS, so I'd prefer to have a more generic solution than just extending the well specification.

I don't think it makes sense to add more metadata under this new attribute, so a simple dictionary should be enough. If we need more image or well-level metadata, then those specs should be updated accordingly.

In terms of mismatches, I think what is in .zattrs should take precedence over the natural ordering of the paths. Thinking of a case like:

"groups" : {
    "0" : 1,
    "1" : 2,
    "2" : 0
  },

...that likely means a corresponding update to raw2ometiff. If this were written up as part of the spec, it probably means adding a bunch of requirements to the effect of the path MUST exist, the index MUST be greater than 0 and less than the group count, each path and index MUST be specified exactly once, etc. That doesn't address what happens if the mapping is fundamentally wrong (with dimension mismatches), but we arguably have that problem already if the OME-XML is edited incorrectly.

One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

@chris-allan
Copy link
Member

Given the current hard mapping between Bio-Formats series and the path to the group which contains the multiscales metadata it may make more sense to embrace this rather than attempt to make something generic that isn't. Something like (for plate data):

"series": [
  "C/1/0",
  "C/2/0"
]

and for non-plate data:

"series": [
  "0",
  "1"
]

This does what it's supposed to do and no more.

@sbesson
Copy link
Member

sbesson commented Jun 7, 2022

If extensibility is unlikely to be a requirement, given that the series are always indexed from 0 to n, a simple list as proposed in #157 (comment) would certainly do the job.

One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

Interesting. At least from my perspective, this would feel like a good compromise allowing to store this metadata without inflating the top-level .zattrs.
Assuming we are still in brainstorming mode, the other option I could think of would be to store such information natively as an object array. I assume the tradeoff here would be the added complexity for accessing the indexing array as opposing to just decoding a JSON vs the extra features associated with using the native Zarr array storage (chunking, compression).

@melissalinkert
Copy link
Member Author

beb41fa and 508c9bb together should address #157 (comment) and move the new attribute to OME/.zattrs. It took me a bit to convince myself that this still works with --series, but it does because --series will remove any unconverted Images before writing METADATA.ome.xml.

@sbesson
Copy link
Member

sbesson commented Jun 10, 2022

Using the latest commit, I successfully converted a public IDR plate with 384 wells, 30 fields of view i.e. ~10K images. I uploaded the output to a temporary public bucket https://uk1s3.embassy.ebi.ac.uk/bf2raw_157/7361.zarr in case the Glencoe team already has some tools where to test the series mapping proposal.

@melissalinkert
Copy link
Member Author

👍 thanks @sbesson!

@joshmoore
Copy link
Contributor

Very sorry for my long silence on this! 😷

My highest level comment is whether or not there's interest in trying to move this discussion to ome/ngff as a new spec. To me this feels very much like the first steps towards the non-transitional (permanent?) solution for ome/ngff#104

melissalinkert commented 14 days ago
One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

It might be worth a round of discussions on it, but it seems like the feeling is generally "everything can't be in one JSON file". One thing I might add though is that at the top level we might want to "register" this location. (See example below)

chris-allan commented 13 days ago
Given the current hard mapping between Bio-Formats series and the path to the group which contains the multiscales metadata it may make more sense to embrace this rather than attempt to make something generic that isn't.

In my first quick attempts at this, I had also sided on the use of an array. I was almost sure I had posted this somewhere, but not being able to find it I'll repeat myself here:

{
    "@type": "GenericContainer",
    "metadataSources": [
        {
            "@type": "OMEXMLMetadata",
            "file": {
                "path": "/OME/METADATA.ome.xml"
            },
            "seriesMapping": [
                "s0", "s1", "s2"
            ]
        }
    ]
}

melissalinkert added a commit to melissalinkert/ngff that referenced this pull request Jul 5, 2022
@melissalinkert
Copy link
Member Author

https://github.com/melissalinkert/ngff/commits/group-list (on top of ome/ngff#112) is a start at describing what this PR does now.

melissalinkert added a commit to melissalinkert/raw2ometiff-1 that referenced this pull request Jul 6, 2022
joshmoore pushed a commit to joshmoore/ngff that referenced this pull request Jul 18, 2022
@joshmoore
Copy link
Contributor

joshmoore commented Sep 15, 2022

Used this current branch to build 2 samples which have been uploaded to uk1s3 for testing. (See ome/ome-ngff-validator#13 for URLs)

Edit: current mainline version of ZarrReader opens both fine with:

BF_CP=$ZARR_JAR showinf -noflat /opt/ome-zarr-py/bf2raw/plate-rows-2.zarr/.zattrs

cc: @dgault

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to give ome/ngff#112 another round of review today. My understanding is that the next step there will be to port this addition to the 0.4 specification document and update https://ngff.openmicroscopy.org/0.4/ to clarify the expectations and conventions of this layout for data consumers

My vote is to get this change merged and move towards the release bioformats2raw 0.5.0. Once the upstream OME-NGFF specification is updated, the README can also be amended to include a reference to the relevant section.

@chris-allan chris-allan merged commit 83388a9 into glencoesoftware:master Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Enumerate series in zarr metadata?
5 participants