Add group path to series index mapping in root attributes #157

melissalinkert · 2022-06-04T00:30:11Z

Fixes #126.

The root level attributes (if they are written) will now contain a groups dictionary that maps each group with multiscales metadata to the corresponding Bio-Formats series index (== Image index in METADATA.ome.xml).

As noted in the comments, the paths are dictionary keys and the integer indexes are the values. If the indexes were keys, they would be stored as strings, so this seems easier for a reader. Example tiny plate:

$ bin/bioformats2raw "test&plateRows=3&plateCols=2.fake" test-plate
$ head -n 10 test-plate/.zattrs
{
  "groups" : {
    "C/1/0" : 4,
    "C/2/0" : 5,
    "B/1/0" : 2,
    "A/1/0" : 0,
    "B/2/0" : 3,
    "A/2/0" : 1
  },
  "bioformats2raw.layout" : 3,

Happy to hear better ideas overall (especially for the attribute name), this is just a place to start.

/cc @kkoz @erindiel @DavidStirling @joshmoore @dgault @sbesson @chris-allan (and feel free to add anyone I missed)

Fixes glencoesoftware#126. The root level attributes (if they are written) will now contain a "groups" dictionary that maps each group with multiscales metadata to the corresponding Bio-Formats series index (== Image index in METADATA.ome.xml).

DavidStirling · 2022-06-06T11:46:36Z

This is absolutely amazing!

sbesson · 2022-06-06T14:10:03Z

Thanks @melissalinkert for starting this discussion and coming up with a few proposal. Capturing a few immediate thoughts:

thinking from a client perspective, I can definitely see the use case for mapping the path to a Zarr multiscales group to an index within a list of images without requiring to parse large XML files. Generally, the approach proposed here feels quite similar to the rowIndex/columnIndex attribute introduced in the plate.wells specification and providing and explicit mapping via the Zarr attrbutes.
from my side, the naming question largely depends on whether this is expected to be specified upstream or whether it should be still considered as internal for now. At least, in the context of the OME-NGFF specification, I can certainly see groups being too generic a name especially current effort might lead towards the specification of extra Zarr groups which would not be captured by this new proposed key. images would be a more specific name and would directly maps into the OME vocabulary. It is also the name used in the well specification and this could likely raise ambiguity on why the same key is defined at different levels
from a specification and reader perspective, what should be the expectation and/or the precedence order in case the XML and the groups key do not match?
do you foresee adding more metadata in the future and needing a list of JSON objects rather than a simple dictionary e.g.

 "groups" : [
   {
      "index" 4,
      "path": "C/1/0"
   },
   {
      "index": 5,
      "path": "C/2/0"
    },
    ...
  ],

finally in terms of scalability, the biggest unknown from my side is the impact of registering every field of view in the top-level metadata in the case of real-world HCS datasets with typically 1-10K images. The size of the JSON had already been a concern during the original work on HCS data (HCS Well metadata ome/ngff#10) and effectively led to the introduction of the well specification to distribute the metadata at different levels of the hierarchy

Based on the above, an alternative approach would be to extend the existing well specification and store the image index as an extra index attribute for each object of the images list. While this approach would bring maximal compatibility with the existing specification, it comes with 2 primary caveats: 1- this would only apply to HCS datasets at the moment, 2- finding the path to a Zarr group given an image index would require to parse plate.wells then iterate over the wells until the relevant image index found.

melissalinkert · 2022-06-06T16:39:53Z

Our use cases aren't specific to HCS, so I'd prefer to have a more generic solution than just extending the well specification.

I don't think it makes sense to add more metadata under this new attribute, so a simple dictionary should be enough. If we need more image or well-level metadata, then those specs should be updated accordingly.

In terms of mismatches, I think what is in .zattrs should take precedence over the natural ordering of the paths. Thinking of a case like:

"groups" : {
    "0" : 1,
    "1" : 2,
    "2" : 0
  },

...that likely means a corresponding update to raw2ometiff. If this were written up as part of the spec, it probably means adding a bunch of requirements to the effect of the path MUST exist, the index MUST be greater than 0 and less than the group count, each path and index MUST be specified exactly once, etc. That doesn't address what happens if the mapping is fundamentally wrong (with dimension mismatches), but we arguably have that problem already if the OME-XML is edited incorrectly.

One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

chris-allan · 2022-06-07T09:34:01Z

Given the current hard mapping between Bio-Formats series and the path to the group which contains the multiscales metadata it may make more sense to embrace this rather than attempt to make something generic that isn't. Something like (for plate data):

"series": [
  "C/1/0",
  "C/2/0"
]

and for non-plate data:

"series": [
  "0",
  "1"
]

This does what it's supposed to do and no more.

sbesson · 2022-06-07T11:47:58Z

If extensibility is unlikely to be a requirement, given that the series are always indexed from 0 to n, a simple list as proposed in #157 (comment) would certainly do the job.

One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

Interesting. At least from my perspective, this would feel like a good compromise allowing to store this metadata without inflating the top-level .zattrs.
Assuming we are still in brainstorming mode, the other option I could think of would be to store such information natively as an object array. I assume the tradeoff here would be the added complexity for accessing the indexing array as opposing to just decoding a JSON vs the extra features associated with using the native Zarr array storage (chunking, compression).

melissalinkert · 2022-06-07T17:59:17Z

beb41fa and 508c9bb together should address #157 (comment) and move the new attribute to OME/.zattrs. It took me a bit to convince myself that this still works with --series, but it does because --series will remove any unconverted Images before writing METADATA.ome.xml.

sbesson · 2022-06-10T16:51:04Z

Using the latest commit, I successfully converted a public IDR plate with 384 wells, 30 fields of view i.e. ~10K images. I uploaded the output to a temporary public bucket https://uk1s3.embassy.ebi.ac.uk/bf2raw_157/7361.zarr in case the Glencoe team already has some tools where to test the series mapping proposal.

melissalinkert · 2022-06-10T19:11:48Z

👍 thanks @sbesson!

joshmoore · 2022-06-20T12:30:57Z

Very sorry for my long silence on this! 😷

My highest level comment is whether or not there's interest in trying to move this discussion to ome/ngff as a new spec. To me this feels very much like the first steps towards the non-transitional (permanent?) solution for ome/ngff#104

melissalinkert commented 14 days ago
One thought for handling the JSON size issue is to move this out of root .zattrs and into a new OME/.zattrs. That reinforces that this metadata is tied to METADATA.ome.xml, and means that anyone who doesn't need it doesn't have to read it. It's still ~150KB for a plate with 10000 images, so if that's too big we'll need to think about other options.

It might be worth a round of discussions on it, but it seems like the feeling is generally "everything can't be in one JSON file". One thing I might add though is that at the top level we might want to "register" this location. (See example below)

chris-allan commented 13 days ago
Given the current hard mapping between Bio-Formats series and the path to the group which contains the multiscales metadata it may make more sense to embrace this rather than attempt to make something generic that isn't.

In my first quick attempts at this, I had also sided on the use of an array. I was almost sure I had posted this somewhere, but not being able to find it I'll repeat myself here:

{
    "@type": "GenericContainer",
    "metadataSources": [
        {
            "@type": "OMEXMLMetadata",
            "file": {
                "path": "/OME/METADATA.ome.xml"
            },
            "seriesMapping": [
                "s0", "s1", "s2"
            ]
        }
    ]
}

Documents the current state of glencoesoftware/bioformats2raw#157

melissalinkert · 2022-07-05T21:26:54Z

https://github.com/melissalinkert/ngff/commits/group-list (on top of ome/ngff#112) is a start at describing what this PR does now.

See glencoesoftware/bioformats2raw#157

Documents the current state of glencoesoftware/bioformats2raw#157

joshmoore · 2022-09-15T10:23:03Z

Used this current branch to build 2 samples which have been uploaded to uk1s3 for testing. (See ome/ome-ngff-validator#13 for URLs)

Edit: current mainline version of ZarrReader opens both fine with:

BF_CP=$ZARR_JAR showinf -noflat /opt/ome-zarr-py/bf2raw/plate-rows-2.zarr/.zattrs

cc: @dgault

sbesson

I am planning to give ome/ngff#112 another round of review today. My understanding is that the next step there will be to port this addition to the 0.4 specification document and update https://ngff.openmicroscopy.org/0.4/ to clarify the expectations and conventions of this layout for data consumers

My vote is to get this change merged and move towards the release bioformats2raw 0.5.0. Once the upstream OME-NGFF specification is updated, the README can also be amended to include a reference to the relevant section.

melissalinkert added 2 commits June 7, 2022 12:28

Move group map to OME/.zattrs and update attribute name

beb41fa

Switch series attribute from map to list

508c9bb

melissalinkert added a commit to melissalinkert/ngff that referenced this pull request Jul 5, 2022

Add "series" attribute in "OME" group under bioformats2raw.layout

056b4e9

Documents the current state of glencoesoftware/bioformats2raw#157

melissalinkert added a commit to melissalinkert/raw2ometiff-1 that referenced this pull request Jul 6, 2022

Allow OME Zarr group

b976226

See glencoesoftware/bioformats2raw#157

melissalinkert mentioned this pull request Jul 6, 2022

Allow OME Zarr group glencoesoftware/raw2ometiff#81

Merged

joshmoore pushed a commit to joshmoore/ngff that referenced this pull request Jul 18, 2022

Add "series" attribute in "OME" group under bioformats2raw.layout

57acc23

Documents the current state of glencoesoftware/bioformats2raw#157

joshmoore mentioned this pull request Sep 15, 2022

bioformats2raw.layout ome/ngff#112

Merged

sbesson approved these changes Sep 16, 2022

View reviewed changes

chris-allan merged commit 83388a9 into glencoesoftware:master Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add group path to series index mapping in root attributes #157

Add group path to series index mapping in root attributes #157

melissalinkert commented Jun 4, 2022

DavidStirling commented Jun 6, 2022

sbesson commented Jun 6, 2022

melissalinkert commented Jun 6, 2022

chris-allan commented Jun 7, 2022

sbesson commented Jun 7, 2022

melissalinkert commented Jun 7, 2022

sbesson commented Jun 10, 2022

melissalinkert commented Jun 10, 2022

joshmoore commented Jun 20, 2022

melissalinkert commented Jul 5, 2022

joshmoore commented Sep 15, 2022 •

edited

Loading

sbesson left a comment

Add group path to series index mapping in root attributes #157

Add group path to series index mapping in root attributes #157

Conversation

melissalinkert commented Jun 4, 2022

DavidStirling commented Jun 6, 2022

sbesson commented Jun 6, 2022

melissalinkert commented Jun 6, 2022

chris-allan commented Jun 7, 2022

sbesson commented Jun 7, 2022

melissalinkert commented Jun 7, 2022

sbesson commented Jun 10, 2022

melissalinkert commented Jun 10, 2022

joshmoore commented Jun 20, 2022

melissalinkert commented Jul 5, 2022

joshmoore commented Sep 15, 2022 • edited Loading

sbesson left a comment

Choose a reason for hiding this comment

joshmoore commented Sep 15, 2022 •

edited

Loading