HCS group layout #9

sbesson · 2020-10-07T12:00:59Z

Following the support for a multiscales and masks, the focus is now shifting to trying to represent HCS data in the NGFF spec. An initial prototype of plate layout had already been implemented in the context of the OME Community Meeting 2020 - https://github.com/ome/omero-guide-cellprofiler/blob/3a441e5594b80e8e95e5e473baa8da140db03656/notebooks/idr0002_zarr.ipynb.

Overall it feels like the HCS specification should primarily revolve around:

the specification of extra group(s) above the multiscales modelling the HCS concept
the specification of the metadata conventions associated with each group

The number of effective dimensions currently supported by the OME model and the various HCS datasets produced by the community are: Plate, Plate Acquisition (also called Plate Run), Well, Well Sample(also called Field of View). The first question is whether how flat vs deep the Zarr folder hierarchy should be to represent these concepts. The two layout below are put for discussion.

All names, layout and content are still up for discussion at this stage.

Option 1: single group

This is the closest to the implementation mentioned above where a series of multiscale images aka Zarr groups (potentially with labels) are collected within a plate Zarr group. Each multiscale image represents a field of view within a well within a plate acquisition with its metadata specified in a dedicated well sample specification.

└── plate.zarr                # Plate
    ├── .zgroup
    ├── .zattrs               # Implements "plate"
    ├── 0                     # First field of view 
    │   ├── .zgroup
    │   ├── .zattrs           # Implements "multiscale", "omero", "well sample" 
    │   ├── 0
    │   │   ...               # Resolution levels
    │   ├── n
    │   └── labels
    ├── ...                   # Field of views
    └── n

Pros:

this keeps the addition fairly minimal and does not create a large nested specialized structure
outside the HCS use case, this simple layout representation could potentially be generalized or at least concepts could be re-used for representing multi-position acquisitions i.e. a group of images related via some spatial context

Cons:

some of the classical HCS look-ups ("find all fields of view within a well") involve traversing all elements and iterating over the attributes
many HCS datasets will easily exceed 10K images per plate acquisition these days - typical number of wells range between 96 and up to 1536 and some acquisitions systems will image ~300 fields of view per well. From the classical experience of HCS file formats (which can easily create 10K-100K binary files under a single folder), we know large number of folders can lead to performance issues on file systems.

Option 2: plate/acquisition/well/well sample

In this proposal, three groups are inserted above the image group: plate, plate acquisition and well. Each multiscale image represents a field of view within a well within a plate acquisition. The full HCS metadata is distributed across the plate acquisition, well and well sample specifications.

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate"
    │
    ├── 0                         # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition"
    │   │
    │   ├── 0                     # First well
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "well"
    │   │   ├── 0                 # First field of view
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "multiscale", "omero", "well sample"
    │   │   │   ├── 0
    │   │   │   │   ...             # Resolution levels
    │   │   │   ├── n
    │   │   │   └── labels
    │   │   ├── ...               # Field of views
    │   │   └── n
    │   ├── ...                   # Wells
    │   └── m
    ├── ...                       # Plate acquisitions
    └── l

Pros:

for most HCS datasets, this should limit the maximal number of sub-groups to be of the order 1K
the structure is more amenable to the classical ways to store and introspect HCS elements

Cons:

in many cases, there is only one acquisition per plate and the second group will be a singleton
this requires four new specifications as opposed to two to describe the various levels. The added complexity might be a barrier to adoption

Option 3: plate/acquisition/row/column/well sample

See https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-706770955

Group names

In both example above, 0, 1,...n are used as the generic group names. Using more explicit informative names reflecting the acquisition e.g. A1, A2, ... or A1 Field 1, A1 Field 2... is definitely a possibility. Given the number of variants found in the ecosystem, I would avoid trying to enforce these names and/or rely on them. Instead the corresponding metadata (typically row, column, index) should be unambiguously specified within the .zattrs of the relevant group(s).

The text was updated successfully, but these errors were encountered:

manics · 2020-10-08T10:35:54Z

Where does the metadata for intermediate levels (plate-acquisition, well) go? For option 2 it can go into .zattrs at the appropriate level of the hierarchy. For option 1 would there be a single metadata file at the plate level with details of everything apart from the image .zattrs, or would plate-acquisition and well metadata be duplicated in all of the image .zattrs?

sbesson · 2020-10-08T14:57:17Z

Following today's discussion and for completeness, pasting here a first list of candidates for specifying HCS metadata as defined in the OME schema:

Plate
  Name
  Description
  Columns
  ColumnNamingConvention
  Rows
  RowsNamingConvention
  ExternalIdentifier
  FieldIndex
  WellOriginX
  WellOriginXUnit 
  WellOriginY
  WellOriginYUnit 
PlateAcquisition
  Name
  Description
  StartTime
  EndTime
  MaximumFieldCount
Well
  Color
  Column
  ExternalDescription
  ExternalIdentifier
  Row
  Type
WellSample
  Index
  PositionX
  PositionXUnit
  PositionY
  PositionYUnit
  Timepoint

An advantage of the second layout where each concept (PlateAcquisition, Well,...) is represented as a group is also that the metadata fields can be stored into the .zattrs of each group.

For the first layout, my assumption is that minimally:

the well sample attributes should go into the low-level .zattrs
the plate attributes should go into the top-level .zattrs
For all the other attributes, I think all representations are on the plates. One possibility would be to only store indexes at the image level i.e. plate acquisition index, well index (or well row + well column), field index and put the rest of the metadata at the plate level. The primary advantage of this structure is that parsing the top-level metadata would suffice to retrieve the dimensionality of the plate. The downside is that this could result in potentially very large JSON.

In all cases, as discussed today, testing the layout + metadata in the context of large HCS datasets of typically several 10K images maybe with latency involved will be necessary to ensure the proposed extension remains performant for the typical queries/manipulations.

melissalinkert · 2020-10-09T16:09:27Z

Reading through this with @chris-allan, option 2 seems like the way to go. It might be worth splitting the well index into two levels in the hierarchy (row and column), especially for 1536 well plates. With option 2, we would lose some of the flexibility that PlateAcquisition currently has, where a WellSample can potentially be linked to more than one PlateAcquisition, but I think that's OK since it's in keeping with the spirit of what PlateAcquisition is meant to represent.

Allowing but not parsing or enforcing arbitrary group names sounds fine, since the actual indexes would be described in the .zattrs.

sbesson · 2020-10-11T21:29:34Z

Thanks for the input @melissalinkert and @chris-allan. Option 2 above was indeed representing the wells as an single group with some 2D indexes (typically row/column) in the .zattrs. As mentioned above, the alternative is to split this into 2 groups. Assuming a grouping by row first, this would mean the following layout:

Option 3: plate/acquisition/row/column/well sample

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate"
    │
    ├── 2020-10-10                # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition"
    │   │
    │   ├── A                     # First row
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "row"
    │   │   │ 
    │   │   ├── 1                 # First column
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "column"
    │   │   │   │
    │   │   │   ├── Field_1       # First field of view
    │   │   │   │   │
    │   │   │   │   ├── .zgroup
    │   │   │   │   ├── .zattrs   # Implements "multiscale", "omero", "well sample"
    │   │   │   │   ├── 0
    │   │   │   │   │   ...       # Resolution levels
    │   │   │   │   ├── n
    │   │   │   │   └── labels
    │   │   │   ├── ...           # Fields of view
    │   │   │   └── Field_m
    │   │   ├── ...               # Columns
    │   │   └── 12
    │   ├── ...                   # Rows
    │   └── H
    ├── ...                       # Plate acquisitions
    └── l

manics · 2020-10-12T12:38:30Z

The option 2 hierarchy makes sense, especially from a metadata perspective (note I'm assuming instead of 0 the directory name would be the well position e.g. A1). With option 3 is there any metadata that applies at the row level? From a filesystem perspective I don't think the additional level is necessary, although you can have 1000s of wells that should be OK, whereas flattening all levels as in Option 1 could lead to 100,000s of directories. This means we can make the decision on the implementation benefits of option 2 vs option 3.

joshmoore · 2020-10-14T07:48:09Z

Re-mentioning a requirement that occurred to me in the bioformats2raw context here: currently the multiscale group name is the series number in a single "fileset" group, roughly equivalent to option 1 here. That allows mapping any of the metadata in OME-XML based on the series index. If we push images down below wells with new naming, we will need a new heuristic, or need to encoded the series number in the ngff metadata, or encode all of the OME-XML metadata in the ngff metadata. Just a thought.

It does make me wonder though if there isn't an option N which some of this:

a plate group somewhere with metadata that is roughly a grid layout in JSON:

plate:
  row:  # array or dictionary
    group: # array or dictionary
       image: "path/to/image/group"

the images can then be anywhere and can contain a small metadata block saying that they are part of a plate:

  plate: "path/to/plate/group"
  row: "a" # or index?
  column: 1

We could/would then still use one of the layouts above as the preferred/default but it would provide some flexibility if need be. (The library, I think, would default to consuming the metadata as the definitive source, but the layout would make it more user friendly.)

jburel · 2020-10-14T08:22:01Z

I am now wondering if the example above image anywhere could be a mechanism to organise images (not in a plate) in a way that user wants to see them. Such feature request has been mentioned many times over the years.

sbesson · 2020-10-16T12:43:11Z

From today's meeting, the decision was to start implementing the third option (Issue description edited to link to the relevant comment) and update https://github.com/ome/omero-cli-zarr/ to create a Zarr representation of the first illumination corrected plate of the irdr0033 submission (https://idr.openmicroscopy.org/webclient/?show=plate-5966)

will-moore · 2020-10-27T12:21:23Z

Updated Option 3

Shows the output that is currently being exported from OMERO by omero-cli-zarr.

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate": {"rows":2, "columns":3,
    │                                                    "row_names": ["A", "B"], "column_names": ["1", "2", "3"],
    │                                                    "plateAcquisitions": [{"path": "2020-10-10"}],
    │                                                    "images": [{"path": "2020-10-10/A/1/Field_1"}...]
    │
    ├── 2020-10-10                # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition" - NB: not yet implemented
    │   │
    │   ├── A                     # First row
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "row" - NB: not yet implemented
    │   │   │ 
    │   │   ├── 1                 # First column
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "column" - NB: not yet implemented
    │   │   │   │
    │   │   │   ├── Field_1       # First field of view
    │   │   │   │   │
    │   │   │   │   ├── .zgroup
    │   │   │   │   ├── .zattrs   # Implements "multiscales", "omero", "well sample"
    │   │   │   │   ├── 0
    │   │   │   │   │   ...       # Resolution levels
    │   │   │   │   ├── n
    │   │   │   │   └── labels
    │   │   │   ├── ...           # Fields of view
    │   │   │   └── Field_m
    │   │   ├── ...               # Columns
    │   │   └── 12
    │   ├── ...                   # Rows
    │   └── H
    ├── ...                       # Plate acquisitions
    └── l

will-moore · 2020-10-28T11:25:36Z

I guess a couple of questions about the plate metadata.
Seems strange to be listing images at the plate level. For example, if I wanted to show a Well in vizarr, I'd want the Well metadata (see Implements "column" above - I guess that should be `Implements "well") to list the Fields in that Well, and not have to load Plate metadata and parse all the "images" in the Plate to find the Fields in that Well?

sbesson · 2020-10-28T13:12:00Z

Couple of additional thoughts/issues while working on an first draft of spec:

I think the column/well comment in https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-717870790 echoes https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-707094613. Splitting the well concept into a 2-levels hierarchy (row/column) as per option 3 definitely implies the row specification will contain at most very minimal metadata.
there is no top-level representation of the (maximum) number of fields of view. From a UI perspective, I can certainly see an argument for pushing the number of fields of views at the well level concept. In that case, should rows/columns etc not be moved down to the plate acquisition level? See also the next point re multi-acquisition layout.
additional thoughts will need to be put in terms of supporting multiple plate acquisitions especially in the case of multiple plate acquisitions with different grid layouts where the rows/columns/row_names/column_names concept would need changes. Possible options are nested dictionaries/lists, moving these keys under the plateAcquisitions key or decide on a different meaning e.g. the superset of possible column/row names across acquisitions
while plateAcquisitions and images are lists of dictionaries with a path key, rows and columns are integers which is slightly confusing. If we implement row and column as intermediate specifications, we might need to rename these to avoid confusion.
plateAcquisitions could probably more simply be renamed as acquisitions or runs

Also trying to think of these keys in terms of MUST vs SHOULD vs MAY as per RFC 2119. Happy to skip this for a version version but a naive assumption would be to have anything that can be recomputed from other keys at a SHOULD level rather than a MUST (e.g. column size, names).

jkh1 · 2020-10-29T10:07:19Z

Maybe a bit late but have you looked at the cellH5 format?
The paper is here (check the supplementary material for details, in particular Sup Fig 2 for the layout of the file).

manics · 2020-10-29T10:14:40Z

Here's the figure @jkh1 mentioned (there's no direct online link available):

sbesson · 2020-10-29T13:39:05Z

Maybe a bit late but have you looked at the cellH5 format?

Definitely not too late and thanks for bringing other known hierarchical representations into the discussion.

As a preamble, CellH5 includes several concept including features or objects which are defined part of the mid-term goal as discussed during today's community call but outside the scope of this issue/extension.

Focusing primarily on the HCS specification, I tried to summarize my understanding of the mappings between the hierarchical structures defined in the current proposal (mentioned at the 2020-10-29 call), in the CellH5 format as well as in the 2016-06 OME schema for reference:

OME-ZARR 2020-10-29	OME 2016-06	CellH5
		sample
plate	Plate	plate
acquisition	PlateAcquisition
column + row	Well	experiment
field of view	WellSample	position

From my side, what this means is that there is no substantial conceptual gap under the plate concept. For instance, it should be doable to translate HCS data stored into in CellH5 into HCS OME-Zarr with minimal changes to the layout. The main top-level group which is not accounted for is the CellH5 sample specification which would currently be represented as a separate Zarr entity

sbesson transferred this issue from ome/omero-ms-zarr Nov 25, 2020

sbesson mentioned this issue Nov 25, 2020

HCS specification #5

Merged

joshmoore added rfc Status: request for comments collection Concern: collection of images labels Nov 25, 2020

joshmoore closed this as completed in #5 Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HCS group layout #9

HCS group layout #9

sbesson commented Oct 7, 2020

manics commented Oct 8, 2020

sbesson commented Oct 8, 2020

melissalinkert commented Oct 9, 2020

sbesson commented Oct 11, 2020

manics commented Oct 12, 2020 •

edited

Loading

joshmoore commented Oct 14, 2020

jburel commented Oct 14, 2020 •

edited

Loading

sbesson commented Oct 16, 2020

will-moore commented Oct 27, 2020

will-moore commented Oct 28, 2020

sbesson commented Oct 28, 2020 •

edited

Loading

jkh1 commented Oct 29, 2020

manics commented Oct 29, 2020 •

edited

Loading

sbesson commented Oct 29, 2020

HCS group layout #9

HCS group layout #9

Comments

sbesson commented Oct 7, 2020

Option 1: single group

Option 2: plate/acquisition/well/well sample

Option 3: plate/acquisition/row/column/well sample

Group names

manics commented Oct 8, 2020

sbesson commented Oct 8, 2020

melissalinkert commented Oct 9, 2020

sbesson commented Oct 11, 2020

Option 3: plate/acquisition/row/column/well sample

manics commented Oct 12, 2020 • edited Loading

joshmoore commented Oct 14, 2020

jburel commented Oct 14, 2020 • edited Loading

sbesson commented Oct 16, 2020

will-moore commented Oct 27, 2020

Updated Option 3

will-moore commented Oct 28, 2020

sbesson commented Oct 28, 2020 • edited Loading

jkh1 commented Oct 29, 2020

manics commented Oct 29, 2020 • edited Loading

sbesson commented Oct 29, 2020

manics commented Oct 12, 2020 •

edited

Loading

jburel commented Oct 14, 2020 •

edited

Loading

sbesson commented Oct 28, 2020 •

edited

Loading

manics commented Oct 29, 2020 •

edited

Loading