Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCS group layout #9

Closed
sbesson opened this issue Oct 7, 2020 · 14 comments · Fixed by #5
Closed

HCS group layout #9

sbesson opened this issue Oct 7, 2020 · 14 comments · Fixed by #5
Labels
collection Concern: collection of images rfc Status: request for comments

Comments

@sbesson
Copy link
Member

sbesson commented Oct 7, 2020

Following the support for a multiscales and masks, the focus is now shifting to trying to represent HCS data in the NGFF spec. An initial prototype of plate layout had already been implemented in the context of the OME Community Meeting 2020 - https://github.com/ome/omero-guide-cellprofiler/blob/3a441e5594b80e8e95e5e473baa8da140db03656/notebooks/idr0002_zarr.ipynb.

Overall it feels like the HCS specification should primarily revolve around:

  • the specification of extra group(s) above the multiscales modelling the HCS concept
  • the specification of the metadata conventions associated with each group

The number of effective dimensions currently supported by the OME model and the various HCS datasets produced by the community are: Plate, Plate Acquisition (also called Plate Run), Well, Well Sample(also called Field of View). The first question is whether how flat vs deep the Zarr folder hierarchy should be to represent these concepts. The two layout below are put for discussion.

All names, layout and content are still up for discussion at this stage.

Option 1: single group

This is the closest to the implementation mentioned above where a series of multiscale images aka Zarr groups (potentially with labels) are collected within a plate Zarr group. Each multiscale image represents a field of view within a well within a plate acquisition with its metadata specified in a dedicated well sample specification.

└── plate.zarr                # Plate
    ├── .zgroup
    ├── .zattrs               # Implements "plate"
    ├── 0                     # First field of view 
    │   ├── .zgroup
    │   ├── .zattrs           # Implements "multiscale", "omero", "well sample" 
    │   ├── 0
    │   │   ...               # Resolution levels
    │   ├── n
    │   └── labels
    ├── ...                   # Field of views
    └── n                    

Pros:

  • this keeps the addition fairly minimal and does not create a large nested specialized structure
  • outside the HCS use case, this simple layout representation could potentially be generalized or at least concepts could be re-used for representing multi-position acquisitions i.e. a group of images related via some spatial context

Cons:

  • some of the classical HCS look-ups ("find all fields of view within a well") involve traversing all elements and iterating over the attributes
  • many HCS datasets will easily exceed 10K images per plate acquisition these days - typical number of wells range between 96 and up to 1536 and some acquisitions systems will image ~300 fields of view per well. From the classical experience of HCS file formats (which can easily create 10K-100K binary files under a single folder), we know large number of folders can lead to performance issues on file systems.

Option 2: plate/acquisition/well/well sample

In this proposal, three groups are inserted above the image group: plate, plate acquisition and well. Each multiscale image represents a field of view within a well within a plate acquisition. The full HCS metadata is distributed across the plate acquisition, well and well sample specifications.

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate"
    │
    ├── 0                         # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition"
    │   │
    │   ├── 0                     # First well
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "well"
    │   │   ├── 0                 # First field of view
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "multiscale", "omero", "well sample"
    │   │   │   ├── 0
    │   │   │   │   ...             # Resolution levels
    │   │   │   ├── n
    │   │   │   └── labels
    │   │   ├── ...               # Field of views
    │   │   └── n
    │   ├── ...                   # Wells
    │   └── m
    ├── ...                       # Plate acquisitions
    └── l

Pros:

  • for most HCS datasets, this should limit the maximal number of sub-groups to be of the order 1K
  • the structure is more amenable to the classical ways to store and introspect HCS elements

Cons:

  • in many cases, there is only one acquisition per plate and the second group will be a singleton
  • this requires four new specifications as opposed to two to describe the various levels. The added complexity might be a barrier to adoption

Option 3: plate/acquisition/row/column/well sample

See https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-706770955

Group names

In both example above, 0, 1,...n are used as the generic group names. Using more explicit informative names reflecting the acquisition e.g. A1, A2, ... or A1 Field 1, A1 Field 2... is definitely a possibility. Given the number of variants found in the ecosystem, I would avoid trying to enforce these names and/or rely on them. Instead the corresponding metadata (typically row, column, index) should be unambiguously specified within the .zattrs of the relevant group(s).

@manics
Copy link
Member

manics commented Oct 8, 2020

Where does the metadata for intermediate levels (plate-acquisition, well) go? For option 2 it can go into .zattrs at the appropriate level of the hierarchy. For option 1 would there be a single metadata file at the plate level with details of everything apart from the image .zattrs, or would plate-acquisition and well metadata be duplicated in all of the image .zattrs?

@sbesson
Copy link
Member Author

sbesson commented Oct 8, 2020

Following today's discussion and for completeness, pasting here a first list of candidates for specifying HCS metadata as defined in the OME schema:

Plate
  Name
  Description
  Columns
  ColumnNamingConvention
  Rows
  RowsNamingConvention
  ExternalIdentifier
  FieldIndex
  WellOriginX
  WellOriginXUnit 
  WellOriginY
  WellOriginYUnit 
PlateAcquisition
  Name
  Description
  StartTime
  EndTime
  MaximumFieldCount
Well
  Color
  Column
  ExternalDescription
  ExternalIdentifier
  Row
  Type
WellSample
  Index
  PositionX
  PositionXUnit
  PositionY
  PositionYUnit
  Timepoint

An advantage of the second layout where each concept (PlateAcquisition, Well,...) is represented as a group is also that the metadata fields can be stored into the .zattrs of each group.

For the first layout, my assumption is that minimally:

  • the well sample attributes should go into the low-level .zattrs
  • the plate attributes should go into the top-level .zattrs
    For all the other attributes, I think all representations are on the plates. One possibility would be to only store indexes at the image level i.e. plate acquisition index, well index (or well row + well column), field index and put the rest of the metadata at the plate level. The primary advantage of this structure is that parsing the top-level metadata would suffice to retrieve the dimensionality of the plate. The downside is that this could result in potentially very large JSON.

In all cases, as discussed today, testing the layout + metadata in the context of large HCS datasets of typically several 10K images maybe with latency involved will be necessary to ensure the proposed extension remains performant for the typical queries/manipulations.

@melissalinkert
Copy link
Member

Reading through this with @chris-allan, option 2 seems like the way to go. It might be worth splitting the well index into two levels in the hierarchy (row and column), especially for 1536 well plates. With option 2, we would lose some of the flexibility that PlateAcquisition currently has, where a WellSample can potentially be linked to more than one PlateAcquisition, but I think that's OK since it's in keeping with the spirit of what PlateAcquisition is meant to represent.

Allowing but not parsing or enforcing arbitrary group names sounds fine, since the actual indexes would be described in the .zattrs.

@sbesson
Copy link
Member Author

sbesson commented Oct 11, 2020

Thanks for the input @melissalinkert and @chris-allan. Option 2 above was indeed representing the wells as an single group with some 2D indexes (typically row/column) in the .zattrs. As mentioned above, the alternative is to split this into 2 groups. Assuming a grouping by row first, this would mean the following layout:

Option 3: plate/acquisition/row/column/well sample

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate"
    │
    ├── 2020-10-10                # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition"
    │   │
    │   ├── A                     # First row
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "row"
    │   │   │ 
    │   │   ├── 1                 # First column
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "column"
    │   │   │   │
    │   │   │   ├── Field_1       # First field of view
    │   │   │   │   │
    │   │   │   │   ├── .zgroup
    │   │   │   │   ├── .zattrs   # Implements "multiscale", "omero", "well sample"
    │   │   │   │   ├── 0
    │   │   │   │   │   ...       # Resolution levels
    │   │   │   │   ├── n
    │   │   │   │   └── labels
    │   │   │   ├── ...           # Fields of view
    │   │   │   └── Field_m
    │   │   ├── ...               # Columns
    │   │   └── 12
    │   ├── ...                   # Rows
    │   └── H
    ├── ...                       # Plate acquisitions
    └── l

@manics
Copy link
Member

manics commented Oct 12, 2020

The option 2 hierarchy makes sense, especially from a metadata perspective (note I'm assuming instead of 0 the directory name would be the well position e.g. A1). With option 3 is there any metadata that applies at the row level? From a filesystem perspective I don't think the additional level is necessary, although you can have 1000s of wells that should be OK, whereas flattening all levels as in Option 1 could lead to 100,000s of directories. This means we can make the decision on the implementation benefits of option 2 vs option 3.

@joshmoore
Copy link
Member

Re-mentioning a requirement that occurred to me in the bioformats2raw context here: currently the multiscale group name is the series number in a single "fileset" group, roughly equivalent to option 1 here. That allows mapping any of the metadata in OME-XML based on the series index. If we push images down below wells with new naming, we will need a new heuristic, or need to encoded the series number in the ngff metadata, or encode all of the OME-XML metadata in the ngff metadata. Just a thought.

It does make me wonder though if there isn't an option N which some of this:

  • a plate group somewhere with metadata that is roughly a grid layout in JSON:
plate:
  row:  # array or dictionary
    group: # array or dictionary
       image: "path/to/image/group"
  • the images can then be anywhere and can contain a small metadata block saying that they are part of a plate:
  plate: "path/to/plate/group"
  row: "a" # or index?
  column: 1

We could/would then still use one of the layouts above as the preferred/default but it would provide some flexibility if need be. (The library, I think, would default to consuming the metadata as the definitive source, but the layout would make it more user friendly.)

@jburel
Copy link
Member

jburel commented Oct 14, 2020

I am now wondering if the example above image anywhere could be a mechanism to organise images (not in a plate) in a way that user wants to see them. Such feature request has been mentioned many times over the years.

@sbesson
Copy link
Member Author

sbesson commented Oct 16, 2020

From today's meeting, the decision was to start implementing the third option (Issue description edited to link to the relevant comment) and update https://github.com/ome/omero-cli-zarr/ to create a Zarr representation of the first illumination corrected plate of the irdr0033 submission (https://idr.openmicroscopy.org/webclient/?show=plate-5966)

@will-moore
Copy link
Member

Updated Option 3

Shows the output that is currently being exported from OMERO by omero-cli-zarr.

└── plate.zarr                    # Plate
    ├── .zgroup
    ├── .zattrs                   # Implements "plate": {"rows":2, "columns":3,
    │                                                    "row_names": ["A", "B"], "column_names": ["1", "2", "3"],
    │                                                    "plateAcquisitions": [{"path": "2020-10-10"}],
    │                                                    "images": [{"path": "2020-10-10/A/1/Field_1"}...]
    │
    ├── 2020-10-10                # First plate acquisition
    │   │
    │   ├── .zgroup
    │   ├── .zattrs               # Implements "plate acquisition" - NB: not yet implemented
    │   │
    │   ├── A                     # First row
    │   │   ├── .zgroup
    │   │   ├── .zattrs           # Implements "row" - NB: not yet implemented
    │   │   │ 
    │   │   ├── 1                 # First column
    │   │   │   ├── .zgroup
    │   │   │   ├── .zattrs       # Implements "column" - NB: not yet implemented
    │   │   │   │
    │   │   │   ├── Field_1       # First field of view
    │   │   │   │   │
    │   │   │   │   ├── .zgroup
    │   │   │   │   ├── .zattrs   # Implements "multiscales", "omero", "well sample"
    │   │   │   │   ├── 0
    │   │   │   │   │   ...       # Resolution levels
    │   │   │   │   ├── n
    │   │   │   │   └── labels
    │   │   │   ├── ...           # Fields of view
    │   │   │   └── Field_m
    │   │   ├── ...               # Columns
    │   │   └── 12
    │   ├── ...                   # Rows
    │   └── H
    ├── ...                       # Plate acquisitions
    └── l

@will-moore
Copy link
Member

I guess a couple of questions about the plate metadata.
Seems strange to be listing images at the plate level. For example, if I wanted to show a Well in vizarr, I'd want the Well metadata (see Implements "column" above - I guess that should be `Implements "well") to list the Fields in that Well, and not have to load Plate metadata and parse all the "images" in the Plate to find the Fields in that Well?

@sbesson
Copy link
Member Author

sbesson commented Oct 28, 2020

Couple of additional thoughts/issues while working on an first draft of spec:

  • I think the column/well comment in https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-717870790 echoes https://github.com/ome/omero-ms-zarr/issues/73#issuecomment-707094613. Splitting the well concept into a 2-levels hierarchy (row/column) as per option 3 definitely implies the row specification will contain at most very minimal metadata.
  • there is no top-level representation of the (maximum) number of fields of view. From a UI perspective, I can certainly see an argument for pushing the number of fields of views at the well level concept. In that case, should rows/columns etc not be moved down to the plate acquisition level? See also the next point re multi-acquisition layout.
  • additional thoughts will need to be put in terms of supporting multiple plate acquisitions especially in the case of multiple plate acquisitions with different grid layouts where the rows/columns/row_names/column_names concept would need changes. Possible options are nested dictionaries/lists, moving these keys under the plateAcquisitions key or decide on a different meaning e.g. the superset of possible column/row names across acquisitions
  • while plateAcquisitions and images are lists of dictionaries with a path key, rows and columns are integers which is slightly confusing. If we implement row and column as intermediate specifications, we might need to rename these to avoid confusion.
  • plateAcquisitions could probably more simply be renamed as acquisitions or runs

Also trying to think of these keys in terms of MUST vs SHOULD vs MAY as per RFC 2119. Happy to skip this for a version version but a naive assumption would be to have anything that can be recomputed from other keys at a SHOULD level rather than a MUST (e.g. column size, names).

@jkh1
Copy link

jkh1 commented Oct 29, 2020

Maybe a bit late but have you looked at the cellH5 format?
The paper is here (check the supplementary material for details, in particular Sup Fig 2 for the layout of the file).

@manics
Copy link
Member

manics commented Oct 29, 2020

Here's the figure @jkh1 mentioned (there's no direct online link available):
image

@sbesson
Copy link
Member Author

sbesson commented Oct 29, 2020

Maybe a bit late but have you looked at the cellH5 format?

Definitely not too late and thanks for bringing other known hierarchical representations into the discussion.

As a preamble, CellH5 includes several concept including features or objects which are defined part of the mid-term goal as discussed during today's community call but outside the scope of this issue/extension.

Focusing primarily on the HCS specification, I tried to summarize my understanding of the mappings between the hierarchical structures defined in the current proposal (mentioned at the 2020-10-29 call), in the CellH5 format as well as in the 2016-06 OME schema for reference:

OME-ZARR 2020-10-29 OME 2016-06 CellH5
sample
plate Plate plate
acquisition PlateAcquisition
column + row Well experiment
field of view WellSample position

From my side, what this means is that there is no substantial conceptual gap under the plate concept. For instance, it should be doable to translate HCS data stored into in CellH5 into HCS OME-Zarr with minimal changes to the layout. The main top-level group which is not accounted for is the CellH5 sample specification which would currently be represented as a separate Zarr entity

@sbesson sbesson transferred this issue from ome/omero-ms-zarr Nov 25, 2020
@joshmoore joshmoore added rfc Status: request for comments collection Concern: collection of images labels Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collection Concern: collection of images rfc Status: request for comments
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants