consolidate_metadata for tables #12

will-moore · 2022-11-17T13:59:31Z

See discussion at https://github.com/ome/ngff/pull/64/files#r1025073730

In order to allow full access to all the tables metadata when the ability to ls subdirectories is not available (e.g. over https / s3), consolidate_metadata() can be used to add metadata to the image.zarr/tables group.

This would also provide a work-around for the listing of tables within that top-level tables group. See https://github.com/ome/ngff/pull/64/files#r1022858025

cc @ivirshup @joshmoore

joshmoore · 2022-11-17T14:03:56Z

consolidate_metadata isn't in the v2 spec and isn't supported by all implementations. I would not suggest we go that route unless absolutely necessary.

ivirshup · 2022-11-17T15:01:22Z

My main thinking here is that I don't want to be on the hook for essentially maintaining a copy of information already encoded by the Groups.

@joshmoore

consolidate_metadata isn't in the v2 spec

Did this come up as a possible backport to v2? It seems quite important to zarr's cloud friendliness.

and isn't supported by all implementations

What's the support like among implementations that are currently used for OME-NGFF?

IIRC netcdf/ xarray's zarr format doesn't require something like this, right?

I'm wondering if it just didn't come up, or if they just implemented consolidated metadata.

will-moore · 2022-11-17T15:32:38Z

OK, so I've removed the consolidate_metadata for now, and replaced it with a solution to the 2nd issue above: listing of tables.

This now writes image.zarr/tables/.zattrs:

{
    "tables": [
        "regions_table"
    ]
}

which is a small change that allows me to at least find the tables, and is consistent with the approach taken with labels.

joshmoore · 2022-11-17T20:57:57Z

I don't want to be on the hook for essentially maintaining a copy of information already encoded by the Groups.

I agree it's not ideal but the other alternatives I don't see as improvements.

Did this come up as a possible backport to v2? It seems quite important to zarr's cloud friendliness.

There's zarr-developers/zarr-specs#113 but it's a pretty major undertaking. (I refer to my oft cited metric of 6 months to introduce dimension_separator). It would also take away from the v3 effort.

What's the support like among implementations that are currently used for OME-NGFF?

Python only as far as I know.

IIRC netcdf/ xarray's zarr format doesn't require something like this, right? I'm wondering if it just didn't come up, or if they just implemented consolidated metadata.

xarray relies heavily on consolidated_metadata. I don't think netcdf does, but they have other extensions in "nczarr" that may be similar.

ivirshup · 2022-11-18T15:20:39Z

@will-moore, responding to your comment: ome/ngff#64 (comment)

Your suggestion

E.g. add this to write_table_regions() from the ome-ngff-tables-prototype:

    # write group names into .zattrs
    table_path = os.path.join(group.store.dir_path(), group.path, table_group_name)
    for sub_dir in ["layers", "obsm", "obsp", "raw", "uns"]:
        sub_path = os.path.join(table_path, sub_dir)
        if not os.path.exists(sub_path):
            continue
        children = [f.name for f in os.scandir(sub_path) if f.is_dir()]
        sub_group = table_group[sub_dir]
        sub_group.attrs[sub_dir] = children
E.g. obsm/.zattrs now looks like this:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "obsm": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

You could instead do:

group: zarr.Group

keys = []
group.visit(lambda x: keys.append(x))
group.attrs["keys"] = keys

But this is basically just a worse version of consolidated metadata. I think it would make more sense to just add consolidated metadata to the spec than having to design an alternative. Especially since there aren't that many cases in the anndata specification where you know whether the child nodes are Groups or Arrays.

@joshmoore

There's zarr-developers/zarr-specs#113 but it's a pretty major undertaking.

My read here is that interest has dropped in favor of instead solving it in v3?

I don't think netcdf does, but they have other extensions in "nczarr" that may be similar.

It looks like they have an alternative: https://docs.unidata.ucar.edu/nug/current/nczarr_head.html#autotoc_md22

They store an bunch of extra jsons with information like a list of variables. These can be turned off with mode=nczarr,zarr, inference rules for which are defined right below.

I agree it's not ideal but the other alternatives I don't see as improvements.

@joshmoore @will-moore

I think "relying on the hierarchy provided by zarr" would be an improvement. It's just that for some implementations, with some stores, the hierarchy doesn't work.

To me, this reads as an issue with zarr. Perhaps one that needs a temporary work around, but ultimately one which should be fixed in zarr.

From this perspective, I would propose minimizing what needs to be done – and making it easy to revert and rely on the zarr hierarchy instead.

So maybe just saying tables should store the a consolidated metadata json or list of keys at their root in OME-NGFF? Then removing this once the spec moves up to v3 or v2 gets consolidated metadata, which ever happens first.

side notes

I would be very interested in knowing how commonly "no ls" problem occurs.
I do not think "s3 ls is slow" is something that should impact decisions here. consolidated_metadata is a much better solution to s3 being slow – and the current implementation is to make a request at each level of the hierarchy.

joshmoore · 2022-11-20T15:51:47Z

My read here is that interest has dropped in favor of instead solving it in v3?

From the POV of the other implementors, I think this would count as a significant change (and related implementation burden).

To me, this reads as an issue with zarr. Perhaps one that needs a temporary work around, but ultimately one which should be fixed in zarr.

I think it's fair to say that it's something that needs to be clarified at the Zarr Spec level, but if there remains a possibility that there's a fileset which is not listable, then the design of NGFF specifications will be forced to deal with that.

ivirshup · 2022-11-22T17:18:35Z

From the POV of the other implementors, I think this would count as a significant change (and related implementation burden).

Are the implementors up for this being solved in v3?

As a broader question, how common are unlistable stores and why are people hosting data in them?

joshmoore · 2022-12-01T14:15:18Z

Are the implementors up for this being solved in v3?

I think there will definitely be work on this at the spec level in v3. I don't want to put words in the implementors' mouths though :)

As a broader question, how common are unlistable stores and why are people hosting data in them?

AWS S3 buckets and other object storage can be configured to not allowing listing. (We even recently ran into this for an AWS Open Data bucket)
HTTP servers don't have a standardized way to provide file listings. (You can attempt to parse the HTML that a url/ returns to find all links, but YMMV.)

will-moore · 2022-12-15T23:47:24Z

Data generated with save_load_squidpy_segment.py script, with the tweaks from the current state of this PR is deployed at https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table and can be viewed with:
https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table

This includes adding child group names to parent group .zattrs under "keys": [].
E.g.
/test_segment.zarr/tables/regions_table/obsm/.zattrs:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "keys": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

Also, an extra step required for viewing data in the browser via zarr.js is to convert int64 arrays to int32 dtype by running an additional script included above:

# generate the data...
$ python save_load_squidpy_segment.py

# fix dtypes for `obs` data
$ python fix_dtype.py test_segment.zarr/tables/regions_table/obs

src/ngff_tables_prototype/writer.py

kevinyamauchi

Awesome, thanks @will-moore !

Also, happy new year!

consolidate_metadata for tables

4653df7

joshmoore mentioned this pull request Nov 17, 2022

Table spec proposal ome/ngff#64

Closed

will-moore force-pushed the consolidate_table_metadata branch from a08c54d to 866242d Compare November 17, 2022 15:28

Write tables_group_name to tables/.zattrs

d2ce7de

will-moore force-pushed the consolidate_table_metadata branch from 866242d to d2ce7de Compare November 17, 2022 15:36

will-moore added 2 commits December 15, 2022 18:01

Add sub_group names table group.attrs['keys']

284406d

Add fix_dtype.py for web data prep for zarr.js

c13705f

will-moore mentioned this pull request Dec 15, 2022

Tables ome/ome-ngff-validator#20

Closed

kevinyamauchi reviewed Dec 21, 2022

View reviewed changes

src/ngff_tables_prototype/writer.py Outdated Show resolved Hide resolved

Don't overwrite existing 'tables' in zattrs

5017081

kevinyamauchi approved these changes Jan 6, 2023

View reviewed changes

kevinyamauchi merged commit 0b7e59c into kevinyamauchi:main Jan 6, 2023

will-moore mentioned this pull request Mar 16, 2023

Dask and Zarr not loading obsp and obsm from remote s3 scverse/anndata#951

Open

berombau mentioned this pull request May 22, 2023

Reading from remote .zarr stores scverse/spatialdata#275

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consolidate_metadata for tables #12

consolidate_metadata for tables #12

will-moore commented Nov 17, 2022

joshmoore commented Nov 17, 2022

ivirshup commented Nov 17, 2022

will-moore commented Nov 17, 2022

joshmoore commented Nov 17, 2022

ivirshup commented Nov 18, 2022

joshmoore commented Nov 20, 2022

ivirshup commented Nov 22, 2022

joshmoore commented Dec 1, 2022

will-moore commented Dec 15, 2022

kevinyamauchi left a comment

consolidate_metadata for tables #12

consolidate_metadata for tables #12

Conversation

will-moore commented Nov 17, 2022

joshmoore commented Nov 17, 2022

ivirshup commented Nov 17, 2022

will-moore commented Nov 17, 2022

joshmoore commented Nov 17, 2022

ivirshup commented Nov 18, 2022

side notes

joshmoore commented Nov 20, 2022

ivirshup commented Nov 22, 2022

joshmoore commented Dec 1, 2022

will-moore commented Dec 15, 2022

kevinyamauchi left a comment

Choose a reason for hiding this comment