Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate_metadata for tables #12

Merged

Conversation

will-moore
Copy link
Contributor

See discussion at https://github.com/ome/ngff/pull/64/files#r1025073730

In order to allow full access to all the tables metadata when the ability to ls subdirectories is not available (e.g. over https / s3), consolidate_metadata() can be used to add metadata to the image.zarr/tables group.

This would also provide a work-around for the listing of tables within that top-level tables group. See https://github.com/ome/ngff/pull/64/files#r1022858025

cc @ivirshup @joshmoore

@joshmoore
Copy link

consolidate_metadata isn't in the v2 spec and isn't supported by all implementations. I would not suggest we go that route unless absolutely necessary.

@ivirshup
Copy link
Collaborator

My main thinking here is that I don't want to be on the hook for essentially maintaining a copy of information already encoded by the Groups.


@joshmoore

consolidate_metadata isn't in the v2 spec

Did this come up as a possible backport to v2? It seems quite important to zarr's cloud friendliness.

and isn't supported by all implementations

What's the support like among implementations that are currently used for OME-NGFF?


IIRC netcdf/ xarray's zarr format doesn't require something like this, right?

I'm wondering if it just didn't come up, or if they just implemented consolidated metadata.

@will-moore will-moore force-pushed the consolidate_table_metadata branch from a08c54d to 866242d Compare November 17, 2022 15:28
@will-moore
Copy link
Contributor Author

OK, so I've removed the consolidate_metadata for now, and replaced it with a solution to the 2nd issue above: listing of tables.

This now writes image.zarr/tables/.zattrs:

{
    "tables": [
        "regions_table"
    ]
}

which is a small change that allows me to at least find the tables, and is consistent with the approach taken with labels.

@will-moore will-moore force-pushed the consolidate_table_metadata branch from 866242d to d2ce7de Compare November 17, 2022 15:36
@joshmoore
Copy link

I don't want to be on the hook for essentially maintaining a copy of information already encoded by the Groups.

I agree it's not ideal but the other alternatives I don't see as improvements.

Did this come up as a possible backport to v2? It seems quite important to zarr's cloud friendliness.

There's zarr-developers/zarr-specs#113 but it's a pretty major undertaking. (I refer to my oft cited metric of 6 months to introduce dimension_separator). It would also take away from the v3 effort.

What's the support like among implementations that are currently used for OME-NGFF?

Python only as far as I know.

IIRC netcdf/ xarray's zarr format doesn't require something like this, right? I'm wondering if it just didn't come up, or if they just implemented consolidated metadata.

xarray relies heavily on consolidated_metadata. I don't think netcdf does, but they have other extensions in "nczarr" that may be similar.

@ivirshup
Copy link
Collaborator

@will-moore, responding to your comment: ome/ngff#64 (comment)

Your suggestion

E.g. add this to write_table_regions() from the ome-ngff-tables-prototype:

    # write group names into .zattrs
    table_path = os.path.join(group.store.dir_path(), group.path, table_group_name)
    for sub_dir in ["layers", "obsm", "obsp", "raw", "uns"]:
        sub_path = os.path.join(table_path, sub_dir)
        if not os.path.exists(sub_path):
            continue
        children = [f.name for f in os.scandir(sub_path) if f.is_dir()]
        sub_group = table_group[sub_dir]
        sub_group.attrs[sub_dir] = children
E.g. obsm/.zattrs now looks like this:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "obsm": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

You could instead do:

group: zarr.Group

keys = []
group.visit(lambda x: keys.append(x))
group.attrs["keys"] = keys

But this is basically just a worse version of consolidated metadata. I think it would make more sense to just add consolidated metadata to the spec than having to design an alternative. Especially since there aren't that many cases in the anndata specification where you know whether the child nodes are Groups or Arrays.

@joshmoore

There's zarr-developers/zarr-specs#113 but it's a pretty major undertaking.

My read here is that interest has dropped in favor of instead solving it in v3?

I don't think netcdf does, but they have other extensions in "nczarr" that may be similar.

It looks like they have an alternative: https://docs.unidata.ucar.edu/nug/current/nczarr_head.html#autotoc_md22

They store an bunch of extra jsons with information like a list of variables. These can be turned off with mode=nczarr,zarr, inference rules for which are defined right below.

I agree it's not ideal but the other alternatives I don't see as improvements.

@joshmoore @will-moore

I think "relying on the hierarchy provided by zarr" would be an improvement. It's just that for some implementations, with some stores, the hierarchy doesn't work.

To me, this reads as an issue with zarr. Perhaps one that needs a temporary work around, but ultimately one which should be fixed in zarr.

From this perspective, I would propose minimizing what needs to be done – and making it easy to revert and rely on the zarr hierarchy instead.

So maybe just saying tables should store the a consolidated metadata json or list of keys at their root in OME-NGFF? Then removing this once the spec moves up to v3 or v2 gets consolidated metadata, which ever happens first.

side notes

  • I would be very interested in knowing how commonly "no ls" problem occurs.
  • I do not think "s3 ls is slow" is something that should impact decisions here. consolidated_metadata is a much better solution to s3 being slow – and the current implementation is to make a request at each level of the hierarchy.

@joshmoore
Copy link

My read here is that interest has dropped in favor of instead solving it in v3?

From the POV of the other implementors, I think this would count as a significant change (and related implementation burden).

To me, this reads as an issue with zarr. Perhaps one that needs a temporary work around, but ultimately one which should be fixed in zarr.

I think it's fair to say that it's something that needs to be clarified at the Zarr Spec level, but if there remains a possibility that there's a fileset which is not listable, then the design of NGFF specifications will be forced to deal with that.

@ivirshup
Copy link
Collaborator

From the POV of the other implementors, I think this would count as a significant change (and related implementation burden).

Are the implementors up for this being solved in v3?


As a broader question, how common are unlistable stores and why are people hosting data in them?

@joshmoore
Copy link

Are the implementors up for this being solved in v3?

I think there will definitely be work on this at the spec level in v3. I don't want to put words in the implementors' mouths though :)

As a broader question, how common are unlistable stores and why are people hosting data in them?

  • AWS S3 buckets and other object storage can be configured to not allowing listing. (We even recently ran into this for an AWS Open Data bucket)
  • HTTP servers don't have a standardized way to provide file listings. (You can attempt to parse the HTML that a url/ returns to find all links, but YMMV.)

@will-moore
Copy link
Contributor Author

Data generated with save_load_squidpy_segment.py script, with the tweaks from the current state of this PR is deployed at https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table and can be viewed with:
https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table

This includes adding child group names to parent group .zattrs under "keys": [].
E.g.
/test_segment.zarr/tables/regions_table/obsm/.zattrs:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "keys": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

Also, an extra step required for viewing data in the browser via zarr.js is to convert int64 arrays to int32 dtype by running an additional script included above:

# generate the data...
$ python save_load_squidpy_segment.py

# fix dtypes for `obs` data
$ python fix_dtype.py test_segment.zarr/tables/regions_table/obs

Copy link
Owner

@kevinyamauchi kevinyamauchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @will-moore !

Also, happy new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants