-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr access concerns - esp. for embargoed or from browser #1745
Comments
So I misunderstood something about kerchunk. It seems it doesn't create an index of a zarr. Rather it creates a zarr-like index of a hdf5 file. So I don't think kerchunk is the solution... but something simpler... an index file of the recursive subdirectory structure of the zarr directory. |
I think I found the solution. It's https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata I think if we could generate this .zmetadata on all the zarr's this would solve the issues. |
we do recommend generating consolidated metadata for the microscopy data. however, even without consolidated metadata one should be able to traverse a zarr tree as long as .zattrs and .zgroup exist. we should ensure that zarr is validated as a container when uploaded. |
I don't believe you can traverse the tree just from .zattrs and .zgroup because they don't contain the information about the subgroups or subarrays. For example, here's a .zgroup/.zattrs that doesn't contain that information: https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zgroup |
@magland - sorry i misread the original post. you were able to read things with s3. indeed for http we used to have our own api endpoint. i can't seem to find this any more, so checking with @AlmightyYakob. |
also @magland, we currently don't support embargoed zarr files. there is some refactoring that's going to happen with embargo. post that we may enable zarrbargo as we call it. |
Makes sense @satra. I just want to put in my request to have consolidated .zmetadata in the root folder of the zarr archives so that they will be able to work with neurosift and other browser-based tools. |
At least for NWB Zarr files, I've requested we just always call that automatically on file creation (hdmf-dev/hdmf-zarr#139). I can add an inspector check for it as well if you'd like |
That would be great! I noticed that's it's not in the example you provided a while ago: |
I have some concerns about access to DANDI Zarr assets from the browser or for embargoed dandisets. I think it's likely that this could be solved by creating a kerchunk index for each Zarr asset, but I'm not sure about it. If that's the case I'd like to suggest that it be a high priority to integrate kerchunk in the dandi upload process.
Edit: Rather than kerchunk I proposed a different solution, see later comments.
I'll explain based on my current understanding (which may be limited) of remote access to Zarr directories.
@alejoe91 showed me this nice example on reading from a Zarr archive from a public AIND bucket
And it also works well when reading this DANDI Zarr example prepared by @CodyCBakerPhD
However, if I try to give zarr the http URL it can only see the top-level attributes in the zarr tree
(side note: If I use the DANDI API Url it doesn't work at all: https://api-staging.dandiarchive.org/api/assets/a617e96e-72cd-4bb8-ab20-b3d6bdc8ecd1/download/)
This highlights the fact that you cannot use normal fetch http requests to read the tree structure of a Zarr directory in an S3 bucket - because there's no way to get a directory listing (unless the admin enables something that is highly not recommended). Instead you need to use the S3 API... which requires AWS credentials (unless the bucket is public).
So this creates two problems
As I mentioned, a possible solution is to use kerchunk and create a json index for every Zarr asset on DANDI. IDK if this will satisfy all the requirements, but it would be great to start trying soon at this early stage.
Another related concern is that use of advanced compression in Zarr assets might make it impossible to read directly from a browser.
See also flatironinstitute/neurosift#70
The text was updated successfully, but these errors were encountered: