Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr support #70

Open
bendichter opened this issue Jul 18, 2023 · 13 comments
Open

Zarr support #70

bendichter opened this issue Jul 18, 2023 · 13 comments

Comments

@bendichter
Copy link
Contributor

This is more of a longer term thing, as we are still building up the ecosystem for NWB Zarr on DANDI. I found this library: https://github.com/gzuidhof/zarr.js/ for reading Zarr files in typescript. @CodyCBakerPhD is working on putting NWB Zarr test files on the DANDI staging server here, though I don't know if these test files are stable enough to build off of yet. Once those are up and registered with DANDI correctly, it might be interesting to develop a generic API that can read both HDF5 and Zarr backends into neurosift. There is reason to believe Zarr could be much faster for reading, as it is explicitly designed to be more optimized for the cloud.

@magland
Copy link
Collaborator

magland commented Jul 18, 2023

Sounds good.

@magland
Copy link
Collaborator

magland commented Jul 18, 2023

A couple comments about the pros/cons of Zarr format.

As you say, I think there are many advantages, including simpler and more efficient data access from the browser.

A couple disadvantages --

Each nwb.zarr asset can contain hundreds or even thousands of files, depending on the chunking settings. This could present a challenge for maintaining the database (cleanup of an asset becomes non-trivial)

Nwb.zarr assets will no longer have an easily computable content hash or an ETag.

Download of the asset becomes non-trivial.

@bendichter
Copy link
Contributor Author

bendichter commented Aug 21, 2023

Update: The DANDI team expressed interested in seeing this feature in our meeting today. We have a few dummy zarr assets here (only the draft mode of the dandiset currently has the properly named NWB Zarr files) which do not have real neural data but should be valid Zarr and valid NWB.

@magland
Copy link
Collaborator

magland commented Aug 21, 2023

Update: The DANDI team expressed interested in seeing this feature in our meeting today. We have a few dummy zarr assets here (only the draft mode of the dandiset currently has the properly named NWB Zarr files) which do not have real neural data but should be valid Zarr and valid NWB.

I'll bump this up to the top of the priority list.

@magland
Copy link
Collaborator

magland commented Aug 22, 2023

@bendichter @CodyCBakerPhD

In order to access the zarr structure in the s3 bucket, you need the ability to list the contents of the directories. I verified that the top-level .zgroup file is accessible (for one of the examples) here

https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/dd868c2b-0a3b-44cb-afdb-eb62f07a701b/.zgroup

However, there is no way for me to know which subgroups exist for this dandiset, because there is no way to query the contents of the top-level directory using simple http requests. While it is possible to expose/enable this for s3 buckets, this is not considered good practice for security reasons.

I believe It is possible to do this using the aws s3 sdk. However, I believe this would require authentication.

Any idea how we can overcome this limitation? The only thing that comes to mind is to have a manifest file that accompanies the dandiset.

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented Aug 22, 2023

I believe the recommendation is to use Kerchunk to pre-compute the 'file' structure as a .json, which can then be used as the reference for which byte ranges to request from which assets

Or setup a Flask server somewhere with a few simple REST endpoints for leveraging existing Python tools for determining such mappings on the fly, assuming the TS implementation of Zarr isn't working for you?

@magland
Copy link
Collaborator

magland commented Nov 9, 2023

@bendichter @CodyCBakerPhD

I wanted to follow up on this, especially as you are moving more toward supporting Zarr. The above limitation means that at present there is no way to load Zarr datasets in Neurosift, Dendro, Python scripts, or really in any method that requires reading the data remotely. So I wanted to ask whether you have plans to generate kerchunk .json files along with every Zarr asset. If so, do you have any examples yet that I could begin to develop against?

@CodyCBakerPhD
Copy link
Contributor

@magland There are definitely way so read the Zarr datasets in Python scripts. (did you mean streaming from DANDI specifically?)

As a first learning experience, did you try downloading the NWB Zarr files and following HDMF-Zarr instructions on reading them?

Also ask Alessio, since I'm quite sure Allen uses Zarr in this capacity on a regular basis

@magland
Copy link
Collaborator

magland commented Nov 9, 2023

There are definitely way so read the Zarr datasets in Python scripts. (did you mean streaming from DANDI specifically?)

@CodyCBakerPhD
I misspoke when I said "Python scripts, or really in any method that requires reading the data remotely". I meant doing this without having the AWS credentials for the bucket. (yes, I am interested in streaming from DANDI)

As a first learning experience, did you try downloading the NWB Zarr files and following HDMF-Zarr instructions on reading them?

I haven't tried that. Do you have an example NWB Zarr file on DANDI that I can download? I can start with that, but my ultimate question is can this be done without downloading.

Also ask Alessio, since I'm quite sure Allen uses Zarr in this capacity on a regular basis

@alejoe91 When you read Zarr from remote bucket, do you have the AWS Credentials, and use something like boto3? I am looking for an http-only way to do it (without aws credentials).

@alejoe91
Copy link

alejoe91 commented Nov 9, 2023

@magland we have a bunch of Nwb zarr files in an open bucket. Would that work?

@magland
Copy link
Collaborator

magland commented Nov 9, 2023

@magland we have a bunch of Nwb zarr files in an open bucket. Would that work?

That would help, thanks. Do you read lazily from those nwb-zarr's?

@alejoe91
Copy link

Ok, you can use this public bucket: https://registry.opendata.aws/allen-nd-open-data/

A sample NWB-zarr file in it is:
s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/

(there are plenty, you can do aws s3 ls --no-sign-request s3://aind-open-data/ | grep nwb to list all)

You can read that file directly with zarr (you need s3fs installed):

import zarr

remote_zarr_location =  = "s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/"

zarr_root = zarr.open(remote_zarr_location = )

Ideally and eventually, you will be able to use pynwb (just pip install hdmf-zarr and upgrade pynwb):

from hdmf_zarr import NWBZarrIO

with NWBZarrIO(remote_zarr_location, "r") as io:
    nwbfile = io.read()

This currently gives me an error, because the backend assumes the file is local and "resolves" the path. But that's another issue (see hdmf-dev/hdmf-zarr#134).

I guess that the direct-zarr approach is still good for a start.

@alejoe91
Copy link

@magland pushed a fix on the hdmf-zarr side that fixes it:
hdmf-dev/hdmf-zarr#138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants