Zarr support #70

bendichter · 2023-07-18T14:48:29Z

This is more of a longer term thing, as we are still building up the ecosystem for NWB Zarr on DANDI. I found this library: https://github.com/gzuidhof/zarr.js/ for reading Zarr files in typescript. @CodyCBakerPhD is working on putting NWB Zarr test files on the DANDI staging server here, though I don't know if these test files are stable enough to build off of yet. Once those are up and registered with DANDI correctly, it might be interesting to develop a generic API that can read both HDF5 and Zarr backends into neurosift. There is reason to believe Zarr could be much faster for reading, as it is explicitly designed to be more optimized for the cloud.

magland · 2023-07-18T14:55:40Z

Sounds good.

magland · 2023-07-18T19:24:56Z

A couple comments about the pros/cons of Zarr format.

As you say, I think there are many advantages, including simpler and more efficient data access from the browser.

A couple disadvantages --

Each nwb.zarr asset can contain hundreds or even thousands of files, depending on the chunking settings. This could present a challenge for maintaining the database (cleanup of an asset becomes non-trivial)

Nwb.zarr assets will no longer have an easily computable content hash or an ETag.

Download of the asset becomes non-trivial.

bendichter · 2023-08-21T14:38:24Z

Update: The DANDI team expressed interested in seeing this feature in our meeting today. We have a few dummy zarr assets here (only the draft mode of the dandiset currently has the properly named NWB Zarr files) which do not have real neural data but should be valid Zarr and valid NWB.

magland · 2023-08-21T19:59:00Z

Update: The DANDI team expressed interested in seeing this feature in our meeting today. We have a few dummy zarr assets here (only the draft mode of the dandiset currently has the properly named NWB Zarr files) which do not have real neural data but should be valid Zarr and valid NWB.

I'll bump this up to the top of the priority list.

magland · 2023-08-22T12:23:07Z

@bendichter @CodyCBakerPhD

In order to access the zarr structure in the s3 bucket, you need the ability to list the contents of the directories. I verified that the top-level .zgroup file is accessible (for one of the examples) here

https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/dd868c2b-0a3b-44cb-afdb-eb62f07a701b/.zgroup

However, there is no way for me to know which subgroups exist for this dandiset, because there is no way to query the contents of the top-level directory using simple http requests. While it is possible to expose/enable this for s3 buckets, this is not considered good practice for security reasons.

I believe It is possible to do this using the aws s3 sdk. However, I believe this would require authentication.

Any idea how we can overcome this limitation? The only thing that comes to mind is to have a manifest file that accompanies the dandiset.

CodyCBakerPhD · 2023-08-22T13:47:49Z

I believe the recommendation is to use Kerchunk to pre-compute the 'file' structure as a .json, which can then be used as the reference for which byte ranges to request from which assets

Or setup a Flask server somewhere with a few simple REST endpoints for leveraging existing Python tools for determining such mappings on the fly, assuming the TS implementation of Zarr isn't working for you?

magland · 2023-11-09T12:48:43Z

@bendichter @CodyCBakerPhD

I wanted to follow up on this, especially as you are moving more toward supporting Zarr. The above limitation means that at present there is no way to load Zarr datasets in Neurosift, Dendro, Python scripts, or really in any method that requires reading the data remotely. So I wanted to ask whether you have plans to generate kerchunk .json files along with every Zarr asset. If so, do you have any examples yet that I could begin to develop against?

CodyCBakerPhD · 2023-11-09T14:59:54Z

@magland There are definitely way so read the Zarr datasets in Python scripts. (did you mean streaming from DANDI specifically?)

As a first learning experience, did you try downloading the NWB Zarr files and following HDMF-Zarr instructions on reading them?

Also ask Alessio, since I'm quite sure Allen uses Zarr in this capacity on a regular basis

magland · 2023-11-09T16:42:13Z

There are definitely way so read the Zarr datasets in Python scripts. (did you mean streaming from DANDI specifically?)

@CodyCBakerPhD
I misspoke when I said "Python scripts, or really in any method that requires reading the data remotely". I meant doing this without having the AWS credentials for the bucket. (yes, I am interested in streaming from DANDI)

As a first learning experience, did you try downloading the NWB Zarr files and following HDMF-Zarr instructions on reading them?

I haven't tried that. Do you have an example NWB Zarr file on DANDI that I can download? I can start with that, but my ultimate question is can this be done without downloading.

Also ask Alessio, since I'm quite sure Allen uses Zarr in this capacity on a regular basis

@alejoe91 When you read Zarr from remote bucket, do you have the AWS Credentials, and use something like boto3? I am looking for an http-only way to do it (without aws credentials).

alejoe91 · 2023-11-09T16:44:35Z

@magland we have a bunch of Nwb zarr files in an open bucket. Would that work?

magland · 2023-11-09T16:48:25Z

@magland we have a bunch of Nwb zarr files in an open bucket. Would that work?

That would help, thanks. Do you read lazily from those nwb-zarr's?

alejoe91 · 2023-11-10T09:25:31Z

Ok, you can use this public bucket: https://registry.opendata.aws/allen-nd-open-data/

A sample NWB-zarr file in it is:
s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/

(there are plenty, you can do aws s3 ls --no-sign-request s3://aind-open-data/ | grep nwb to list all)

You can read that file directly with zarr (you need s3fs installed):

import zarr

remote_zarr_location =  = "s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/"

zarr_root = zarr.open(remote_zarr_location = )

Ideally and eventually, you will be able to use pynwb (just pip install hdmf-zarr and upgrade pynwb):

from hdmf_zarr import NWBZarrIO

with NWBZarrIO(remote_zarr_location, "r") as io:
    nwbfile = io.read()

This currently gives me an error, because the backend assumes the file is local and "resolves" the path. But that's another issue (see hdmf-dev/hdmf-zarr#134).

I guess that the direct-zarr approach is still good for a start.

alejoe91 · 2023-11-10T10:27:37Z

@magland pushed a fix on the hdmf-zarr side that fixes it:
hdmf-dev/hdmf-zarr#138

magland added this to neurosift Jul 18, 2023

alejoe91 mentioned this issue Nov 10, 2023

[Feature]: S3 streaming support via fsspec hdmf-dev/hdmf-zarr#134

Closed

3 tasks

magland mentioned this issue Nov 10, 2023

Zarr access concerns - esp. for embargoed or from browser dandi/dandi-archive#1745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr support #70

Zarr support #70

bendichter commented Jul 18, 2023

magland commented Jul 18, 2023

magland commented Jul 18, 2023

bendichter commented Aug 21, 2023 •

edited

Loading

magland commented Aug 21, 2023

magland commented Aug 22, 2023

CodyCBakerPhD commented Aug 22, 2023 •

edited

Loading

magland commented Nov 9, 2023

CodyCBakerPhD commented Nov 9, 2023

magland commented Nov 9, 2023 •

edited

Loading

alejoe91 commented Nov 9, 2023

magland commented Nov 9, 2023

alejoe91 commented Nov 10, 2023

alejoe91 commented Nov 10, 2023

Zarr support #70

Zarr support #70

Comments

bendichter commented Jul 18, 2023

magland commented Jul 18, 2023

magland commented Jul 18, 2023

bendichter commented Aug 21, 2023 • edited Loading

magland commented Aug 21, 2023

magland commented Aug 22, 2023

CodyCBakerPhD commented Aug 22, 2023 • edited Loading

magland commented Nov 9, 2023

CodyCBakerPhD commented Nov 9, 2023

magland commented Nov 9, 2023 • edited Loading

alejoe91 commented Nov 9, 2023

magland commented Nov 9, 2023

alejoe91 commented Nov 10, 2023

alejoe91 commented Nov 10, 2023

bendichter commented Aug 21, 2023 •

edited

Loading

CodyCBakerPhD commented Aug 22, 2023 •

edited

Loading

magland commented Nov 9, 2023 •

edited

Loading