-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr support #70
Comments
Sounds good. |
A couple comments about the pros/cons of Zarr format. As you say, I think there are many advantages, including simpler and more efficient data access from the browser. A couple disadvantages -- Each nwb.zarr asset can contain hundreds or even thousands of files, depending on the chunking settings. This could present a challenge for maintaining the database (cleanup of an asset becomes non-trivial) Nwb.zarr assets will no longer have an easily computable content hash or an ETag. Download of the asset becomes non-trivial. |
Update: The DANDI team expressed interested in seeing this feature in our meeting today. We have a few dummy zarr assets here (only the draft mode of the dandiset currently has the properly named NWB Zarr files) which do not have real neural data but should be valid Zarr and valid NWB. |
I'll bump this up to the top of the priority list. |
In order to access the zarr structure in the s3 bucket, you need the ability to list the contents of the directories. I verified that the top-level .zgroup file is accessible (for one of the examples) here However, there is no way for me to know which subgroups exist for this dandiset, because there is no way to query the contents of the top-level directory using simple http requests. While it is possible to expose/enable this for s3 buckets, this is not considered good practice for security reasons. I believe It is possible to do this using the aws s3 sdk. However, I believe this would require authentication. Any idea how we can overcome this limitation? The only thing that comes to mind is to have a manifest file that accompanies the dandiset. |
I believe the recommendation is to use Kerchunk to pre-compute the 'file' structure as a Or setup a Flask server somewhere with a few simple REST endpoints for leveraging existing Python tools for determining such mappings on the fly, assuming the TS implementation of Zarr isn't working for you? |
I wanted to follow up on this, especially as you are moving more toward supporting Zarr. The above limitation means that at present there is no way to load Zarr datasets in Neurosift, Dendro, Python scripts, or really in any method that requires reading the data remotely. So I wanted to ask whether you have plans to generate kerchunk .json files along with every Zarr asset. If so, do you have any examples yet that I could begin to develop against? |
@magland There are definitely way so read the Zarr datasets in Python scripts. (did you mean streaming from DANDI specifically?) As a first learning experience, did you try downloading the NWB Zarr files and following HDMF-Zarr instructions on reading them? Also ask Alessio, since I'm quite sure Allen uses Zarr in this capacity on a regular basis |
@CodyCBakerPhD
I haven't tried that. Do you have an example NWB Zarr file on DANDI that I can download? I can start with that, but my ultimate question is can this be done without downloading.
@alejoe91 When you read Zarr from remote bucket, do you have the AWS Credentials, and use something like boto3? I am looking for an http-only way to do it (without aws credentials). |
@magland we have a bunch of Nwb zarr files in an open bucket. Would that work? |
That would help, thanks. Do you read lazily from those nwb-zarr's? |
Ok, you can use this public bucket: https://registry.opendata.aws/allen-nd-open-data/ A sample NWB-zarr file in it is: (there are plenty, you can do You can read that file directly with
Ideally and eventually, you will be able to use
This currently gives me an error, because the backend assumes the file is local and "resolves" the path. But that's another issue (see hdmf-dev/hdmf-zarr#134). I guess that the direct-zarr approach is still good for a start. |
@magland pushed a fix on the |
This is more of a longer term thing, as we are still building up the ecosystem for NWB Zarr on DANDI. I found this library: https://github.com/gzuidhof/zarr.js/ for reading Zarr files in typescript. @CodyCBakerPhD is working on putting NWB Zarr test files on the DANDI staging server here, though I don't know if these test files are stable enough to build off of yet. Once those are up and registered with DANDI correctly, it might be interesting to develop a generic API that can read both HDF5 and Zarr backends into neurosift. There is reason to believe Zarr could be much faster for reading, as it is explicitly designed to be more optimized for the cloud.
The text was updated successfully, but these errors were encountered: