-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP0002 Review #254
Comments
CC: @zarr-developers/implementation-council @zarr-developers/steering-council |
Just a quick sanity check - this spec depends heavily on storage backends supporting Range requests; suffixes in particular (for getting the shard location footer). Over on this issue apache/arrow-rs#4611 , it's been suggested that common stores like S3 don't support suffixes. I don't have a store/ account I could use to test - could anyone give it a go? If not it's obviously a HEAD then a non-suffix GET Range request in serial, which isn't ideal. But possibly not that big an issue when accounting for the lack of multipart range support. From apache/arrow-rs#4612 it looks like multipart ranges aren't supported by any cloud provider (have yet to interact with a server implementation which does tbh, and you need to jump through a lot of hoops to interpret the response as, according to the HTTP spec, they can basically send you whatever they want) and that particular library gets around it by making a bunch of different requests in parallel. Presumably that isn't a blocker in and of itself, although making hundreds or thousands of requests to pull down a region of a single shard is hardly ideal, especially as any given backend could just decide to send you the entire shard, or indeed just about any subset of it, with each request! |
All major cloud providers (including S3, GCS, Azure) and static file HTTP server applications support requesting suffixes. curl -H 'Range: bytes=-524292' https://static.webknossos.org/data/zarr_v3/l4_sample/color/1/c/0/3/3/1 | wc -c
Multipart ranges are not as widely supported. I know that S3 doesn't support it. Issuing single range requests per inner chunk is equivalent to using Zarr without sharding. So, I would argue sharding doesn't make the situation worse. On the contrary, implementations can choose to coalesce the byte ranges of multiple chunks into single requests to reduce the number of requests. This works especially well, if the chunks are laid out in an order that matches the common access pattern (e.g. Z-ordering). Implementations can also download entire shards. In our standard configuration that means reducing the number of requests by a factor of ~32,000. |
Yes, so long as the existence of sharding doesn't change users' heuristics on (sub)chunk layout. I suppose it comes down to whether you see sharding as a way to coalesce your small chunks into single files, or as a way to access inner regions of your large chunks!
The "worst case" I was thinking about was when you need about half the chunk - clients may want some internal strategy trading off making thousands of small requests to download just what you need, or downloading considerably more than you need in 1 request to then slice the result. |
Yes, and perhaps the Zarr implementation would allow the user to specify the threshold. e.g. "I'm reading data from a RAID6 array of HDDs, which has very high bandwidth but also terrifyingly high latencies for random reads, so it's faster to read sequentially, even if I throw away 90% of the chunks after reading them from disk". |
Hi everyone, |
Sharding is specified as a new codec. That means that Zarr without sharding is not affected. Libraries that don't support it can still use the normal chunking. Of course, libraries that don't support sharding will not be able to open/read arrays that have been created with sharding. |
Thanks for the clarification @normanrz ! |
I vote YES! Currently migrating zarrita.js (future for zarr.js) to ZEP0001 and hope to have the capacity to support ZEP0002. |
I vote yes for tensorstore and neuroglancer! |
I vote YES for Zarr.jl , it is definitely something I would need and use, but can not promise an implementation time line. |
I've implemented the sharding codec in manzt/zarrita.js v0.3.2. It is compatible with the latest from scalableminds/zarrita. |
tensorstore now supports zarr v3 with sharding. |
I vote yes for zarr-js. We recently added experimental support for Zarr v3 including support for sharding codec |
Neuroglancer also now supports zarr v3 with sharding. |
Is there by any chance an interesting public v3 sharded image dataset, that could be used as an example? |
We have a public EM dataset at There is no multiscale metadata for v3, yet. So this is what the hierarchy looks like:
The arrays use chunk shape [1, 32, 32, 32] and shard shape [1, 1024, 1024, 1024]. The EM (color) data is 195G and the segmentation is 10G in total. EM data by Motta et al., Science 2019, segmentation by scalable minds. |
Thanks. I was able to access that dataset in Neuroglancer. Note though that there appears to be a misconfiguration with your server for e.g. https://static.webknossos.org/data/zarr_v3/l4dense_motta_et_al_demo/color/1/zarr.json (which seems to involve cloudfront and s3). Specifically the issue is that cloudfront is caching the responses without regard to the As far as OME, I know you are working on an update to the OME-zarr spec for zarr v3. Neuroglancer also supports the existing OME-zarr metadata for zarr v3, exactly as it is supported for zarr v2. |
Note: If you have a strong objection to Neuroglancer supporting the existing OME-zarr metadata with zarr v3, I am open to changing that. |
Thanks! I fixed the header caching and added OME v0.4 metadata to color and segmentation. |
Thanks, I'm able to view it in neuroglancer now. Regarding the OME metadata, I think you need to add a half-voxel translation to each scale to account for the fact that OME assumes the origin is in the center of a voxel, in order to properly align the scales. |
I vote |
Thanks to everybody, who already voted! Everybody else, I am looking forward to your votes. Please note that the deadline for voting is in less than a week. Thanks! cc @zarr-developers/implementation-council @zarr-developers/steering-council |
I vote in favor. |
I vote in favor to not further block this extension. I find some aspects of the current extension proposal unfortunate and hope for an improved sharding extension in the future. Here are my opinions:
|
Since it is a configuration option, it would presumably have to be the same for all shards, which means that anything other than "start" or "end" is unlikely to be useful, and "end" couldn't be indicated by a fixed byte offset. |
Could |
Our institute uses cephfs for its large-scale cluster-accessible storage, which gets grumpy if you have too many files. Strictly I think that file counts only impact performance when there are too many in one directory (which is much more likely with V2-style dot-separated chunk indices), but in practice cephfs' administration tools impose quotas on total number of files. (LMB example here - not sure how modern this setup is or how common similar types of storage are elsewhere) |
I like the idea of having an
Our original motivation: Some file systems that hold multiple petabytes are configured to have larger block sizes (e.g. 2MB). For the chunk sizes that we need for interactive visualization (e.g |
I added a proof-of-concept for the |
|
Since the codec configuration, including the |
Another consideration is that both start and end are relatively easy to support when writing but an arbitrary byte offset would be kind of tricky --- you may have to add padding bytes. |
Yes, I understand. I'm imagining a scenario where there is an existing HDF5 file that is symlinked and used as a shard in multiple arrays. That said a few of Stephan's scenarios include the index somewhere in the middle of the file, so having the index at an arbitrary offset would help address those. Implementations may need to calculate the length of the chunk index anyways so is checking if the location is the negative length really that different? I think the main burden is having to validate another parameter. My proposal is that we accept anything that is a valid value for a HTTP Range request. |
I'm confused. Wouldn't the writer be the one setting the indexLocation? |
You might want to write to an existing array created by a different implementation. Which of Stephan's points relates to having an index in the middle? |
@mkitti I agree that doing this as a codec enables other shard codecs in the future, this is great. Sorry if I wasn't clear and caused confusion. None of my scenarios demand an index in the middle and I agree with everybody else that this sounds complicated and does not generalize across shards. Start and end as options sounds great. Thanks for the updates on the many files argument. It'd be great if specific cases where this is relevant could be listed. For our use of AWS S3, and the institute managed file system, we try to reach 1MB blocks and consult with the providers and administrators to make it so that this is ok. The streaming-speed argument holds and is most important for the streaming case (which can be lessened by parallel asynchronous access), random access of single chunks would not be accelerated by shards. I haven't found strict rules about number of files/ keys in the S3 or GC docs but I probably haven't looked carefully enough. Generally, I believe that it would be great to have concrete real world examples where this is useful. @clbarnes ' hostile admin example may be a good one. |
There seems to be a general preference for the index to be at the end from the others, so if we ever needed to add chunks to the end of the file, then perhaps having the index in the middle is at least a temporary scenario before a rewrite. My abstractions generally do not depend on the index anywhere in particular. Kerchunk exports chunk information to an external file. I also have a program to move HDF5's chunk index to an arbitrary location in the file. My understanding is that the Betzig lab, or perhaps more specifically the Advanced Bioimaging Center at UC Berkeley, also encountered some issues with file limits on Lustre based clusters. The practical limits probably should be directory based, but from a file system perspective. The other complaint I have heard from storage admins is a relatively high amount of metadata IOPs hitting thr file systems, basically |
Searching around, I found another anecdote of someone running out of inodes when using Zarr: The solution there was to use ZipStore. Incidentally, zip files also have a central directory at the end of a file. I'm also reminded here of @rabernat 's benchmarks of the initial implementation in zarr-python. I would be interested in hearing how the other implementations of sharding have avoided the pitfalls that Ryan encountered. |
Hi everyone, |
Just to amplify @normanrz's example, since that was exactly what motivated my and @clbarnes original interest in sharding. The block allocation and inode limits of our various network filesystems meant that 2-4MB files were the best common denominator, but this made chunks far larger than optimal for remote visualization. I also see this as somewhat of an amelioration on getting in-memory chunk-size to be comparable for multiple datasets for computational purposes, when compressed size may differ greatly for efficient storage purposes (e.g., raw microscopy vs. seg), or more generally, some accommodation for memory-awareness/striding. I vote YES on behalf of sci-rs/zarr. |
Happy 🎃, everyone. I vote YES for the ZSC. I also assume that there will be a follow-on PR to introduce the index location. The exact process of those adjustments is still a bit up in the air. I’d propose as with #263 that we will ping the ZIC for votes or vetoes there. And in general, as with ZEP1, please keep any further clarifications and questions coming as implementations are written. But I think we’ll all be quite enthused to have another ZEP signed off on. Thanks, all!
Though this is more a discussion for elsewhere, I personally don’t see any issue with having support for that combination, but I’d highly suggest we not expose the community to that mix until Norman’s NGFF spec is decided on (i.e. let’s not write or publish them) |
Sorry for the confusion, I should have said, "as a member of the ZSC". I've updated the description but also pinged all of the remaining voters. |
I vote YES (and updated OP), thanks all! |
Hi all, I vote YES as a member of the ZSC. Congratulations on the degree of consensus achieved and on the progress towards multiple implementations. I note there are still technical dimensions to be explored but am confident this is a good step forward. |
I vote YES Appreciate all the hard work everyone has done here. Am sure implementers and users will appreciate all the thought and effort that has gone into this implementation. This is a major achievement! 👏 Would make two suggestions we can discuss separately. Have filed new issues to discuss those items independently: Again thanks for all of your hard work and congratulations! 🎉 |
Hi, all. Sorry about missing the deadline on this. I was in favor last time I looked at it, but let me review the latest version this evening before officially voting. |
I vote YES as a ZIC member, but cannot personally commit immediate effort toward a codec-based sharding implementation in I also agree with above comments that Congratulations to the ZEP0002 authors! I found the proposal be well written and the codec-based version was easier to follow than an earlier draft I had read. |
This concludes the voting process. ZEP2 has been accepted by ZIC and ZSC 🎉 . Thanks everybody for reviewing the specification, providing feedback and participating in the voting process! I am looking forward to seeing the sharding codec being implemented in the various Zarr implementations. As already mentioned by @joshmoore, there may be smaller changes (e.g. |
In the spirit of looking forward to an exciting new year, I'm going to close this issue. If anyone has concerns about that, please let me know. |
Hi everyone,
I hope you’re doing well.
Thank you all for taking the time to review the ZEP0001 and V3 specification. The V3 specification is approved and accepted by the ZSC, ZIC and the Zarr community.
The initial discussion on sharding dates back to
11/2021
; please see zarr-developers/zarr-python#877. There have been major developments since the proposal of sharding, some of them are:08/2022
→ Submission of sharding for a ZEP, i.e. ZEP0002; see ZEP 2 - Sharding storage transformer zeps#1308/2022
→ Prototype implementation of sharding as storage transformer in Zarr-Python (2022/08); see Sharding storage transformer for v3 zarr-python#111103/2023
→ Pivoting to implement sharding as codec rather than storage transformer; see sharding as a codec rather than array storage transformer #220 and meeting notes03/2023
→ Updated prototype implementation of sharding as a codec in Zarrita; see here07/2023
→ Addedindex_codecs
to the sharding codecNow, we want to put forth the ZEP0002 - Sharding Codec for voting.
We have created this issue to track the approvals from the ZSC, ZIC and the broader Zarr community.
Specific technical feedback on sharding should be made via narrowly scoped issues on the zarr-specs repository that link to this issue.
Now, according to the section, ‘How does a ZEP becomes accepted’ - ZEP0000, a ZEP must satisfy three conditions for approval:
As an implementation council member, you have three options for your vote:
We request you, the ZIC, and the ZSC review the ZEP0002 and let us know your thoughts. We’ve listed steps to read and understand the sharding completely. They are as follows:
We understand that the whole process takes time, so we’ve decided to have a relaxed timeline for ZEP0002 voting. We’d appreciate your vote by
31 October 2023, 23:59:59
AoE.Example implementations
Please let us know if there are any questions. Thank you for your time.
Voting status:
The text was updated successfully, but these errors were encountered: