-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial read #667
Partial read #667
Conversation
Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-01-20 18:14:08 UTC |
Codecov Report
@@ Coverage Diff @@
## master #667 +/- ##
========================================
Coverage 99.94% 99.94%
========================================
Files 28 28
Lines 10089 10262 +173
========================================
+ Hits 10083 10256 +173
Misses 6 6
|
I added some release notes (that are conflicting), removed some ded code (see 7452482), and try to find some test code to increase coverage.
Fails with
I'll try to dive into this. I'm wondering if it's not |
I'm going to rebase because of conflicts, I may end up squashing commits for simplicity of rebasing. current commit is/was 3ae237f |
3ae237f
to
7c472e4
Compare
zarr/indexing.py
Outdated
) | ||
|
||
# any selection can not be out of the range of the chunk | ||
self.selection_shape = np.empty(self.arr_shape)[self.selection].shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.selection_shape
size is determined by self.selection
, which later on is muted with self.selection.pop()
; which might be a source of bug; though it might be ok as self.selection_shape seem to be unused....
This raises another error
I've also found some simplification of instance variables that can be local variable I'll push that. |
this second failure seem to be due to me having older numcodec that does not have decode partial for blosc;but still assuming it only partially decoded. |
So both failure are due to me having an old version numcodec, I've added a check that completely kip the partial reading logic if the compressor does not have a decode_partial method. |
I've also restored the |
zarr/indexing.py
Outdated
last_dim_slice = None if self.selection[-1].step > 1 else self.selection.pop() | ||
for i, sl in enumerate(self.selection): | ||
dim_chunk_loc_slices = [] | ||
for i, x in enumerate(slice_to_range(sl, arr_shape[i])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you use i
in both loop and the argument of the second depends on I; that seem weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner i seem unused as well.
I have a couple more cleanup in https://github.com/Carreau/zarr-python/tree/partial_read (d8464d4), but won't push now on this branch as I want to see what the coverage status is. |
We ran out of time with travi CI. I'm working on migrating to GH actions; Test won't complete without this. Apologies for the delay. |
cc @rabernat @jhamman @rsignell-usgs (as I think we discussed this last year at SciPy 2019 so may be of interest) |
Rebased on master, now that master have all the CI migrated to gh action. |
b063054
to
3a6a0d8
Compare
I've turned some of your conditional logics into asserts in d098528; so that there is no uncovered branch. Let me know if that's wrong. I've also use I've turned some instance variables into local ones when not used outside the function and sprinkled some docs and and release note. (make Partial read opt-in, and added the option to open_array) . I've left some Before I pushed last commits, all test seemed to be passing, windows was still pending... though coverage was missing in afew spot. You know the code better than I, so it will likely be easier to know how to cover the missing lines.. |
@Carreau, thanks for the review and the contributions you added. I'll look over the coverage try to get that back up to 100, and fix the doc strings you added the ? to. |
Do you think you can rebase ? There seem to be merge commits that make changes hard to follow. |
acc22c5
to
953ee5a
Compare
Note: discussion (plus notes) during the community call yesterday. General call for feedback so this can get merged to move forward the BaseStore work followed by the V3 updates. |
zarr/indexing.py
Outdated
def int_to_slice(dim_selection): | ||
return slice(dim_selection, dim_selection + 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just inline this since it is simple enough and used in only a couple places
Thanks Andrew! 😄 Had a couple suggestions above |
Co-authored-by: jakirkham <[email protected]>
Just pushed up those changes. Thanks @jakirkham ! |
Thanks, with @jakirkham feedback and the last two commits I think we can move forward ! I'm approving as well and merging. |
Thanks all! 😀 |
Great to see this! I think it's important enough that we may want to put a blog somewhere. What do the final benchmarks look like? Am I right in thinking that blosc is still the default compressor and will continue to be? |
We do have a blog on the webpage repo. So writing something up there makes sense. More generally we may want to do this to summarize the work done for the CZI grant. (cc @davidbrochart as well) |
Thanks for pinging me @jakirkham, I would be happy to summarize the work done on the C++ side in this blog. We could also have a meeting and present to each other what we have done, what do you think? |
Pangeo is organizing a new webinar series of short tech talks. It would be great to have a presentation on the latest developments from Zarr. Presentations will be recorded, publicized and assigned a DOI. If anyone wants to sign up to present, just fill out this short form: https://forms.gle/zuU8XcQHHKS6DBvv8 |
Yeah, @davidbrochart, I think that makes sense. Do you want to go ahead and set up a poll (for the meeting time)? I think we should try to schedule this in mid-February if that works |
Actually I submitted a proposal for a presentation on xtensor-zarr in the Pangeo webinar series. If that is accepted, that could be a better option, since people could see the recording. What do you think? |
I think both are useful. The blog is also likely not too different from other write ups we are doing. So hopefully doesn’t take too much effort to produce. Plus it gives us an opportunity to share this work with a broader audience and publicly recognize CZI for their support, which I’m sure they will appreciate 🙂 |
I can do a blog on the partial read work as well. |
Very late question here: has partial read been implemented for uncompressed data? It should be even easier. |
No |
Would it be possible to rename this "Partial read of blosc-compressed chunks" or similar? |
We could do that. Though this likely still comes in the commit history (and potentially other search paths) That said, as there is a whole new sharding codec planned (please see ZEP 2), this is probably best handled via that pathway, which is planned as part of the v3 work ( #1583 ) Separately there has been discussion of moving from Blosc 1 to 2 ( zarr-developers/numcodecs#413 ), which would likely entail other changes as well So it is probably best to direct attention to more recent issues/PRs at this stage |
This PR adds the capability to partially read and decompress chunks in an arrays that are initialized with blosc decompression and an fsspec based backend. Its related to zarr-specs #59
I did some initial benchmarking showing processing time for indexing with various array and chunk shapes filled with random numbers stored on s3:
array_shape=(1000, 1000), chunk_shape=(100, 500)
array_shape=(10000, 10000), chunk_shape=(1000, 1000)
array_shape=(10000, 10000), chunk_shape=(2000, 5000)
array_shape=(10000, 10000), chunk_shape=(5000, 5000)
TODO: