Partial read #667

andrewfulton9 · 2020-12-01T23:28:51Z

This PR adds the capability to partially read and decompress chunks in an arrays that are initialized with blosc decompression and an fsspec based backend. Its related to zarr-specs #59

I did some initial benchmarking showing processing time for indexing with various array and chunk shapes filled with random numbers stored on s3:

array_shape=(1000, 1000), chunk_shape=(100, 500)

array_shape=(10000, 10000), chunk_shape=(1000, 1000)

array_shape=(10000, 10000), chunk_shape=(2000, 5000)

array_shape=(10000, 10000), chunk_shape=(5000, 5000)

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

pep8speaks · 2020-12-01T23:28:59Z

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file zarr/indexing.py:

Line 476:30: E741 ambiguous variable name 'l'
Line 936:39: E203 whitespace before ':'
Line 941:65: E203 whitespace before ':'

In the file zarr/tests/test_core.py:

Line 2430:9: E265 block comment should start with '# '

In the file zarr/util.py:

Line 578:21: E203 whitespace before ':'

Comment last updated at 2021-01-20 18:14:08 UTC

codecov · 2020-12-02T00:07:39Z

Codecov Report

Merging #667 (45e8c25) into master (a8ee708) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master     #667    +/-   ##
========================================
  Coverage   99.94%   99.94%            
========================================
  Files          28       28            
  Lines       10089    10262   +173     
========================================
+ Hits        10083    10256   +173     
  Misses          6        6

Impacted Files	Coverage Δ
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/creation.py	`100.00% <100.00%> (ø)`
zarr/errors.py	`100.00% <100.00%> (ø)`
zarr/indexing.py	`100.00% <100.00%> (ø)`
zarr/tests/test_core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_indexing.py	`100.00% <100.00%> (ø)`
zarr/tests/test_storage.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <100.00%> (ø)`

Carreau · 2020-12-11T18:10:16Z

I added some release notes (that are conflicting), removed some ded code (see 7452482), and try to find some test code to increase coverage.

z = self.create_array(shape=(500,500, 500), chunks=(50, 50, 50), dtype='<i4')
z[:,:,:] = 0
z[0,:,0].any() # fails

Fails with

zarr/core.py:578: in __getitem__
    return self.get_basic_selection(selection, fields=fields)
zarr/core.py:703: in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
zarr/core.py:746: in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
zarr/core.py:1041: in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
zarr/core.py:1731: in _chunk_getitems
    self._process_chunk(out, cdatas[ckey], chunk_select, drop_axes,
zarr/core.py:1638: in _process_chunk
    chunk_partial = self._decode_chunk(
zarr/core.py:1851: in _decode_chunk
    chunk = self._compressor.decode(cdata)
numcodecs/blosc.pyx:495: in numcodecs.blosc.Blosc.decode
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   RuntimeError: error during blosc decompression: -1

numcodecs/blosc.pyx:399: RuntimeError

I'll try to dive into this.

I'm wondering if it's not start_points_buffer = self.fs.read_block(self.key_path, 16, int(self.nblocks*4)) where int rounds down instead of up. Running in a meeting will look at that later.

Carreau · 2020-12-15T19:44:06Z

I'm going to rebase because of conflicts, I may end up squashing commits for simplicity of rebasing. current commit is/was 3ae237f

Carreau · 2020-12-15T21:12:27Z

zarr/indexing.py

+            )
+
+        # any selection can not be out of the range of the chunk
+        self.selection_shape = np.empty(self.arr_shape)[self.selection].shape


self.selection_shape size is determined by self.selection, which later on is muted with self.selection.pop(); which might be a source of bug; though it might be ok as self.selection_shape seem to be unused....

Carreau · 2020-12-15T21:30:09Z

This raises another error

from zarr.storage import (
    FSStore,
)

from zarr.storage import (
    init_array,
)
from zarr.core import Array
import atexit
import shutil
from tempfile import mkdtemp, mktemp
from numcodecs import Blosc

path = './tmp-test'
path = mkdtemp()
atexit.register(shutil.rmtree, path)
store = FSStore(path)
init_array(store, compressor=Blosc(), shape=(50, 50, 50), chunks=(5, 5, 5), dtype="<i4")
z = Array(
    store,
    cache_metadata=True,
    cache_attrs=True,
    partial_decompress=True,
)

z[:, :, :] = 0
assert z[:, 0, :].any()

Traceback (most recent call last):
  File "foo.py", line 31, in <module>
    assert z[:, 0, :].any()
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 604, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 729, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 772, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1067, in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1781, in _chunk_getitems
    self._process_chunk(
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1677, in _process_chunk
    chunk_partial = self._decode_chunk(
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1933, in _decode_chunk
    chunk = chunk.reshape(expected_shape or self._chunks, order=self._order)
ValueError: cannot reshape array of size 125 into shape (1,1,5)

I've also found some simplification of instance variables that can be local variable I'll push that.

Carreau · 2020-12-15T22:08:26Z

this second failure seem to be due to me having older numcodec that does not have decode partial for blosc;but still assuming it only partially decoded.

Carreau · 2020-12-15T22:45:44Z

So both failure are due to me having an old version numcodec, I've added a check that completely kip the partial reading logic if the compressor does not have a decode_partial method.

Carreau · 2020-12-15T22:47:08Z

I've also restored the .coveragerc file that was nuked in one of your merge with master; so that may help with the coverall. I'm going to wait to see where CI brings us.

Carreau · 2020-12-15T22:52:47Z

zarr/indexing.py

+        last_dim_slice = None if self.selection[-1].step > 1 else self.selection.pop()
+        for i, sl in enumerate(self.selection):
+            dim_chunk_loc_slices = []
+            for i, x in enumerate(slice_to_range(sl, arr_shape[i])):


you use i in both loop and the argument of the second depends on I; that seem weird.

inner i seem unused as well.

Carreau · 2020-12-15T23:09:17Z

I have a couple more cleanup in https://github.com/Carreau/zarr-python/tree/partial_read (d8464d4), but won't push now on this branch as I want to see what the coverage status is.

Carreau · 2020-12-16T16:49:01Z

We ran out of time with travi CI. I'm working on migrating to GH actions; Test won't complete without this. Apologies for the delay.

jakirkham · 2020-12-16T19:17:19Z

cc @rabernat @jhamman @rsignell-usgs (as I think we discussed this last year at SciPy 2019 so may be of interest)

Carreau · 2021-01-04T21:06:05Z

Rebased on master, now that master have all the CI migrated to gh action.
Added some extra cleanup, will see what the total coverage is and iterate on this.

docs/release.rst

Carreau · 2021-01-04T23:07:41Z

@andrewfulton9

I've turned some of your conditional logics into asserts in d098528; so that there is no uncovered branch. Let me know if that's wrong. I've also use np.prod(..., dtype=int) to avoid polluting intermediate results with floating point numbers when the inner iterable was empty and it was returning 1.0 instead of 1.

I've turned some instance variables into local ones when not used outside the function and sprinkled some docs and and release note. (make Partial read opt-in, and added the option to open_array) .

I've left some ? in the docstring of the partial reader as I'm not always sure if the ints are number of bytes or number of elements. Not sure if it matters much.

Before I pushed last commits, all test seemed to be passing, windows was still pending... though coverage was missing in afew spot. You know the code better than I, so it will likely be easier to know how to cover the missing lines..

zarr/indexing.py

zarr/util.py

andrewfulton9 · 2021-01-05T00:11:02Z

@Carreau, thanks for the review and the contributions you added. I'll look over the coverage try to get that back up to 100, and fix the doc strings you added the ? to.

zarr/tests/test_core.py

Carreau · 2021-01-06T17:16:40Z

Do you think you can rebase ? There seem to be merge commits that make changes hard to follow.
Other than than +1 to get that in.

joshmoore · 2021-01-14T11:37:15Z

Note: discussion (plus notes) during the community call yesterday. General call for feedback so this can get merged to move forward the BaseStore work followed by the V3 updates.

zarr/core.py

jakirkham · 2021-01-20T01:48:30Z

zarr/indexing.py

+def int_to_slice(dim_selection):
+    return slice(dim_selection, dim_selection + 1, 1)


Maybe we can just inline this since it is simple enough and used in only a couple places

jakirkham · 2021-01-20T01:50:17Z

Thanks Andrew! 😄 Had a couple suggestions above

Co-authored-by: jakirkham <[email protected]>

andrewfulton9 · 2021-01-20T18:14:43Z

Just pushed up those changes. Thanks @jakirkham !

Carreau · 2021-01-25T02:31:54Z

Thanks, with @jakirkham feedback and the last two commits I think we can move forward ! I'm approving as well and merging.

jakirkham · 2021-01-25T16:43:27Z

Thanks all! 😀

martindurant · 2021-01-27T19:58:23Z

Great to see this! I think it's important enough that we may want to put a blog somewhere. What do the final benchmarks look like? Am I right in thinking that blosc is still the default compressor and will continue to be?

jakirkham · 2021-01-27T21:02:17Z

We do have a blog on the webpage repo. So writing something up there makes sense.

More generally we may want to do this to summarize the work done for the CZI grant. (cc @davidbrochart as well)

davidbrochart · 2021-01-27T21:11:50Z

Thanks for pinging me @jakirkham, I would be happy to summarize the work done on the C++ side in this blog. We could also have a meeting and present to each other what we have done, what do you think?

rabernat · 2021-01-28T00:34:28Z

Pangeo is organizing a new webinar series of short tech talks. It would be great to have a presentation on the latest developments from Zarr. Presentations will be recorded, publicized and assigned a DOI.

If anyone wants to sign up to present, just fill out this short form: https://forms.gle/zuU8XcQHHKS6DBvv8

jakirkham · 2021-01-29T00:01:00Z

Thanks for pinging me @jakirkham, I would be happy to summarize the work done on the C++ side in this blog. We could also have a meeting and present to each other what we have done, what do you think?

Yeah, @davidbrochart, I think that makes sense. Do you want to go ahead and set up a poll (for the meeting time)? I think we should try to schedule this in mid-February if that works

davidbrochart · 2021-01-29T08:17:26Z

Actually I submitted a proposal for a presentation on xtensor-zarr in the Pangeo webinar series. If that is accepted, that could be a better option, since people could see the recording. What do you think?

jakirkham · 2021-01-29T08:34:34Z

I think both are useful. The blog is also likely not too different from other write ups we are doing. So hopefully doesn’t take too much effort to produce. Plus it gives us an opportunity to share this work with a broader audience and publicly recognize CZI for their support, which I’m sure they will appreciate 🙂

andrewfulton9 · 2021-01-29T17:08:29Z

I can do a blog on the partial read work as well.

martindurant · 2021-09-02T15:24:43Z

Very late question here: has partial read been implemented for uncompressed data? It should be even easier.

rabernat · 2021-09-02T15:33:51Z

No

naught101 · 2023-12-10T23:43:43Z

Would it be possible to rename this "Partial read of blosc-compressed chunks" or similar?

jakirkham · 2023-12-11T22:11:36Z

We could do that. Though this likely still comes in the commit history (and potentially other search paths)

That said, as there is a whole new sharding codec planned (please see ZEP 2), this is probably best handled via that pathway, which is planned as part of the v3 work ( #1583 )

Separately there has been discussion of moving from Blosc 1 to 2 ( zarr-developers/numcodecs#413 ), which would likely entail other changes as well

So it is probably best to direct attention to more recent issues/PRs at this stage

Carreau force-pushed the partial_read branch from 3ae237f to 7c472e4 Compare December 15, 2020 20:08

Carreau reviewed Dec 15, 2020

View reviewed changes

Carreau force-pushed the partial_read branch from 79bd916 to 0b5b1b7 Compare January 4, 2021 21:04

Carreau force-pushed the partial_read branch 2 times, most recently from b063054 to 3a6a0d8 Compare January 4, 2021 21:16