Add blosc getitem #235

andrewfulton9 · 2020-05-22T16:23:41Z

This PR demonstrates a way to do partial decompression of a blosc compressed array buffer as discussed in This Issue. The biggest limitation of the decompress_partial method is the inability to properly decompress parts of a multidimensional array.

TODO:

Unit tests and/or doctests in docstrings
tox -e py38 passes locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
tox -e docs passes locally
AppVeyor and Travis CI passes
Test coverage to 100% (Coveralls passes)

jakirkham · 2020-05-22T20:53:03Z

cc @alimanfoo @rabernat @jhamman

jakirkham · 2020-05-22T20:54:15Z

The biggest limitation of the decompress_partial method is the inability to properly decompress parts of a multidimensional array.

Wouldn't this amount to multiple partial getitems?

Carreau · 2020-05-22T23:25:31Z

Wouldn't this amount to multiple partial getitems?

On arbitrary slices yes, on the first/last dimension ([..., a:b]/[a:b, ...]) it is likely a single one depending on whether the array is C or F is my rough reading on how that could work and still be relatively efficient.

We have to be careful with heuristics though, at some point it is likely faster to read the full array than to do multiple partial decode.

jakirkham · 2020-05-23T00:34:27Z

Good point. Have you benchmarked this Andrew? It would be good if we could develop some intuition about when and how much this helps 🙂

andrewfulton9 · 2020-05-24T16:41:53Z

@jakirkham, I'll put together a little notebook with some benchmarks for this

andrewfulton9 · 2020-05-24T21:09:16Z

Ok, I've done some benchmarking now. Some of the work I've done can be seen here.

To summarize though, results seem to highly dependant on the array/buffer size. It seems that the larger the array, the more efficient the partial decode method is, often outperforming the decode method regardless of number of items decompressed. As the arrays get smaller, other factors (such as number of items, block size, compresser, clevel, ect) seem to contribute more to the variability of the results, though this could be that local processes on my computer had more of an effect since the decompression times were so low anyway. I did run each test iteration 100 times, and compared the mean time of partial decompression and full decompression of a buffer.

Carreau · 2020-05-26T16:20:28Z

For multidimensional slices I'm wondering if https://github.com/Quansight/ndindex would be helpful.

The advantage of using warnings instead of globals is performance and the ability to also still get the warning printed by default.

Example of unstable-allowing context manager.

pep8speaks · 2020-06-03T19:44:07Z

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-08 19:48:09 UTC

alimanfoo · 2020-06-03T21:03:53Z

Hi all, just to say that this PR looks in good shape to me and happy to see this go in as an experimental feature if it would be useful for exploratory work.

Also just to cross-reference, previously for zarr we've followed a convention where a feature is considered part of the "experimental API" if it is documented as an "experimental feature". This usually just means adding a note in the docstring to say it's an experimental feature, and also a similar statement if the feature is used in the tutorial or other docs. The contributor guide describes what this means for API compatibility and versioning.

For example, the consolidate_metadata() function is marked as an experimental feature in the docstring.

I'd suggest following a similar approach here and adding notes to the docstrings of new function and method to state they are experimental features.

Re the allow_unstable() context manager, I don't mind either way if we use it, happy to follow advice. If we do use that I'd make a soft suggestion to rename it to allow_experimental() just to be consistent with terminology around the experimental API and features.

andrewfulton9 · 2020-06-04T15:52:46Z

Great I'll fix this up with you comments in mind and let you know when Its ready. Should be by the end of the day

andrewfulton9 · 2020-06-10T17:19:28Z

I believe this is ready to merge at this point, unless anyone has any other feedback

alimanfoo

Thanks a lot for updating. A couple of questions here regarding handling of type size.

Also could you remove from the PR the regenerated Cython C sources for the other codecs (compat_ext.c, lz4.c, vlen.c, zstd.c)? Those are just noise changes created by regenerating sources on a different system.

numcodecs/blosc.pyx

alimanfoo · 2020-06-10T22:24:15Z

numcodecs/blosc.pyx

+
+    # determine typesize if not given
+    if typesize == 0:
+        typesize = source[3]


Just wondering, here you extract the typesize from the blosc header. What happens if the user overrides and gives a typesize that doesn't match what's present in the header?

alimanfoo · 2020-06-10T22:29:09Z

numcodecs/blosc.pyx

+    # infers encoding size from typesize if not given. Could be wrong if
+    # array is converted before encoding.
+    if encoding_size == 0:
+       encoding_size = typesize


Again just wondering what happens if user overrides here.

alimanfoo · 2020-06-10T22:30:21Z

numcodecs/blosc.pyx

@@ -393,6 +394,99 @@ def decompress(source, dest=None):
    return dest


+def decompress_partial(source, start, nitems, typesize=0, encoding_size=0, dest=None):


I'm slightly concerned about exposing typesize and encoding_size in the API here. Would/should the user ever override these? Or should these both always be set to whatever typesize is given in the blosc header of the source buffer?

These are exposed because if the array if encoded to bytes before the buffer is encoded in blosc, then blosc will have 1 as the header size which may not be true.

for example, they typesize and header need to be given to pass this test:

# test encoding of bytes buf = arr.tobytes(order='A') enc = codec.encode(buf) dec = codec.decode_partial(enc, start, nitems, ITEMSIZE, 1) compare_arrays(compare_arr, dec, precision=precision)

I'm open to other ideas for this. Ideally a user wouldn't be using these kwargs unless they knew what they were doing

Thanks for clarification. I think the use case for exposing typesize in the API is clear.

Regarding encoding_size, is that needed in the API? Shouldn't the encoding_size always be the value of the itemsize in the header? Or have I misunderstood? In the tests you are setting it to 1, but that is what would be in the blosc header.

I don't think this should be a concern. Zarr tries to maintain the dtype of the object being written in part to make sure that Blosc gets the correct itemsize. We do this by using a utility function ensure_ndarray. So I think we are safe to drop it.

@alimanfoo, I made an edit so the encoding size is always determined from from the source buffer header so only the typesize is exposed now.

@jakirkham, If a return buffer isn't given, the typesize is still needed to create it. With the regular decompression, the return buffer can just be created the same size as the source buffer and then ensure_ndarray can size it appropriately, but with the partial decompression, the return buffer is sized by nitems * typesize. Maybe an option would be to make the destination buffer the same size as the source buffer and then only fill it up to the size of the decompressed data, then just take slice(0, nitems)?

@jakirkham and @alimanfoo, Any more thoughts about this? If the encoding is controlled by zarr, then I think it makes sense to leave out the typesize option. If we do that. I'll also remove the tests with pre-encoded arrays as well.

SGTM. Let's see what Alistair says :)

Hi folks, sorry for slow reply.

FWIW I think you can assume that the typesize has been correctly passed through to blosc during write, and therefore that it can always be read from the header.

All of the tests with different non-numpy array-like things were originally added to ensure that the decode and encode methods would accept any object exposing a (new-style) buffer interface. Of course in this situation it causes a problem because the conversion e.g. from numpy to bytes causes a loss of information about the typesize. So FWIW I think you could either (a) drop the tests involving things which don't propagate the original typesize, or (b) adjust those tests to compare against the expectation given the new typesize after conversion from numpy to something else.

To elaborate a little, numcodecs is intended to be used by zarr but also other libraries. The tests ensure that numcodecs can handle as input any object which exposes the new-style buffer interface.

However, numcodecs can also assume that whatever typesize (itemsize) is exposed via that buffer interface is the right one.

I.e., if a user has performed some conversion of their data upstream of numcodecs which has caused some loss or change to the typesize (itemsize) that numcodecs receives, that is the user's problem.

encoding size is now removed and I updated the tests to handle it appropriately

jakirkham · 2020-06-12T02:20:10Z

numcodecs/blosc.pyx

@@ -393,6 +394,100 @@ def decompress(source, dest=None):
    return dest


+def decompress_partial(source, start, nitems, typesize=0, dest=None):


Just for clarity, we can drop typesize and rely on the header. The dtype will already passed through to Blosc correctly from Zarr.

alimanfoo

Thanks @andrewfulton9, this is looking good to me, just a couple of docstring nits.

numcodecs/blosc.pyx

… is the right size

docs/release.rst

alimanfoo · 2020-07-08T19:49:52Z

Thanks @andrewfulton9, glad to get this in 👍

jakirkham · 2020-08-12T19:31:00Z

FYI the logic to use this in Zarr is being worked on in PR ( zarr-developers/zarr-python#584 ).

andrewfulton9 added 8 commits May 19, 2020 14:48

adds partial decompression, not passing tests

bd6dfb1

work

780e509

debugging

1b994b2

passes tests

9e66051

turns back default vebosity on tests

4d53cfb

updates docs

dad1f83

adds code to infer typesize/encoding size, cleans up code

1aafd9d

removes unused import

f5b3902

Carreau and others added 2 commits May 26, 2020 14:50

Example of unstable-allowing context manager.

be3b074

The advantage of using warnings instead of globals is performance and the ability to also still get the warning printed by default.

Merge pull request #2 from Carreau/warning-context

f24e36c

Example of unstable-allowing context manager.

andrewfulton9 added 3 commits June 5, 2020 11:41

removes unstable context manager

8678d62

adds experimental docstrings

d03da9f

updates cython c generated file

49e8915

alimanfoo reviewed Jun 10, 2020

View reviewed changes

andrewfulton9 added 3 commits June 11, 2020 17:45

removes extra line

c88f24d

rollback c file noise changes

5a9eb6e

encoding size is taken from source header now

67f5280

jakirkham reviewed Jun 12, 2020

View reviewed changes

fix method docstring

45720ce

andrewfulton9 added 2 commits June 12, 2020 10:04

removes debugging print statements

15f482f

removes encoding_size argument from partial_decompression, updates tests

4e34582

alimanfoo reviewed Jul 1, 2020

View reviewed changes

numcodecs/blosc.pyx Outdated Show resolved Hide resolved

numcodecs/blosc.pyx Outdated Show resolved Hide resolved

andrewfulton9 added 2 commits July 5, 2020 12:21

fixes up formatting of docstrings. Adds check that destination buffer…

4cbfe5c

… is the right size

merges

09c86d0

alimanfoo reviewed Jul 8, 2020

View reviewed changes

docs/release.rst Outdated Show resolved Hide resolved

Update docs/release.rst

73017f4

alimanfoo merged commit a4be370 into zarr-developers:master Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blosc getitem #235

Add blosc getitem #235

andrewfulton9 commented May 22, 2020

jakirkham commented May 22, 2020

jakirkham commented May 22, 2020

Carreau commented May 22, 2020

jakirkham commented May 23, 2020

andrewfulton9 commented May 24, 2020

andrewfulton9 commented May 24, 2020

Carreau commented May 26, 2020

pep8speaks commented Jun 3, 2020 •

edited

Loading

alimanfoo commented Jun 3, 2020

andrewfulton9 commented Jun 4, 2020

andrewfulton9 commented Jun 10, 2020

alimanfoo left a comment

alimanfoo Jun 10, 2020

alimanfoo Jun 10, 2020

alimanfoo Jun 10, 2020

andrewfulton9 Jun 11, 2020

alimanfoo Jun 11, 2020

jakirkham Jun 12, 2020

andrewfulton9 Jun 12, 2020

andrewfulton9 Jun 22, 2020

jakirkham Jun 23, 2020

alimanfoo Jun 26, 2020

alimanfoo Jun 26, 2020

andrewfulton9 Jul 1, 2020

jakirkham Jun 12, 2020

alimanfoo left a comment

alimanfoo commented Jul 8, 2020

jakirkham commented Aug 12, 2020

		@@ -393,6 +394,99 @@ def decompress(source, dest=None):
		return dest


		def decompress_partial(source, start, nitems, typesize=0, encoding_size=0, dest=None):

		@@ -393,6 +394,100 @@ def decompress(source, dest=None):
		return dest


		def decompress_partial(source, start, nitems, typesize=0, dest=None):

Add blosc getitem #235

Add blosc getitem #235

Conversation

andrewfulton9 commented May 22, 2020

jakirkham commented May 22, 2020

jakirkham commented May 22, 2020

Carreau commented May 22, 2020

jakirkham commented May 23, 2020

andrewfulton9 commented May 24, 2020

andrewfulton9 commented May 24, 2020

Carreau commented May 26, 2020

pep8speaks commented Jun 3, 2020 • edited Loading

Comment last updated at 2020-07-08 19:48:09 UTC

alimanfoo commented Jun 3, 2020

andrewfulton9 commented Jun 4, 2020

andrewfulton9 commented Jun 10, 2020

alimanfoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo commented Jul 8, 2020

jakirkham commented Aug 12, 2020

pep8speaks commented Jun 3, 2020 •

edited

Loading