Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blosc getitem #235

Merged
merged 22 commits into from
Jul 8, 2020
Merged

Conversation

andrewfulton9
Copy link
Contributor

This PR demonstrates a way to do partial decompression of a blosc compressed array buffer as discussed in This Issue. The biggest limitation of the decompress_partial method is the inability to properly decompress parts of a multidimensional array.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py38 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@jakirkham
Copy link
Member

cc @alimanfoo @rabernat @jhamman

@jakirkham
Copy link
Member

The biggest limitation of the decompress_partial method is the inability to properly decompress parts of a multidimensional array.

Wouldn't this amount to multiple partial getitems?

@Carreau
Copy link
Contributor

Carreau commented May 22, 2020

Wouldn't this amount to multiple partial getitems?

On arbitrary slices yes, on the first/last dimension ([..., a:b]/[a:b, ...]) it is likely a single one depending on whether the array is C or F is my rough reading on how that could work and still be relatively efficient.

We have to be careful with heuristics though, at some point it is likely faster to read the full array than to do multiple partial decode.

@jakirkham
Copy link
Member

Good point. Have you benchmarked this Andrew? It would be good if we could develop some intuition about when and how much this helps 🙂

@andrewfulton9
Copy link
Contributor Author

@jakirkham, I'll put together a little notebook with some benchmarks for this

@andrewfulton9
Copy link
Contributor Author

Ok, I've done some benchmarking now. Some of the work I've done can be seen here.

To summarize though, results seem to highly dependant on the array/buffer size. It seems that the larger the array, the more efficient the partial decode method is, often outperforming the decode method regardless of number of items decompressed. As the arrays get smaller, other factors (such as number of items, block size, compresser, clevel, ect) seem to contribute more to the variability of the results, though this could be that local processes on my computer had more of an effect since the decompression times were so low anyway. I did run each test iteration 100 times, and compared the mean time of partial decompression and full decompression of a buffer.

@Carreau
Copy link
Contributor

Carreau commented May 26, 2020

For multidimensional slices I'm wondering if https://github.com/Quansight/ndindex would be helpful.

Carreau and others added 2 commits May 26, 2020 14:50
The advantage of using warnings instead of globals is performance and
the ability to also still get the warning printed by default.
Example of unstable-allowing context manager.
@pep8speaks
Copy link

pep8speaks commented Jun 3, 2020

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-08 19:48:09 UTC

@alimanfoo
Copy link
Member

Hi all, just to say that this PR looks in good shape to me and happy to see this go in as an experimental feature if it would be useful for exploratory work.

Also just to cross-reference, previously for zarr we've followed a convention where a feature is considered part of the "experimental API" if it is documented as an "experimental feature". This usually just means adding a note in the docstring to say it's an experimental feature, and also a similar statement if the feature is used in the tutorial or other docs. The contributor guide describes what this means for API compatibility and versioning.

For example, the consolidate_metadata() function is marked as an experimental feature in the docstring.

I'd suggest following a similar approach here and adding notes to the docstrings of new function and method to state they are experimental features.

Re the allow_unstable() context manager, I don't mind either way if we use it, happy to follow advice. If we do use that I'd make a soft suggestion to rename it to allow_experimental() just to be consistent with terminology around the experimental API and features.

@andrewfulton9
Copy link
Contributor Author

Great I'll fix this up with you comments in mind and let you know when Its ready. Should be by the end of the day

@andrewfulton9
Copy link
Contributor Author

I believe this is ready to merge at this point, unless anyone has any other feedback

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating. A couple of questions here regarding handling of type size.

Also could you remove from the PR the regenerated Cython C sources for the other codecs (compat_ext.c, lz4.c, vlen.c, zstd.c)? Those are just noise changes created by regenerating sources on a different system.

numcodecs/blosc.pyx Outdated Show resolved Hide resolved
numcodecs/blosc.pyx Show resolved Hide resolved

# determine typesize if not given
if typesize == 0:
typesize = source[3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, here you extract the typesize from the blosc header. What happens if the user overrides and gives a typesize that doesn't match what's present in the header?

# infers encoding size from typesize if not given. Could be wrong if
# array is converted before encoding.
if encoding_size == 0:
encoding_size = typesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again just wondering what happens if user overrides here.

@@ -393,6 +394,99 @@ def decompress(source, dest=None):
return dest


def decompress_partial(source, start, nitems, typesize=0, encoding_size=0, dest=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about exposing typesize and encoding_size in the API here. Would/should the user ever override these? Or should these both always be set to whatever typesize is given in the blosc header of the source buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are exposed because if the array if encoded to bytes before the buffer is encoded in blosc, then blosc will have 1 as the header size which may not be true.

for example, they typesize and header need to be given to pass this test:

# test encoding of bytes
buf = arr.tobytes(order='A')
enc = codec.encode(buf)
dec = codec.decode_partial(enc, start, nitems, ITEMSIZE, 1)
compare_arrays(compare_arr, dec, precision=precision)

I'm open to other ideas for this. Ideally a user wouldn't be using these kwargs unless they knew what they were doing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification. I think the use case for exposing typesize in the API is clear.

Regarding encoding_size, is that needed in the API? Shouldn't the encoding_size always be the value of the itemsize in the header? Or have I misunderstood? In the tests you are setting it to 1, but that is what would be in the blosc header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a concern. Zarr tries to maintain the dtype of the object being written in part to make sure that Blosc gets the correct itemsize. We do this by using a utility function ensure_ndarray. So I think we are safe to drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimanfoo, I made an edit so the encoding size is always determined from from the source buffer header so only the typesize is exposed now.

@jakirkham, If a return buffer isn't given, the typesize is still needed to create it. With the regular decompression, the return buffer can just be created the same size as the source buffer and then ensure_ndarray can size it appropriately, but with the partial decompression, the return buffer is sized by nitems * typesize. Maybe an option would be to make the destination buffer the same size as the source buffer and then only fill it up to the size of the decompressed data, then just take slice(0, nitems)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham and @alimanfoo, Any more thoughts about this? If the encoding is controlled by zarr, then I think it makes sense to leave out the typesize option. If we do that. I'll also remove the tests with pre-encoded arrays as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Let's see what Alistair says :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks, sorry for slow reply.

FWIW I think you can assume that the typesize has been correctly passed through to blosc during write, and therefore that it can always be read from the header.

All of the tests with different non-numpy array-like things were originally added to ensure that the decode and encode methods would accept any object exposing a (new-style) buffer interface. Of course in this situation it causes a problem because the conversion e.g. from numpy to bytes causes a loss of information about the typesize. So FWIW I think you could either (a) drop the tests involving things which don't propagate the original typesize, or (b) adjust those tests to compare against the expectation given the new typesize after conversion from numpy to something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate a little, numcodecs is intended to be used by zarr but also other libraries. The tests ensure that numcodecs can handle as input any object which exposes the new-style buffer interface.

However, numcodecs can also assume that whatever typesize (itemsize) is exposed via that buffer interface is the right one.

I.e., if a user has performed some conversion of their data upstream of numcodecs which has caused some loss or change to the typesize (itemsize) that numcodecs receives, that is the user's problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding size is now removed and I updated the tests to handle it appropriately

@@ -393,6 +394,100 @@ def decompress(source, dest=None):
return dest


def decompress_partial(source, start, nitems, typesize=0, dest=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity, we can drop typesize and rely on the header. The dtype will already passed through to Blosc correctly from Zarr.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrewfulton9, this is looking good to me, just a couple of docstring nits.

numcodecs/blosc.pyx Outdated Show resolved Hide resolved
numcodecs/blosc.pyx Outdated Show resolved Hide resolved
docs/release.rst Outdated Show resolved Hide resolved
@alimanfoo alimanfoo merged commit a4be370 into zarr-developers:master Jul 8, 2020
@alimanfoo
Copy link
Member

Thanks @andrewfulton9, glad to get this in 👍

@jakirkham
Copy link
Member

FYI the logic to use this in Zarr is being worked on in PR ( zarr-developers/zarr-python#584 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants