-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blosc getitem #235
Add blosc getitem #235
Conversation
Wouldn't this amount to multiple partial getitems? |
On arbitrary slices yes, on the first/last dimension ( We have to be careful with heuristics though, at some point it is likely faster to read the full array than to do multiple partial decode. |
Good point. Have you benchmarked this Andrew? It would be good if we could develop some intuition about when and how much this helps 🙂 |
@jakirkham, I'll put together a little notebook with some benchmarks for this |
Ok, I've done some benchmarking now. Some of the work I've done can be seen here. To summarize though, results seem to highly dependant on the array/buffer size. It seems that the larger the array, the more efficient the partial decode method is, often outperforming the decode method regardless of number of items decompressed. As the arrays get smaller, other factors (such as number of items, block size, compresser, clevel, ect) seem to contribute more to the variability of the results, though this could be that local processes on my computer had more of an effect since the decompression times were so low anyway. I did run each test iteration 100 times, and compared the mean time of partial decompression and full decompression of a buffer. |
For multidimensional slices I'm wondering if https://github.com/Quansight/ndindex would be helpful. |
The advantage of using warnings instead of globals is performance and the ability to also still get the warning printed by default.
Example of unstable-allowing context manager.
Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-07-08 19:48:09 UTC |
Hi all, just to say that this PR looks in good shape to me and happy to see this go in as an experimental feature if it would be useful for exploratory work. Also just to cross-reference, previously for zarr we've followed a convention where a feature is considered part of the "experimental API" if it is documented as an "experimental feature". This usually just means adding a note in the docstring to say it's an experimental feature, and also a similar statement if the feature is used in the tutorial or other docs. The contributor guide describes what this means for API compatibility and versioning. For example, the consolidate_metadata() function is marked as an experimental feature in the docstring. I'd suggest following a similar approach here and adding notes to the docstrings of new function and method to state they are experimental features. Re the |
Great I'll fix this up with you comments in mind and let you know when Its ready. Should be by the end of the day |
I believe this is ready to merge at this point, unless anyone has any other feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for updating. A couple of questions here regarding handling of type size.
Also could you remove from the PR the regenerated Cython C sources for the other codecs (compat_ext.c, lz4.c, vlen.c, zstd.c)? Those are just noise changes created by regenerating sources on a different system.
numcodecs/blosc.pyx
Outdated
|
||
# determine typesize if not given | ||
if typesize == 0: | ||
typesize = source[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, here you extract the typesize
from the blosc header. What happens if the user overrides and gives a typesize that doesn't match what's present in the header?
numcodecs/blosc.pyx
Outdated
# infers encoding size from typesize if not given. Could be wrong if | ||
# array is converted before encoding. | ||
if encoding_size == 0: | ||
encoding_size = typesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again just wondering what happens if user overrides here.
numcodecs/blosc.pyx
Outdated
@@ -393,6 +394,99 @@ def decompress(source, dest=None): | |||
return dest | |||
|
|||
|
|||
def decompress_partial(source, start, nitems, typesize=0, encoding_size=0, dest=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned about exposing typesize
and encoding_size
in the API here. Would/should the user ever override these? Or should these both always be set to whatever typesize is given in the blosc header of the source buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are exposed because if the array if encoded to bytes before the buffer is encoded in blosc, then blosc will have 1 as the header size which may not be true.
for example, they typesize and header need to be given to pass this test:
# test encoding of bytes
buf = arr.tobytes(order='A')
enc = codec.encode(buf)
dec = codec.decode_partial(enc, start, nitems, ITEMSIZE, 1)
compare_arrays(compare_arr, dec, precision=precision)
I'm open to other ideas for this. Ideally a user wouldn't be using these kwargs unless they knew what they were doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification. I think the use case for exposing typesize
in the API is clear.
Regarding encoding_size
, is that needed in the API? Shouldn't the encoding_size
always be the value of the itemsize in the header? Or have I misunderstood? In the tests you are setting it to 1, but that is what would be in the blosc header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a concern. Zarr tries to maintain the dtype
of the object being written in part to make sure that Blosc gets the correct itemsize
. We do this by using a utility function ensure_ndarray
. So I think we are safe to drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimanfoo, I made an edit so the encoding size is always determined from from the source buffer header so only the typesize is exposed now.
@jakirkham, If a return buffer isn't given, the typesize is still needed to create it. With the regular decompression, the return buffer can just be created the same size as the source buffer and then ensure_ndarray
can size it appropriately, but with the partial decompression, the return buffer is sized by nitems * typesize
. Maybe an option would be to make the destination buffer the same size as the source buffer and then only fill it up to the size of the decompressed data, then just take slice(0, nitems)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakirkham and @alimanfoo, Any more thoughts about this? If the encoding is controlled by zarr, then I think it makes sense to leave out the typesize
option. If we do that. I'll also remove the tests with pre-encoded arrays as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Let's see what Alistair says :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, sorry for slow reply.
FWIW I think you can assume that the typesize has been correctly passed through to blosc during write, and therefore that it can always be read from the header.
All of the tests with different non-numpy array-like things were originally added to ensure that the decode and encode methods would accept any object exposing a (new-style) buffer interface. Of course in this situation it causes a problem because the conversion e.g. from numpy to bytes causes a loss of information about the typesize. So FWIW I think you could either (a) drop the tests involving things which don't propagate the original typesize, or (b) adjust those tests to compare against the expectation given the new typesize after conversion from numpy to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate a little, numcodecs is intended to be used by zarr but also other libraries. The tests ensure that numcodecs can handle as input any object which exposes the new-style buffer interface.
However, numcodecs can also assume that whatever typesize (itemsize) is exposed via that buffer interface is the right one.
I.e., if a user has performed some conversion of their data upstream of numcodecs which has caused some loss or change to the typesize (itemsize) that numcodecs receives, that is the user's problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding size is now removed and I updated the tests to handle it appropriately
numcodecs/blosc.pyx
Outdated
@@ -393,6 +394,100 @@ def decompress(source, dest=None): | |||
return dest | |||
|
|||
|
|||
def decompress_partial(source, start, nitems, typesize=0, dest=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity, we can drop typesize
and rely on the header. The dtype
will already passed through to Blosc correctly from Zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andrewfulton9, this is looking good to me, just a couple of docstring nits.
… is the right size
Thanks @andrewfulton9, glad to get this in 👍 |
FYI the logic to use this in Zarr is being worked on in PR ( zarr-developers/zarr-python#584 ). |
This PR demonstrates a way to do partial decompression of a blosc compressed array buffer as discussed in This Issue. The biggest limitation of the
decompress_partial
method is the inability to properly decompress parts of a multidimensional array.TODO:
tox -e py38
passes locallytox -e docs
passes locally