-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): minimum working codec pipeline #19
Conversation
- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes` - Separate `retrieve_chunk` and `retrieve_chunk_subset` - Separate `store_chunk` and `store_chunk_subset` - Add assertions to simple.py
- Add config options - CodecPipelineImpl interior mutability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two quick comments, would like to hear from @flying-sheep. I am updating https://github.com/ilan-gold/zarrs-python/pull/20/files because I realized that keeping zarr-python's tests in sync with the release we use is probably more trouble than its worth. So I copied over their codec tests only, which looked relatively comprehensive. I think getting those working is a good call.
Also in terms of MVP, I was thinking of dropping in-memory support. At least for now in the testing PR I am doing that to keep things simple.
src/lib.rs
Outdated
) | ||
} | ||
|
||
fn selection_to_array_subset(selection: &PyTuple, shape: &[u64]) -> PyResult<ArraySubset> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tend towards doing this in python but would be curious to see what @flying-sheep would say so let's put a pin in this for now.
For example, I would tend towards initially converting everything to either a ndarray of ints, slices, or ints on the python side to minimize the rust pyo3 business. I think it's a bit more verbose than doing this stuff in python directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, sounds reasonable. It would probably be necessary to separate the unstrided slices/ellipsis selections from the others though. The latter probably needs a new specialised retrieve_chunks
method that lets numpy
do the indexing on the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved some of this logic out in 33037bf
) | ||
.collect::<PyResult<Vec<_>>>()?; | ||
|
||
py.allow_threads(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my laptop, performance seems quite similar without this py.allow_threads
. Could you see what you get on the benchmark without this? It would line up more with my mental map from https://pyo3.rs/main/parallelism if this actually wasn't necessary i.e., rayon just "works" from within pyo3
I still don't fully get why we need asyncio.to_thread
either TBH but that one does seem necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are benchmarks with the allow_threads
commented out:
LDeakin/zarr_benchmarks@zarrs_python...zarrs_python_wo_allow_threads
It is definitely needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I don't think I have the time now to create a repro for this, but I'd be curious what the pyo3 people have to say. I don't really get why both of these are needed. I would think that we should be able to get the same performance without using either, but I must be missing something in the way I conceive of things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrent calls to retrieve_chunks
would spend more time waiting for the GIL without the allow_threads
. benchmark_read_chunks.svg shows that if multiple chunks are requested concurrently, the performance is relatively flat without allow_theads
.
The asyncio.to_thread
lets read
immediately return a future that can be awaited, without even kicking off any real work. Otherwise, read
would block until the data has been retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get conceptually why you need to release the GIL, I suppose, but what I don't get is why we have to do it here when the PyO3 docs specifically do not.
Here's their "rust_parallel" example:
https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs#L5-L11
And this is timed from python (since it is a pyfunction
):
https://github.com/PyO3/pyo3/blob/9f955e4ebf5f38d1f9de4837951588ef6c850622/examples/word-count/tests/test_word_count.py#L35-L37
This doesn't require allow_threads
(which I agree, to me, doesn't make a crazy amount of sense) but apparently it's fastest. Maybe it has to do something with the asyncio.to_thread
we're using, but I would think that also has no impact because the underlying function of to_thread
is not actually asynchronous.... Also, to_thread
releases the GIL, at least for I/O tasks, no? So I would think that maybe that propagates to rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't require allow_threads (which I agree, to me, doesn't make a crazy amount of sense) but apparently it's fastest
The rust_parallel
example uses all cores via the Rayon thread pool. Whereas the others are all sequential on the Rust side just for the purposes of the example.
In practice, Rust methods are being called concurrently like in run_rust_sequential_twice
by zarr-python
, which has its own thread pool. The Rust thread pool is independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust_parallel example uses all cores via the Rayon thread pool.
But aren't we doing that as well? I guess that's my confusion. Why does theirs work without py.allow_threads
and ours does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They only call the parallel search
function from one Python thread in the benchmarks, whereas our array methods can get called by multiple Python threads concurrently, like in the test_word_count_rust_sequential_twice_with_threads
benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Makes sense!
* (fix): ci for running tests * (fix): no need to extract tests * (Fix): remove duplicate name * (fix): use a submodule... * (chore): remove memory store + port zarr codec tests * (chore): remove `dlpark` * (fix): getattr for itemsize * (chore): remove runtime F ordering * (chore): skip vlen * (feat): parse int --------- Co-authored-by: Lachlan Deakin <[email protected]>
@@ -100,6 +100,8 @@ async def read( | |||
# ) | |||
|
|||
out = out.as_ndarray_like() # FIXME: Error if array is not in host memory | |||
if not out.dtype.isnative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can out
can be changed in such situations? Or does that just create a new array ignored by the caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. as_ndarray_like
returns the underlying data. So, I think you could change it as long as you do a 0-copy numpy (or similar) operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NDBuffer
is just a wrapper class interface around array libraries. I am not sure what the design decision behind it over dlpack (which I had here before) was https://github.com/dmlc/dlpack, but undoing it is probably not going to happen now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0-copy numpy (or similar) operation.
And it would have to be in-place on some level, as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LDeakin I think the issue here, looking at the zarr code, would not be endianness of the out array (which, both from how I think of it and the fact that no tests are failing, makes some sense to me). Rather I'd be worried about the incoming data from disk. But I think that your underlying implementation would handle that since it parses the codec metadata, no? i.e., you have a codec (or equivalent) in zarrs to swap bytes if need be as per the metadata.
So I think this is a reasonable check, but I am not worried. I don't see any handling in the zarr-python codebase about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, there is a codec that takes of swapping bytes from disk if needed, but zarrs
will always decode to native endian.
Looks like zarr-python does not have this entirely sorted anyway: zarr-developers/zarr-python#2324
Shall we rebase merge this on |
@LDeakin Yup, sound good to me. |
get_chunk_representation
,retrieve_chunk_bytes
, andstore_chunk_bytes
retrieve_chunk
andretrieve_chunk_subset
store_chunk
andstore_chunk_subset