Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): minimum working codec pipeline #19

Merged
merged 15 commits into from
Nov 5, 2024

Conversation

LDeakin
Copy link
Collaborator

@LDeakin LDeakin commented Oct 23, 2024

  • Add internal get_chunk_representation, retrieve_chunk_bytes, and store_chunk_bytes
  • Separate retrieve_chunk and retrieve_chunk_subset
  • Separate store_chunk and store_chunk_subset
  • Add assertions to simple.py

- Add internal `get_chunk_representation`, `retrieve_chunk_bytes`, and `store_chunk_bytes`
- Separate `retrieve_chunk` and `retrieve_chunk_subset`
- Separate `store_chunk` and `store_chunk_subset`
- Add assertions to simple.py
Copy link
Owner

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two quick comments, would like to hear from @flying-sheep. I am updating https://github.com/ilan-gold/zarrs-python/pull/20/files because I realized that keeping zarr-python's tests in sync with the release we use is probably more trouble than its worth. So I copied over their codec tests only, which looked relatively comprehensive. I think getting those working is a good call.

Also in terms of MVP, I was thinking of dropping in-memory support. At least for now in the testing PR I am doing that to keep things simple.

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
)
}

fn selection_to_array_subset(selection: &PyTuple, shape: &[u64]) -> PyResult<ArraySubset> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend towards doing this in python but would be curious to see what @flying-sheep would say so let's put a pin in this for now.

For example, I would tend towards initially converting everything to either a ndarray of ints, slices, or ints on the python side to minimize the rust pyo3 business. I think it's a bit more verbose than doing this stuff in python directly.

Copy link
Collaborator Author

@LDeakin LDeakin Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sounds reasonable. It would probably be necessary to separate the unstrided slices/ellipsis selections from the others though. The latter probably needs a new specialised retrieve_chunks method that lets numpy do the indexing on the output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved some of this logic out in 33037bf

)
.collect::<PyResult<Vec<_>>>()?;

py.allow_threads(move || {
Copy link
Owner

@ilan-gold ilan-gold Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my laptop, performance seems quite similar without this py.allow_threads. Could you see what you get on the benchmark without this? It would line up more with my mental map from https://pyo3.rs/main/parallelism if this actually wasn't necessary i.e., rayon just "works" from within pyo3

I still don't fully get why we need asyncio.to_thread either TBH but that one does seem necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are benchmarks with the allow_threads commented out:
LDeakin/zarr_benchmarks@zarrs_python...zarrs_python_wo_allow_threads

It is definitely needed

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I don't think I have the time now to create a repro for this, but I'd be curious what the pyo3 people have to say. I don't really get why both of these are needed. I would think that we should be able to get the same performance without using either, but I must be missing something in the way I conceive of things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent calls to retrieve_chunks would spend more time waiting for the GIL without the allow_threads. benchmark_read_chunks.svg shows that if multiple chunks are requested concurrently, the performance is relatively flat without allow_theads.

The asyncio.to_thread lets read immediately return a future that can be awaited, without even kicking off any real work. Otherwise, read would block until the data has been retrieved.

Copy link
Owner

@ilan-gold ilan-gold Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get conceptually why you need to release the GIL, I suppose, but what I don't get is why we have to do it here when the PyO3 docs specifically do not.

Here's their "rust_parallel" example:
https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs#L5-L11
And this is timed from python (since it is a pyfunction):
https://github.com/PyO3/pyo3/blob/9f955e4ebf5f38d1f9de4837951588ef6c850622/examples/word-count/tests/test_word_count.py#L35-L37

This doesn't require allow_threads (which I agree, to me, doesn't make a crazy amount of sense) but apparently it's fastest. Maybe it has to do something with the asyncio.to_thread we're using, but I would think that also has no impact because the underlying function of to_thread is not actually asynchronous.... Also, to_thread releases the GIL, at least for I/O tasks, no? So I would think that maybe that propagates to rust?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't require allow_threads (which I agree, to me, doesn't make a crazy amount of sense) but apparently it's fastest

The rust_parallel example uses all cores via the Rayon thread pool. Whereas the others are all sequential on the Rust side just for the purposes of the example.

In practice, Rust methods are being called concurrently like in run_rust_sequential_twice by zarr-python, which has its own thread pool. The Rust thread pool is independent.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust_parallel example uses all cores via the Rayon thread pool.

But aren't we doing that as well? I guess that's my confusion. Why does theirs work without py.allow_threads and ours does not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They only call the parallel search function from one Python thread in the benchmarks, whereas our array methods can get called by multiple Python threads concurrently, like in the test_word_count_rust_sequential_twice_with_threads benchmark.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Makes sense!

ilan-gold and others added 2 commits October 31, 2024 07:27
* (fix): ci for running tests

* (fix): no need to extract tests

* (Fix): remove duplicate name

* (fix): use a submodule...

* (chore): remove memory store + port zarr codec tests

* (chore): remove `dlpark`

* (fix): getattr for itemsize

* (chore): remove runtime F ordering

* (chore): skip vlen

* (feat): parse int

---------

Co-authored-by: Lachlan Deakin <[email protected]>
@@ -100,6 +100,8 @@ async def read(
# )

out = out.as_ndarray_like() # FIXME: Error if array is not in host memory
if not out.dtype.isnative:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can out can be changed in such situations? Or does that just create a new array ignored by the caller

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. as_ndarray_like returns the underlying data. So, I think you could change it as long as you do a 0-copy numpy (or similar) operation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NDBuffer is just a wrapper class interface around array libraries. I am not sure what the design decision behind it over dlpack (which I had here before) was https://github.com/dmlc/dlpack, but undoing it is probably not going to happen now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-copy numpy (or similar) operation.

And it would have to be in-place on some level, as well

Copy link
Owner

@ilan-gold ilan-gold Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LDeakin I think the issue here, looking at the zarr code, would not be endianness of the out array (which, both from how I think of it and the fact that no tests are failing, makes some sense to me). Rather I'd be worried about the incoming data from disk. But I think that your underlying implementation would handle that since it parses the codec metadata, no? i.e., you have a codec (or equivalent) in zarrs to swap bytes if need be as per the metadata.

So I think this is a reasonable check, but I am not worried. I don't see any handling in the zarr-python codebase about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, there is a codec that takes of swapping bytes from disk if needed, but zarrs will always decode to native endian.
Looks like zarr-python does not have this entirely sorted anyway: zarr-developers/zarr-python#2324

@ilan-gold ilan-gold changed the title (fix): minimum working codec pipeline (feat): minimum working codec pipeline Oct 31, 2024
@LDeakin
Copy link
Collaborator Author

LDeakin commented Nov 4, 2024

Shall we rebase merge this on ld/codec_pipeline and then review an actual PR on main?

@ilan-gold
Copy link
Owner

@LDeakin Yup, sound good to me.

@LDeakin LDeakin merged commit bd71529 into ld/codec_pipeline Nov 5, 2024
1 check passed
@LDeakin LDeakin deleted the ld/codec_pipeline_mvp branch November 5, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants