-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memoryview serialisation #3743
Memoryview serialisation #3743
Conversation
+1 |
distributed/protocol/serialize.py
Outdated
@@ -568,6 +568,11 @@ def _deserialize_bytes(header, frames): | |||
return b"".join(frames) | |||
|
|||
|
|||
@dask_deserialize.register(memoryview) | |||
def _serialize_memoryview(header, frames): | |||
return memoryview(b"".join(frames)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth special casing length-1 frames
, since then you can just return frames[0]
and avoid the copy? Maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than the one question.
Thanks Martin! So one thing I'm curious about is there's a fair bit of logic in NumPy serialization, which we seem to be skipping here. For example it is possible to have Python objects much like one can in NumPy arrays. So it might make sense to pickle in that case. Similarly we may want logic to flatten N-D data, cast larger types, and split large frames. Though we don't need to worry about |
I am surprised to find that I think the same goes for complex types and multiple dimensions - those should be already using numpy. I daresay that in every case here, we have a bytes-like thing (from arrow, whatever) that we don't want to accidentally copy. In short, I am not keen to change the numpy side of things, even if it just amounts to moving code around a little, to support memoryviews that I'd be surprised if they got used. Prepard to be proven wrong... |
I think this goes back to the fact that the Python Buffer Protocol allows things like arrays of pointers. This was to make PIL developers happy with the PEP IIRC. So Was mostly thinking we could get more out of the same code. Though you may be right this seldom comes up in practice. Yeah I guess we wait and see if there's an issue. 🙂 |
Since pandas and numpy already have their own serialisation, someone would really have to be trying to break Dask :) |
Fixes #3640 (for discussion)
Does not create an intermediate bytes object for serialization, passes through the memoryview. That, of course, still gets sent as bytes on the wire, so for deserialization, we just wrap it.