Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoryview serialisation #3743

Merged
merged 3 commits into from
Apr 27, 2020
Merged

Memoryview serialisation #3743

merged 3 commits into from
Apr 27, 2020

Conversation

martindurant
Copy link
Member

Fixes #3640 (for discussion)

Does not create an intermediate bytes object for serialization, passes through the memoryview. That, of course, still gets sent as bytes on the wire, so for deserialization, we just wrap it.

@mrocklin
Copy link
Member

+1

@@ -568,6 +568,11 @@ def _deserialize_bytes(header, frames):
return b"".join(frames)


@dask_deserialize.register(memoryview)
def _serialize_memoryview(header, frames):
return memoryview(b"".join(frames))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth special casing length-1 frames, since then you can just return frames[0] and avoid the copy? Maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other than the one question.

@martindurant martindurant merged commit 3de9973 into dask:master Apr 27, 2020
@martindurant martindurant deleted the memview branch April 27, 2020 17:00
@jakirkham
Copy link
Member

Thanks Martin!

So one thing I'm curious about is there's a fair bit of logic in NumPy serialization, which we seem to be skipping here. For example it is possible to have Python objects much like one can in NumPy arrays. So it might make sense to pickle in that case. Similarly we may want logic to flatten N-D data, cast larger types, and split large frames. Though we don't need to worry about datetime or timedelta objects as they are not supported types in memoryviews. Given this, what do you think about repurposing this logic for memoryviews and allowing NumPy serialization to just piggyback on memoryview serialization?

@martindurant
Copy link
Member Author

t is possible to have Python objects

I am surprised to find that memoryview.format == "O" is indeed allowed. Does this actually ever happen? I suppose this just failed previously, or it never ever happened, but I am not sure.

I think the same goes for complex types and multiple dimensions - those should be already using numpy. I daresay that in every case here, we have a bytes-like thing (from arrow, whatever) that we don't want to accidentally copy.

In short, I am not keen to change the numpy side of things, even if it just amounts to moving code around a little, to support memoryviews that I'd be surprised if they got used. Prepard to be proven wrong...

@jakirkham
Copy link
Member

I think this goes back to the fact that the Python Buffer Protocol allows things like arrays of pointers. This was to make PIL developers happy with the PEP IIRC. So memoryview.format == "O" is one such case as one would have PyObject* in the array. In practice, this might come up when serializing a column of strings? 🤷‍♂️

Was mostly thinking we could get more out of the same code. Though you may be right this seldom comes up in practice.

Yeah I guess we wait and see if there's an issue. 🙂

@martindurant
Copy link
Member Author

In practice, this might come up when serializing a column of strings?

Since pandas and numpy already have their own serialisation, someone would really have to be trying to break Dask :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficient serialization of memoryviews
4 participants