Memoryview serialisation #3743

martindurant · 2020-04-27T15:02:07Z

Fixes #3640 (for discussion)

Does not create an intermediate bytes object for serialization, passes through the memoryview. That, of course, still gets sent as bytes on the wire, so for deserialization, we just wrap it.

mrocklin · 2020-04-27T15:18:35Z

+1

TomAugspurger · 2020-04-27T15:20:00Z

distributed/protocol/serialize.py

@@ -568,6 +568,11 @@ def _deserialize_bytes(header, frames):
    return b"".join(frames)


+@dask_deserialize.register(memoryview)
+def _serialize_memoryview(header, frames):
+    return memoryview(b"".join(frames))


Is it worth special casing length-1 frames, since then you can just return frames[0] and avoid the copy? Maybe?

TomAugspurger

Looks good other than the one question.

jakirkham · 2020-04-27T17:28:30Z

Thanks Martin!

So one thing I'm curious about is there's a fair bit of logic in NumPy serialization, which we seem to be skipping here. For example it is possible to have Python objects much like one can in NumPy arrays. So it might make sense to pickle in that case. Similarly we may want logic to flatten N-D data, cast larger types, and split large frames. Though we don't need to worry about datetime or timedelta objects as they are not supported types in memoryviews. Given this, what do you think about repurposing this logic for memoryviews and allowing NumPy serialization to just piggyback on memoryview serialization?

martindurant · 2020-04-27T17:45:25Z

t is possible to have Python objects

I am surprised to find that memoryview.format == "O" is indeed allowed. Does this actually ever happen? I suppose this just failed previously, or it never ever happened, but I am not sure.

I think the same goes for complex types and multiple dimensions - those should be already using numpy. I daresay that in every case here, we have a bytes-like thing (from arrow, whatever) that we don't want to accidentally copy.

In short, I am not keen to change the numpy side of things, even if it just amounts to moving code around a little, to support memoryviews that I'd be surprised if they got used. Prepard to be proven wrong...

jakirkham · 2020-04-30T02:13:56Z

I think this goes back to the fact that the Python Buffer Protocol allows things like arrays of pointers. This was to make PIL developers happy with the PEP IIRC. So memoryview.format == "O" is one such case as one would have PyObject* in the array. In practice, this might come up when serializing a column of strings? 🤷‍♂️

Was mostly thinking we could get more out of the same code. Though you may be right this seldom comes up in practice.

Yeah I guess we wait and see if there's an issue. 🙂

martindurant · 2020-04-30T12:37:01Z

In practice, this might come up when serializing a column of strings?

Since pandas and numpy already have their own serialisation, someone would really have to be trying to break Dask :)

Martin Durant added 2 commits April 27, 2020 10:59

Serialise memview without copy

32c63b0

black

fa1b094

martindurant force-pushed the memview branch from ac782cc to fa1b094 Compare April 27, 2020 15:04

TomAugspurger reviewed Apr 27, 2020

View reviewed changes

Special-case deser memview with one frame

a74a6b8

martindurant merged commit 3de9973 into dask:master Apr 27, 2020

martindurant deleted the memview branch April 27, 2020 17:00

jakirkham mentioned this pull request Jul 1, 2020

Evaluate further serialization performance improvements rapidsai/dask-cuda#106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoryview serialisation #3743

Memoryview serialisation #3743

martindurant commented Apr 27, 2020

mrocklin commented Apr 27, 2020

TomAugspurger Apr 27, 2020

martindurant Apr 27, 2020

TomAugspurger left a comment

jakirkham commented Apr 27, 2020

martindurant commented Apr 27, 2020

jakirkham commented Apr 30, 2020

martindurant commented Apr 30, 2020

Memoryview serialisation #3743

Memoryview serialisation #3743

Conversation

martindurant commented Apr 27, 2020

mrocklin commented Apr 27, 2020

TomAugspurger Apr 27, 2020

Choose a reason for hiding this comment

martindurant Apr 27, 2020

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

jakirkham commented Apr 27, 2020

martindurant commented Apr 27, 2020

jakirkham commented Apr 30, 2020

martindurant commented Apr 30, 2020