-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add serialization methods for List
and StructDtype
#8441
Add serialization methods for List
and StructDtype
#8441
Conversation
List
and StructDtype
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8441 +/- ##
===============================================
Coverage ? 83.00%
===============================================
Files ? 109
Lines ? 18215
Branches ? 0
===============================================
Hits ? 15119
Misses ? 3096
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me Charles! Are you going to add some tests for these?
Overall looks good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll add some tests to test_pickling.py
- thanks for the review @marlenezw 😃 one question:
python/cudf/cudf/core/dtypes.py
Outdated
@@ -12,6 +12,7 @@ | |||
|
|||
import cudf | |||
from cudf._typing import Dtype | |||
from cudf.core.buffer import Buffer | |||
|
|||
|
|||
class _BaseDtype(ExtensionDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should _BaseDtype
extend Serializable
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes, you've just added it should now. I'm actually not 100% sure what we should and shouldn't do with _BaseDtype. Ashwin is probably the right person to ask about this, though IMO this should be fine 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would agree. Our dtypes should probably all be Serializable
.
It looks like serialization of dtype is already tested implicitly through the column serialization tests in |
|
||
|
||
class _BaseDtype(ExtensionDtype): | ||
class _BaseDtype(ExtensionDtype, Serializable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems weird about this. We're making all of our extension dtypes serializable, but I believe we end up needing to override serialize
and deserialize
for all of them (ListDtype
, StructDtype
, CategoricalDtype
). To me that suggests either the parent class needs to be generalized to be able to do at least some of the common work between these child classes, or that this inheritance relationship just isn't quite right.
I am weakly -1 on doing this as part of this PR. I maybe it makes more sense to add the serialize
/deserialize
methods in this PR and then refactor the common code out either into Serializeable
or something that goes in between Serializeable
and _BaseDtype
in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, Serializable
was an abstract base class, which forced all derived classes to implement serialize
and deserialize
. For performance reasons, we disabled that and made it a regular class. Now, derived classes must implement serialize
and deserialize
, but that is "only" by convention.
That being said, there's still very much value in inheriting from Serializable
, as we get the methods host_serialize
, device_serialize
, host_deserialize
, device_deserialize
"for free" by the inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @brandon-b-miller's objection to making this change, but not the reasons.
Serializable
declares an interface, but leaves it up to subclasses to implement it. Whether or not certain subclasses (e.g. all dtypes) can share parts (or all) of that implementation isn't really relevant to whether or not the inheritance pattern makes sense. All that inheriting from Serializable
does is indicate that if subclasses implement serialize
and deserialize
, it will be possible to do pickle.dumps(obj)
.
All of the *_(?:de)?serialize
methods just exist to provide hooks into Serializable.__reduce_ex__
, the method that actually enables serialization. My issue with using Serializable
for dtype objects is that these hooks are all predicated on the assumption that a subclass of Serializable
can be decomposed into some header
of metadata a collection of frames
, which isn't the case for dtypes. If you look at the contents of the methods implemented by Serializable
, they're encoding a bunch of metadata that IMO isn't really appropriate for a dtype, but rather for typed memory buffers (e.g. the length of the array or whether it's stored in device memory).
That being the case, I think that it would be simpler and more appropriate to directly implement the pickling protocol (ideally via __getstate__
and __setstate__
, but if not then via __reduce*
methods) rather than trying to leverage Serializable
. To @brandon-b-miller's point, if some of that logic can be shared between dtypes it would also be great to do that by implementing it at the level of _BaseDtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see some merit to @brandon-b-miller's point of making subclasses of Serializable
that generalize some of the common work that's happening in the serialization function, though I haven't really inspected those functions outside of the dtypes to see if there's a lot of intersection there - were you thinking something like SerializableDtype
, SerializableFrame
, etc...?
To @vyasr's point, I feel like implementing the pickling protocol for the dtypes themselves could result in redundant code, since it would essentially entail copying Serializable.__reduce_ex__
in _BaseDtype
. Is there a downside to having host/device deserialization implemented for dtypes other than the fact that those functions aren't really appropriate?
Also feel like that scenario gives more motivation for making subclasses of Serializable
, as we could have subclasses that include/exclude the functions we consider inappropriate for their derived classes (such as the host/device serialization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serializable
is less about making objects picklable and more about serializing objects according to the Dask serialization protocol. The *serialize
methods are absolutely required here in order for dtype objects to be able to be sent efficiently across the wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true though that most dtypes really are composed only of metadata. The exception being CategoricalDtype
, which for compatibility with Pandas, encapsulates also a column of categories (residing on the device).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read a little more I'm comfortable 👍 -ing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I see now that we're registering Serializable
's methods to dask.distributed
in cudf/comm/serialize.py
. It does seem like we could simplify the specifics of the serialization protocol for dtypes since they are (almost) entirely metadata and not data, but for now I think moving forward with this approach is fine for now.
}, | ||
], | ||
) | ||
def test_serialize_categorical_columns(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be moved to test_categorical.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - is it okay if I go ahead and move the other column serialization tests to their corresponding files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit otherwise lgtm. Nice work,
Rerun tests |
@gpucibot merge |
Adds
serialize
/deserialize
methods forList
andStructDtypes
, which I intend to use as part of #8153 when these dtypes are included in thePackedColumns
object there.