Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serialization methods for List and StructDtype #8441

Merged
merged 10 commits into from
Jun 21, 2021

Conversation

charlesbluca
Copy link
Member

Adds serialize/deserialize methods for List and StructDtypes, which I intend to use as part of #8153 when these dtypes are included in the PackedColumns object there.

@charlesbluca charlesbluca requested a review from a team as a code owner June 4, 2021 15:08
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 4, 2021
@charlesbluca charlesbluca changed the title Add serialization methods for list and struct dtypes Add serialization methods for List and StructDtype Jun 4, 2021
@codecov
Copy link

codecov bot commented Jun 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@c086731). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8441   +/-   ##
===============================================
  Coverage                ?   83.00%           
===============================================
  Files                   ?      109           
  Lines                   ?    18215           
  Branches                ?        0           
===============================================
  Hits                    ?    15119           
  Misses                  ?     3096           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c086731...bf948a1. Read the comment docs.

@marlenezw marlenezw added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jun 7, 2021
Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me Charles! Are you going to add some tests for these?
Overall looks good to merge.

Copy link
Member Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll add some tests to test_pickling.py - thanks for the review @marlenezw 😃 one question:

@@ -12,6 +12,7 @@

import cudf
from cudf._typing import Dtype
from cudf.core.buffer import Buffer


class _BaseDtype(ExtensionDtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should _BaseDtype extend Serializable now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes, you've just added it should now. I'm actually not 100% sure what we should and shouldn't do with _BaseDtype. Ashwin is probably the right person to ask about this, though IMO this should be fine 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would agree. Our dtypes should probably all be Serializable.

@charlesbluca
Copy link
Member Author

It looks like serialization of dtype is already tested implicitly through the column serialization tests in test_serialize.py; the only exception is StructColumn, which does not have a working serialize() method, so I added a test for the dtype itself. If we ever add serialization for struct columns, this method can then be replaced.



class _BaseDtype(ExtensionDtype):
class _BaseDtype(ExtensionDtype, Serializable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems weird about this. We're making all of our extension dtypes serializable, but I believe we end up needing to override serialize and deserialize for all of them (ListDtype, StructDtype, CategoricalDtype). To me that suggests either the parent class needs to be generalized to be able to do at least some of the common work between these child classes, or that this inheritance relationship just isn't quite right.

I am weakly -1 on doing this as part of this PR. I maybe it makes more sense to add the serialize/deserialize methods in this PR and then refactor the common code out either into Serializeable or something that goes in between Serializeable and _BaseDtype in a separate PR.

Copy link
Contributor

@shwina shwina Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, Serializable was an abstract base class, which forced all derived classes to implement serialize and deserialize. For performance reasons, we disabled that and made it a regular class. Now, derived classes must implement serialize and deserialize, but that is "only" by convention.

That being said, there's still very much value in inheriting from Serializable, as we get the methods host_serialize, device_serialize, host_deserialize, device_deserialize "for free" by the inheritance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @brandon-b-miller's objection to making this change, but not the reasons.

Serializable declares an interface, but leaves it up to subclasses to implement it. Whether or not certain subclasses (e.g. all dtypes) can share parts (or all) of that implementation isn't really relevant to whether or not the inheritance pattern makes sense. All that inheriting from Serializable does is indicate that if subclasses implement serialize and deserialize, it will be possible to do pickle.dumps(obj).

All of the *_(?:de)?serialize methods just exist to provide hooks into Serializable.__reduce_ex__, the method that actually enables serialization. My issue with using Serializable for dtype objects is that these hooks are all predicated on the assumption that a subclass of Serializable can be decomposed into some header of metadata a collection of frames, which isn't the case for dtypes. If you look at the contents of the methods implemented by Serializable, they're encoding a bunch of metadata that IMO isn't really appropriate for a dtype, but rather for typed memory buffers (e.g. the length of the array or whether it's stored in device memory).

That being the case, I think that it would be simpler and more appropriate to directly implement the pickling protocol (ideally via __getstate__ and __setstate__, but if not then via __reduce* methods) rather than trying to leverage Serializable. To @brandon-b-miller's point, if some of that logic can be shared between dtypes it would also be great to do that by implementing it at the level of _BaseDtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see some merit to @brandon-b-miller's point of making subclasses of Serializable that generalize some of the common work that's happening in the serialization function, though I haven't really inspected those functions outside of the dtypes to see if there's a lot of intersection there - were you thinking something like SerializableDtype, SerializableFrame, etc...?

To @vyasr's point, I feel like implementing the pickling protocol for the dtypes themselves could result in redundant code, since it would essentially entail copying Serializable.__reduce_ex__ in _BaseDtype. Is there a downside to having host/device deserialization implemented for dtypes other than the fact that those functions aren't really appropriate?

Also feel like that scenario gives more motivation for making subclasses of Serializable, as we could have subclasses that include/exclude the functions we consider inappropriate for their derived classes (such as the host/device serialization).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializable is less about making objects picklable and more about serializing objects according to the Dask serialization protocol. The *serialize methods are absolutely required here in order for dtype objects to be able to be sent efficiently across the wire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true though that most dtypes really are composed only of metadata. The exception being CategoricalDtype, which for compatibility with Pandas, encapsulates also a column of categories (residing on the device).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read a little more I'm comfortable 👍 -ing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I see now that we're registering Serializable's methods to dask.distributed in cudf/comm/serialize.py. It does seem like we could simplify the specifics of the serialization protocol for dtypes since they are (almost) entirely metadata and not data, but for now I think moving forward with this approach is fine for now.

},
],
)
def test_serialize_categorical_columns(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be moved to test_categorical.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - is it okay if I go ahead and move the other column serialization tests to their corresponding files?

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit otherwise lgtm. Nice work,

@shwina
Copy link
Contributor

shwina commented Jun 19, 2021

Rerun tests

@charlesbluca
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f71c6fe into rapidsai:branch-21.08 Jun 21, 2021
@charlesbluca charlesbluca deleted the add-dtype-serialization branch August 3, 2021 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants