Add serialization methods for `List` and `StructDtype` #8441

charlesbluca · 2021-06-04T15:08:14Z

Adds serialize/deserialize methods for List and StructDtypes, which I intend to use as part of #8153 when these dtypes are included in the PackedColumns object there.

codecov · 2021-06-04T17:37:56Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@c086731). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #8441   +/-   ##
===============================================
  Coverage                ?   83.00%           
===============================================
  Files                   ?      109           
  Lines                   ?    18215           
  Branches                ?        0           
===============================================
  Hits                    ?    15119           
  Misses                  ?     3096           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c086731...bf948a1. Read the comment docs.

marlenezw

This looks good to me Charles! Are you going to add some tests for these?
Overall looks good to merge.

charlesbluca

Yeah, I'll add some tests to test_pickling.py - thanks for the review @marlenezw 😃 one question:

charlesbluca · 2021-06-08T16:17:32Z

python/cudf/cudf/core/dtypes.py

@@ -12,6 +12,7 @@

 import cudf
 from cudf._typing import Dtype
+from cudf.core.buffer import Buffer


 class _BaseDtype(ExtensionDtype):


Should _BaseDtype extend Serializable now?

With the changes, you've just added it should now. I'm actually not 100% sure what we should and shouldn't do with _BaseDtype. Ashwin is probably the right person to ask about this, though IMO this should be fine 😄

Would agree. Our dtypes should probably all be Serializable.

charlesbluca · 2021-06-09T17:49:47Z

It looks like serialization of dtype is already tested implicitly through the column serialization tests in test_serialize.py; the only exception is StructColumn, which does not have a working serialize() method, so I added a test for the dtype itself. If we ever add serialization for struct columns, this method can then be replaced.

brandon-b-miller · 2021-06-14T19:58:53Z

python/cudf/cudf/core/dtypes.py



-class _BaseDtype(ExtensionDtype):
+class _BaseDtype(ExtensionDtype, Serializable):


Something seems weird about this. We're making all of our extension dtypes serializable, but I believe we end up needing to override serialize and deserialize for all of them (ListDtype, StructDtype, CategoricalDtype). To me that suggests either the parent class needs to be generalized to be able to do at least some of the common work between these child classes, or that this inheritance relationship just isn't quite right.

I am weakly -1 on doing this as part of this PR. I maybe it makes more sense to add the serialize/deserialize methods in this PR and then refactor the common code out either into Serializeable or something that goes in between Serializeable and _BaseDtype in a separate PR.

Originally, Serializable was an abstract base class, which forced all derived classes to implement serialize and deserialize. For performance reasons, we disabled that and made it a regular class. Now, derived classes must implement serialize and deserialize, but that is "only" by convention.

That being said, there's still very much value in inheriting from Serializable, as we get the methods host_serialize, device_serialize, host_deserialize, device_deserialize "for free" by the inheritance.

I think I agree with @brandon-b-miller's objection to making this change, but not the reasons.

Serializable declares an interface, but leaves it up to subclasses to implement it. Whether or not certain subclasses (e.g. all dtypes) can share parts (or all) of that implementation isn't really relevant to whether or not the inheritance pattern makes sense. All that inheriting from Serializable does is indicate that if subclasses implement serialize and deserialize, it will be possible to do pickle.dumps(obj).

All of the *_(?:de)?serialize methods just exist to provide hooks into Serializable.__reduce_ex__, the method that actually enables serialization. My issue with using Serializable for dtype objects is that these hooks are all predicated on the assumption that a subclass of Serializable can be decomposed into some header of metadata a collection of frames, which isn't the case for dtypes. If you look at the contents of the methods implemented by Serializable, they're encoding a bunch of metadata that IMO isn't really appropriate for a dtype, but rather for typed memory buffers (e.g. the length of the array or whether it's stored in device memory).

That being the case, I think that it would be simpler and more appropriate to directly implement the pickling protocol (ideally via __getstate__ and __setstate__, but if not then via __reduce* methods) rather than trying to leverage Serializable. To @brandon-b-miller's point, if some of that logic can be shared between dtypes it would also be great to do that by implementing it at the level of _BaseDtype.

I do see some merit to @brandon-b-miller's point of making subclasses of Serializable that generalize some of the common work that's happening in the serialization function, though I haven't really inspected those functions outside of the dtypes to see if there's a lot of intersection there - were you thinking something like SerializableDtype, SerializableFrame, etc...?

To @vyasr's point, I feel like implementing the pickling protocol for the dtypes themselves could result in redundant code, since it would essentially entail copying Serializable.__reduce_ex__ in _BaseDtype. Is there a downside to having host/device deserialization implemented for dtypes other than the fact that those functions aren't really appropriate?

Also feel like that scenario gives more motivation for making subclasses of Serializable, as we could have subclasses that include/exclude the functions we consider inappropriate for their derived classes (such as the host/device serialization).

Serializable is less about making objects picklable and more about serializing objects according to the Dask serialization protocol. The *serialize methods are absolutely required here in order for dtype objects to be able to be sent efficiently across the wire.

It's true though that most dtypes really are composed only of metadata. The exception being CategoricalDtype, which for compatibility with Pandas, encapsulates also a column of categories (residing on the device).

Having read a little more I'm comfortable 👍 -ing here.

Got it, I see now that we're registering Serializable's methods to dask.distributed in cudf/comm/serialize.py. It does seem like we could simplify the specifics of the serialization protocol for dtypes since they are (almost) entirely metadata and not data, but for now I think moving forward with this approach is fine for now.

python/cudf/cudf/tests/test_serialize.py

…erialization

brandon-b-miller · 2021-06-18T21:11:48Z

python/cudf/cudf/tests/test_serialize.py

+        },
+    ],
+)
+def test_serialize_categorical_columns(data):


could this be moved to test_categorical.py

Sounds good - is it okay if I go ahead and move the other column serialization tests to their corresponding files?

brandon-b-miller

One nit otherwise lgtm. Nice work,

shwina · 2021-06-19T11:12:59Z

Rerun tests

charlesbluca · 2021-06-21T13:47:13Z

@gpucibot merge

Add serialization methods for list and struct dtypes

27ac11d

charlesbluca requested a review from a team as a code owner June 4, 2021 15:08

charlesbluca requested review from marlenezw and skirui-source June 4, 2021 15:08

github-actions bot added the Python Affects Python cuDF API. label Jun 4, 2021

charlesbluca changed the title ~~Add serialization methods for list and struct dtypes~~ Add serialization methods for List and StructDtype Jun 4, 2021

marlenezw added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jun 7, 2021

marlenezw approved these changes Jun 8, 2021

View reviewed changes

charlesbluca commented Jun 8, 2021

View reviewed changes

charlesbluca added 5 commits June 9, 2021 11:40

Refactor dtype serialization to always use type-serialized

99192fd

Use list dtype serialization for list column serialization

52c9054

Add tests for categorical columns, struct dtypes

b899fb2

Add handling for pyarrow -> cudf decimal type

07174e3

Have dtypes extend Serializable

af342ff

brandon-b-miller reviewed Jun 14, 2021

View reviewed changes

python/cudf/cudf/tests/test_serialize.py Outdated Show resolved Hide resolved

charlesbluca added 3 commits June 14, 2021 18:07

Move struct dtype test

7cb557a

Merge remote-tracking branch 'upstream/branch-21.08' into add-dtype-s…

b42cb79

…erialization

Merge remote-tracking branch 'upstream/branch-21.08' into add-dtype-s…

74c6661

…erialization

This was referenced Jun 16, 2021

Expose pack/unpack API to Python #8153

Merged

Expose a Decimal32Dtype in cuDF Python #8438

Merged

brandon-b-miller reviewed Jun 18, 2021

View reviewed changes

brandon-b-miller approved these changes Jun 18, 2021

View reviewed changes

Relocate serialization functions to correpsonding column tests

bf948a1

rapids-bot bot merged commit f71c6fe into rapidsai:branch-21.08 Jun 21, 2021

charlesbluca deleted the add-dtype-serialization branch August 3, 2021 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add serialization methods for `List` and `StructDtype` #8441

Add serialization methods for `List` and `StructDtype` #8441

charlesbluca commented Jun 4, 2021

codecov bot commented Jun 4, 2021 •

edited

Loading

marlenezw left a comment

charlesbluca left a comment

charlesbluca Jun 8, 2021

marlenezw Jun 10, 2021

shwina Jun 10, 2021

charlesbluca commented Jun 9, 2021

brandon-b-miller Jun 14, 2021

shwina Jun 14, 2021 •

edited

Loading

vyasr Jun 14, 2021

charlesbluca Jun 16, 2021

shwina Jun 16, 2021

shwina Jun 16, 2021

brandon-b-miller Jun 16, 2021

vyasr Jun 16, 2021

brandon-b-miller Jun 18, 2021

charlesbluca Jun 18, 2021

brandon-b-miller left a comment

shwina commented Jun 19, 2021

charlesbluca commented Jun 21, 2021



		class _BaseDtype(ExtensionDtype):
		class _BaseDtype(ExtensionDtype, Serializable):

Add serialization methods for List and StructDtype #8441

Add serialization methods for List and StructDtype #8441

Conversation

charlesbluca commented Jun 4, 2021

codecov bot commented Jun 4, 2021 • edited Loading

Codecov Report

marlenezw left a comment

Choose a reason for hiding this comment

charlesbluca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesbluca commented Jun 9, 2021

Choose a reason for hiding this comment

shwina Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller left a comment

Choose a reason for hiding this comment

shwina commented Jun 19, 2021

charlesbluca commented Jun 21, 2021

Add serialization methods for `List` and `StructDtype` #8441

Add serialization methods for `List` and `StructDtype` #8441

codecov bot commented Jun 4, 2021 •

edited

Loading

shwina Jun 14, 2021 •

edited

Loading