Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with _CPackedColumns.serialize() handling of host and device data #8759

Merged
merged 6 commits into from
Jul 19, 2021

Conversation

charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented Jul 16, 2021

A few changes in here to resolve some test failures using pack/unpack with DataFrame serialization:

  • We now use a Buffer in frames to represent the device data, so that Dask can correctly perform the necessary DtoD transfers when moving a packed columns object between devices
  • With Fix a crash in pack() when being handed tables with no columns. #8697 merged, the results of packing an empty DataFrame will set the pointer to the host data to NULL; since Cython cannot make a memoryview from NULL, we now check for this condition before making the host data array
  • The serialized type is now included in the serialized header

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Jul 16, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner July 16, 2021 18:14
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small Q

@@ -822,12 +823,10 @@ cdef class _CPackedColumns:

def serialize(self):
header = {}
frames = []
frames = [Buffer(self.gpu_data_ptr, self.gpu_data_size, self)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to directly call Buffer.serialize() (and subsequently Buffer.deserialize() below) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to leave this as a Buffer so we are able to inherit the additional methods of Serializable; see this assertation in Serializable.device_serialize():

assert all(
(type(f) in [cudf.core.buffer.Buffer, memoryview]) for f in frames
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is a dumb question, could you elaborate: why Buffer.deserialize() fails to inherit the additional methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my mistake I was thinking of just passing the output of Buffer.deserialize() into frames, which wouldn't work because of the assertion above - you are correct, we should be serializing the Buffer here, and add the resulting header to the PackedColumns header.

@codecov
Copy link

codecov bot commented Jul 16, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@b011299). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 53c45ca differs from pull request most recent head ee4af28. Consider uploading reports for the commit ee4af28 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8759   +/-   ##
===============================================
  Coverage                ?   10.52%           
===============================================
  Files                   ?      116           
  Lines                   ?    18532           
  Branches                ?        0           
===============================================
  Hits                    ?     1951           
  Misses                  ?    16581           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b011299...ee4af28. Read the comment docs.

@charlesbluca charlesbluca changed the title Store GPU data in Buffer for PackedColumns serialization Fix issues with _CPackedColumns.serialize() handling of host and device data Jul 19, 2021
@charlesbluca
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cfd3d5d into rapidsai:branch-21.08 Jul 19, 2021
@charlesbluca charlesbluca deleted the fix-pack-serialization branch August 3, 2021 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants