Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Implement collection serialization protocol #1942

Merged
merged 3 commits into from
Nov 18, 2024

Conversation

penguin-wwy
Copy link
Contributor

@penguin-wwy penguin-wwy commented Nov 13, 2024

What does this PR do?

Implement a new format for collection serialization in pyfury.

Related issues

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

fury_tuple: Mean +- std dev: [base] 259 us +- 6 us -> [collection] 256 us +- 5 us: 1.01x faster
fury_large_tuple: Mean +- std dev: [base] 92.7 ms +- 5.5 ms -> [collection] 63.7 ms +- 4.8 ms: 1.46x faster
fury_list: Mean +- std dev: [base] 277 us +- 6 us -> [collection] 267 us +- 3 us: 1.04x faster
fury_large_list: Mean +- std dev: [base] 92.8 ms +- 5.3 ms -> [collection] 66.5 ms +- 3.0 ms: 1.40x faster

Geometric mean: 1.21x faster

@@ -1644,29 +1656,143 @@ cdef class CollectionSerializer(Serializer):
cpdef int16_t get_xtype_id(self):
return -FuryType.LIST.value

cpdef int8_t write_header(self, Buffer buffer, value):
cdef int8_t collect_flag = COLLECTION_DEFAULT_FLAG
elem_type = type(next(iter(value)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This create iterator object, maybe we could just init it as None here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to extract the type of the first element and avoid checking elem_type is None in the loop. Since the collection might be a set, an iterator retrieves it.

@@ -1644,29 +1656,143 @@ cdef class CollectionSerializer(Serializer):
cpdef int16_t get_xtype_id(self):
return -FuryType.LIST.value

cpdef int8_t write_header(self, Buffer buffer, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently golang didn't implement this protocol yet, could we enable it only for pure python serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_header will only be called in Python, and the Xlang format has not been modified at the moment

@chaokunyang
Copy link
Collaborator

Hi @penguin-wwy , thanks for your hard work. This new protocol should be faster, you did't implement the new protocol fully. I left some comments, could you take a look at it?

@penguin-wwy
Copy link
Contributor Author

Benchmark data has been updated.

cdef MapRefResolver ref_resolver = self.ref_resolver
cdef ClassResolver class_resolver = self.class_resolver
if (collect_flag & COLLECTION_NOT_SAME_TYPE) == 0:
if elem_type is str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still write null flag and type here, this should be skipped for same type and not-null elements. could you take org.apache.fury.serializer.collection.AbstractCollectionSerializer#writeSameTypeElements as a reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, the type of None is NoneType, so if all elements are of the same type, either none of them are None or all of them are None.

>>> type(None)
<class 'NoneType'>
>>> type(object()) is type(None)
False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such cases, maybe we could write elements like this:

elem_type_ptr = xxx
for elem in list_value:
    if elem_type_ptr == str_type_ptr:
         buffer.write_string(elem)
    elif elem_type_ptr == int_type_ptr:
         buffer.write_varint64(elem)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, write the type flag only once for primitive types

Copy link
Collaborator

@chaokunyang chaokunyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks very much

@chaokunyang chaokunyang merged commit fb2172b into apache:main Nov 18, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants