-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Implement collection serialization protocol #1942
Conversation
@@ -1644,29 +1656,143 @@ cdef class CollectionSerializer(Serializer): | |||
cpdef int16_t get_xtype_id(self): | |||
return -FuryType.LIST.value | |||
|
|||
cpdef int8_t write_header(self, Buffer buffer, value): | |||
cdef int8_t collect_flag = COLLECTION_DEFAULT_FLAG | |||
elem_type = type(next(iter(value))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This create iterator object, maybe we could just init it as None here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to extract the type of the first element and avoid checking elem_type is None
in the loop. Since the collection might be a set, an iterator retrieves it.
python/pyfury/_serialization.pyx
Outdated
@@ -1644,29 +1656,143 @@ cdef class CollectionSerializer(Serializer): | |||
cpdef int16_t get_xtype_id(self): | |||
return -FuryType.LIST.value | |||
|
|||
cpdef int8_t write_header(self, Buffer buffer, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently golang didn't implement this protocol yet, could we enable it only for pure python serialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_header
will only be called in Python
, and the Xlang
format has not been modified at the moment
Hi @penguin-wwy , thanks for your hard work. This new protocol should be faster, you did't implement the new protocol fully. I left some comments, could you take a look at it? |
Benchmark data has been updated. |
cdef MapRefResolver ref_resolver = self.ref_resolver | ||
cdef ClassResolver class_resolver = self.class_resolver | ||
if (collect_flag & COLLECTION_NOT_SAME_TYPE) == 0: | ||
if elem_type is str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still write null flag and type here, this should be skipped for same type and not-null elements. could you take org.apache.fury.serializer.collection.AbstractCollectionSerializer#writeSameTypeElements
as a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, the type of None
is NoneType
, so if all elements are of the same type, either none of them are None
or all of them are None
.
>>> type(None)
<class 'NoneType'>
>>> type(object()) is type(None)
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such cases, maybe we could write elements like this:
elem_type_ptr = xxx
for elem in list_value:
if elem_type_ptr == str_type_ptr:
buffer.write_string(elem)
elif elem_type_ptr == int_type_ptr:
buffer.write_varint64(elem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, write the type flag only once for primitive types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks very much
What does this PR do?
Implement a new format for collection serialization in pyfury.
Related issues
Does this PR introduce any user-facing change?
Benchmark