Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Protocol] Multi-language row format support #314

Open
chaokunyang opened this issue May 21, 2023 · 0 comments
Open

[Protocol] Multi-language row format support #314

chaokunyang opened this issue May 21, 2023 · 0 comments
Labels
enhancement New feature or request protocol

Comments

@chaokunyang
Copy link
Collaborator

chaokunyang commented May 21, 2023

Is your feature request related to a problem? Please describe.

Row format is a comon requirement for big data processing system. Althrough columar format such as arrow has many advantages, the streamign system still needs to use row format due to the batch latency introduced in columar batch.

On ther other hand, operations such as hash join are more performant using row foramt in stead of column format.

The systems such as spark/flink/doris/etc.. all implemented the row format in a similar way. The repeated reimplementation is not an easy work, and if the system has multi-language requirements, the row format will need to be reimplemented multiple times. This make the work more tricky. If those systems want to cooperate with each other, the users will have to write code to convert to each other, which makes the stuff more tricky.

Binary protocol is hard to implement right, the offset management in row format is complex, not to mention implement it in multiple lanuages.

The columar format has been standardized by apache arrow. But there hasn't been any projects targeting to standardizing row format in a independent. Maybe it's an opportunity for fury to fill up this gap.

Arrow:

Fury:

Row format can be used for core data structure for high performance distributed big data processing system. It can be used as protocol for cross-language serialization.
The object graph in #180 is a dynamic serialization protocol, which write meta into stream too, the row format is a strong-typed protocol, which the meta are not serialized into the stream, the cost may be small if we don't padding fields whose size is less than 8 bytes.

Describe the solution you'd like

Spark tungsten is mature row format, its design and implementation can be take as a goog reference.

The fury row format can differs at:

  • Use arrow schema to describe meta.
    • reduce meta definition and implementation
    • Out-of-box meta cross-language serialization
    • Better arrow interoperability
  • String support ascii/utf16/utf8 encoding for fast string encoding and zero-copy.
  • Decimal use arrow decimal format.
  • Variable-size field can be inline in fixed-size region if small enough.
  • Allow skip padding in the future. spark use 8 byte for /bool/byte/short/int/float, which is space wasting
  • Support adding fields without breaking compatibility
    • Backward compatibility is necessary for online service.
  • The implementation support java/C++/python/golang/javascript/rust/etc..
@chaokunyang chaokunyang added enhancement New feature or request protocol labels May 21, 2023
This was referenced May 21, 2023
@chaokunyang chaokunyang pinned this issue May 21, 2023
@rainsonGain rainsonGain unpinned this issue May 30, 2023
@chaokunyang chaokunyang pinned this issue Jun 26, 2023
@chaokunyang chaokunyang unpinned this issue Oct 20, 2023
@chaokunyang chaokunyang pinned this issue Oct 20, 2023
@chaokunyang chaokunyang unpinned this issue Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request protocol
Projects
None yet
Development

No branches or pull requests

1 participant