[Protocol] Multi-language row format support #314

chaokunyang · 2023-05-21T01:35:01Z

Is your feature request related to a problem? Please describe.

Row format is a comon requirement for big data processing system. Althrough columar format such as arrow has many advantages, the streamign system still needs to use row format due to the batch latency introduced in columar batch.

On ther other hand, operations such as hash join are more performant using row foramt in stead of column format.

The systems such as spark/flink/doris/etc.. all implemented the row format in a similar way. The repeated reimplementation is not an easy work, and if the system has multi-language requirements, the row format will need to be reimplemented multiple times. This make the work more tricky. If those systems want to cooperate with each other, the users will have to write code to convert to each other, which makes the stuff more tricky.

Binary protocol is hard to implement right, the offset management in row format is complex, not to mention implement it in multiple lanuages.

The columar format has been standardized by apache arrow. But there hasn't been any projects targeting to standardizing row format in a independent. Maybe it's an opportunity for fury to fill up this gap.

Arrow:

Fury:

Row format can be used for core data structure for high performance distributed big data processing system. It can be used as protocol for cross-language serialization.
The object graph in #180 is a dynamic serialization protocol, which write meta into stream too, the row format is a strong-typed protocol, which the meta are not serialized into the stream, the cost may be small if we don't padding fields whose size is less than 8 bytes.

Describe the solution you'd like

Spark tungsten is mature row format, its design and implementation can be take as a goog reference.

The fury row format can differs at:

Use arrow schema to describe meta.
- reduce meta definition and implementation
- Out-of-box meta cross-language serialization
- Better arrow interoperability
String support ascii/utf16/utf8 encoding for fast string encoding and zero-copy.
Decimal use arrow decimal format.
Variable-size field can be inline in fixed-size region if small enough.
Allow skip padding in the future. spark use 8 byte for /bool/byte/short/int/float, which is space wasting
Support adding fields without breaking compatibility
- Backward compatibility is necessary for online service.
The implementation support java/C++/python/golang/javascript/rust/etc..

chaokunyang added enhancement New feature or request protocol labels May 21, 2023

This was referenced May 21, 2023

[Java] Row format interface #315

Closed

[Java] Arrow type visitor #317

Closed

chaokunyang pinned this issue May 21, 2023

chaokunyang mentioned this issue May 29, 2023

[Protocol] Union support for row format #421

Open

rainsonGain unpinned this issue May 30, 2023

chaokunyang mentioned this issue Jun 5, 2023

[C++] Binary row format for c++ #440

Closed

This was referenced Jun 16, 2023

[Python] Python row format binding #464

Closed

[Python] Python encoder for row format #465

Closed

chaokunyang pinned this issue Jun 26, 2023

chaokunyang unpinned this issue Oct 20, 2023

chaokunyang pinned this issue Oct 20, 2023

chaokunyang unpinned this issue Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Protocol] Multi-language row format support #314

[Protocol] Multi-language row format support #314

chaokunyang commented May 21, 2023 •

edited

Loading

[Protocol] Multi-language row format support #314

[Protocol] Multi-language row format support #314

Comments

chaokunyang commented May 21, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

chaokunyang commented May 21, 2023 •

edited

Loading