You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Row format is a comon requirement for big data processing system. Althrough columar format such as arrow has many advantages, the streamign system still needs to use row format due to the batch latency introduced in columar batch.
On ther other hand, operations such as hash join are more performant using row foramt in stead of column format.
The systems such as spark/flink/doris/etc.. all implemented the row format in a similar way. The repeated reimplementation is not an easy work, and if the system has multi-language requirements, the row format will need to be reimplemented multiple times. This make the work more tricky. If those systems want to cooperate with each other, the users will have to write code to convert to each other, which makes the stuff more tricky.
Binary protocol is hard to implement right, the offset management in row format is complex, not to mention implement it in multiple lanuages.
The columar format has been standardized by apache arrow. But there hasn't been any projects targeting to standardizing row format in a independent. Maybe it's an opportunity for fury to fill up this gap.
Arrow:
Fury:
Row format can be used for core data structure for high performance distributed big data processing system. It can be used as protocol for cross-language serialization.
The object graph in #180 is a dynamic serialization protocol, which write meta into stream too, the row format is a strong-typed protocol, which the meta are not serialized into the stream, the cost may be small if we don't padding fields whose size is less than 8 bytes.
Describe the solution you'd like
Spark tungsten is mature row format, its design and implementation can be take as a goog reference.
The fury row format can differs at:
Use arrow schema to describe meta.
reduce meta definition and implementation
Out-of-box meta cross-language serialization
Better arrow interoperability
String support ascii/utf16/utf8 encoding for fast string encoding and zero-copy.
Decimal use arrow decimal format.
Variable-size field can be inline in fixed-size region if small enough.
Allow skip padding in the future. spark use 8 byte for /bool/byte/short/int/float, which is space wasting
Support adding fields without breaking compatibility
Backward compatibility is necessary for online service.
The implementation support java/C++/python/golang/javascript/rust/etc..
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Row format is a comon requirement for big data processing system. Althrough columar format such as arrow has many advantages, the streamign system still needs to use row format due to the batch latency introduced in columar batch.
On ther other hand, operations such as hash join are more performant using row foramt in stead of column format.
The systems such as spark/flink/doris/etc.. all implemented the row format in a similar way. The repeated reimplementation is not an easy work, and if the system has multi-language requirements, the row format will need to be reimplemented multiple times. This make the work more tricky. If those systems want to cooperate with each other, the users will have to write code to convert to each other, which makes the stuff more tricky.
Binary protocol is hard to implement right, the offset management in row format is complex, not to mention implement it in multiple lanuages.
The columar format has been standardized by apache arrow. But there hasn't been any projects targeting to standardizing row format in a independent. Maybe it's an opportunity for fury to fill up this gap.
Arrow:
Fury:
Row format can be used for core data structure for high performance distributed big data processing system. It can be used as protocol for cross-language serialization.
The object graph in #180 is a dynamic serialization protocol, which write meta into stream too, the row format is a strong-typed protocol, which the meta are not serialized into the stream, the cost may be small if we don't padding fields whose size is less than 8 bytes.
Describe the solution you'd like
Spark tungsten is mature row format, its design and implementation can be take as a goog reference.
The fury row format can differs at:
The text was updated successfully, but these errors were encountered: