-
Notifications
You must be signed in to change notification settings - Fork 222
Proper way to add custom logic types base on arrow2. #326
Comments
Thanks for reaching out! The Arrow spec supports this; it is called extension types. I haven't had the time to implement them yet, but it is very well within the arrow spec, so we are good to go on that front 👍 |
Otherwise, a general approach would be to generalize the types:
and then:
The fields' metadata is always transmitted in IPC, FFI and parquet, so as long as the consumer knows the semantics about what to do with it, you can transit the logical type. |
So the DataType will be refactored into this struct to support |
(the otherwise is if you want to support it outside arrow2). For arrow2, I think that we need to extend It requires internal changes on |
Ok, I'd like to support it inside arrow2, cause there will be lots of duplicated code to have if support outside arrow2. Like: arrow2/src/io/ipc/write/serialize.rs Lines 411 to 433 in 3ce4f1c
If you have any plans to do that I am willing to help with that |
Awesome! My thinking is to add
This is a bit of work, but would offer a great UX to derive all kinds of extensions, since consumers of the library would not have to worry about the details of how arrow stores extensions in the Fields: they just need to pass the extension type. We would need to change some of the arrays also, since e.g. At IPC boundaries we would map |
I'll create a pr to investigate the changes. |
Hello, Datafuse data memory runtime system is based on arrow2 and we really appreciate arrow2 to be such a great library.
Its physical type can meet our requirements, but we found it's hard to add custom logic types base on arrow2.
There are Arrow2's primary goals from README doc, such as the first one:
MUST NOT implement any logical type other than the ones defined on the arrow specification
I think it's reasonable, but from the side of a database system, this may be not enough. If we want to have a logic DataType named
Date16
,whose physical type is
UInt16
, it represents the days range from1970 - 2149
, this may cause more work on it.In Datafuse:
We have two main ways to implement custom logic types. @jorgecarleitao Would you give us some advices about it?
Fork another version of arrow2, add the logic datatype into enum DataType directly, and make others(such as:
io/ipc
) work well with it. This may be somethingsimilar to https://github.com/tensorbase/tensorbase/blob/53366bb9bc17271096ca31c7861f36a39305c9a7/crates/arrow/src/datatypes/datatype.rs#L97 , but it breaks the baseline codes of arrow2,
which makes it hard to contribute back upstream.
Based on the upstream codes of arrow2, since arrow2's physical type is enough for us, we just need to introduce logic types.
But we have to introduce our own
io/ipc
andio/parquet
and flight frameworks to work with our custom types.The text was updated successfully, but these errors were encountered: