Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Proper way to add custom logic types base on arrow2. #326

Closed
sundy-li opened this issue Aug 24, 2021 · 8 comments
Closed

Proper way to add custom logic types base on arrow2. #326

sundy-li opened this issue Aug 24, 2021 · 8 comments
Labels
question Further information is requested

Comments

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 24, 2021

Hello, Datafuse data memory runtime system is based on arrow2 and we really appreciate arrow2 to be such a great library.
Its physical type can meet our requirements, but we found it's hard to add custom logic types base on arrow2.

There are Arrow2's primary goals from README doc, such as the first one:
MUST NOT implement any logical type other than the ones defined on the arrow specification

I think it's reasonable, but from the side of a database system, this may be not enough. If we want to have a logic DataType named Date16,
whose physical type is UInt16, it represents the days range from 1970 - 2149, this may cause more work on it.

In Datafuse:

  1. There is no mandatory requirement to follow the logic types as the arrow specification.
  2. Did not have a plan to go through other query engines base on arrow.
  3. Currently based on arrow2's array/compute/io and flight services.

We have two main ways to implement custom logic types. @jorgecarleitao Would you give us some advices about it?

  1. Fork another version of arrow2, add the logic datatype into enum DataType directly, and make others(such as: io/ipc) work well with it. This may be something
    similar to https://github.com/tensorbase/tensorbase/blob/53366bb9bc17271096ca31c7861f36a39305c9a7/crates/arrow/src/datatypes/datatype.rs#L97 , but it breaks the baseline codes of arrow2,
    which makes it hard to contribute back upstream.

  2. Based on the upstream codes of arrow2, since arrow2's physical type is enough for us, we just need to introduce logic types.
    But we have to introduce our own io/ipc and io/parquet and flight frameworks to work with our custom types.

@jorgecarleitao
Copy link
Owner

Thanks for reaching out!

The Arrow spec supports this; it is called extension types. I haven't had the time to implement them yet, but it is very well within the arrow spec, so we are good to go on that front 👍

@jorgecarleitao
Copy link
Owner

Otherwise, a general approach would be to generalize the types:

enum DataType {
    ArrowType(Arrow::DataType),
    CustomType1,
    CustomType2,
}

and then:

  • implement Into<Datatype> for Arrow::DataType that is lossy (for ease of use) that falls into arrow's physical in-memory datatype, used within arrow2 compute and stuff
  • Into<Arrow::Field> for (&str, DataType, bool) that includes a .with_metadata(...) storing custom metadata (e.g. "type": "date16")
  • From<Arrow::Field> for DataType that reads the metadata in the field denoting the custom type to decide which type it should be converted to (arrow or custom)

The fields' metadata is always transmitted in IPC, FFI and parquet, so as long as the consumer knows the semantics about what to do with it, you can transit the logical type.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 24, 2021

enum DataType {
    ArrowType(Arrow::DataType),
    CustomType1,
    CustomType2,
}

So the DataType will be refactored into this struct to support extension types ? Seems python's arrow can register new logic types into the DataType, but rust's enum is Immutable

@jorgecarleitao
Copy link
Owner

(the otherwise is if you want to support it outside arrow2).

For arrow2, I think that we need to extend DataType with a new variant like DataType::Extension(String, Box<DataType>, Option<Metadata>) so that everyone has access to the extension information from array.data_type().

It requires internal changes on arrow2 to make sure that this datatype remains as useful as its internal type (the second argument DataType::Extension). I need to research a bit the best way of achieving this without breaking too many public APIs.

@sundy-li
Copy link
Collaborator Author

Ok, I'd like to support it inside arrow2, cause there will be lots of duplicated code to have if support outside arrow2.

Like:

match array.data_type() {
DataType::Null => (),
DataType::Boolean => write_boolean(array, buffers, arrow_data, offset, is_little_endian),
DataType::Int8 => {
write_primitive::<i8>(array, buffers, arrow_data, offset, is_little_endian)
}
DataType::Int16 => {
write_primitive::<i16>(array, buffers, arrow_data, offset, is_little_endian)
}
DataType::Int32
| DataType::Date32
| DataType::Time32(_)
| DataType::Interval(IntervalUnit::YearMonth) => {
write_primitive::<i32>(array, buffers, arrow_data, offset, is_little_endian)
}
DataType::Int64
| DataType::Date64
| DataType::Time64(_)
| DataType::Timestamp(_, _)
| DataType::Duration(_) => {
write_primitive::<i64>(array, buffers, arrow_data, offset, is_little_endian)
}
DataType::Decimal(_, _) => {

If you have any plans to do that I am willing to help with that

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Aug 24, 2021

Awesome!

My thinking is to add DataType::Extension(String, Box<DataType>, Option<BTreeMap<String,String>>) and change the matches to be as follows:

match array.data_type() {
    DataType::UInt16 | DataType::Extension(_, UInt16, _) => {/* code for u16 */}
}

This is a bit of work, but would offer a great UX to derive all kinds of extensions, since consumers of the library would not have to worry about the details of how arrow stores extensions in the Fields: they just need to pass the extension type.

We would need to change some of the arrays also, since e.g. BooleanArray's datatype could now be an extension instead of DataType::Boolean.

At IPC boundaries we would map Field -> DataType::Extension based on the fields' metadata.

@sundy-li
Copy link
Collaborator Author

I'll create a pr to investigate the changes.

@jorgecarleitao jorgecarleitao added the question Further information is requested label Aug 31, 2021
@jorgecarleitao
Copy link
Owner

Closed by #359 ; guide and example now covers how to do this. Thanks a lot @sundy-li for bringing this up. It certainty helped at prioritizing it :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants