-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue: migrate Arrow/Parquet to the official implementation. #555
Comments
What about |
Only a few places use BTW, we also use methods in greptimedb/src/script/src/python/vector.rs Lines 115 to 122 in 6d762aa
They might have different semantic |
Yes, so many API changes. For this particular case, I think we can set Gladly (or maybe sadly 🥲) it looks like we don't have this kind of boundary-case tests. I almost see the future that we've paid lots of care to review the change (I hope so 🤪) but still ignore some critical changes that will take us a day to trace it down. |
We may also need to refactor |
I'm working on the new vectors based on the official |
I reproduced this BUG in another repo and find out that it is highly related to the usage of the with_match_primitive_type_id macro. The AddressSanitizer also says that the stack overflowed when running the tests.
The |
arrow only provides a generic version of arithmetic operation, but arrow2 supports passing scalar dynamically arrow: pub fn add_scalar_dyn<T>(array: &dyn Array, scalar: T::Native) -> Result<ArrayRef>
where
T: ArrowNumericType,
T::Native: ArrowNativeTypeOp,
{} arrow2: pub fn add_scalar(lhs: &dyn Array, rhs: &dyn Scalar) -> Box<dyn Array> {} |
Unfortunately, we don't support |
What type of enhancement is this?
Tech debt reduction
What does the enhancement do?
As discussed in #388, we decide to migrate the arrow/parquet implementation to the official version. This issue tracks the progress of this work.
Implementation challenges
Major API changes
ExecutionPlan::execute
change to sync https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.executeExecutionPlan
TaskContext
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.TaskContext.htmlRuntimeEnv
, with more functionality)TableProvider
is separate intoTableProvider
(physical stage) andTableSource
(logical stage) https://docs.rs/datafusion-expr/latest/datafusion_expr/trait.TableSource.htmlArray
traitvalidity
arrow-format
https://github.com/DataEngineeringLabs/arrow-format also needs to migrate to Apache ArrowParquetWriter
(read/write sst) and statistics (predicate).Works before migration
ExecutionPlan::execute
to sync method @waynexia feat: lazy evaluated record batch stream #573Specifyoutput_ordering
for ourExecutionPlan
implTableSource
@v0y4g3rTaskContext
)add proxy method likeget_validity
forArrayRef
datatypes::arrow
instead ofarrow
directly @evenyag refactor: Use re-exported arrow mod from datatypes crate #571MutableBitmap
@evenyag refactor: replace some usage of MutableBitmap by BitVec #610Migrating
Working on branch
replace-arrow2
Target versions:
14.0.0
26.0.0
26.0.0
Migrating datatypes
Branch datatypes2 is working on migrating our vectors to arrow
TODO
VectorOp::cast()
feat: impl insert data from query #1025VectorOp::take()
create_current_timestamp_vector()
arrow_array_get()
is_xxx
to propertiesHelper::static_cast()
, avoid checking whether the vector is const in user codesquery/tests
toquery/src
#971script/python/builtins/mod.rs
move toscript/python/builtins.rs
, also check other mods refactor: rename somemod.rs
to <MOD_NAME>.rs #784RecordBatch
/Vector
/Schema
push_value_ref()
method in MutableVector trait #978Migrating functions
TODO
test_clip_fn_*
may fail with SIGSEVG fix: pre-cast to avoid tremendous match arms #734The text was updated successfully, but these errors were encountered: