-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor array_union
and array_intersect
functions to one general function
#8516
Conversation
f529f83
to
2aa3635
Compare
query ? | ||
select array_intersect([1, 1, 2, 2, 3, 3], null); | ||
---- | ||
[1, 2, 3] | ||
|
||
query ? | ||
select array_intersect(null, [1, 1, 2, 2, 3, 3]); | ||
---- | ||
[1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found previous tests forgot to test that the array should be distinct when one of the arrays is NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous implement is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can return empty array. array_intersect(array, null) -> array (empty)
The idea is turn null to empty array.
array_union(array, null) -> array
array_except(array, null) -> array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to return empty array for array_intersect, but met this panic.
[/Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/array_expressions.rs:1641] new_empty_array(array1.data_type()).data_type() = List(
Field {
name: "item",
data_type: Int64,
nullable: true,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
)
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
stack backtrace:
0: rust_begin_unwind
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
1: core::panicking::panic_fmt
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
2: core::panicking::panic_bounds_check
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:162:5
3: arrow_array::array::list_array::GenericListArray<OffsetSize>::value
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
4: datafusion_common::scalar::ScalarValue::try_from_array
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/scalar.rs:2147:36
5: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:48
6: core::result::Result<T,E>::and_then
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1319:22
7: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:26
8: datafusion_physical_expr::functions::create_physical_fun::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:414:13
9: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr::physical_expr::PhysicalExpr>::evaluate
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/scalar_function.rs:161:9
10: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:390:23
11: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::mutate
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:284:44
12: datafusion_common::tree_node::TreeNode::rewrite
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/tree_node.rs:205:13
13: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:142:9
14: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:102:29
15: core::iter::adapters::map::map_try_fold::{{closure}}
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:91:28
16: core::iter::traits::iterator::Iterator::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:2303:21
17: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:117:9
18: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:195:9
19: <I as alloc::vec::in_place_collect::SpecInPlaceCollect<T,I>>::collect_in_place
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:258:13
20: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:182:28
21: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/mod.rs:2696:9
22: core::iter::traits::iterator::Iterator::collect
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
23: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:51
24: core::iter::adapters::try_process
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:164:17
25: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:9
26: core::iter::traits::iterator::Iterator::collect
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
27: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:96:20
28: <datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions as datafusion_optimizer::optimizer::OptimizerRule>::try_optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:57:17
29: datafusion_optimizer::optimizer::Optimizer::optimize_recursively
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:419:18
30: datafusion_optimizer::optimizer::Optimizer::optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:294:21
31: datafusion::execution::context::SessionState::optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1791:13
32: datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1806:28
33: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:154:61
34: datafusion::dataframe::DataFrame::collect::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:758:48
35: datafusion_cli::exec::exec_and_print::{{closure}}
at ./src/exec.rs:231:36
36: datafusion_cli::exec::exec_from_repl::{{closure}}
at ./src/exec.rs:168:65
37: datafusion_cli::main::{{closure}}
at ./src/main.rs:221:14
38: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
39: tokio::runtime::coop::with_budget
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
40: tokio::runtime::coop::budget
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
41: tokio::runtime::park::CachedParkThread::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
42: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
43: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
44: tokio::runtime::context::runtime::enter_runtime
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
45: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
46: tokio::runtime::runtime::Runtime::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:45
47: datafusion_cli::main
at ./src/main.rs:215:5
48: core::ops::function::FnOnce::call_once
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I am not sure it is an issue 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try make_array(&[])
for empty array. Also, you need to fix the return_type
in datafusion/expr/src/built_in_function.rs
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
I check DuckDB results, which are NULL also. @jayzhan211
D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│ varchar │
├────────────────────────────────────┤
│ NULL │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│ varchar │
├─────────────────────────────────────────────────┤
│ NULL │
└─────────────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
I check DuckDB results, which are NULL also. @jayzhan211
D select typeof( list_intersect(null, null)); ┌────────────────────────────────────┐ │ typeof(list_intersect(NULL, NULL)) │ │ varchar │ ├────────────────────────────────────┤ │ NULL │ └────────────────────────────────────┘ D select typeof( list_intersect(null, [])); ┌─────────────────────────────────────────────────┐ │ typeof(list_intersect(NULL, main.list_value())) │ │ varchar │ ├─────────────────────────────────────────────────┤ │ NULL │ └─────────────────────────────────────────────────┘
the first element should be list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first element is null, you should return null instead
@jayzhan211 do you have some time to review this PR? |
query ? | ||
select array_union(null, [1, 1, 2, 2, 3, 3]); | ||
---- | ||
[1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union is correct
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
for (first_arr, second_arr) in l.iter().zip(r.iter()) { | ||
if let (Some(first_arr), Some(second_arr)) = (first_arr, second_arr) { | ||
let l_values = converter.convert_columns(&[first_arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you iterate each row and convert_columns based on it. You call convert_columns
N times, N is num of rows. But, previous implementation we just do one convert_columns
and iterate the each row.
It seems this change is more costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the previous implementation is to convert_columns
for all values at once, the current is to convert_columns
for each element each time. The difference in cost is negligible. @jayzhan211
// Null type | ||
(DataType::Null, DataType::List(field)) | ||
| (DataType::List(field), DataType::Null) => { | ||
let array = match array1.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better just split null, list and list, null into two to avoid another type checking inside it.
6fee213
to
fe1083b
Compare
} | ||
|
||
let values = converter.convert_rows(rows)?; | ||
let field = Arc::new(Field::new("item", dt, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should we create a field here instead of passing the cloned one? if you got the error related to field, likely there is error in return_type or elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
77d9480
to
c34e6c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Weijun-H and @jayzhan211
… function (apache#8516) * Refactor array_union and array_intersect functions * fix cli * fix ci * add tests for null * modify the return type * update tests * fix clippy * fix clippy * add tests for largelist * fix clippy * Add field parameter to generic_set_lists() function * Add large array drop statements * fix clippy
Which issue does this PR close?
Closes #.
Rationale for this change
array_intersect
LargeList
array_union
andarray_intersect
logicWhat changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?