Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor array_union and array_intersect functions to one general function #8516

Merged
merged 13 commits into from
Dec 27, 2023

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Dec 12, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

  • implement the output type for array_intersect
  • support LargeList
  • integrate array_union and array_intersect logic

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 12, 2023
@Weijun-H Weijun-H marked this pull request as draft December 12, 2023 16:41
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 12, 2023
@Weijun-H Weijun-H marked this pull request as ready for review December 12, 2023 21:45
Comment on lines 3046 to 3054
query ?
select array_intersect([1, 1, 2, 2, 3, 3], null);
----
[1, 2, 3]

query ?
select array_intersect(null, [1, 1, 2, 2, 3, 3]);
----
[1, 2, 3]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found previous tests forgot to test that the array should be distinct when one of the arrays is NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implement is incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can return empty array. array_intersect(array, null) -> array (empty)
The idea is turn null to empty array.
array_union(array, null) -> array
array_except(array, null) -> array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to return empty array for array_intersect, but met this panic.

[/Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/array_expressions.rs:1641] new_empty_array(array1.data_type()).data_type() = List(
    Field {
        name: "item",
        data_type: Int64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
)
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
   2: core::panicking::panic_bounds_check
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:162:5
   3: arrow_array::array::list_array::GenericListArray<OffsetSize>::value
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
   4: datafusion_common::scalar::ScalarValue::try_from_array
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/scalar.rs:2147:36
   5: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:48
   6: core::result::Result<T,E>::and_then
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1319:22
   7: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:26
   8: datafusion_physical_expr::functions::create_physical_fun::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:414:13
   9: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr::physical_expr::PhysicalExpr>::evaluate
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/scalar_function.rs:161:9
  10: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:390:23
  11: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::mutate
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:284:44
  12: datafusion_common::tree_node::TreeNode::rewrite
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/tree_node.rs:205:13
  13: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:142:9
  14: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:102:29
  15: core::iter::adapters::map::map_try_fold::{{closure}}
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:91:28
  16: core::iter::traits::iterator::Iterator::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:2303:21
  17: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:117:9
  18: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:195:9
  19: <I as alloc::vec::in_place_collect::SpecInPlaceCollect<T,I>>::collect_in_place
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:258:13
  20: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:182:28
  21: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/mod.rs:2696:9
  22: core::iter::traits::iterator::Iterator::collect
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
  23: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:51
  24: core::iter::adapters::try_process
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:164:17
  25: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:9
  26: core::iter::traits::iterator::Iterator::collect
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
  27: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:96:20
  28: <datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions as datafusion_optimizer::optimizer::OptimizerRule>::try_optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:57:17
  29: datafusion_optimizer::optimizer::Optimizer::optimize_recursively
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:419:18
  30: datafusion_optimizer::optimizer::Optimizer::optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:294:21
  31: datafusion::execution::context::SessionState::optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1791:13
  32: datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1806:28
  33: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:154:61
  34: datafusion::dataframe::DataFrame::collect::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:758:48
  35: datafusion_cli::exec::exec_and_print::{{closure}}
             at ./src/exec.rs:231:36
  36: datafusion_cli::exec::exec_from_repl::{{closure}}
             at ./src/exec.rs:168:65
  37: datafusion_cli::main::{{closure}}
             at ./src/main.rs:221:14
  38: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
  39: tokio::runtime::coop::with_budget
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
  40: tokio::runtime::coop::budget
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
  41: tokio::runtime::park::CachedParkThread::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
  42: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  43: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  44: tokio::runtime::context::runtime::enter_runtime
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  45: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  46: tokio::runtime::runtime::Runtime::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:45
  47: datafusion_cli::main
             at ./src/main.rs:215:5
  48: core::ops::function::FnOnce::call_once
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I am not sure it is an issue 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try make_array(&[]) for empty array. Also, you need to fix the return_type in datafusion/expr/src/built_in_function.rs too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

I check DuckDB results, which are NULL also. @jayzhan211

D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│              varchar               │
├────────────────────────────────────┤
│ NULL                               │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│                     varchar                     │
├─────────────────────────────────────────────────┤
│ NULL                                            │
└─────────────────────────────────────────────────┘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-12-18 at 07 41 05

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

I check DuckDB results, which are NULL also. @jayzhan211

D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│              varchar               │
├────────────────────────────────────┤
│ NULL                               │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│                     varchar                     │
├─────────────────────────────────────────────────┤
│ NULL                                            │
└─────────────────────────────────────────────────┘

the first element should be list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first element is null, you should return null instead

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

@jayzhan211 do you have some time to review this PR?

query ?
select array_union(null, [1, 1, 2, 2, 3, 3]);
----
[1, 2, 3]
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union is correct

let converter = RowConverter::new(vec![SortField::new(dt.clone())])?;
for (first_arr, second_arr) in l.iter().zip(r.iter()) {
if let (Some(first_arr), Some(second_arr)) = (first_arr, second_arr) {
let l_values = converter.convert_columns(&[first_arr])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you iterate each row and convert_columns based on it. You call convert_columns N times, N is num of rows. But, previous implementation we just do one convert_columns and iterate the each row.

It seems this change is more costly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the previous implementation is to convert_columns for all values at once, the current is to convert_columns for each element each time. The difference in cost is negligible. @jayzhan211

// Null type
(DataType::Null, DataType::List(field))
| (DataType::List(field), DataType::Null) => {
let array = match array1.data_type() {
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better just split null, list and list, null into two to avoid another type checking inside it.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Dec 18, 2023
}

let values = converter.convert_rows(rows)?;
let field = Arc::new(Field::new("item", dt, true));
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we create a field here instead of passing the cloned one? if you got the error related to field, likely there is error in return_type or elsewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 👍

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H and @jayzhan211

@alamb alamb merged commit 58b0a2b into apache:main Dec 27, 2023
22 checks passed
@Weijun-H Weijun-H mentioned this pull request Dec 29, 2023
19 tasks
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 4, 2024
… function (apache#8516)

* Refactor array_union and array_intersect functions

* fix cli

* fix ci

* add tests for null

* modify the return type

* update tests

* fix clippy

* fix clippy

* add tests for largelist

* fix clippy

* Add field parameter to generic_set_lists() function

* Add large array drop statements

* fix clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants