-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement array_union
#7897
Implement array_union
#7897
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edmondop please check some comments
Co-authored-by: comphead <[email protected]>
Co-authored-by: comphead <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
fn union_generic_lists<OffsetSize: OffsetSizeTrait>( | ||
l: &GenericListArray<OffsetSize>, | ||
r: &GenericListArray<OffsetSize>, | ||
) -> Result<GenericListArray<OffsetSize>, DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer Result<GenericListArray<OffsetSize>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed with 648ae5d
(_, DataType::Null) => Ok(array1.clone()), | ||
(DataType::List(_), DataType::List(_)) => { | ||
check_datatypes("array_union", &[&array1, &array2])?; | ||
let list1 = array1.as_list::<i32>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer datafusion::common::cast as_list_array and as_large_list_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion::common::cast and arrow::array::cast are doing the similar thing. The difference is that common::cast return datafusion error, while arrow::array::cast return Arror error. To me, return datafusion error in datafusion project make much more senses for me. Also, mixing these two casting is quite messy, we can have arrow casting inside common::cast and call common::cast for other crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found that arrow::array::cast does not return Result<>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, I don't have strong opinion on which to use yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we directly use as_list_array
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the offset i32 branch you mean @Weijun-H ? I like the fact that the code has the same structure for LargeList and List
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think as_list_array
can avoid panic because it returns the Return
type.
(_, DataType::Null) => Ok(array1.clone()), | ||
(DataType::List(_), DataType::List(_)) => { | ||
check_datatypes("array_union", &[&array1, &array2])?; | ||
let list1 = array1.as_list::<i32>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion::common::cast and arrow::array::cast are doing the similar thing. The difference is that common::cast return datafusion error, while arrow::array::cast return Arror error. To me, return datafusion error in datafusion project make much more senses for me. Also, mixing these two casting is quite messy, we can have arrow casting inside common::cast and call common::cast for other crate.
9237fb4
to
4c1bcef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edmondop for keep improving the PR
Please also test
- different datatypes, fixed len(int, float, etc) / variable(strings, inner arrays)
- nulls
- empty arrays
we can put tests for every type and figure out if this needs to be fixed within this PR or we can create a followup PR
Will do! Are the other cases for empty arrays that aren't there @comphead ? |
Now I can see
would be very nice to see
|
Adding these two tests break the sqltests, but it doesn't seem to be caused by my code. I checked in the backtrace and there is nothing that refers
The stacktrace is the following
|
array union should have the same data type which including the dimension of array. Union with 2d and 1d is not allowed |
|
||
# array_union scalar function #6 | ||
query ? | ||
select array_union([], []); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked in spark
scala> spark.sql("select array_union(array(), array())").show(false)
+-----------------------------+
|array_union(array(), array())|
+-----------------------------+
|[] |
+-----------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I prefer return empty array in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I don't think you should check the input types (they should still be lists, they'll just have an offset buffer of [0, 1]
Maybe you just need to handle the case specially in array_union
|
||
# array_union scalar function #10 | ||
query ? | ||
select array_union(null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> spark.sql("select array_union(array(null), array(null))").show(false)
+-------------------------------------+
|array_union(array(NULL), array(NULL))|
+-------------------------------------+
|[NULL] |
+-------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more interesting, because I am not sure null has offsets in the ListArray, I suppose my code just get into the union function, but it exits immediately with empty results, because the offsets are empty...How would you go about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> spark.sql("select array_union(array(null), array(null))").show(false) +-------------------------------------+ |array_union(array(NULL), array(NULL))| +-------------------------------------+ |[NULL] | +-------------------------------------+
array(null) is not the same as null.
array(null) is like array_union scalar function #8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to add an extra branch in my match case like so:
(DataType::Null, DataType::Null) => {
let data = ArrayData::new_empty(&DataType::Null);
Ok(Arc::new(FixedSizeListArray::from(data)))
}
but that also produce a NULL response rather than an empty arary, do you have any guidance @comphead ? It's unclear to me how to get an []
and also a [NULL]
[hello, datafusion] | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove empty lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @edmondop -- this is a really nice (first 🤯 ) contribution and a great example of a team effort.
I suggest we merge this PR as is, and file a ticket to fix the null handling as a follow on PR. My rationale is that this is a new function, so it isn't introducing a regression and we can iteratively improve it over time
let r_values = converter.convert_columns(&[r_values])?; | ||
|
||
// Might be worth adding an upstream OffsetBufferBuilder | ||
let mut offsets = Vec::<OffsetSize>::with_capacity(l.len() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really neat implementation @edmondop.
|
||
# array_union scalar function #6 | ||
query ? | ||
select array_union([], []); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I don't think you should check the input types (they should still be lists, they'll just have an offset buffer of [0, 1]
Maybe you just need to handle the case specially in array_union
Thanks again everyone who contributed to this PR |
Which issue does this PR close?
Closes #6981
Rationale for this change
Support additional array_union SQL keyword
Are these changes tested?
Additional slt tests have been added, obviously they do not pass until duplication issue is solved :)