Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support array_distinct function. #8268

Merged
merged 10 commits into from
Dec 8, 2023

Conversation

my-vegetable-has-exploded
Copy link
Contributor

Which issue does this PR close?

Closes #7289

Rationale for this change

just use list.iter().sorted().dedup() to remove duplicates for each list in listarray

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Nov 19, 2023
@my-vegetable-has-exploded my-vegetable-has-exploded changed the title Minor: Implement array_distinct function. Minor: Support array_distinct function. Nov 19, 2023
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new function support, I did some tests and have a few suggestions:

❯ select array_distinct([]);
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: could not cast value to arrow_array::array::list_array::GenericListArray<i32>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I think this empty array case should be handled inside implementation (and also included in sqllogictest)

datafusion/expr/src/built_in_function.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Nov 21, 2023

I merged #8269 so we can probably pick up the change for this PR

@my-vegetable-has-exploded
Copy link
Contributor Author

PTAL, @alamb @jayzhan211 @2010YOUY01 , thanks.

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/array.slt Show resolved Hide resolved
datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?;
// distinct for each list in ListArray
for arr in array.iter().flatten() {
let values = converter.convert_columns(&[arr])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not distinct array in columnar way, arr has only one column, using row format need extra encoding and decoding

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not distinct array in columnar way, arr has only one column, using row format need extra encoding and decoding

It is great to distinct array without row converter, but I don't think we can do that without downcast to exact arr then do the distinction. Is there any recommended way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.

Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work

The row converter handles all the types internally.

What we have typically done in the past with DataFusion is to use non type specific code like RowConverter for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32) for example

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
@my-vegetable-has-exploded my-vegetable-has-exploded changed the title Minor: Support array_distinct function. Support array_distinct function. Nov 30, 2023
@my-vegetable-has-exploded
Copy link
Contributor Author

Since no more comments for a fews days, I think maybe this pr can go ahead?
cc @alamb, thanks.

@alamb
Copy link
Contributor

alamb commented Dec 6, 2023

Thanks @my-vegetable-has-exploded -- I'll take a look hopefully today or maybe tomorrow

Comment on lines +2163 to +2164
let array = as_list_array(&args[0])?;
general_array_distinct(array, field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put let array = as_list_array(&args[0])?; in general_array_distinct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get your point. Iargelist differs with list, so I think it maybe better to handle it before generic function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean change the function signature:

general_array_distinct<OffsetSize: OffsetSizeTrait>(
    array: &ArrayRef,
    field: &FieldRef,
)

Then cast array in the fucntion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are referring to something like general_array_has_dispatch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this PR as is and then add support for LargeList (using the OffsetSize trait) as a follow on PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you for this contribution @my-vegetable-has-exploded and thank you @Weijun-H and @jayzhan211 for the help getting this PR ready.

I think it looks very nice and is a good example of collaboration 🦾

let converter = RowConverter::new(vec![SortField::new(dt.clone())])?;
// distinct for each list in ListArray
for arr in array.iter().flatten() {
let values = converter.convert_columns(&[arr])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.

Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work

The row converter handles all the types internally.

What we have typically done in the past with DataFusion is to use non type specific code like RowConverter for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32) for example

Comment on lines +2163 to +2164
let array = as_list_array(&args[0])?;
general_array_distinct(array, field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this PR as is and then add support for LargeList (using the OffsetSize trait) as a follow on PR

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

I took the liberty of merging up from main to make sure there there are no logical conflicts. I intend to merge the PR when the tests pass

@alamb alamb merged commit cd02c40 into apache:main Dec 8, 2023
23 checks passed
@my-vegetable-has-exploded
Copy link
Contributor Author

Thanks all.

@my-vegetable-has-exploded my-vegetable-has-exploded deleted the array-distinct branch December 9, 2023 11:55
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
* implement distinct func

implement slt & proto

fix null & empty list

* add comment for slt

Co-authored-by: Alex Huang <[email protected]>

* fix largelist

* add largelist for slt

* Use collect for rows & init capcity for offsets.

* fixup: remove useless match

* fix fmt

* fix fmt

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement array_distinct function
6 participants