Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: impl the basic string_agg function #8148

Merged
merged 16 commits into from
Nov 18, 2023
Merged

Conversation

haohuaijin
Copy link
Contributor

Which issue does this PR close?

part of #7910

Rationale for this change

add a new aggregate function string_agg

What changes are included in this PR?

Implement the basic string_agg function, which currently does not support distinct and order by clauses.
Support for distinct and order by will be added in future pr.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Nov 13, 2023
a

query TTTT
SELECT STRING_AGG('a',','), STRING_AGG('a', NULL), STRING_AGG(NULL, ','), STRING_AGG(NULL, NULL)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In postgresql, select string_agg('a', NULL) return a
but in duckdb, select string_agg('a', NULL) return NULL
I implemented it according to postgresql.

@@ -2987,3 +2987,66 @@ NULL NULL 1 NULL 3 6 0 0 0
NULL NULL 1 NULL 5 15 0 0 0
3 0 2 1 5.5 16.5 0.5 4.5 1.5
3 0 3 1 6 18 2 18 6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported these tests over from duckdb.

@alamb
Copy link
Contributor

alamb commented Nov 14, 2023

cc @universalmind303 -- do you have time to help review this PR?

@2010YOUY01
Copy link
Contributor

Thank you! The implementation looks good to me, one minor suggestion is to add some empty string cases to sqllogictest like select string_agg('', '|'), string_agg('a', '');

Also adopting test cases from mature systems like Postgres/Duckdb is a really good idea 👍🏼 , we should encourage contributors to do so when adding new functions

@haohuaijin
Copy link
Contributor Author

select string_agg('', '|'), string_agg('a', '');

Thanks for this example @2010YOUY01 , I found a bug in handling empty strings and nulls, which has now been fixed and added additional tests.

@haohuaijin
Copy link
Contributor Author

the ci break maybe related to #8169

@ozankabak
Copy link
Contributor

@haohuaijin, indeed seems like it. We will fix promptly

@haohuaijin
Copy link
Contributor Author

@ozankabak I've submitted a PR to try to fix it #8186

@ozankabak
Copy link
Contributor

Thank you for the hotfix, for some reason we didn't realize the tweak needed to accommodate #8034

@ozankabak
Copy link
Contributor

I will approve the hotfix once CI passes

@haohuaijin
Copy link
Contributor Author

PTAL @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haohuaijin this looks good to me -- I have a few small suggestions, but overall I think it looks good.

One thing I thought of was is I think this function will be non deterministic when aggregating without an ORDER BY -- specifically since the aggregate can see the inputs in any order when doing a multi-phase repartitioned group by.

I don't think we need to fix this now, but we should probably file a ticket to fix it

Thanks again

datafusion/physical-expr/src/aggregate/string_agg.rs Outdated Show resolved Hide resolved

impl Accumulator for StringAggAccumulator {
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
let string_array: Vec<_> = as_generic_string_array::<i64>(&values[0])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using i64 means this will always take LargeStringArray, right? Shouldn't this use i32 for StringArray and i64 for LargeStringArray? However, the tests seem to cover this so this one looks good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do some cast like in sum i32 -> i64, so when we input a StringArray, cast it to a LargeStringArray.

Comment on lines +3046 to +3048
SELECT 'text1'::varchar(1000) as my_column union all
SELECT 'text1'::varchar(1000) as my_column union all
SELECT 'text1'::varchar(1000) as my_column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT 'text1'::varchar(1000) as my_column union all
SELECT 'text1'::varchar(1000) as my_column union all
SELECT 'text1'::varchar(1000) as my_column
SELECT 'text1'::varchar(1000) as my_column union all
SELECT 'text2'::varchar(1000) as my_column union all
SELECT 'text3'::varchar(1000) as my_column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with out order by inside string_agg, the result is non deterministic, so I keep text1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense -- thank you

@haohuaijin
Copy link
Contributor Author

we should probably file a ticket to fix it

track in #8260

@alamb
Copy link
Contributor

alamb commented Nov 18, 2023

Thanks @haohuaijin @universalmind303 and @2010YOUY01

@alamb alamb merged commit 76ced31 into apache:main Nov 18, 2023
22 checks passed
@haohuaijin haohuaijin deleted the str_agg branch November 18, 2023 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants