-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sum(distinct)
support
#2405
sum(distinct)
support
#2405
Conversation
cc @andygrove @alamb @yjshen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks @WinkerDu
"+--------------+-----------------------+", | ||
"| AVG(test.c1) | SUM(DISTINCT test.c2) |", | ||
"+--------------+-----------------------+", | ||
"| 1.75 | 3 |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -297,6 +297,18 @@ pub(crate) fn sum(lhs: &ScalarValue, rhs: &ScalarValue) -> Result<ScalarValue> { | |||
(ScalarValue::Int64(lhs), ScalarValue::Int8(rhs)) => { | |||
typed_sum!(lhs, rhs, Int64, i64) | |||
} | |||
(ScalarValue::Int64(lhs), ScalarValue::UInt64(rhs)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fine change in this PR -- though it is strange to me that we have to be doing these casts in sum.rs as it duplicates some non trivial amount of the logic in coercion -- maybe it would be possible to make this code cleaner / consolidate more of the coercion logic.
Again, no changes needed for this PR but I figured I would point it out while reading this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb
I'd look into this coercion logic in some follow up pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercion u64 to i64 seems irrational to me. Why do we need this kind of coercion in sum distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @WinkerDu is just following the same pre-existing pattern with this PR
I agree that the pre-existing pattern doesn't make sense to me (it should have already been done by the time the distinct aggregate code is being executed)
|
||
fn update(&mut self, values: &[ScalarValue]) -> Result<()> { | ||
values.iter().for_each(|v| { | ||
// If the value is NULL, it is not included in the final sum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
generic_test_sum_distinct!( | ||
array, | ||
DataType::Int32, | ||
ScalarValue::from(6i64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
generic_test_sum_distinct!( | ||
array, | ||
DataType::UInt32, | ||
ScalarValue::from(4i64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @andygrove @tustvold @Dandandan |
Thanks @WinkerDu for the code and @yjshen @Ted-Jiang and @xudong963 for the reviews. It will be nice to clean up some of the coercion logic -- @WinkerDu let me know if I should file a ticket to track that work |
Which issue does this PR close?
Closes #2404 .
Rationale for this change
For now, DataFusion hasn't support
sum(distinct)
yet. Though optimizerSingleDistinctToGroupBy
(#1315) supports single distinct usage, there are more SQL scenes have not been covered.For example, runs the following unit test,
then error raises like
What changes are included in this PR?
Introduces
expressions::DistinctSum
into DFHashSet
to record unique numeric list, updateHashSet
when new item input.ScalarValue::List
, which Built fromHashSet
.DistinctSum
by computing sum from numeric stored inHashSet
Are there any user-facing changes?
No.