Implement exact median, add `AggregateState` #3009

andygrove · 2022-08-01T21:29:56Z

Which issue does this PR close?

Closes #2925

Rationale for this change

Needed for h2o benchmarks.

What changes are included in this PR?

Add the ability for accumulators to return either arrays or scalar values
Implement new median aggregate
Remove a few calls to unwrap

Are there any user-facing changes?

Yes, if implementing UDAFs.

codecov-commenter · 2022-08-01T22:27:24Z

Codecov Report

Merging #3009 (ca9d6a9) into master (c7fa789) will decrease coverage by 0.00%.
The diff coverage is 79.79%.

@@            Coverage Diff             @@
##           master    #3009      +/-   ##
==========================================
- Coverage   85.81%   85.81%   -0.01%     
==========================================
  Files         282      286       +4     
  Lines       51531    51790     +259     
==========================================
+ Hits        44219    44441     +222     
- Misses       7312     7349      +37

Impacted Files	Coverage Δ
.../physical-expr/src/aggregate/array_agg_distinct.rs	`80.18% <0.00%> (ø)`
datafusion/physical-expr/src/aggregate/mod.rs	`25.00% <ø> (ø)`
datafusion/physical-expr/src/expressions/mod.rs	`100.00% <ø> (ø)`
datafusion/proto/src/from_proto.rs	`35.53% <0.00%> (-0.05%)`	⬇️
datafusion/proto/src/lib.rs	`93.47% <0.00%> (ø)`
datafusion/proto/src/to_proto.rs	`53.03% <0.00%> (-0.19%)`	⬇️
datafusion/physical-expr/src/aggregate/median.rs	`63.85% <63.85%> (ø)`
datafusion/physical-expr/src/aggregate/build_in.rs	`89.92% <66.66%> (-0.26%)`	⬇️
datafusion/expr/src/accumulator.rs	`77.77% <77.77%> (ø)`
...tafusion/core/src/physical_plan/aggregates/hash.rs	`92.95% <100.00%> (ø)`
... and 29 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

alamb · 2022-08-03T12:58:06Z

I plan to review this carefully later today

alamb

Thanks @andygrove -- I think this is great!

I think the way that other aggregates (like distinct) have handled the need to hold multiple values is to encode them in ScalarValue::List -- I suspect this approach will be much higher performance (and I can imagine adding other extension variants here like Box<dyn Any> or something to allow people to encode their aggregate state using whatever custom type they wanted 🤔

I am still somewhat worried about the split in Grouping that we have: a Row and a Column one -- e.g. the Row accumulator does not support median. However, I think this is tracked in #2723 and I don't think anything new is needed for this PR

https://github.com/apache/arrow-datafusion/blob/0ff59de810f344b197b2e9491a0a9aefca52d88f/datafusion/physical-expr/src/aggregate/row_accumulator.rs

But may

alamb · 2022-08-03T20:20:31Z

datafusion/expr/src/accumulator.rs

@@ -44,3 +44,27 @@ pub trait Accumulator: Send + Sync + Debug {
    /// returns its value based on its current state.
    fn evaluate(&self) -> Result<ScalarValue>;
 }
+
+#[derive(Debug)]
+pub enum AggregateState {


This is a very elegant idea. Can you please add docstrings to AggregateState explaining what is going on?

I think it would be worth updating the docstrings in the accumulator trait with some discussion / examples of how to use the Array state.

datafusion/expr/src/accumulator.rs

datafusion/physical-expr/src/aggregate/count_distinct.rs

alamb · 2022-08-03T20:29:05Z

datafusion/physical-expr/src/aggregate/median.rs

+use std::sync::Arc;
+
+/// MEDIAN aggregate expression. This uses a lot of memory because all values need to be
+/// stored in memory before a result can be computed. If an approximation is sufficient


I wonder if it is worth (perhaps as a follow on PR) putting a cap on the number of values DataFusion will try to buffer to compute median and throw a runtime error if that number is exceeded 🤔 That way we could avoid OOM kills

I clicked "submit" too early

Co-authored-by: Andrew Lamb <[email protected]>

alamb

I think it looks good to me (this time I actually finished the review!)

I left some suggestions on how to make the implementation possibly better, but I think any of them could be done as a follow on.

The only thing I think we should really do prior to merging this test is to add additional coverage in the sql_integration test

alamb · 2022-08-03T20:33:04Z

datafusion/physical-expr/src/aggregate/median.rs

+#[derive(Debug)]
+struct MedianAccumulator {
+    data_type: DataType,
+    all_values: Vec<ArrayRef>,


I wonder if you would be better served here by using an ArrayBuilder (though I realize they are strongly typed so it might be more award -- though it is likely faster)

alamb · 2022-08-03T20:34:11Z

datafusion/physical-expr/src/aggregate/median.rs

+            .map(|v| AggregateState::Array(v.clone()))
+            .collect();
+        if vec.is_empty() {
+            match self.data_type {


Is it correct to produce a single [0] element array? Wouldn't that mean that the 0 is now included in the median calculation even though it was not in the original data?

These arrays have length 0. I just pushed a refactor to clean this up and make it more obvious.

alamb · 2022-08-03T20:43:59Z

datafusion/physical-expr/src/aggregate/median.rs

+
+    fn evaluate(&self) -> Result<ScalarValue> {
+        match self.all_values[0].data_type() {
+            DataType::Int8 => median!(self, arrow::datatypes::Int8Type, Int8, 2),


Instead of using a macro here, I wonder if you could use the concat and take kernels

https://docs.rs/arrow/19.0.0/arrow/compute/kernels/concat/index.html
https://docs.rs/arrow/19.0.0/arrow/compute/kernels/take/index.html

Something like (untested):

let sorted = sort(concat(&self.all_values)); let len = sorted.len(); let mid = len / 2; if len % 2 == 0 { let indexes: UInt64Array = [mid-1, mid].into_iter().collect(); // 🤔 Not sure how to do an average: let values = average(take(sorted, indexes)) ScalarValue::try_from_array(values, 0) } else { ScalarValue::try_from_array(sorted, mid) }

But the need for an average stymies that - though I guess we could implement an average kernel in datafusion and then put it back into arrow

alamb · 2022-08-03T20:46:09Z

datafusion/physical-expr/src/aggregate/median.rs

+}
+
+/// Combine all non-null values from provided arrays into a single array
+fn combine_arrays<T: ArrowPrimitiveType>(arrays: &[ArrayRef]) -> Result<ArrayRef> {


You might be able to do this with concat and take as well

Untested

let final_array = concat(arrays); let indexes = final_array.iter().enumerate().filter_map(|(i, v)| v.map(|_| i)).collect(); take(final_array, indexes)

alamb · 2022-08-03T20:46:33Z

datafusion/physical-expr/src/aggregate/utils.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! Utilities used in aggregates


alamb · 2022-08-03T20:48:36Z

datafusion/core/tests/sql/aggregates.rs

@@ -221,7 +221,7 @@ async fn csv_query_stddev_6() -> Result<()> {
 }

 #[tokio::test]
-async fn csv_query_median_1() -> Result<()> {


If possible, I would recommend adding a basic test in sql for a median for all the different data types that are supported (not just on aggregate_test_100 but a dedicated test setup with known data (maybe integers 10, 9, 8, ... 0)

Added in ef1effd

…edian

Co-authored-by: Andrew Lamb <[email protected]>

…edian

andygrove · 2022-08-03T22:16:36Z

Thanks for the review @alamb. I will work on addressing the feedback over the next couple of days.

yjshen · 2022-08-04T02:59:24Z

The main concern that departure row_aggregate from aggregate comes from the intention to do in-place updates for row-based states. I assume we could store pointers in RowLayout::WordAligned for varlena states during accumulation, finalizing and inlining them into the row-state while we are spilling later. UDAFs withBox<dyn Any> state requires more, perhaps an extra serde provided.

For the median state store, a Map<value, value_occurence_count> might be more likely to be space efficient. Though it may require more computations than the current approach, and likely not work with arrow compute kernels.

andygrove · 2022-08-04T14:53:38Z

For the median state store, a Map<value, value_occurence_count> might be more likely to be space efficient. Though it may require more computations than the current approach, and likely not work with arrow compute kernels.

Clever idea. I'm not sure that would work for floating-point types?

andygrove · 2022-08-04T14:57:42Z

I suspect this approach will be much higher performance

@alamb It isn't clear to me which approach you are referring to here. I assume you are saying that the approach in this PR of using Array rather than ScalarList::Vec is likely more performant?

alamb · 2022-08-04T20:37:07Z

@alamb It isn't clear to me which approach you are referring to here. I assume you are saying that the approach in this PR of using Array rather than ScalarList::Vec is likely more performant?

I guess I was saying that making lots of small Arrays and concatenating them together may not be all that much more performant than a Scalar::List but I haven't measured it.

Clever idea. I'm not sure that would work for floating-point types?

It would probably work but likely take up more space 😆

In general @yjshen 's approach would be good for low cardinality (relatively low numbers of distinct values) and not as good for high cardinality (relatively high numbers of distinct values) -- floating point data happens to often be quite high cardinality

andygrove · 2022-08-05T13:10:35Z

@alamb I think this is ready for another look. I added tests and filed a couple of follow-on issues:

alamb · 2022-08-05T19:13:32Z

datafusion/core/tests/sql/aggregates.rs

+}
+
+#[tokio::test]
+async fn median_u8() -> Result<()> {


alamb · 2022-08-05T19:13:45Z

datafusion/core/tests/sql/aggregates.rs

+        "approx_median",
+        DataType::Float64,
+        Arc::new(Float64Array::from(vec![1.1, f64::NAN, f64::NAN, f64::NAN])),
+        "NaN", // probably not the desired behavior? - see https://github.com/apache/arrow-datafusion/issues/3039


testing for the win!

alamb · 2022-08-05T19:14:22Z

datafusion/core/tests/sql/mod.rs

@@ -127,7 +127,13 @@ where
                l.as_ref().parse::<f64>().unwrap(),
                r.as_str().parse::<f64>().unwrap(),
            );
-            assert!((l - r).abs() <= 2.0 * f64::EPSILON);
+            if l.is_nan() || r.is_nan() {


ursabot · 2022-08-05T20:02:18Z

Benchmark runs are scheduled for baseline = 581934d and contender = 245def0. 245def0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Implement exact median

429b001

andygrove added the api change Changes the API exposed to users of the crate label Aug 1, 2022

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner labels Aug 1, 2022

andygrove added 2 commits August 1, 2022 15:31

revert some changes

60d8395

toml format

a18567a

add median to protobuf

408f9c3

andygrove changed the title ~~WIP: Implement exact median~~ Implement exact median Aug 3, 2022

andygrove marked this pull request as ready for review August 3, 2022 02:00

andygrove added 4 commits August 2, 2022 20:13

remove some unwraps

fa40574

remove some unwraps

e456b39

remove some unwraps

a3e3c2b

fix

1975629

andygrove requested a review from alamb August 3, 2022 02:27

andygrove added 5 commits August 3, 2022 07:25

clippy

34e87eb

reduce code duplication

8c89a74

reduce code duplication

dc4cb35

more tests

547f980

move tests to simplify github diff

9c17969

alamb changed the title ~~Implement exact median~~ Implement exact median, add AggregateState Aug 3, 2022

alamb previously approved these changes Aug 3, 2022

View reviewed changes

Update datafusion/expr/src/accumulator.rs

63cfbe6

Co-authored-by: Andrew Lamb <[email protected]>

alamb approved these changes Aug 3, 2022

View reviewed changes

andygrove added 2 commits August 3, 2022 16:01

refactor to make it more obvious that empty arrays are being created

ef0f7dc

Merge branch 'median' of github.com:andygrove/arrow-datafusion into m…

ff89d6d

…edian

andygrove and others added 3 commits August 3, 2022 16:09

partially address feedback

f31d92d

Update datafusion/physical-expr/src/aggregate/count_distinct.rs

b6eae6f

Co-authored-by: Andrew Lamb <[email protected]>

Merge branch 'median' of github.com:andygrove/arrow-datafusion into m…

ca9d6a9

…edian

andygrove mentioned this pull request Aug 5, 2022

Review NaN handling in median and approx_median #3039

Open

add more tests

ef1effd

andygrove mentioned this pull request Aug 5, 2022

Follow-on work for median aggregate function #3040

Closed

more docs

74ff2ef

andygrove added 2 commits August 5, 2022 07:30

clippy

7be6781

avoid a clone

1a92bea

alamb approved these changes Aug 5, 2022

View reviewed changes

andygrove merged commit 245def0 into apache:master Aug 5, 2022

andygrove deleted the median branch August 5, 2022 19:56

waitingkuo mentioned this pull request Aug 8, 2022

update cargo.lock in datafusion-cli #3074

Merged

jonmmease mentioned this pull request Aug 11, 2022

Median aggregation using DataFrame panics: "AggregateState is not a scalar aggregate" #3105

Closed

alamb mentioned this pull request Dec 2, 2022

Fix panic in median "AggregateState is not a scalar aggregate" #4488

Merged

alamb mentioned this pull request Dec 11, 2022

Remove AggregateState wrapper #4582

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement exact median, add `AggregateState` #3009

Implement exact median, add `AggregateState` #3009

andygrove commented Aug 1, 2022 •

edited

Loading

codecov-commenter commented Aug 1, 2022 •

edited

Loading

alamb commented Aug 3, 2022

alamb left a comment

alamb Aug 3, 2022

alamb Aug 3, 2022

alamb left a comment

alamb Aug 3, 2022

alamb Aug 3, 2022

andygrove Aug 3, 2022

alamb Aug 3, 2022

alamb Aug 3, 2022

alamb Aug 3, 2022

alamb Aug 3, 2022

andygrove Aug 5, 2022

andygrove commented Aug 3, 2022

yjshen commented Aug 4, 2022

andygrove commented Aug 4, 2022

andygrove commented Aug 4, 2022

alamb commented Aug 4, 2022 •

edited

Loading

andygrove commented Aug 5, 2022

alamb Aug 5, 2022

alamb Aug 5, 2022

alamb Aug 5, 2022

ursabot commented Aug 5, 2022

Implement exact median, add AggregateState #3009

Implement exact median, add AggregateState #3009

Conversation

andygrove commented Aug 1, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Aug 1, 2022 • edited Loading

Codecov Report

alamb commented Aug 3, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Aug 3, 2022

yjshen commented Aug 4, 2022

andygrove commented Aug 4, 2022

andygrove commented Aug 4, 2022

alamb commented Aug 4, 2022 • edited Loading

andygrove commented Aug 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Aug 5, 2022

Implement exact median, add `AggregateState` #3009

Implement exact median, add `AggregateState` #3009

andygrove commented Aug 1, 2022 •

edited

Loading

codecov-commenter commented Aug 1, 2022 •

edited

Loading

alamb commented Aug 4, 2022 •

edited

Loading