Push limit into aggregation for DISTINCT ... LIMIT queries #8038

msirek · 2023-11-02T19:55:00Z

Which issue does this PR close?

Closes #7781.

Rationale for this change

Evaluation of queries form
SELECT DISTINCT column_list FROM table LIMIT n;
may read more rows than necessary when performing a grouped hash aggregation.

If a given batch of input rows is seen which contains more group values than
the LIMIT value, switching the aggregation to output mode early allows the
the limit to be reached more quickly and minimizes the number of rows which
need to be processed by the aggregation, or the input stream.

What changes are included in this PR?

Push limit into AggregateExec for DISTINCT with GROUP BY

This commit adds physical plan rewrite rule LimitedDistinctAggregation,
but does not wire it up for use by the optimizer. The rule matches a
LocalLimitExec or GlobalLimitExec operation as the parent of an
AggregateExec which has a group-by, but no aggregate expressions, order-by
or ordering requirements, or filtering, and pushes the limit into the
the AggregateExec as a limit hint.

As the aggregation may be applied in a series of AggregateExec
operations, the limit is also pushed down a chain of direct
AggregateExec decendents having identical grouping columns.

The rule must be applied before distribution requirements are enforced
as that rule may inject other operations in between the different
AggregateExecs. Applying the rule early means only directly-connected
AggregateExecs need to be examined.

The key point of this rule is that it is only legal for cases where not
all rows in the group need to be processed to ensure correctness.

Unit tests for LimitedDistinctAggregation are included.

Soft limit for GroupedHashAggregateStream with no aggregate expressions

This commit wires up the LimitedDistinctAggregation rule in the physical
plan optimizer and updates the GroupedHashAggregateStream with an
optional soft limit on the number of group_values in a batch. If the
number of group_values in a single batch exceeds the limit, the operation
immediately signals the input is done, switches to output mode and emits all groups.

This commit includes sqllogictests for DISTINCT queries with a LIMIT.

The CombinePartialFinalAggregate rule is also updated to convey the
limit on the final aggregation to the combined aggregation.

Add datafusion.optimizer.enable_distinct_aggregation_soft_limit setting

This commit adds the datafusion.optimizer.enable_distinct_aggregation_soft_limit
configuration setting, which defaults to true. When true, the
LimitedDistinctAggregation physical plan rewrite rule is enabled, which
pushes a LIMIT into a grouped aggregation with no aggregate expressions,
as a soft limit, to emit all grouped values seen so far once the limit is reached.

Fix result checking in topk_aggregate benchmark

This commit fixes the logic which validates the rows returned by the benchmark query.
The test was expecting hexadecimal digits in lowercase, but results are uppercase.

Make the topk_aggregate benchmark's make_data function public

This commit moves the make_data function, which generates either random or ascending
time series data, to the data_utils module, so it could be shared by other benchmarks.

Add benchmark for DISTINCT queries

This commit adds a benchmark for queries using DISTINCT or GROUP BY with a LIMIT clause
and no aggregate expressions. It is intended to test the performance of the
LimitedDistinctAggregation rewrite rule and new limit hint in GroupedHashAggregateStream.

Are these changes tested?

unit tests for LimitedDistinctAggregation
unit test for propagating the AggregateExec limit to the result of CombinePartialFinalAggregate
unit test for datafusion.optimizer.enable_distinct_aggregation_soft_limit setting
sqllogictests to check correct results of AggregateExec with a limit
criterion benchmarks for DISTINCT queries

Are there any user-facing changes?

No

Notes

This is opened as a draft PR.
PRs for the individual commits can be opened separately if this is too large to review in one PR.

msirek · 2023-11-02T20:01:08Z

Results from criterion_benchmark_limited_distinct

baseline

custom-measurement-time/distinct_group_by_u64_narrow_limit_10
time: [120.79 ms 121.22 ms 121.67 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
custom-measurement-time/distinct_group_by_u64_narrow_limit_100
time: [124.01 ms 124.55 ms 125.12 ms]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
custom-measurement-time/distinct_group_by_u64_narrow_limit_1000
time: [124.04 ms 124.48 ms 124.94 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
custom-measurement-time/distinct_group_by_u64_narrow_limit_10000
time: [127.71 ms 129.49 ms 131.52 ms]
Found 15 outliers among 100 measurements (15.00%)
9 (9.00%) high mild
6 (6.00%) high severe
Benchmarking custom-measurement-time/group_by_multiple_columns_limit_10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 40.0s. You may wish to increase target time to 136.5s, or reduce sample count to 20.
custom-measurement-time/group_by_multiple_columns_limit_10
time: [1.3416 s 1.3465 s 1.3517 s]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild

new

custom-measurement-time/distinct_group_by_u64_narrow_limit_10
time: [83.726 ms 84.168 ms 84.596 ms]
change: [-31.003% -30.568% -30.128%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) low mild
1 (1.00%) high mild
custom-measurement-time/distinct_group_by_u64_narrow_limit_100
time: [112.00 ms 112.33 ms 112.66 ms]
change: [-10.299% -9.8102% -9.3399%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
custom-measurement-time/distinct_group_by_u64_narrow_limit_1000
time: [112.01 ms 112.37 ms 112.73 ms]
change: [-10.152% -9.7257% -9.2833%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
custom-measurement-time/distinct_group_by_u64_narrow_limit_10000
time: [111.86 ms 112.21 ms 112.55 ms]
change: [-14.699% -13.350% -12.114%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
Benchmarking custom-measurement-time/aggregate_group_by_multiple_columns_limit_10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 40.0s. You may wish to increase target time to 114.8s, or reduce sample count to 30.
custom-measurement-time/group_by_multiple_columns_limit_10
time: [1.1212 s 1.1247 s 1.1285 s]
change: [-16.881% -16.473% -16.075%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe

Results from criterion_benchmark_limited_distinct_sampled

baseline

Benchmarking distinct query with 100 partitions and 100000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 15.3s, or reduce sample count to 30.
distinct query with 100 partitions and 100000 samples per partition with limit 10
time: [151.67 ms 151.93 ms 152.19 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

Benchmarking distinct query with 10 partitions and 1000000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 97.2s, or reduce sample count to 10.
distinct query with 10 partitions and 1000000 samples per partition with limit 10
time: [922.72 ms 933.79 ms 944.47 ms]
Found 23 outliers among 100 measurements (23.00%)
20 (20.00%) low severe
3 (3.00%) low mild

Benchmarking distinct query with 1 partitions and 10000000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 105.1s, or reduce sample count to 10.
distinct query with 1 partitions and 10000000 samples per partition with limit 10
time: [1.0396 s 1.0424 s 1.0454 s]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

new

Benchmarking distinct query with 100 partitions and 100000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, or reduce sample count to 50.
distinct query with 100 partitions and 100000 samples per partition with limit 10
time: [85.114 ms 87.070 ms 88.822 ms]
change: [-44.085% -42.690% -41.525%] (p = 0.00 < 0.05)
Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
16 (16.00%) low severe
4 (4.00%) low mild

Benchmarking distinct query with 10 partitions and 1000000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.2s, or reduce sample count to 40.
distinct query with 10 partitions and 1000000 samples per partition with limit 10
time: [104.73 ms 106.50 ms 108.28 ms]
change: [-88.813% -88.594% -88.348%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild

Benchmarking distinct query with 1 partitions and 10000000 samples per partition with limit 10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 97.1s, or reduce sample count to 10.
distinct query with 1 partitions and 10000000 samples per partition with limit 10
time: [952.15 ms 954.94 ms 957.80 ms]
change: [-8.7663% -8.3904% -8.0083%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

alamb

This is looking very neat @msirek

alamb · 2023-11-03T18:02:05Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+    /// If the number of `group_values` in a single batch exceeds this value,
+    /// the `GroupedHashAggregateStream` operation immediately switches to
+    /// output mode and emits all groups.
+    group_values_soft_limit: Option<usize>,


Thanks. I believe this is ready for a look. Please let me know if you think I should break it down into smaller PRs.

I can't see any good way to break this down (other than maybe breaking out the benchmarks), so this is fine

alamb · 2023-11-07T18:57:20Z

Thank you @msirek -- I hope to review it more carefully later today or tomorrow

alamb

Wow -- thank you for this contribution @msirek -- it is very nice read. I found it well documented and well tested. 🦾

I have a suggestion on how to simplify this code in msirek#1 but I think what is in this PR is also correct and thus can be merged

BTW I would love to know anything you can share about your usecase and what you are doing with DataFusion

cc @avantgardnerio @thinkharderdev and @Dandandan as this is somewhat related to the high cardinality tracing_id usecase

alamb · 2023-11-08T18:06:17Z

datafusion/core/benches/data_utils/mod.rs

@@ -156,3 +161,83 @@ pub fn create_record_batches(
        })
        .collect::<Vec<_>>()
 }
+
+/// Create time series data with `partition_cnt` partitions and `sample_cnt` rows per partition


Thank you for adding the comments here

alamb · 2023-11-08T18:13:20Z

datafusion/core/src/physical_optimizer/optimizer.rs

@@ -79,6 +80,8 @@ impl PhysicalOptimizer {
            // repartitioning and local sorting steps to meet distribution and ordering requirements.
            // Therefore, it should run before EnforceDistribution and EnforceSorting.
            Arc::new(JoinSelection::new()),
+            // The LimitedDistinctAggregation rule should be applied before the EnforceDistribution rule


I think adding the rationale for this limitation would be helpful. Your PR description I think explains it pretty well:

Suggested change

// The LimitedDistinctAggregation rule should be applied before the EnforceDistribution rule

// The LimitedDistinctAggregation rule should be applied before the EnforceDistribution rule

// As that rule may inject other operations in between the different AggregateExecs.

// Applying the rule early means only directly-connected AggregateExecs must be examined.

I think adding the rationale for this limitation would be helpful. Your PR description I think explains it pretty well:

Made the suggested change.

alamb · 2023-11-08T18:14:46Z

datafusion/core/src/physical_optimizer/combine_partial_final_agg.rs

@@ -86,7 +86,7 @@ impl PhysicalOptimizerRule for CombinePartialFinalAggregate {
                                            } else {
                                                AggregateMode::SinglePartitioned
                                            };
-                                        AggregateExec::try_new(
+                                        let combined_agg = AggregateExec::try_new(


Another way to express the same logic with less indenting is:

AggregateExec::try_new( mode, input_agg_exec.group_by().clone(), input_agg_exec.aggr_expr().to_vec(), input_agg_exec.filter_expr().to_vec(), input_agg_exec.order_by_expr().to_vec(), input_agg_exec.input().clone(), input_agg_exec.input_schema().clone(), ) .map(|combined_agg| { combined_agg.with_limit(agg_exec.limit()) }) .ok() .map(Arc::new)

Another way to express the same logic with less indenting is:

AggregateExec::try_new( mode, input_agg_exec.group_by().clone(), input_agg_exec.aggr_expr().to_vec(), input_agg_exec.filter_expr().to_vec(), input_agg_exec.order_by_expr().to_vec(), input_agg_exec.input().clone(), input_agg_exec.input_schema().clone(), ) .map(|combined_agg| { combined_agg.with_limit(agg_exec.limit()) }) .ok() .map(Arc::new)

Applied this change from your example PR, thanks!

alamb · 2023-11-08T18:15:09Z

datafusion/core/src/physical_optimizer/topk_aggregation.rs

        let sort = SortExec::new(sort.expr().to_vec(), child)
            .with_fetch(sort.fetch())
            .with_preserve_partitioning(sort.preserve_partitioning());
        Some(Arc::new(sort))
    }
 }

-fn transform_down_mut<F>(


👍 for moving into the trait

alamb · 2023-11-08T18:17:56Z

datafusion/core/tests/sql/group_by.rs

@@ -70,6 +70,45 @@ async fn group_by_date_trunc() -> Result<()> {
    Ok(())
 }

+#[tokio::test]
+async fn distinct_group_by_limit() -> Result<()> {


Does this test add additional coverage compared to the tests in datafusion/sqllogictest/test_files/aggregate.slt?

Not really, except that it uses mode=Single. Removed the test.

alamb · 2023-11-08T18:21:09Z

datafusion/physical-plan/src/aggregates/mod.rs

+        if self.order_by_expr().iter().any(|e| e.is_some()) {
+            return false;
+        }
+        // ensure there is no output ordering; can this rule be relaxed?


it is probably subsumed by the check on the required input ordering, because the group operator doesn't introduce any new orderings

it is probably subsumed by the check on the required input ordering

OK. Even if there is an ORDER BY in a nested expression (e.g derived table), there's no requirement that the rows are presented in that order unless there's a top-level ORDER BY.

Is it OK to keep this check in until we have a known test case where there is an output ordering but no required input ordering? Just want to avoid incorrect results in the event there is some edge case not considered.

yes I think keeping the check is quite prudent

alamb · 2023-11-08T18:21:37Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+    /// If the number of `group_values` in a single batch exceeds this value,
+    /// the `GroupedHashAggregateStream` operation immediately switches to
+    /// output mode and emits all groups.
+    group_values_soft_limit: Option<usize>,


I can't see any good way to break this down (other than maybe breaking out the benchmarks), so this is fine

alamb · 2023-11-08T18:42:39Z

datafusion/physical-plan/src/aggregates/row_hash.rs

-                                // If spill files exist, stream-merge them.
-                                extract_ok!(self.update_merged_stream());
-                                self.exec_state = ExecutionState::ReadingInput;
+                            if let Poll::Ready(Some(Err(e))) =


I was confused by this at first, as it looks like it discards any batch produced by set_input_done_and_produce_output? Like if set_input_done_and_produce_output returns Poll::Ready(Some(batch)) it just gets dropped 🤔

However, then I re-reviewed the code and set_input_done_and_produce_output never returns Poll::Ready(Some(batch)).

I have a thought about how to simplify this code which I will put up as another PR for your consideration

I don't think this would prevent this PR from merging

alamb · 2023-11-08T18:44:46Z

datafusion/sqllogictest/test_files/aggregate.slt

+----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3], has_header=true
+
+# TODO(msirek): Extend checking in LimitedDistinctAggregation equal groupings to ignore the order of columns


BTW the equivalence module (recently worked on from @ozankabak and @mustafasrepo ) https://github.com/apache/arrow-datafusion/blob/15d8c9bf48a56ae9de34d18becab13fd1942dc4a/datafusion/physical-expr/src/equivalence.rs has logic to perform this type of analysis

BTW the equivalence module (recently worked on from @ozankabak and @mustafasrepo ) https://github.com/apache/arrow-datafusion/blob/15d8c9bf48a56ae9de34d18becab13fd1942dc4a/datafusion/physical-expr/src/equivalence.rs has logic to perform this type of analysis

Thanks, I will take a look.

It would probably take a bit more work and testing to support this since PhysicalGroupBy isn't exactly the same as an equivalence class. I've opened issue #8101 for this.

alamb · 2023-11-08T18:45:08Z

datafusion/common/src/config.rs

+        /// When set to true, the optimizer will push a limit operation into
+        /// grouped aggregations which have no aggregate expressions, as a soft limit,
+        /// emitting groups once the limit is reached, before all rows in the group are read.
+        pub enable_distinct_aggregation_soft_limit: bool, default = true


💯 for a disable flag

avantgardnerio

nice!

msirek · 2023-11-08T19:40:45Z

I have a suggestion on how to simplify this code in msirek#1

Thanks! I've included those commits in this PR.

BTW I would love to know anything you can share about your usecase and what you are doing with DataFusion

I'm not really a DataFusion user. I just have a personal interest in query optimization and DataFusion looks pretty neat.

msirek · 2023-11-08T20:29:15Z

Thanks for the reviews!

alamb · 2023-11-08T22:23:10Z

🤔 this branch has some conflicts that need to be fixed

Perhaps #8004

alamb · 2023-11-09T14:18:55Z

Thanks again @msirek -- looks great ❤️

Mark Sirek added 6 commits November 2, 2023 12:24

Push limit into AggregateExec for DISTINCT with GROUP BY

798c719

Soft limit for GroupedHashAggregateStream with no aggregate expressions

b93e7ce

Add datafusion.optimizer.enable_distinct_aggregation_soft_limit setting

7354fd9

Fix result checking in topk_aggregate benchmark

3ce467e

Make the topk_aggregate benchmark's make_data function public

84879fe

Add benchmark for DISTINCT queries

29d7289

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 2, 2023

Fix doc formatting with prettier

7649dfb

alamb reviewed Nov 3, 2023

View reviewed changes

This was referenced Nov 4, 2023

Incorrect results in COUNT(*) queries involving LIMIT #8048

Closed

Fix incorrect results in COUNT(*) queries with LIMIT #8049

Merged

msirek marked this pull request as ready for review November 7, 2023 16:53

alamb mentioned this pull request Nov 8, 2023

Simplify code for LIMIT in aggregate stream msirek/arrow-datafusion#1

Open

alamb approved these changes Nov 8, 2023

View reviewed changes

avantgardnerio self-requested a review November 8, 2023 19:03

avantgardnerio approved these changes Nov 8, 2023

View reviewed changes

alamb and others added 3 commits November 8, 2023 11:15

Minor: Simply early emit logic in GroupByHash

53ff1d3

remove level of indentation

9932ace

Use '///' for function comments

e839159

Address review comments

348c53d

Mark Sirek and others added 4 commits November 8, 2023 16:47

rename transform_local_limit to transform_limit

40f5d42

Resolve conflicts

783f5f0

Merge branch 'main' into agg_with_limit

6071ee4

Update test after merge with main

db94a8b

msirek mentioned this pull request Nov 9, 2023

Push limit into aggregation when nested GROUP BY expressions are equivalent but not identical #8101

Open

msirek requested a review from alamb November 9, 2023 07:29

alamb merged commit 43cc870 into apache:main Nov 9, 2023
23 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

liukun4515 mentioned this pull request May 27, 2024

The limit info lost in the AggregateExec when ser/deser the physical plan #10630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push limit into aggregation for DISTINCT ... LIMIT queries #8038

Push limit into aggregation for DISTINCT ... LIMIT queries #8038

msirek commented Nov 2, 2023 •

edited

Loading

msirek commented Nov 2, 2023

alamb left a comment

alamb Nov 3, 2023

msirek Nov 7, 2023

alamb Nov 8, 2023

alamb commented Nov 7, 2023

alamb left a comment

alamb Nov 8, 2023

alamb Nov 8, 2023

msirek Nov 8, 2023

alamb Nov 8, 2023

msirek Nov 8, 2023

alamb Nov 8, 2023

alamb Nov 8, 2023

msirek Nov 8, 2023

alamb Nov 8, 2023

msirek Nov 8, 2023

alamb Nov 8, 2023

alamb Nov 8, 2023

alamb Nov 8, 2023

alamb Nov 8, 2023

msirek Nov 8, 2023

msirek Nov 9, 2023

alamb Nov 8, 2023

avantgardnerio left a comment

msirek commented Nov 8, 2023

msirek commented Nov 8, 2023

alamb commented Nov 8, 2023

alamb commented Nov 9, 2023

Push limit into aggregation for DISTINCT ... LIMIT queries #8038

Push limit into aggregation for DISTINCT ... LIMIT queries #8038

Conversation

msirek commented Nov 2, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Push limit into AggregateExec for DISTINCT with GROUP BY

Soft limit for GroupedHashAggregateStream with no aggregate expressions

Add datafusion.optimizer.enable_distinct_aggregation_soft_limit setting

Fix result checking in topk_aggregate benchmark

Make the topk_aggregate benchmark's make_data function public

Add benchmark for DISTINCT queries

Are these changes tested?

Are there any user-facing changes?

Notes

msirek commented Nov 2, 2023

Results from criterion_benchmark_limited_distinct

baseline

new

Results from criterion_benchmark_limited_distinct_sampled

baseline

new

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 7, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avantgardnerio left a comment

Choose a reason for hiding this comment

msirek commented Nov 8, 2023

msirek commented Nov 8, 2023

alamb commented Nov 8, 2023

alamb commented Nov 9, 2023

msirek commented Nov 2, 2023 •

edited

Loading