Optimize the performance queries with a single distinct aggregate #1315

ic4y · 2021-11-16T11:03:49Z

Which issue does this PR close?

resolves #1282
related to #1312

Rationale for this change

Improve the performance of single_distinct_agg

a single distinct aggregation optimization method as follows:

- Aggregation
       GROUP BY (k)
       F1(DISTINCT s0, s1, ...),
       F2(DISTINCT s0, s1, ...),
    - X
into
- Aggregation
         GROUP BY (k)
         F1(x)
         F2(x)
     - Aggregation
            GROUP BY (k, s0, s1, ...)
         - X

I used a test data set of 60 million to test datafunshion before and after using the optimizer. After optimization, the performance has double improvement and the execution time has been reduced from 12 seconds to 6 seconds.
The test results and the logical plan before and after optimization are as follows:

sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;

------------------original---------------------
Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+-------------------------------------------------+
| COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
+-------------------------------------------------+
| 1040570                                         |
+-------------------------------------------------+
usage millis: 12033

----------------after optimization-------------
Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] [LO_EXTENDEDPRICE:Int64]
      TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+----------------------------------------+
| COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
+----------------------------------------+
| 1040570                                |
+----------------------------------------+
usage millis: 5817

What changes are included in this PR?

Are there any user-facing changes?

nothing

Dandandan · 2021-11-16T16:51:31Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+            schema: _,
+            group_expr,
+        } => {
+            match is_single_agg(plan) {


This could be using if/else.

Dandandan · 2021-11-16T16:56:53Z

Great idea and results! @ic4y

xudong963 · 2021-11-16T17:22:47Z

Thanks for your contribution @ic4y, I'll take a look tomorrow.

xudong963 · 2021-11-17T03:47:15Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+        LogicalPlan::Aggregate {
+            input,
+            aggr_expr,
+            schema: _,
+            group_expr,
+        } => {


Suggested change

LogicalPlan::Aggregate {

input,

aggr_expr,

schema: _,

group_expr,

} => {

LogicalPlan::Aggregate {

input,

aggr_expr,

group_expr,

..

} => {

xudong963 · 2021-11-17T03:49:37Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+fn is_single_agg(plan: &LogicalPlan) -> bool {
+    match plan {
+        LogicalPlan::Aggregate {
+            input: _,


ditto, you can also check other places

Thanks i fixed it

xudong963 · 2021-11-17T05:25:34Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+        } => {
+            match is_single_agg(plan) {
+                true => {
+                    let mut all_group_args: Vec<Expr> = Vec::new();


Suggested change

let mut all_group_args: Vec<Expr> = Vec::new();

let mut all_group_args = Vec::with_capacity(group_expr.len());

xudong963 · 2021-11-17T05:38:32Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                    // remove distinct and collection args
+                    let mut new_aggr_expr = aggr_expr
+                        .iter()
+                        .map(|aggfunc| match aggfunc {


aggfunc is still an Expr, so it's better to have a name with expr not func

I have a question: if all exprs in aggr_expr are Expr::AggregateFunction?

Yes, because there is judgment in is_single_distinct_agg()

xudong963 · 2021-11-17T05:43:41Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                        aggr_expr: Vec::new(),
+                        schema: grouped_schema,
+                    };
+                    let mut expres = group_expr.clone();


nit: expres ?

xudong963 · 2021-11-17T05:46:23Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                    let expr = plan.expressions();
+                    // apply the optimization to all inputs of the plan
+                    let inputs = plan.inputs();
+
+                    let new_inputs = inputs
+                        .iter()
+                        .map(|plan| optimize(plan, execution_props))
+                        .collect::<Result<Vec<_>>>()?;
+
+                    utils::from_plan(plan, &expr, &new_inputs)


This part is redundant from the 113~119 lines, so if we can eliminate duplicate code.

Thanks i fixed it

xudong963 · 2021-11-17T05:47:49Z

BTW, some tests can't pass due to distinct -> group by

ic4y · 2021-11-17T06:17:43Z

@xudong963 Thank you for reviewing！I prioritize the problem of failing the tests

alamb · 2021-11-17T15:27:20Z

Thanks @ic4y -- this looks really cool. I plan to review this PR carefully shortly

alamb

Thank you @ic4y ❤️ I went through this carefully, and I think it could be merged. This is a very nice first contribution,. I left a few stylistic comments and it looks like there are some test failures (looks like some output needs to be updated).

Also, I think this transformation is also valid for multiple distinct aggregates (if they share the same argument ). For example

SELECT F1(DISTINCT s), F2(DISTINCT s), k
...  
GROUP BY k

Rewritten to

SELECT F1(s), F2(s)
FROM (
  SELECT s, k ... GROUP BY s, k
) 
GROUP BY k

alamb · 2021-11-17T16:50:26Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+///   </pre>
+///   <p>


These and <p> tags seem out of place.

I think you could use something like

/// ```text /// - Aggregation /// GROUP BY (k) /// F1(s) /// - Aggregation /// GROUP BY (k, s) /// - X /// ```

If you wanted to use monospaced fonts to illustrate the transformation

alamb · 2021-11-17T16:54:58Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+use std::sync::Arc;
+
+/// single distinct to group by optimizer rule
+///   - Aggregation


Here is another way to display this transformation

SELECT F1(DISTINCT s) ... GROUP BY k

Rewritten to

SELECT F1(s) FROM ( SELECT s, k ... GROUP BY s, k ) GROUP BY k

alamb · 2021-11-17T17:00:25Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                let mut all_group_args = Vec::with_capacity(group_expr.len());
+                all_group_args.append(&mut group_expr.clone());


Suggested change

let mut all_group_args = Vec::with_capacity(group_expr.len());

all_group_args.append(&mut group_expr.clone());

let mut all_group_args = group_expr.clone();

I don't think this really matters, but I figured I would point out a slightly shorter way to do the same thing.

alamb · 2021-11-17T17:06:40Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                    .iter()
+                    .map(|agg_expr| match agg_expr {
+                        Expr::AggregateFunction { fun, args, .. } => {
+                            all_group_args.append(&mut args.clone());


Likewise here you can do something like this (and avoid a mut):

Suggested change

all_group_args.append(&mut args.clone());

all_group_args.extend(args.iter().cloned());

alamb · 2021-11-17T17:07:11Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                    input: Arc::new(grouped_agg.unwrap()),
+                    aggr_expr: new_aggr_expr,
+                    schema: final_agg_schema.clone(),
+                    group_expr: group_expr.clone(),


Suggested change

group_expr: group_expr.clone(),

group_expr,

alamb · 2021-11-17T17:08:58Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                    group_expr: group_expr.clone(),
+                };
+
+                let mut alias_expr: Vec<Expr> = Vec::new();


It might help here to explain the rationale for adding this alias (so the aggregates are displayed in the same way even after the rewrite). It is important but may not be obvious to other readers

alamb · 2021-11-17T17:14:12Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+    }
+
+    #[test]
+    fn distinct_and_common() -> Result<()> {


ic4y · 2021-11-18T08:14:10Z

@alamb Thank you for your help.
I updated these and supports multiple distinct aggregates that use the same argument

alamb

Looks great @ic4y -- I started CI and when it passes I'll merge this one in. 🎉

alamb · 2021-11-18T22:17:35Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+                })
+                .count()
+                == aggr_expr.len()
+                && fields_set.len() == 1


alamb · 2021-11-18T22:18:59Z

datafusion/src/optimizer/single_distinct_to_groupby.rs

+            )?
+            .build()?;
+        // Should work
+        let expected = "Projection: #test.a AS a, #COUNT(test.b) AS COUNT(DISTINCT test.b), #MAX(test.b) AS MAX(DISTINCT test.b) [a:UInt32, COUNT(DISTINCT test.b):UInt64;N, MAX(DISTINCT test.b):UInt32;N]\


Dandandan · 2021-11-19T10:14:29Z

Thanks for this great contribution 🎉

houqp · 2021-11-22T02:55:43Z

Very cool optimization, thanks @ic4y !

add single_distinct_to_group_by optimizer rule

af56862

github-actions bot added the datafusion Changes in the datafusion crate label Nov 16, 2021

Dandandan reviewed Nov 16, 2021

View reviewed changes

Dandandan closed this Nov 16, 2021

Dandandan reopened this Nov 16, 2021

xudong963 reviewed Nov 17, 2021

View reviewed changes

xudong963 mentioned this pull request Nov 17, 2021

Optimizing queries with multiple aggregations where one Is aggregating on DISTINCT #1320

Open

xudong963 reviewed Nov 17, 2021

View reviewed changes

alamb changed the title ~~add single_distinct_to_group_by optimizer rule~~ Optimize the performance queries with a single distinct aggregate Nov 17, 2021

liuli and others added 2 commits November 17, 2021 23:42

add single_distinct_to_group_by optimizer rule

c9f0ac0

Merge branch 'master' into single_distinct

dd5c5a3

alamb approved these changes Nov 17, 2021

View reviewed changes

Fix a test method and support multiple aggregateFunction

7f4cf11

alamb approved these changes Nov 18, 2021

View reviewed changes

Dandandan approved these changes Nov 19, 2021

View reviewed changes

Dandandan merged commit a60cdb0 into apache:master Nov 19, 2021

houqp added the performance Make DataFusion faster label Nov 20, 2021

jiangzhx mentioned this pull request Mar 21, 2022

[Optimizer] Eliminate the distinct #2045

Closed

jackwener mentioned this pull request Mar 25, 2022

Eliminate max/min distinct #2084

Closed

This was referenced May 2, 2022

sum(distinct) support #2404

Closed

sum(distinct) support #2405

Merged

Support complete distinct usage for aggregate expressions #2406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the performance queries with a single distinct aggregate #1315

Optimize the performance queries with a single distinct aggregate #1315

ic4y commented Nov 16, 2021 •

edited by alamb

Loading

Dandandan Nov 16, 2021

Dandandan commented Nov 16, 2021

xudong963 commented Nov 16, 2021

xudong963 Nov 17, 2021

xudong963 Nov 17, 2021

ic4y Nov 17, 2021

xudong963 Nov 17, 2021

xudong963 Nov 17, 2021

xudong963 Nov 17, 2021

ic4y Nov 17, 2021

xudong963 Nov 17, 2021

xudong963 Nov 17, 2021

ic4y Nov 17, 2021

xudong963 commented Nov 17, 2021

ic4y commented Nov 17, 2021

alamb commented Nov 17, 2021

alamb left a comment

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

alamb Nov 17, 2021

ic4y commented Nov 18, 2021

alamb left a comment

alamb Nov 18, 2021

alamb Nov 18, 2021

Dandandan commented Nov 19, 2021

houqp commented Nov 22, 2021

	let mut all_group_args: Vec<Expr> = Vec::new();
	let mut all_group_args = Vec::with_capacity(group_expr.len());

		let mut all_group_args = Vec::with_capacity(group_expr.len());
		all_group_args.append(&mut group_expr.clone());

	let mut all_group_args = Vec::with_capacity(group_expr.len());
	all_group_args.append(&mut group_expr.clone());
	let mut all_group_args = group_expr.clone();

	all_group_args.append(&mut args.clone());
	all_group_args.extend(args.iter().cloned());

Optimize the performance queries with a single distinct aggregate #1315

Optimize the performance queries with a single distinct aggregate #1315

Conversation

ic4y commented Nov 16, 2021 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Dandandan commented Nov 16, 2021

xudong963 commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Nov 17, 2021

ic4y commented Nov 17, 2021

alamb commented Nov 17, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ic4y commented Nov 18, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Nov 19, 2021

houqp commented Nov 22, 2021

ic4y commented Nov 16, 2021 •

edited by alamb

Loading