[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

zzcclp · 2024-02-07T06:55:37Z

What changes were proposed in this pull request?

Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle between them:
Examples:

 HashAggregate(t1.i, SUM, final)
                |                  =>    HashAggregate(t1.i, SUM, complete)
 HashAggregate(t1.i, SUM, partial)

For example:
TPCH Q18 with bucket tables, before this pr:

there are two HashAggregateTransformer in one whole stage;

after this pr:

there is only one HashAggregateTransformer in one whole stage, and will reduce the time for the second HashAggregateTransformer.

Now this feature only support for CH backend.

Close #4668.

(Fixes: #4668)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions · 2024-02-07T06:55:57Z

#4668

github-actions · 2024-02-07T06:56:13Z

Run Gluten Clickhouse CI

zzcclp · 2024-02-07T06:56:32Z

@liujiayi771 @lgbo-ustc please help to review, thanks.

github-actions · 2024-02-07T12:21:19Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-02-07T12:28:59Z

gluten-core/src/main/scala/io/glutenproject/extension/MergeTwoPhasesHashAggregate.scala

+    }
+  }
+
+  override def apply(plan: SparkPlan): SparkPlan = {


It is better to use PhysicalPlanSelector.maybe to check if guten is enabled.

Perhaps you should check whether the SparkPlan has been tagged as TRANSFORM_UNSUPPORTED. There are some rules earlier that will tag the SparkPlan, for example, FallbackOnANSIMode will tag all SparkPlans as TRANSFORM_UNSUPPORTED when ANSI mode is enabled. In this case, we can also avoid merging the aggregations.

github-actions · 2024-02-07T12:29:35Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-02-07T12:37:31Z

gluten-core/src/main/scala/io/glutenproject/extension/MergeTwoPhasesHashAggregate.scala

+        if (isPartialAgg(child, hashAgg)) {
+          // convert to complete mode aggregate expressions
+          val completeAggregateExpressions = aggregateExpressions.map(_.copy(mode = Complete))
+          HashAggregateExec(


Can we use

hashAgg.copy( aggregateExpressions = completeAggregateExpressions, child = child.child )

github-actions · 2024-02-07T13:03:17Z

Run Gluten Clickhouse CI

github-actions · 2024-02-08T06:15:54Z

Run Gluten Clickhouse CI

github-actions · 2024-02-08T06:23:18Z

Run Gluten Clickhouse CI

github-actions · 2024-02-08T06:38:32Z

Run Gluten Clickhouse CI

github-actions · 2024-02-08T13:08:53Z

Run Gluten Clickhouse CI

zzcclp · 2024-02-18T02:36:59Z

@liujiayi771 @lgbo-ustc @ulysses-you @PHILO-HE @rui-mo please help to review again, thanks.

github-actions · 2024-02-18T06:10:20Z

Run Gluten Clickhouse CI

ulysses-you · 2024-02-19T02:34:00Z

gluten-core/src/main/scala/io/glutenproject/extension/MergeTwoPhasesHashAggregate.scala

+              resultExpressions,
+              child: HashAggregateExec)
+            if !isStreaming && isTransformable(hashAgg) && isTransformable(child) =>
+          if (isPartialAgg(child, hashAgg)) {


why not put this if into previous line ?

ulysses-you · 2024-02-19T02:35:46Z

gluten-core/src/main/scala/io/glutenproject/extension/MergeTwoPhasesHashAggregate.scala

+          } else {
+            objectHashAgg
+          }
+        case plan: SparkPlan => plan


can we also handle SortAggregate ? it is possible that there is no shuffle and sort between two SortAggregate.

added SortAggregate

ulysses-you · 2024-02-19T02:40:47Z

...ds-clickhouse/src/main/scala/io/glutenproject/execution/CHHashAggregateExecTransformer.scala

          // to handle outputs according to the AggregateMode
          for (attr <- child.output) {
            typeList.add(ConverterUtils.getTypeNode(attr.dataType, attr.nullable))
            nameList.add(ConverterUtils.genColumnNameWithExprId(attr))
            nameList.addAll(ConverterUtils.collectStructFieldNames(attr.dataType))
          }
          (child.output, output)
-        } else if (!modes.contains(Partial)) {
+        } else if (modes.forall(_ == Partial)) {


I'm not sure how CH backend transform aggregate. But the code seems different with before, Partial can appear with PartialMerge with distinct.

rui-mo

Could you add some description for the motivation of this PR? It is a usual case for partial + final without shuffle between? Thanks.

github-actions · 2024-02-20T03:50:11Z

Run Gluten Clickhouse CI

zzcclp · 2024-02-20T03:55:04Z

Could you add some description for the motivation of this PR? It is a usual case for partial + final without shuffle between? Thanks.

Updated, please review again, thanks. @ulysses-you @rui-mo

ulysses-you · 2024-02-20T05:43:18Z

gluten-core/src/main/scala/io/glutenproject/extension/MergeTwoPhasesHashAggregate.scala

+          agg.resultExpressions,
+          agg.child
+        )
+      transformer.doValidate().isValid


Here we have not pulled out pre/post project, so this validation seems very likely to fail. Do we need this check ? I think it should be fine even we fallback to vanilla Spark after merging aggregates, since there already exists a rule ReplaceHashWithSortAgg to do the similar thing.

I am not sure that merging sort aggregates also works for the vanilla Spark, do you have any idea for this? the hash based aggregates may be work even fallback.

github-actions · 2024-02-20T07:01:34Z

Run Gluten Clickhouse CI

ulysses-you · 2024-02-20T07:41:32Z

lgtm if test pass, cc @rui-mo if you have other comments

…gate in the spark plan when there is no shuffle Examples: HashAggregate(t1.i, SUM, final) | => HashAggregate(t1.i, SUM, complete) HashAggregate(t1.i, SUM, partial) now this feature only support for CH backend. Close apache#4668. Co-authored-by: lgbo <[email protected]>

github-actions · 2024-02-20T13:51:23Z

Run Gluten Clickhouse CI

lgbo-ustc · 2024-02-21T01:05:01Z

LGTM

GlutenPerfBot · 2024-02-21T01:58:14Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_4669_time.csv	log/native_master_02_20_2024_c3614f866_time.csv	difference	percentage
q1	30.14	34.24	4.108	113.63%
q2	25.53	24.42	-1.113	95.64%
q3	38.43	38.78	0.350	100.91%
q4	39.99	38.03	-1.960	95.10%
q5	83.15	70.83	-12.318	85.19%
q6	36.19	7.23	-28.954	19.99%
q7	118.82	82.68	-36.140	69.58%
q8	100.31	85.38	-14.929	85.12%
q9	132.51	126.15	-6.357	95.20%
q10	50.64	43.25	-7.387	85.41%
q11	23.49	20.79	-2.704	88.49%
q12	23.02	26.48	3.458	115.02%
q13	47.66	44.88	-2.782	94.16%
q14	13.86	18.88	5.022	136.24%
q15	33.06	29.18	-3.884	88.25%
q16	15.84	15.42	-0.425	97.32%
q17	106.42	102.55	-3.871	96.36%
q18	154.83	149.13	-5.702	96.32%
q19	20.37	12.57	-7.793	61.74%
q20	34.01	26.34	-7.670	77.45%
q21	226.46	225.83	-0.630	99.72%
q22	13.84	13.68	-0.158	98.86%
total	1368.56	1236.72	-131.840	90.37%

zzcclp requested review from zhztheplayer, ulysses-you and PHILO-HE February 7, 2024 06:55

liujiayi771 reviewed Feb 7, 2024

View reviewed changes

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 7aa0f55 to 35c2555 Compare February 8, 2024 06:15

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 46d6a6d to 17c9586 Compare February 8, 2024 13:08

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 17c9586 to f5e1689 Compare February 18, 2024 06:09

ulysses-you reviewed Feb 19, 2024

View reviewed changes

zzcclp requested a review from rui-mo February 19, 2024 03:18

rui-mo reviewed Feb 19, 2024

View reviewed changes

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from f5e1689 to 51e5270 Compare February 20, 2024 03:49

ulysses-you reviewed Feb 20, 2024

View reviewed changes

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 51e5270 to 580338f Compare February 20, 2024 07:01

rui-mo previously approved these changes Feb 20, 2024

View reviewed changes

zzcclp dismissed rui-mo’s stale review via 66bf015 February 20, 2024 13:50

zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 580338f to 66bf015 Compare February 20, 2024 13:50

ulysses-you approved these changes Feb 21, 2024

View reviewed changes

ulysses-you merged commit 716b412 into apache:main Feb 21, 2024
19 checks passed

lgbo-ustc mentioned this pull request Aug 16, 2024

[GLUTEN-6878][CH] Avoid name collisions in naming aggregate result #6886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

zzcclp commented Feb 7, 2024 •

edited

Loading

github-actions bot commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

zzcclp commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

liujiayi771 Feb 7, 2024

liujiayi771 Feb 7, 2024

zzcclp Feb 8, 2024

github-actions bot commented Feb 7, 2024

liujiayi771 Feb 7, 2024

zzcclp Feb 8, 2024

github-actions bot commented Feb 7, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

zzcclp commented Feb 18, 2024

github-actions bot commented Feb 18, 2024

ulysses-you Feb 19, 2024

zzcclp Feb 20, 2024

ulysses-you Feb 19, 2024

zzcclp Feb 20, 2024

ulysses-you Feb 19, 2024

zzcclp Feb 20, 2024

rui-mo left a comment

github-actions bot commented Feb 20, 2024

zzcclp commented Feb 20, 2024

ulysses-you Feb 20, 2024

zzcclp Feb 20, 2024

zzcclp Feb 20, 2024

github-actions bot commented Feb 20, 2024

ulysses-you commented Feb 20, 2024

github-actions bot commented Feb 20, 2024

lgbo-ustc commented Feb 21, 2024

GlutenPerfBot commented Feb 21, 2024

[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

Conversation

zzcclp commented Feb 7, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

zzcclp commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

zzcclp commented Feb 18, 2024

github-actions bot commented Feb 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2024

zzcclp commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2024

ulysses-you commented Feb 20, 2024

github-actions bot commented Feb 20, 2024

lgbo-ustc commented Feb 21, 2024

GlutenPerfBot commented Feb 21, 2024

zzcclp commented Feb 7, 2024 •

edited

Loading