Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4668][CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle #4669

Merged
merged 1 commit into from
Feb 21, 2024

Conversation

zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Feb 7, 2024

What changes were proposed in this pull request?

Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle between them:
Examples:

 HashAggregate(t1.i, SUM, final)
                |                  =>    HashAggregate(t1.i, SUM, complete)
 HashAggregate(t1.i, SUM, partial)

For example:
TPCH Q18 with bucket tables, before this pr:
tmp2
there are two HashAggregateTransformer in one whole stage;

after this pr:
tmp1
there is only one HashAggregateTransformer in one whole stage, and will reduce the time for the second HashAggregateTransformer.

Now this feature only support for CH backend.

Close #4668.

(Fixes: #4668)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

github-actions bot commented Feb 7, 2024

#4668

Copy link

github-actions bot commented Feb 7, 2024

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Feb 7, 2024

@liujiayi771 @lgbo-ustc please help to review, thanks.

Copy link

github-actions bot commented Feb 7, 2024

Run Gluten Clickhouse CI

}
}

override def apply(plan: SparkPlan): SparkPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use PhysicalPlanSelector.maybe to check if guten is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should check whether the SparkPlan has been tagged as TRANSFORM_UNSUPPORTED. There are some rules earlier that will tag the SparkPlan, for example, FallbackOnANSIMode will tag all SparkPlans as TRANSFORM_UNSUPPORTED when ANSI mode is enabled. In this case, we can also avoid merging the aggregations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

github-actions bot commented Feb 7, 2024

Run Gluten Clickhouse CI

if (isPartialAgg(child, hashAgg)) {
// convert to complete mode aggregate expressions
val completeAggregateExpressions = aggregateExpressions.map(_.copy(mode = Complete))
HashAggregateExec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use

hashAgg.copy(
  aggregateExpressions = completeAggregateExpressions,
  child = child.child
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

github-actions bot commented Feb 7, 2024

Run Gluten Clickhouse CI

@zzcclp zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 7aa0f55 to 35c2555 Compare February 8, 2024 06:15
Copy link

github-actions bot commented Feb 8, 2024

Run Gluten Clickhouse CI

2 similar comments
Copy link

github-actions bot commented Feb 8, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Feb 8, 2024

Run Gluten Clickhouse CI

@zzcclp zzcclp force-pushed the merge_hashagg_if_no_shuffle branch from 46d6a6d to 17c9586 Compare February 8, 2024 13:08
Copy link

github-actions bot commented Feb 8, 2024

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Feb 18, 2024

@liujiayi771 @lgbo-ustc @ulysses-you @PHILO-HE @rui-mo please help to review again, thanks.

Copy link

Run Gluten Clickhouse CI

resultExpressions,
child: HashAggregateExec)
if !isStreaming && isTransformable(hashAgg) && isTransformable(child) =>
if (isPartialAgg(child, hashAgg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put this if into previous line ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else {
objectHashAgg
}
case plan: SparkPlan => plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also handle SortAggregate ? it is possible that there is no shuffle and sort between two SortAggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added SortAggregate

// to handle outputs according to the AggregateMode
for (attr <- child.output) {
typeList.add(ConverterUtils.getTypeNode(attr.dataType, attr.nullable))
nameList.add(ConverterUtils.genColumnNameWithExprId(attr))
nameList.addAll(ConverterUtils.collectStructFieldNames(attr.dataType))
}
(child.output, output)
} else if (!modes.contains(Partial)) {
} else if (modes.forall(_ == Partial)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how CH backend transform aggregate. But the code seems different with before, Partial can appear with PartialMerge with distinct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@zzcclp zzcclp requested a review from rui-mo February 19, 2024 03:18
Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some description for the motivation of this PR? It is a usual case for partial + final without shuffle between? Thanks.

Copy link

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Feb 20, 2024

Could you add some description for the motivation of this PR? It is a usual case for partial + final without shuffle between? Thanks.

Updated, please review again, thanks. @ulysses-you @rui-mo

agg.resultExpressions,
agg.child
)
transformer.doValidate().isValid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have not pulled out pre/post project, so this validation seems very likely to fail. Do we need this check ? I think it should be fine even we fallback to vanilla Spark after merging aggregates, since there already exists a rule ReplaceHashWithSortAgg to do the similar thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that merging sort aggregates also works for the vanilla Spark, do you have any idea for this? the hash based aggregates may be work even fallback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor

lgtm if test pass, cc @rui-mo if you have other comments

rui-mo
rui-mo previously approved these changes Feb 20, 2024
…gate in the spark plan when there is no shuffle

Examples:

 HashAggregate(t1.i, SUM, final)
                |                  =>    HashAggregate(t1.i, SUM, complete)
 HashAggregate(t1.i, SUM, partial)

now this feature only support for CH backend.

Close apache#4668.

Co-authored-by: lgbo <[email protected]>
Copy link

Run Gluten Clickhouse CI

@lgbo-ustc
Copy link
Contributor

LGTM

@ulysses-you ulysses-you merged commit 716b412 into apache:main Feb 21, 2024
19 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4669_time.csv log/native_master_02_20_2024_c3614f866_time.csv difference percentage
q1 30.14 34.24 4.108 113.63%
q2 25.53 24.42 -1.113 95.64%
q3 38.43 38.78 0.350 100.91%
q4 39.99 38.03 -1.960 95.10%
q5 83.15 70.83 -12.318 85.19%
q6 36.19 7.23 -28.954 19.99%
q7 118.82 82.68 -36.140 69.58%
q8 100.31 85.38 -14.929 85.12%
q9 132.51 126.15 -6.357 95.20%
q10 50.64 43.25 -7.387 85.41%
q11 23.49 20.79 -2.704 88.49%
q12 23.02 26.48 3.458 115.02%
q13 47.66 44.88 -2.782 94.16%
q14 13.86 18.88 5.022 136.24%
q15 33.06 29.18 -3.884 88.25%
q16 15.84 15.42 -0.425 97.32%
q17 106.42 102.55 -3.871 96.36%
q18 154.83 149.13 -5.702 96.32%
q19 20.37 12.57 -7.793 61.74%
q20 34.01 26.34 -7.670 77.45%
q21 226.46 225.83 -0.630 99.72%
q22 13.84 13.68 -0.158 98.86%
total 1368.56 1236.72 -131.840 90.37%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Merge two phase hash-based aggregate into one aggregate in the spark plan when there is no shuffle
6 participants