-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-4421][VL] Disable flushable aggregate when input is already partitioned by grouping keys #4443
Conversation
/Benchmark Velox |
Run Gluten Clickhouse CI |
- name: TPC-H SF1.0 && TPC-DS SF1.0 Parquet local spark3.3 Q38 flush | ||
run: | | ||
$PATH_TO_GLUTEN_TE/$OS_IMAGE_NAME/gha/gha-checkout/exec.sh 'cd /opt/gluten/tools/gluten-it \ | ||
&& GLUTEN_IT_JVM_ARGS=-Xmx5G sbin/gluten-it.sh queries-compare \ | ||
--local --preset=velox --benchmark-type=ds --error-on-memleak --off-heap-size=10g -s=1.0 --threads=16 --iterations=1 --queries=q38 \ | ||
--disable-bhj \ | ||
--extra-conf=spark.gluten.sql.columnar.backend.velox.maxPartialAggregationMemoryRatio=0.1 \ | ||
--extra-conf=spark.gluten.sql.columnar.backend.velox.maxExtendedPartialAggregationMemoryRatio=0.2 \ | ||
--extra-conf=spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinPct=100 \ | ||
--extra-conf=spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinRows=0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later we may develop a new way to arrange these gluten-it CI jobs. The yaml file size is exploding.
Run Gluten Clickhouse CI |
/Benchmark Velox |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
5638614
to
43fb07b
Compare
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
/Benchmark Velox |
Run Gluten Clickhouse CI |
I'm trying to understand this issue, please correct me if wrong. What we want to fix in this pr is: do not convert regular to flushable agg (2) because the flushable agg would output more than one group for the same group and then cause the partial count accumulator (3) bigger than expected.
|
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
It's more or less similar to the issue this PR is trying to solve. Except that Q38 generates plan like the following:
In Spark 3.2 the agg (1) can be flushable since there was another distinct aggregation generated on reducer side:
That's why the issue is found starting from Spark 3.3. There might be some new optimizations from vanilla Spark. |
BTW, forgot to mention that current code should already be able to handle this case without this patch (we convert agg to flushable agg only when it's the one close to shuffle). But still thanks for taking the example which is valuable anyway. |
The optimization since Spark3.3 is due to the pr apache/spark#35779. So we should not convert regular to flushable agg if it is a group by only aggreagate and it's adjacent parent is a partial aggregate ? |
Thanks for the information. I think we indeed could forbid flushing on the case apache/spark#35779 optimizes against, although I am thinking whether the patch could provide a more general fix. When an aggregate can be considered to emit distinct data and so propagate "distinct attributes", the distinct aggregation must be a "final distinct aggregation", which means it has to process data that are already partitioned by the distinct keys. Base on this assumption, the patch could be a correct fix (Correct me if I was wrong, indeed). Additionally, the fix could be considered "general" since it's not limited to distinct aggregation. For example, a partial sum agg could produce meaningful data for a specific grouping set when it is handling input that was already partitioned by grouping keys. This may not be a good example since I doubt Catalyst planner never creates plan related to this case, but anyway the principle here is to be more careful to use flushable aggregation since vanilla Spark doesn't have this kind of optimization as of now. |
Run Gluten Clickhouse CI |
This pr is a kind of conservative fix for the issue, that means it can fix the issue but may miss optimize some other cases, e.g., if the agg is on the top of a shuffled join with the same keys, then the partial agg would not be converted to flushable.
I'm fine to fix it first since it's a data correctness issue, and do further optimization in next pr. |
I understand your point. And I think this kind of plan is not able to be optimized within flushable agg even without the patch. So let's keep enhancing the rule to cover more cases like that in further development iterations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the fix
Thanks for reviewing! |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
If child output already partitioned by aggregation keys (this function returns true), we should avoid the optimization converting to flushable aggregation.
For example, if input is hash-partitioned by keys (a, b) and aggregate node requests "group by a, b, c", then the aggregate should NOT flush as the grouping set (a, b, c) will be created only on a single partition among the whole cluster. Spark's planner may use this information to perform optimizations like doing "partial_count(a, b, c)" directly on the output data.
This fixes #4421