-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve avg/sum Aggregator performance for Decimal #5866
Conversation
Great work, I can take a look tomorrow and rerun the benchmarks on my machine. |
@mingmwang query 18 also is affected by this (although should be less of a difference than query 17) |
I tried testing the changes in this PR and ran into some errors when running query 1 using the code in https://github.com/sql-benchmarks/sqlbench-runners/tree/main/datafusion
I don't see these errors when running against the latest in the main branch. |
Here are the results on my desktop (24 core) for queries 17 and 18 (sf=10) This PR
Main branch
|
When running in memory (SF=1)
main
PR
|
I run the test with sf=1 and partition = 1. I will do more test with sf= 10 tomorrow. |
@Dandandan, how did you generate the flame graph? At my side, there're many unknowns. Could you teach me how to avoid them and share your commands for the flame graph?
|
I can not reproduce the issue using DataFusion's own benchmark data(sf=10), but I'm able to reproduce the issue using Spark generated benchmark data. I guess Spark's tpch data schema is different with DataFusion's. |
Hi @Dandandan, for the latest code, the cast is almost avoided. The ratio is reduced from 16.86% to 0.46%. The related flame graphs are as follows: With single partition benchmark, the latency is reduced from 11s to 9s for q17. |
I added the overflow check when converting the internal sum type to the result type, there was bug it is fixed now. |
Could you please rerun the benchmark with partition = 1 or partition = 2 ? |
Maybe decimals vs floats? Official TPC-H uses decimals. |
Both are decimals, just the precision is a little different. Anyway, it is my bug. |
Hi @mingmwang , for generating the flamegraphs I currently use I use the following command:
|
@@ -31,3 +33,49 @@ pub fn get_accum_scalar_values_as_arrays( | |||
.map(|s| s.to_array_of_size(1)) | |||
.collect::<Vec<_>>()) | |||
} | |||
|
|||
pub fn calculate_result_decimal_for_avg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a little bit different then the explicit arithmetic expression in the SQL, the reason is this final result decimal conversion is implicit, the logical plans/optimizers actually are not aware of this implicit conversion. This is similar to the implicit cast
that I pulled up this PR, the cast is also implicit and the logical plans/optimizers are not aware of this implicit cast
either. Those implicit cast
and conversions are only related to internal state of some specific Aggregation Accumulator
, they are not part of the SQL plans/expressions tree.
But in general, we can have a more general decimal precision change and overflow check logic for Decimal. And the Aggregators can reuse the logic. |
@Dandandan @andygrove @andygrove @yahoNanJing |
LGTM. |
Hi @Dandandan, @andygrove, @alamb, do you still have any concerns for this PR? At my side, this PR can achieve around 20% performance improvement for q17 with single partition. It would be better to merge this PR first and then continue other bottleneck refining :) |
* improve avg/sum Aggregator performance * check type before cast * fix Arithmetic Overflow bug * fix clippy
Which issue does this PR close?
Closes #5858
closes #5859.
Rationale for this change
Improve avg(decimal) and sum(decimal) performance, avoid type cast in the inner loop, improve the TPCH-q17 performance
What changes are included in this PR?
pull up the implicit
cast
from thesum_batch()
in theAvgAccumulator
andSumAccumulator
to theaggregate_expressions()
, so the cast is doing in the outer loop.In AvgAccumulator, differ the sum data type and return data type, for Decimal, they are different.
In the SparkSQL, the sum type of avg is
DECIMAL(min(38,precision+10)
, and the return type isDECIMAL(min(38,precision+4), min(38,scale+4))
. Add overflow check for Decimal when convert from the internal sum type to the return type.Are these changes tested?
I had test this on my local Mac, for TPCH-q17, there is at least 10% improvement
Before this PR:
Query 17 iteration 0 took 3249.9 ms and returned 1 rows
Query 17 iteration 1 took 3430.5 ms and returned 1 rows
Query 17 iteration 2 took 3413.1 ms and returned 1 rows
Query 17 avg time: 3364.49 ms
After this PR:
Query 17 iteration 0 took 3019.4 ms and returned 1 rows
Query 17 iteration 1 took 2979.6 ms and returned 1 rows
Query 17 iteration 2 took 2963.0 ms and returned 1 rows
Query 17 avg time: 2987.34 ms
I need someone else who can run the benchmark and verify this on a spare machine.
Are there any user-facing changes?