Improve avg/sum Aggregator performance for Decimal #5866

mingmwang · 2023-04-04T16:03:29Z

Which issue does this PR close?

Closes #5858
closes #5859.

Rationale for this change

Improve avg(decimal) and sum(decimal) performance, avoid type cast in the inner loop, improve the TPCH-q17 performance

What changes are included in this PR?

pull up the implicit cast from the sum_batch() in the AvgAccumulator and SumAccumulator to the aggregate_expressions(), so the cast is doing in the outer loop.
In AvgAccumulator, differ the sum data type and return data type, for Decimal, they are different.
In the SparkSQL, the sum type of avg is DECIMAL(min(38,precision+10), and the return type is DECIMAL(min(38,precision+4), min(38,scale+4)). Add overflow check for Decimal when convert from the internal sum type to the return type.

Are these changes tested?

I had test this on my local Mac, for TPCH-q17, there is at least 10% improvement

Before this PR:

Query 17 iteration 0 took 3249.9 ms and returned 1 rows
Query 17 iteration 1 took 3430.5 ms and returned 1 rows
Query 17 iteration 2 took 3413.1 ms and returned 1 rows
Query 17 avg time: 3364.49 ms

After this PR:
Query 17 iteration 0 took 3019.4 ms and returned 1 rows
Query 17 iteration 1 took 2979.6 ms and returned 1 rows
Query 17 iteration 2 took 2963.0 ms and returned 1 rows
Query 17 avg time: 2987.34 ms

I need someone else who can run the benchmark and verify this on a spare machine.

Are there any user-facing changes?

mingmwang · 2023-04-04T16:09:58Z

@liukun4515 @yahoNanJing

Dandandan · 2023-04-04T16:43:36Z

Great work, I can take a look tomorrow and rerun the benchmarks on my machine.

Dandandan · 2023-04-04T16:46:06Z

@mingmwang query 18 also is affected by this (although should be less of a difference than query 17)

andygrove · 2023-04-04T18:06:30Z

I tried testing the changes in this PR and ran into some errors when running query 1 using the code in https://github.com/sql-benchmarks/sqlbench-runners/tree/main/datafusion

thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81

I don't see these errors when running against the latest in the main branch.

andygrove · 2023-04-04T18:22:32Z

Here are the results on my desktop (24 core) for queries 17 and 18 (sf=10)

This PR

Executing query 17 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q17.sql
Query 17 executed in: 25.076557605s
Query 17 executed in: 14.468542149s
Query 17 executed in: 13.238961128s
Executing query 18 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q18.sql
Query 18 executed in: 8.197590942s
Query 18 executed in: 7.883707736s
Query 18 executed in: 7.936216688s

Main branch

Executing query 17 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q17.sql
Query 17 executed in: 20.10441009s
Query 17 executed in: 14.731355792s
Query 17 executed in: 13.869329179s
Executing query 18 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q18.sql
Query 18 executed in: 8.900509355s
Query 18 executed in: 8.695394752s
Query 18 executed in: 8.541570444s
Executing query 19 from /home/andy/git

Dandandan · 2023-04-04T19:52:08Z

When running in memory (SF=1)

cargo run --release --bin tpch benchmark datafusion --path ./data/ --format parquet --partitions 16 -q 17 --iterations 10 -m

main

Query 17 iteration 0 took 725.6 ms and returned 1 rows
Query 17 iteration 1 took 675.3 ms and returned 1 rows
Query 17 iteration 2 took 635.1 ms and returned 1 rows
Query 17 iteration 3 took 702.1 ms and returned 1 rows
Query 17 iteration 4 took 624.7 ms and returned 1 rows
Query 17 iteration 5 took 692.2 ms and returned 1 rows
Query 17 iteration 6 took 681.0 ms and returned 1 rows
Query 17 iteration 7 took 668.9 ms and returned 1 rows
Query 17 iteration 8 took 675.6 ms and returned 1 rows
Query 17 iteration 9 took 660.6 ms and returned 1 rows
Query 17 avg time: 674.12 ms

Query 18 iteration 0 took 377.6 ms and returned 57 rows
Query 18 iteration 1 took 361.1 ms and returned 57 rows
Query 18 iteration 2 took 353.7 ms and returned 57 rows
Query 18 iteration 3 took 362.2 ms and returned 57 rows
Query 18 iteration 4 took 359.8 ms and returned 57 rows
Query 18 iteration 5 took 356.8 ms and returned 57 rows
Query 18 iteration 6 took 350.8 ms and returned 57 rows
Query 18 iteration 7 took 360.6 ms and returned 57 rows
Query 18 iteration 8 took 351.6 ms and returned 57 rows
Query 18 iteration 9 took 352.2 ms and returned 57 rows
Query 18 avg time: 358.65 ms

PR

Query 17 iteration 0 took 640.5 ms and returned 1 rows
Query 17 iteration 1 took 649.0 ms and returned 1 rows
Query 17 iteration 2 took 597.3 ms and returned 1 rows
Query 17 iteration 3 took 639.3 ms and returned 1 rows
Query 17 iteration 4 took 645.2 ms and returned 1 rows
Query 17 iteration 5 took 653.8 ms and returned 1 rows
Query 17 iteration 6 took 628.6 ms and returned 1 rows
Query 17 iteration 7 took 643.7 ms and returned 1 rows
Query 17 iteration 8 took 600.0 ms and returned 1 rows
Query 17 iteration 9 took 591.8 ms and returned 1 rows
Query 17 avg time: 628.91 ms

Query 18 iteration 0 took 355.8 ms and returned 57 rows
Query 18 iteration 1 took 333.5 ms and returned 57 rows
Query 18 iteration 2 took 317.5 ms and returned 57 rows
Query 18 iteration 3 took 327.6 ms and returned 57 rows
Query 18 iteration 4 took 318.2 ms and returned 57 rows
Query 18 iteration 5 took 324.2 ms and returned 57 rows
Query 18 iteration 6 took 315.0 ms and returned 57 rows
Query 18 iteration 7 took 323.1 ms and returned 57 rows
Query 18 iteration 8 took 314.8 ms and returned 57 rows
Query 18 iteration 9 took 322.0 ms and returned 57 rows
Query 18 avg time: 325.17 ms

Dandandan · 2023-04-04T20:03:12Z

Although PR shows an improvement, ~15% of the CPU samples still end up in cast.

mingmwang · 2023-04-05T17:47:38Z

I run the test with sf=1 and partition = 1. I will do more test with sf= 10 tomorrow.

yahoNanJing · 2023-04-06T03:04:18Z

@Dandandan, how did you generate the flame graph? At my side, there're many unknowns. Could you teach me how to avoid them and share your commands for the flame graph?

cargo run --release --bin tpch -- benchmark datafusion --iterations 10 --path ./data-parquet --format parquet --partitions 1 --query 17

sudo perf record -F 99 -g -p 917304

sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg

…o avg_decimal

mingmwang · 2023-04-06T06:39:06Z

I tried testing the changes in this PR and ran into some errors when running query 1 using the code in https://github.com/sql-benchmarks/sqlbench-runners/tree/main/datafusion

thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81

I don't see these errors when running against the latest in the main branch.

I can not reproduce the issue using DataFusion's own benchmark data(sf=10), but I'm able to reproduce the issue using Spark generated benchmark data. I guess Spark's tpch data schema is different with DataFusion's.

yahoNanJing · 2023-04-06T07:02:09Z

Hi @Dandandan, for the latest code, the cast is almost avoided. The ratio is reduced from 16.86% to 0.46%. The related flame graphs are as follows:

With single partition benchmark, the latency is reduced from 11s to 9s for q17.

mingmwang · 2023-04-06T07:16:23Z

I tried testing the changes in this PR and ran into some errors when running query 1 using the code in https://github.com/sql-benchmarks/sqlbench-runners/tree/main/datafusion

thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81

I don't see these errors when running against the latest in the main branch.

I added the overflow check when converting the internal sum type to the result type, there was bug it is fixed now.

mingmwang · 2023-04-06T07:17:15Z

Here are the results on my desktop (24 core) for queries 17 and 18 (sf=10)

This PR

Executing query 17 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q17.sql
Query 17 executed in: 25.076557605s
Query 17 executed in: 14.468542149s
Query 17 executed in: 13.238961128s
Executing query 18 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q18.sql
Query 18 executed in: 8.197590942s
Query 18 executed in: 7.883707736s
Query 18 executed in: 7.936216688s

Main branch

Executing query 17 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q17.sql
Query 17 executed in: 20.10441009s
Query 17 executed in: 14.731355792s
Query 17 executed in: 13.869329179s
Executing query 18 from /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10//q18.sql
Query 18 executed in: 8.900509355s
Query 18 executed in: 8.695394752s
Query 18 executed in: 8.541570444s
Executing query 19 from /home/andy/git

Could you please rerun the benchmark with partition = 1 or partition = 2 ?

andygrove · 2023-04-06T14:09:21Z

I tried testing the changes in this PR and ran into some errors when running query 1 using the code in https://github.com/sql-benchmarks/sqlbench-runners/tree/main/datafusion
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
thread 'tokio-runtime-worker' panicked at 'Unexpected accumulator state in hash aggregate: Internal("Arithmetic Overflow in AvgAccumulator")', /home/andy/.cargo/git/checkouts/arrow-datafusion-bfd9a8de51c58474/4e6eac5/datafusion/core/src/physical_plan/aggregates/row_hash.rs:642:81
I don't see these errors when running against the latest in the main branch.
I can not reproduce the issue using DataFusion's own benchmark data(sf=10), but I'm able to reproduce the issue using Spark generated benchmark data. I guess Spark's tpch data schema is different with DataFusion's.

Maybe decimals vs floats? Official TPC-H uses decimals.

mingmwang · 2023-04-06T14:33:46Z

I don't see these errors when running against the latest in the main branch.

I can not reproduce the issue using DataFusion's own benchmark data(sf=10), but I'm able to reproduce the issue using Spark generated benchmark data. I guess Spark's tpch data schema is different with DataFusion's.

Maybe decimals vs floats? Official TPC-H uses decimals.

Both are decimals, just the precision is a little different. Anyway, it is my bug.

Dandandan · 2023-04-06T16:42:00Z

Hi @mingmwang , for generating the flamegraphs I currently use cargo flamegraph from https://github.com/flamegraph-rs/flamegraph .

I use the following command:

CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --freq 5000 --bin tpch -- benchmark datafusion --path ./data/ --format parquet --partitions 16 -q 17 -d --iterations 10

alamb · 2023-04-06T19:02:31Z

datafusion/physical-expr/src/aggregate/utils.rs

@@ -31,3 +33,49 @@ pub fn get_accum_scalar_values_as_arrays(
        .map(|s| s.to_array_of_size(1))
        .collect::<Vec<_>>())
 }
+
+pub fn calculate_result_decimal_for_avg(


This seems somewhat related to #5675 from @viirya to calculate the output size of a decimal multiplication. I wonder if there is some more general function / approach than special casing in the aggregator

I think it is a little bit different then the explicit arithmetic expression in the SQL, the reason is this final result decimal conversion is implicit, the logical plans/optimizers actually are not aware of this implicit conversion. This is similar to the implicit cast that I pulled up this PR, the cast is also implicit and the logical plans/optimizers are not aware of this implicit cast either. Those implicit cast and conversions are only related to internal state of some specific Aggregation Accumulator, they are not part of the SQL plans/expressions tree.

mingmwang · 2023-04-07T11:47:19Z

But in general, we can have a more general decimal precision change and overflow check logic for Decimal. And the Aggregators can reuse the logic.

mingmwang · 2023-04-09T15:46:21Z

@Dandandan @andygrove @andygrove @yahoNanJing
Would you please help to review and approve this PR ?

yahoNanJing · 2023-04-10T08:15:01Z

LGTM.

yahoNanJing · 2023-04-11T03:04:56Z

Hi @Dandandan, @andygrove, @alamb, do you still have any concerns for this PR? At my side, this PR can achieve around 20% performance improvement for q17 with single partition. It would be better to merge this PR first and then continue other bottleneck refining :)

* improve avg/sum Aggregator performance * check type before cast * fix Arithmetic Overflow bug * fix clippy

improve avg/sum Aggregator performance

4e6eac5

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions labels Apr 4, 2023

mingmwang mentioned this pull request Apr 4, 2023

TPCH, Query 18 and 17 very slow #5646

Closed

yahoNanJing removed logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Apr 5, 2023

mingmwang added 2 commits April 6, 2023 11:27

Merge branch 'main' of https://github.com/apache/arrow-datafusion int…

cd943e8

…o avg_decimal

check type before cast

082457f

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions labels Apr 6, 2023

fix Arithmetic Overflow bug

ea4be8b

alamb reviewed Apr 6, 2023

View reviewed changes

alamb changed the title ~~Improve avg/sum Aggregator performance~~ Improve avg/sum Aggregator performance for Decimal Apr 6, 2023

yahoNanJing mentioned this pull request Apr 10, 2023

Change back SmallVec to Vec for JoinHashMap - Issue 5940 #5941

Closed

Dandandan approved these changes Apr 11, 2023

View reviewed changes

mingmwang added 2 commits April 11, 2023 15:06

merge with upstream

eab7747

fix clippy

51b1e2f

yahoNanJing merged commit c97048d into apache:main Apr 11, 2023

korowa pushed a commit to korowa/arrow-datafusion that referenced this pull request Apr 13, 2023

Improve avg/sum Aggregator performance for Decimal (apache#5866)

3a762a4

* improve avg/sum Aggregator performance * check type before cast * fix Arithmetic Overflow bug * fix clippy

yahoNanJing mentioned this pull request Apr 14, 2023

Row accumulator support update Scalar values #6003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve avg/sum Aggregator performance for Decimal #5866

Improve avg/sum Aggregator performance for Decimal #5866

mingmwang commented Apr 4, 2023 •

edited by alamb

Loading

mingmwang commented Apr 4, 2023

Dandandan commented Apr 4, 2023

Dandandan commented Apr 4, 2023

andygrove commented Apr 4, 2023

andygrove commented Apr 4, 2023 •

edited

Loading

Dandandan commented Apr 4, 2023

Dandandan commented Apr 4, 2023

mingmwang commented Apr 5, 2023

yahoNanJing commented Apr 6, 2023 •

edited

Loading

mingmwang commented Apr 6, 2023

yahoNanJing commented Apr 6, 2023 •

edited

Loading

mingmwang commented Apr 6, 2023

mingmwang commented Apr 6, 2023

This PR

Main branch

andygrove commented Apr 6, 2023

mingmwang commented Apr 6, 2023

Dandandan commented Apr 6, 2023

alamb Apr 6, 2023

mingmwang Apr 7, 2023 •

edited

Loading

mingmwang commented Apr 7, 2023

mingmwang commented Apr 9, 2023

yahoNanJing commented Apr 10, 2023

yahoNanJing commented Apr 11, 2023

Improve avg/sum Aggregator performance for Decimal #5866

Improve avg/sum Aggregator performance for Decimal #5866

Conversation

mingmwang commented Apr 4, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mingmwang commented Apr 4, 2023

Dandandan commented Apr 4, 2023

Dandandan commented Apr 4, 2023

andygrove commented Apr 4, 2023

andygrove commented Apr 4, 2023 • edited Loading

This PR

Main branch

Dandandan commented Apr 4, 2023

main

PR

Dandandan commented Apr 4, 2023

mingmwang commented Apr 5, 2023

yahoNanJing commented Apr 6, 2023 • edited Loading

mingmwang commented Apr 6, 2023

yahoNanJing commented Apr 6, 2023 • edited Loading

mingmwang commented Apr 6, 2023

mingmwang commented Apr 6, 2023

This PR

Main branch

andygrove commented Apr 6, 2023

mingmwang commented Apr 6, 2023

Dandandan commented Apr 6, 2023

alamb Apr 6, 2023

Choose a reason for hiding this comment

mingmwang Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

mingmwang commented Apr 7, 2023

mingmwang commented Apr 9, 2023

yahoNanJing commented Apr 10, 2023

yahoNanJing commented Apr 11, 2023

mingmwang commented Apr 4, 2023 •

edited by alamb

Loading

andygrove commented Apr 4, 2023 •

edited

Loading

yahoNanJing commented Apr 6, 2023 •

edited

Loading

yahoNanJing commented Apr 6, 2023 •

edited

Loading

mingmwang Apr 7, 2023 •

edited

Loading