Start setting up tpch planning benchmarks #8665

matthewmturner · 2023-12-27T16:30:59Z

Which issue does this PR close?

Part of #8638

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

andygrove · 2023-12-27T17:55:31Z

datafusion/core/benches/sql_planner.rs

+        Field::new("l_discount", DataType::Float64, false),
+        Field::new("l_tax", DataType::Float64, false),


It may not be important for this benchmark, but financial amounts should be decimal rather than float. We have these schema definitions already in https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/tpch/mod.rs#L45 that should match the TPC-H specification.

I must have messed something up when i was cleaning up and creating PR, i had actually copied from that exact location. Thx for catching, will fix.

viirya · 2023-12-27T19:27:57Z

datafusion/core/benches/sql_planner.rs

+    let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
+    c.bench_function("physical_plan_tpch_q1", |b| {
+        b.iter(|| physical_plan(&ctx, &q1_sql))
+    });
+
+    let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
+    c.bench_function("physical_plan_tpch_q12", |b| {
+        b.iter(|| physical_plan(&ctx, &q12_sql))
+    });


If this is going to add all tpch queries here, I'm wondering if we should create a separate benchmark for tpch?

it was my understanding the intent from #8638 was to have all planning benchmarks in a single place and less to be able to isolate benchmark per source (tpch / clickbench / etc). that being said, i can see value in that if we wanted to take our plan benchmarks a step further and perhaps compare to other engines.

i dont have a strong opinion either way - happy to go with the consensus here.

What do you think about having a benchmark like this that planned them all in a single go. Something like

let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap(); let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap(); c.bench_function("physical_plan_tpch", |b| { b.iter(|| physical_plan(&ctx, &q1_sql)) b.iter(|| physical_plan(&ctx, &q12_sql)) });

And we can add per-query benchmarks if needed / desired?

I am personally torn between per-query benchmarks which would provide more detail, but require more aggregation to summarize and this single number, with less specificity

Indeed, there are several ways we can go about this. If there isnt an immediate need for per query benchmarks then i think aggregating could work and then we can add per query plans over time or as needed.

Also, and probably not for this PR, it looks like criterion has some features that would allow you to establish a baseline benchmark (https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines). I'm wondering if that could be useful for integrating into CI to ensure no regressions.

There is also the option of using benchmark_group which allows you to group function benchmarks in final output.

alamb

Thank you @matthewmturner -- this is looking great to me. I think we can address the feedback and then iterate in subsequent PRs

alamb · 2023-12-28T20:32:51Z

datafusion/core/benches/sql_planner.rs

+    let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
+    c.bench_function("physical_plan_tpch_q1", |b| {
+        b.iter(|| physical_plan(&ctx, &q1_sql))
+    });
+
+    let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
+    c.bench_function("physical_plan_tpch_q12", |b| {
+        b.iter(|| physical_plan(&ctx, &q12_sql))
+    });


What do you think about having a benchmark like this that planned them all in a single go. Something like

let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap(); let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap(); c.bench_function("physical_plan_tpch", |b| { b.iter(|| physical_plan(&ctx, &q1_sql)) b.iter(|| physical_plan(&ctx, &q12_sql)) });

And we can add per-query benchmarks if needed / desired?

I am personally torn between per-query benchmarks which would provide more detail, but require more aggregation to summarize and this single number, with less specificity

alamb · 2023-12-28T20:37:00Z

I believe CI will be fixed by merging up from main -- the clippy issue was fixed in #8662

matthewmturner · 2023-12-29T05:31:33Z

@alamb i made updates and went with your approach. with this in place it should be pretty easy to iterate and add more targeted benchmarks as needed. let me know if you think anything else needed.

alamb

Thanks @matthewmturner

alamb · 2023-12-30T14:11:47Z

datafusion/core/benches/sql_planner.rs

+    let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
+    let q13_sql = std::fs::read_to_string("../../benchmarks/queries/q13.sql").unwrap();
+    let q14_sql = std::fs::read_to_string("../../benchmarks/queries/q14.sql").unwrap();
+    // let q15_sql = std::fs::read_to_string("../../benchmarks/queries/q15.sql").unwrap();


it might be good in a follow on PR to note why this query is commented out.

alamb · 2023-12-30T14:12:23Z

I think we should get this in and iterate.

alamb · 2023-12-30T14:45:07Z

Just for fun I ran this benchmark under a profiler and I see lots of time spent in DFSchema related items:

This is going to be great

github-actions bot added the core Core DataFusion crate label Dec 27, 2023

andygrove reviewed Dec 27, 2023

View reviewed changes

viirya reviewed Dec 27, 2023

View reviewed changes

alamb approved these changes Dec 28, 2023

View reviewed changes

alamb mentioned this pull request Dec 28, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 25, 2023 #8655

Closed

7 tasks

matthewmturner added 2 commits December 28, 2023 20:17

Start setting up tpch planning benchmarks

7626eee

Add remaining tpch queries

707a018

matthewmturner force-pushed the feat/benchmark-plans branch from 0f0e478 to 707a018 Compare December 29, 2023 01:20

matthewmturner added 2 commits December 28, 2023 20:42

Fix bench function

ee67c4d

Clippy

5d432b9

alamb approved these changes Dec 30, 2023

View reviewed changes

alamb merged commit 545275b into apache:main Dec 30, 2023
22 checks passed

alamb mentioned this pull request Dec 30, 2023

Make a faster way to check column existence in optimizer (not is_err()) #5309

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

appletreeisyellow mentioned this pull request Jan 11, 2024

chore: temporary branch for IOx update (12-25-2023 to 12-31-2023) appletreeisyellow/datafusion#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start setting up tpch planning benchmarks #8665

Start setting up tpch planning benchmarks #8665

matthewmturner commented Dec 27, 2023 •

edited by alamb

Loading

andygrove Dec 27, 2023

matthewmturner Dec 27, 2023

viirya Dec 27, 2023

matthewmturner Dec 28, 2023

alamb Dec 28, 2023

matthewmturner Dec 28, 2023

matthewmturner Dec 28, 2023

alamb left a comment

alamb Dec 28, 2023

alamb commented Dec 28, 2023

matthewmturner commented Dec 29, 2023

alamb left a comment

alamb Dec 30, 2023

alamb commented Dec 30, 2023

alamb commented Dec 30, 2023

		Field::new("l_discount", DataType::Float64, false),
		Field::new("l_tax", DataType::Float64, false),

Start setting up tpch planning benchmarks #8665

Start setting up tpch planning benchmarks #8665

Conversation

matthewmturner commented Dec 27, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 28, 2023

matthewmturner commented Dec 29, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 30, 2023

alamb commented Dec 30, 2023

matthewmturner commented Dec 27, 2023 •

edited by alamb

Loading