Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start setting up tpch planning benchmarks #8665

Merged
merged 4 commits into from
Dec 30, 2023

Conversation

matthewmturner
Copy link
Contributor

@matthewmturner matthewmturner commented Dec 27, 2023

Which issue does this PR close?

Part of #8638

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Dec 27, 2023
Comment on lines 71 to 72
Field::new("l_discount", DataType::Float64, false),
Field::new("l_tax", DataType::Float64, false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be important for this benchmark, but financial amounts should be decimal rather than float. We have these schema definitions already in https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/tpch/mod.rs#L45 that should match the TPC-H specification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have messed something up when i was cleaning up and creating PR, i had actually copied from that exact location. Thx for catching, will fix.

Comment on lines 167 to 273
let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
c.bench_function("physical_plan_tpch_q1", |b| {
b.iter(|| physical_plan(&ctx, &q1_sql))
});

let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
c.bench_function("physical_plan_tpch_q12", |b| {
b.iter(|| physical_plan(&ctx, &q12_sql))
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to add all tpch queries here, I'm wondering if we should create a separate benchmark for tpch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was my understanding the intent from #8638 was to have all planning benchmarks in a single place and less to be able to isolate benchmark per source (tpch / clickbench / etc). that being said, i can see value in that if we wanted to take our plan benchmarks a step further and perhaps compare to other engines.

i dont have a strong opinion either way - happy to go with the consensus here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about having a benchmark like this that planned them all in a single go. Something like

    let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
    let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();

    c.bench_function("physical_plan_tpch", |b| {
        b.iter(|| physical_plan(&ctx, &q1_sql))
        b.iter(|| physical_plan(&ctx, &q12_sql))
    });

And we can add per-query benchmarks if needed / desired?

I am personally torn between per-query benchmarks which would provide more detail, but require more aggregation to summarize and this single number, with less specificity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there are several ways we can go about this. If there isnt an immediate need for per query benchmarks then i think aggregating could work and then we can add per query plans over time or as needed.

Also, and probably not for this PR, it looks like criterion has some features that would allow you to establish a baseline benchmark (https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines). I'm wondering if that could be useful for integrating into CI to ensure no regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the option of using benchmark_group which allows you to group function benchmarks in final output.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @matthewmturner -- this is looking great to me. I think we can address the feedback and then iterate in subsequent PRs

Comment on lines 167 to 273
let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
c.bench_function("physical_plan_tpch_q1", |b| {
b.iter(|| physical_plan(&ctx, &q1_sql))
});

let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
c.bench_function("physical_plan_tpch_q12", |b| {
b.iter(|| physical_plan(&ctx, &q12_sql))
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about having a benchmark like this that planned them all in a single go. Something like

    let q1_sql = std::fs::read_to_string("../../benchmarks/queries/q1.sql").unwrap();
    let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();

    c.bench_function("physical_plan_tpch", |b| {
        b.iter(|| physical_plan(&ctx, &q1_sql))
        b.iter(|| physical_plan(&ctx, &q12_sql))
    });

And we can add per-query benchmarks if needed / desired?

I am personally torn between per-query benchmarks which would provide more detail, but require more aggregation to summarize and this single number, with less specificity

@alamb
Copy link
Contributor

alamb commented Dec 28, 2023

I believe CI will be fixed by merging up from main -- the clippy issue was fixed in #8662

@matthewmturner
Copy link
Contributor Author

@alamb i made updates and went with your approach. with this in place it should be pretty easy to iterate and add more targeted benchmarks as needed. let me know if you think anything else needed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let q12_sql = std::fs::read_to_string("../../benchmarks/queries/q12.sql").unwrap();
let q13_sql = std::fs::read_to_string("../../benchmarks/queries/q13.sql").unwrap();
let q14_sql = std::fs::read_to_string("../../benchmarks/queries/q14.sql").unwrap();
// let q15_sql = std::fs::read_to_string("../../benchmarks/queries/q15.sql").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good in a follow on PR to note why this query is commented out.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

I think we should get this in and iterate.

@alamb alamb merged commit 545275b into apache:main Dec 30, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

Just for fun I ran this benchmark under a profiler and I see lots of time spent in DFSchema related items:

Screenshot 2023-12-30 at 9 43 54 AM

This is going to be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants