Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Merged
merged 11 commits into from
Nov 12, 2023

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Nov 10, 2023

Which issue does this PR close?

Closes #7949
Closes #7950

Rationale for this change

Improve benchmark results.

What changes are included in this PR?

Q17 performs a join where the left input is a ParquetExec reading lineitem and the right input is FilterExec wrapping a ParquetExec that reads part.

Both ParquetExecs provide num_rows, but the FilterExec around the part input discards all statistics from the underlying ParquetExec and this means that the existing optimizations for choosing the build side of the join cannot determine which input is smaller due to the missing statistics.

This PR changes the behavior of FilterExec:statistics in the case where we cannot determine accurate statistics. Instead of returning num_rows as Precision::Absent, we now assume that the filter selects 20% of rows from it's input. There is a follow-up issue #8133 to make this configurable.

Benchmark Results: TPCH @ SF10

I see an overall improvement from 128 seconds to 104 seconds locally, so around 18% faster.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ improve-filter-stats ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  5685.47ms │            5513.51ms │     no change │
│ QQuery 2     │  1464.46ms │             740.12ms │ +1.98x faster │
│ QQuery 3     │  4642.75ms │            2840.77ms │ +1.63x faster │
│ QQuery 4     │  2443.85ms │            2520.49ms │     no change │
│ QQuery 5     │  6045.88ms │            5907.45ms │     no change │
│ QQuery 6     │  1347.40ms │            1368.79ms │     no change │
│ QQuery 7     │ 11895.06ms │            8319.67ms │ +1.43x faster │
│ QQuery 8     │  6517.29ms │            6113.85ms │ +1.07x faster │
│ QQuery 9     │ 10649.04ms │            9081.60ms │ +1.17x faster │
│ QQuery 10    │  5829.50ms │            5666.43ms │     no change │
│ QQuery 11    │  1526.80ms │             813.19ms │ +1.88x faster │
│ QQuery 12    │  2432.82ms │            2441.65ms │     no change │
│ QQuery 13    │  5136.62ms │            4757.18ms │ +1.08x faster │
│ QQuery 14    │  1939.14ms │            1949.23ms │     no change │
│ QQuery 15    │  1552.10ms │            1628.54ms │     no change │
│ QQuery 16    │  1465.63ms │             967.13ms │ +1.52x faster │
│ QQuery 17    │ 10933.48ms │            6389.48ms │ +1.71x faster │
│ QQuery 18    │ 20957.45ms │           17243.47ms │ +1.22x faster │
│ QQuery 19    │  3415.85ms │            3120.01ms │ +1.09x faster │
│ QQuery 20    │  4453.76ms │            4291.37ms │     no change │
│ QQuery 21    │ 16522.46ms │           11447.19ms │ +1.44x faster │
│ QQuery 22    │  1585.07ms │            1703.55ms │  1.07x slower │
└──────────────┴────────────┴──────────────────────┴───────────────┘

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No

@andygrove andygrove self-assigned this Nov 10, 2023
@andygrove andygrove added the performance Make DataFusion faster label Nov 10, 2023
@andygrove andygrove changed the title WIP: Fix join order for TPCH Q17 Fix join order for TPCH Q17 Nov 10, 2023
@andygrove andygrove marked this pull request as ready for review November 10, 2023 18:49
@andygrove
Copy link
Member Author

@berkaysynnada Could you take a look and make sure this looks sensible?

@andygrove andygrove added the enhancement New feature or request label Nov 10, 2023
return Ok(Statistics::new_unknown(&schema));
// assume worst case, that the filter is highly selective and
// returns all the rows from its input
return Ok(input_stats.clone().into_inexact());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make a slightly different assumption that is a better metric, e.g. each filter returning 50% or 20% of input rows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a configuration option to control the default selectivity. I'll take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. Since statistics will be Inexact, it should never result in an incorrect output, but may improve average-case complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like making this configurable will be a larger change. I filed #8133 and linked to it from the comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The talk Join Order Optimization with (almost) no Statistics is focused on full join reordering rather than just choosing the build side of a join but talks about selectivity estimates and is very relevant to this discussion. They found that selectivity of 0.2 worked well with TPC-H.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change to use 0.2 as the default, and now Q18 has an improved join order as well. I updated the results in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default selectivities / cost estimates work ok for TPCH queries where the data is relatively uniformly distributed.

However, in general in my experience they tend to cause problems when the data is skewed or has correlations between the columns.

Hopefully we'll be able to keep the number of hard coded constants / assumptions low in DataFusion (so there are fewer things for the optimizer to get wrong :) )

@github-actions github-actions bot added the core Core DataFusion crate label Nov 11, 2023
@andygrove andygrove changed the title Fix join order for TPCH Q17 Fix join order for TPCH Q17 & q18 Nov 11, 2023
@andygrove andygrove changed the title Fix join order for TPCH Q17 & q18 Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics Nov 11, 2023
@github-actions github-actions bot removed the core Core DataFusion crate label Nov 11, 2023
let selectivity = 0.2_f32;
let mut stats = input_stats.clone().into_inexact();
if let Precision::Inexact(n) = stats.num_rows {
stats.num_rows = Precision::Inexact((selectivity * n as f32) as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can/should do the same for the total_byte_size value

@Dandandan Dandandan merged commit 6fe00ce into apache:main Nov 12, 2023
22 checks passed
@Dandandan
Copy link
Contributor

Thanks @andygrove impressive results with a small change!

@ozankabak
Copy link
Contributor

Once the type Precision is extended to support range estimates in addition to point estimates, we will be able to do much more. This small change demonstrates the potential of a good statistics infrastructure really well. Thanks @andygrove

@andygrove andygrove deleted the improve-filter-stats branch November 12, 2023 16:09
@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

cc @NGA-TRAN

@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

@ozankabak
Copy link
Contributor

BTW when statistics support range estimates, we will not need to hardcode any assumption like 20% within how the operator reports the statistics. In this specific example, the filter would say it could be anything between 0 and the incoming number of rows, which is accurate. The logic that consumes these stats is now free to make any heuristic assumptions it wants to make on top of this information.

@Dandandan
Copy link
Contributor

Dandandan commented Nov 13, 2023

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

Was surprised as well and checked why, it is because the tests use CSV rather than Parquet.

@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

Was surprised as well and checked why, it is because the tests use CSV rather than Parquet.

Maybe we can create a test with some sort of 'statistics only' with the statistics from the parquet files, but that doesn't need the actual data and verify the plan that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
4 participants