-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126
Conversation
@berkaysynnada Could you take a look and make sure this looks sensible? |
return Ok(Statistics::new_unknown(&schema)); | ||
// assume worst case, that the filter is highly selective and | ||
// returns all the rows from its input | ||
return Ok(input_stats.clone().into_inexact()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make a slightly different assumption that is a better metric, e.g. each filter returning 50% or 20% of input rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a configuration option to control the default selectivity. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea. Since statistics will be Inexact
, it should never result in an incorrect output, but may improve average-case complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like making this configurable will be a larger change. I filed #8133 and linked to it from the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The talk Join Order Optimization with (almost) no Statistics is focused on full join reordering rather than just choosing the build side of a join but talks about selectivity estimates and is very relevant to this discussion. They found that selectivity of 0.2 worked well with TPC-H.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a change to use 0.2 as the default, and now Q18 has an improved join order as well. I updated the results in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default selectivities / cost estimates work ok for TPCH queries where the data is relatively uniformly distributed.
However, in general in my experience they tend to cause problems when the data is skewed or has correlations between the columns.
Hopefully we'll be able to keep the number of hard coded constants / assumptions low in DataFusion (so there are fewer things for the optimizer to get wrong :) )
Co-authored-by: Daniël Heres <[email protected]>
let selectivity = 0.2_f32; | ||
let mut stats = input_stats.clone().into_inexact(); | ||
if let Precision::Inexact(n) = stats.num_rows { | ||
stats.num_rows = Precision::Inexact((selectivity * n as f32) as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can/should do the same for the total_byte_size
value
Thanks @andygrove impressive results with a small change! |
Once the type |
cc @NGA-TRAN |
I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔 |
BTW when statistics support range estimates, we will not need to hardcode any assumption like 20% within how the operator reports the statistics. In this specific example, the filter would say it could be anything between 0 and the incoming number of rows, which is accurate. The logic that consumes these stats is now free to make any heuristic assumptions it wants to make on top of this information. |
Was surprised as well and checked why, it is because the tests use CSV rather than Parquet. |
Maybe we can create a test with some sort of 'statistics only' with the statistics from the parquet files, but that doesn't need the actual data and verify the plan that way. |
Which issue does this PR close?
Closes #7949
Closes #7950
Rationale for this change
Improve benchmark results.
What changes are included in this PR?
Q17 performs a join where the left input is a
ParquetExec
readinglineitem
and the right input isFilterExec
wrapping aParquetExec
that readspart
.Both
ParquetExec
s providenum_rows
, but theFilterExec
around thepart
input discards all statistics from the underlyingParquetExec
and this means that the existing optimizations for choosing the build side of the join cannot determine which input is smaller due to the missing statistics.This PR changes the behavior of
FilterExec:statistics
in the case where we cannot determine accurate statistics. Instead of returningnum_rows
asPrecision::Absent
, we now assume that the filter selects 20% of rows from it's input. There is a follow-up issue #8133 to make this configurable.Benchmark Results: TPCH @ SF10
I see an overall improvement from 128 seconds to 104 seconds locally, so around 18% faster.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No