-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join cardinality computation for cost-based nested join optimizations #3787
Conversation
+ max(ij_cardinality, right_num_rows) | ||
- ij_cardinality | ||
} | ||
_ => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to panic here (even if currently it is unreachable, we could add join types later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always needs to be in sync with the outer match (which actually covers all the existing enumerations), so technically it should always be covered (if I am not missing something).
I think keeping the model as simple as possible is likely the most robust and lead to the least surprising plans. This PR seems to have a reasonably straightforward model so 👍 In general, there are many known limitations to a cost based optimizer (because cost models are always estimates and thus can "get it wrong" sometimes, and typically it is hard to predict / debug when this happens (because there are correlations or skew in the data, for example)). I would personally love to see DataFusion head more towards the "dynamically optimize at runtime" approach for joins (like have the join operators themselves scan the inputs to see if they can determine which is smaller, etc). That being said, the reason that CBO is so prevalent is that it is relatively simple to implement and well understood, so I don't have any objections to pursuing a more sophisticated cost model for DataFusion Thank you for this @isidentical |
Definitely agree with all the points @alamb! The same is true for most of the systems (e.g. compilers/interpreters); once you have the ability to determine a path with real data (JIT compilers), it is always going to be superior / more reliable than a profile-guided static analysis. But as you have already mentioned, doing adaptive (dynamical optimizations) optimizations is something that needs to be planned more carefully (in terms of the overall design of the execution process) and this is a lot more complex than simpler cost based analyzers. I'd be happy to also prepare a more general document after this PR about the remaining cost estimations, and how we can have a more reliable computation for other foundational operations (filter selectivity, aggregates, etc.). But that doesn't necessarily need to block the work on the adaptiveness part (e.g. one of the things that I'd love to see [and even maybe work on, if I can find some time] is #816, which is a mix of both a CBO and an adaptive optimization which we can (and should) actually do at the runtime). |
I will try and find some time tomorrow to review this in detail. It is hard to find the time to review all the PRs! |
Thanks so much for all the reviews today ❤️ Let me know if there is anything that I can do to shed more light on (in terms of implementation), would be happy to assist in the review process as much as I can. |
return None; | ||
} | ||
|
||
let col_selectivity = max(left_stat.distinct_count, right_stat.distinct_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need a fallback for the number of distinct values (as it's often expensive to have distinct counts)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the min and max are from numeric origin, we should be able to use min(num_left_rows, scalar_range(left_stats.min, left_stats.max))
if there is no distinct_count
(I remember seeing this pattern, though not sure where it was). WDYT @Dandandan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also makes sense to me. I also believe seeing something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @isidentical -- as you mention this code model is relatively simple but I think it is a good base on which to build on and its testing is top notch 👍
Thank you again
// Seems like there are a few implementations of this algorithm that implement | ||
// exponential decay for the selectivity (like Hive's Optiq Optimizer). Needs | ||
// further exploration. | ||
join_selectivity = col_selectivity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the join columns aren't correlated then a better estimate might be
join_selectivity = col_selectivity * join_selectivity;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, this is already what Hive's Optiq optimizer does (I've put a little comment to explain it). I was going to also implement it, but there was an observation from the spark's side saying this can lead to a significantly lower underestimation on certain cases.
* Generally, inner join with multiple join keys can also be estimated based on the above
* formula:
* T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
* However, the denominator can become very large and excessively reduce the result, so we use a
* conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.
Should we put it off as a follow-up ticket and try to find a couple of cases to see how it behaves / how it really is (or I can also make the change here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think empirical testing is a good idea -- all of these options are estimates and thus will be wrong in some cases so I think it is just a question of what tradeoffs we want to make
// Since we don't have any information about the selectivity, | ||
// we can only assume that the join will produce the cartesian | ||
// product. | ||
_ => left_num_rows * right_num_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, in practice joins where the cost matters are almost never cartesian products (because if they were for large inputs the query would never finish and if the inputs are small it doesn't really matter where in the join tree they go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also give up here (instead of trying to provide a conservative estimate), and see if we can come up with a more reasonable solution. The only case this happens is when we don't know the uniqueness of some join columns. @Dandandan and I have been discussing about an algorithm to actually use min/max values (and the total number of rows) to see if we can somehow infer distinct_count without having it in the statistics. That would hopefully make this case redundant.
Filed #3813 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to give in that case?
There is also this presentation about optimizing the order of joins without statistics available (which also seems to do fine). We could also see if we can reuse some of these ideas:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I've pushed 9e4467e which should turn this off now. We can work on #3813 to also improve this case when the distinct count is not known and min/max is known (which seems common) and then also start looking into the solution from the DuckDB. (thanks a lot for the link btw, it seems super interesting!)
Some(selectivity) if selectivity > 0 => { | ||
(left_num_rows * right_num_rows) / selectivity | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I went over some boundary conditions I think this model is fairly simplistic but reasonable
- 1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows
Exactly. I think the very first next step for this PR would actually be starting on the selectivity aspect for the join filters. We don't currently compute it, but there a few good examples on how we can use a simplistic method from the propagated min/max values to determine the selectivity (if we can, obviously that might not be the case for some).
I left some comment / style suggestions but nothing required -- @isidentical please let me know if you would like to merge this PR as is or make changes. |
@alamb thanks so much for all the reviews, I addressed all the stylistic points. Regarding the logical part (exponential |
Thanks again @isidentical -- I think this is a great step forward |
Benchmark runs are scheduled for baseline = b654fde and contender = ac20bfd. ac20bfd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
||
// With the assumption that the smaller input's domain is generally represented in the bigger | ||
// input's domain, we can estimate the inner join's cardinality by taking the cartesian product | ||
// of the two inputs and normalizing it by the selectivity factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: why use the selectivity factor(aka max_distinct
) to normalize? @isidentical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an inner join, that means only the intersection on the join columns is going to be present in the result. In the conditions above, we have already verified that the ranges intersect so the number of distinct values (with a uniform distribution) is the most reliable way to actually find out the number of matching rows.
E.g. a.col1
has [1, 10000]
(all distinct) and b.col1
has [1, 1000]
(all distinct), for an inner join on a.col1=b.col1
; we can estimate that the result cardinality as 10000 x 1000 / max(10000, 1000)
=
1000
. The same can be applied for distinct counts lower than the actual ranges (e.g. if a.col1
had 1000
distinct values instead 10000
then there might be 10000
rows hence the normalization above ^)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks a lot! @isidentical
I think the stats estimation should be applied to LogicalPlans instead of PhysicalPlans |
Could you explain why? The estimation moved to physical plans at some point, as at the moment of a logical plan there are often no statistics available. |
By the way, should we try to keep the discussion in a single place for making it easier to follow/see by everyone? (we already have an open thread in the design doc, so maybe we should either move it to a new issue to discuss further (on top of the existing discussionS in #962) or keep it in the document) |
Sure, let's discuss the topic in the google doc. |
Which issue does this PR close?
This PR is an initial step towards #128, although it does not close it (more info below).
Rationale for this change
DataFusion's statistics can be helpful when making cost-based optimizations (e.g. the probe ordering for hash join uses the statistics to determine which side is more 'heavy'). However there are a few operations that don't produce any statistics since it is hard to know beforehand and we don't want to await the full results before we can make the optimizations (the physical optimizer pass is for example not adaptive, but rather just done at once when planning the query with the stats we can derive statically). There have been extensive academical and practical work done on this estimation, and there are both very complex algorithms and more simpler ones. This PR is just an introduction towards implementing the basic principles of join cardinality computation from Spark's Catalyst optimizer (which AFAIK is heavily inspired from Hive's Optiq CBO) which is very simple and heavily battle-tested (initially shared by @Dandandan & @alamb on #128).
What changes are included in this PR?
The catalyst optimizer is very extensive in terms of cost calculation, so this PR is just the initial steps towards having something similar. It implements cardinality estimation (without filter selectivity, which would probably be the next follow up) for inner join and other join types which we can estimate through deriving the result from the inner join cardinality. More description on the algorithms are available in the code, as well as the blog post Spark/DataBricks team has shared.
Are there any user-facing changes?
Yes, this brings more statistics and more potential for optimization.