Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join cardinality computation for cost-based nested join optimizations #3787

Merged
merged 3 commits into from
Oct 13, 2022

Conversation

isidentical
Copy link
Contributor

Which issue does this PR close?

This PR is an initial step towards #128, although it does not close it (more info below).

Rationale for this change

DataFusion's statistics can be helpful when making cost-based optimizations (e.g. the probe ordering for hash join uses the statistics to determine which side is more 'heavy'). However there are a few operations that don't produce any statistics since it is hard to know beforehand and we don't want to await the full results before we can make the optimizations (the physical optimizer pass is for example not adaptive, but rather just done at once when planning the query with the stats we can derive statically). There have been extensive academical and practical work done on this estimation, and there are both very complex algorithms and more simpler ones. This PR is just an introduction towards implementing the basic principles of join cardinality computation from Spark's Catalyst optimizer (which AFAIK is heavily inspired from Hive's Optiq CBO) which is very simple and heavily battle-tested (initially shared by @Dandandan & @alamb on #128).

What changes are included in this PR?

The catalyst optimizer is very extensive in terms of cost calculation, so this PR is just the initial steps towards having something similar. It implements cardinality estimation (without filter selectivity, which would probably be the next follow up) for inner join and other join types which we can estimate through deriving the result from the inner join cardinality. More description on the algorithms are available in the code, as well as the blog post Spark/DataBricks team has shared.

Are there any user-facing changes?

Yes, this brings more statistics and more potential for optimization.

@github-actions github-actions bot added the core Core DataFusion crate label Oct 10, 2022
@isidentical isidentical marked this pull request as ready for review October 10, 2022 23:08
@andygrove andygrove self-requested a review October 11, 2022 03:20
+ max(ij_cardinality, right_num_rows)
- ij_cardinality
}
_ => unreachable!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to panic here (even if currently it is unreachable, we could add join types later).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always needs to be in sync with the outer match (which actually covers all the existing enumerations), so technically it should always be covered (if I am not missing something).

@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

I think keeping the model as simple as possible is likely the most robust and lead to the least surprising plans. This PR seems to have a reasonably straightforward model so 👍

In general, there are many known limitations to a cost based optimizer (because cost models are always estimates and thus can "get it wrong" sometimes, and typically it is hard to predict / debug when this happens (because there are correlations or skew in the data, for example)).

I would personally love to see DataFusion head more towards the "dynamically optimize at runtime" approach for joins (like have the join operators themselves scan the inputs to see if they can determine which is smaller, etc).

That being said, the reason that CBO is so prevalent is that it is relatively simple to implement and well understood, so I don't have any objections to pursuing a more sophisticated cost model for DataFusion

Thank you for this @isidentical

@isidentical
Copy link
Contributor Author

Definitely agree with all the points @alamb! The same is true for most of the systems (e.g. compilers/interpreters); once you have the ability to determine a path with real data (JIT compilers), it is always going to be superior / more reliable than a profile-guided static analysis. But as you have already mentioned, doing adaptive (dynamical optimizations) optimizations is something that needs to be planned more carefully (in terms of the overall design of the execution process) and this is a lot more complex than simpler cost based analyzers.

I'd be happy to also prepare a more general document after this PR about the remaining cost estimations, and how we can have a more reliable computation for other foundational operations (filter selectivity, aggregates, etc.). But that doesn't necessarily need to block the work on the adaptiveness part (e.g. one of the things that I'd love to see [and even maybe work on, if I can find some time] is #816, which is a mix of both a CBO and an adaptive optimization which we can (and should) actually do at the runtime).

@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

I will try and find some time tomorrow to review this in detail. It is hard to find the time to review all the PRs!

@isidentical
Copy link
Contributor Author

I will try and find some time tomorrow to review this in detail. It is hard to find the time to review all the PRs!

Thanks so much for all the reviews today ❤️ Let me know if there is anything that I can do to shed more light on (in terms of implementation), would be happy to assist in the review process as much as I can.

return None;
}

let col_selectivity = max(left_stat.distinct_count, right_stat.distinct_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a fallback for the number of distinct values (as it's often expensive to have distinct counts)..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the min and max are from numeric origin, we should be able to use min(num_left_rows, scalar_range(left_stats.min, left_stats.max)) if there is no distinct_count (I remember seeing this pattern, though not sure where it was). WDYT @Dandandan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also makes sense to me. I also believe seeing something like that.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @isidentical -- as you mention this code model is relatively simple but I think it is a good base on which to build on and its testing is top notch 👍

Thank you again

datafusion/core/src/physical_plan/join_utils.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/join_utils.rs Outdated Show resolved Hide resolved
// Seems like there are a few implementations of this algorithm that implement
// exponential decay for the selectivity (like Hive's Optiq Optimizer). Needs
// further exploration.
join_selectivity = col_selectivity;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the join columns aren't correlated then a better estimate might be

join_selectivity = col_selectivity * join_selectivity;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, this is already what Hive's Optiq optimizer does (I've put a little comment to explain it). I was going to also implement it, but there was an observation from the spark's side saying this can lead to a significantly lower underestimation on certain cases.

   * Generally, inner join with multiple join keys can also be estimated based on the above
   * formula:
   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
   * However, the denominator can become very large and excessively reduce the result, so we use a
   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.

Should we put it off as a follow-up ticket and try to find a couple of cases to see how it behaves / how it really is (or I can also make the change here)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think empirical testing is a good idea -- all of these options are estimates and thus will be wrong in some cases so I think it is just a question of what tradeoffs we want to make

// Since we don't have any information about the selectivity,
// we can only assume that the join will produce the cartesian
// product.
_ => left_num_rows * right_num_rows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, in practice joins where the cost matters are almost never cartesian products (because if they were for large inputs the query would never finish and if the inputs are small it doesn't really matter where in the join tree they go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also give up here (instead of trying to provide a conservative estimate), and see if we can come up with a more reasonable solution. The only case this happens is when we don't know the uniqueness of some join columns. @Dandandan and I have been discussing about an algorithm to actually use min/max values (and the total number of rows) to see if we can somehow infer distinct_count without having it in the statistics. That would hopefully make this case redundant.

Filed #3813 for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to give in that case?

There is also this presentation about optimizing the order of joins without statistics available (which also seems to do fine). We could also see if we can reuse some of these ideas:

https://www.youtube.com/watch?v=aNRoR0Z3SzU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I've pushed 9e4467e which should turn this off now. We can work on #3813 to also improve this case when the distinct count is not known and min/max is known (which seems common) and then also start looking into the solution from the DuckDB. (thanks a lot for the link btw, it seems super interesting!)

Comment on lines 439 to 441
Some(selectivity) if selectivity > 0 => {
(left_num_rows * right_num_rows) / selectivity
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I went over some boundary conditions I think this model is fairly simplistic but reasonable

  • 1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows

Exactly. I think the very first next step for this PR would actually be starting on the selectivity aspect for the join filters. We don't currently compute it, but there a few good examples on how we can use a simplistic method from the propagated min/max values to determine the selectivity (if we can, obviously that might not be the case for some).

@alamb
Copy link
Contributor

alamb commented Oct 12, 2022

I left some comment / style suggestions but nothing required -- @isidentical please let me know if you would like to merge this PR as is or make changes.

@isidentical
Copy link
Contributor Author

@alamb thanks so much for all the reviews, I addressed all the stylistic points. Regarding the logical part (exponential join_selectivity, filter selectivity, cartesian product at the worst case scenario), I've tried to explain the current reasoning and potential for future work, but I can also start including some in this PR as well (depending on what is easier for you folks to review).

@alamb alamb merged commit ac20bfd into apache:master Oct 13, 2022
@alamb
Copy link
Contributor

alamb commented Oct 13, 2022

Thanks again @isidentical -- I think this is a great step forward

@ursabot
Copy link

ursabot commented Oct 13, 2022

Benchmark runs are scheduled for baseline = b654fde and contender = ac20bfd. ac20bfd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java


// With the assumption that the smaller input's domain is generally represented in the bigger
// input's domain, we can estimate the inner join's cardinality by taking the cartesian product
// of the two inputs and normalizing it by the selectivity factor.
Copy link
Member

@xudong963 xudong963 Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: why use the selectivity factor(aka max_distinct) to normalize? @isidentical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an inner join, that means only the intersection on the join columns is going to be present in the result. In the conditions above, we have already verified that the ranges intersect so the number of distinct values (with a uniform distribution) is the most reliable way to actually find out the number of matching rows.

E.g. a.col1 has [1, 10000] (all distinct) and b.col1 has [1, 1000] (all distinct), for an inner join on a.col1=b.col1; we can estimate that the result cardinality as 10000 x 1000 / max(10000, 1000) = 1000. The same can be applied for distinct counts lower than the actual ranges (e.g. if a.col1 had 1000 distinct values instead 10000 then there might be 10000 rows hence the normalization above ^)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks a lot! @isidentical

@mingmwang
Copy link
Contributor

I think the stats estimation should be applied to LogicalPlans instead of PhysicalPlans

@Dandandan
Copy link
Contributor

I think the stats estimation should be applied to LogicalPlans instead of PhysicalPlans

Could you explain why? The estimation moved to physical plans at some point, as at the moment of a logical plan there are often no statistics available.

@isidentical
Copy link
Contributor Author

By the way, should we try to keep the discussion in a single place for making it easier to follow/see by everyone? (we already have an open thread in the design doc, so maybe we should either move it to a new issue to discuss further (on top of the existing discussionS in #962) or keep it in the document)

@mingmwang
Copy link
Contributor

Sure, let's discuss the topic in the google doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants