Join cardinality computation for cost-based nested join optimizations #3787

isidentical · 2022-10-10T22:19:42Z

Which issue does this PR close?

This PR is an initial step towards #128, although it does not close it (more info below).

Rationale for this change

DataFusion's statistics can be helpful when making cost-based optimizations (e.g. the probe ordering for hash join uses the statistics to determine which side is more 'heavy'). However there are a few operations that don't produce any statistics since it is hard to know beforehand and we don't want to await the full results before we can make the optimizations (the physical optimizer pass is for example not adaptive, but rather just done at once when planning the query with the stats we can derive statically). There have been extensive academical and practical work done on this estimation, and there are both very complex algorithms and more simpler ones. This PR is just an introduction towards implementing the basic principles of join cardinality computation from Spark's Catalyst optimizer (which AFAIK is heavily inspired from Hive's Optiq CBO) which is very simple and heavily battle-tested (initially shared by @Dandandan & @alamb on #128).

What changes are included in this PR?

The catalyst optimizer is very extensive in terms of cost calculation, so this PR is just the initial steps towards having something similar. It implements cardinality estimation (without filter selectivity, which would probably be the next follow up) for inner join and other join types which we can estimate through deriving the result from the inner join cardinality. More description on the algorithms are available in the code, as well as the blog post Spark/DataBricks team has shared.

Are there any user-facing changes?

Yes, this brings more statistics and more potential for optimization.

…mizations

Dandandan · 2022-10-11T07:58:48Z

datafusion/core/src/physical_plan/join_utils.rs

+                        + max(ij_cardinality, right_num_rows)
+                        - ij_cardinality
+                }
+                _ => unreachable!(),


I would prefer not to panic here (even if currently it is unreachable, we could add join types later).

This always needs to be in sync with the outer match (which actually covers all the existing enumerations), so technically it should always be covered (if I am not missing something).

alamb · 2022-10-11T12:46:12Z

I think keeping the model as simple as possible is likely the most robust and lead to the least surprising plans. This PR seems to have a reasonably straightforward model so 👍

In general, there are many known limitations to a cost based optimizer (because cost models are always estimates and thus can "get it wrong" sometimes, and typically it is hard to predict / debug when this happens (because there are correlations or skew in the data, for example)).

I would personally love to see DataFusion head more towards the "dynamically optimize at runtime" approach for joins (like have the join operators themselves scan the inputs to see if they can determine which is smaller, etc).

That being said, the reason that CBO is so prevalent is that it is relatively simple to implement and well understood, so I don't have any objections to pursuing a more sophisticated cost model for DataFusion

Thank you for this @isidentical

isidentical · 2022-10-11T13:40:13Z

Definitely agree with all the points @alamb! The same is true for most of the systems (e.g. compilers/interpreters); once you have the ability to determine a path with real data (JIT compilers), it is always going to be superior / more reliable than a profile-guided static analysis. But as you have already mentioned, doing adaptive (dynamical optimizations) optimizations is something that needs to be planned more carefully (in terms of the overall design of the execution process) and this is a lot more complex than simpler cost based analyzers.

I'd be happy to also prepare a more general document after this PR about the remaining cost estimations, and how we can have a more reliable computation for other foundational operations (filter selectivity, aggregates, etc.). But that doesn't necessarily need to block the work on the adaptiveness part (e.g. one of the things that I'd love to see [and even maybe work on, if I can find some time] is #816, which is a mix of both a CBO and an adaptive optimization which we can (and should) actually do at the runtime).

alamb · 2022-10-11T16:33:59Z

I will try and find some time tomorrow to review this in detail. It is hard to find the time to review all the PRs!

isidentical · 2022-10-11T16:39:24Z

I will try and find some time tomorrow to review this in detail. It is hard to find the time to review all the PRs!

Thanks so much for all the reviews today ❤️ Let me know if there is anything that I can do to shed more light on (in terms of implementation), would be happy to assist in the review process as much as I can.

datafusion/core/src/physical_plan/join_utils.rs

Dandandan · 2022-10-11T20:31:26Z

datafusion/core/src/physical_plan/join_utils.rs

+            return None;
+        }
+
+        let col_selectivity = max(left_stat.distinct_count, right_stat.distinct_count);


I think we might need a fallback for the number of distinct values (as it's often expensive to have distinct counts)..

If the min and max are from numeric origin, we should be able to use min(num_left_rows, scalar_range(left_stats.min, left_stats.max)) if there is no distinct_count (I remember seeing this pattern, though not sure where it was). WDYT @Dandandan?

It also makes sense to me. I also believe seeing something like that.

alamb

Thank you @isidentical -- as you mention this code model is relatively simple but I think it is a good base on which to build on and its testing is top notch 👍

Thank you again

datafusion/core/src/physical_optimizer/hash_build_probe_order.rs

datafusion/core/src/physical_plan/join_utils.rs

alamb · 2022-10-12T17:24:11Z

datafusion/core/src/physical_plan/join_utils.rs

+            // Seems like there are a few implementations of this algorithm that implement
+            // exponential decay for the selectivity (like Hive's Optiq Optimizer). Needs
+            // further exploration.
+            join_selectivity = col_selectivity;


If the join columns aren't correlated then a better estimate might be

join_selectivity = col_selectivity * join_selectivity;

If I understand it correctly, this is already what Hive's Optiq optimizer does (I've put a little comment to explain it). I was going to also implement it, but there was an observation from the spark's side saying this can lead to a significantly lower underestimation on certain cases.

* Generally, inner join with multiple join keys can also be estimated based on the above * formula: * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn))) * However, the denominator can become very large and excessively reduce the result, so we use a * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.

Should we put it off as a follow-up ticket and try to find a couple of cases to see how it behaves / how it really is (or I can also make the change here)?

I think empirical testing is a good idea -- all of these options are estimates and thus will be wrong in some cases so I think it is just a question of what tradeoffs we want to make

alamb · 2022-10-12T17:25:29Z

datafusion/core/src/physical_plan/join_utils.rs

+        // Since we don't have any information about the selectivity,
+        // we can only assume that the join will produce the cartesian
+        // product.
+        _ => left_num_rows * right_num_rows,


For what it is worth, in practice joins where the cost matters are almost never cartesian products (because if they were for large inputs the query would never finish and if the inputs are small it doesn't really matter where in the join tree they go.

We could also give up here (instead of trying to provide a conservative estimate), and see if we can come up with a more reasonable solution. The only case this happens is when we don't know the uniqueness of some join columns. @Dandandan and I have been discussing about an algorithm to actually use min/max values (and the total number of rows) to see if we can somehow infer distinct_count without having it in the statistics. That would hopefully make this case redundant.

Filed #3813 for this.

I think it might be better to give in that case?

There is also this presentation about optimizing the order of joins without statistics available (which also seems to do fine). We could also see if we can reuse some of these ideas:

https://www.youtube.com/watch?v=aNRoR0Z3SzU

Makes sense! I've pushed 9e4467e which should turn this off now. We can work on #3813 to also improve this case when the distinct count is not known and min/max is known (which seems common) and then also start looking into the solution from the DuckDB. (thanks a lot for the link btw, it seems super interesting!)

alamb · 2022-10-12T17:26:14Z

datafusion/core/src/physical_plan/join_utils.rs

+        Some(selectivity) if selectivity > 0 => {
+            (left_num_rows * right_num_rows) / selectivity
+        }


When I went over some boundary conditions I think this model is fairly simplistic but reasonable

1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows

1:1 Join (where there is exactly one or zero matches for each input on the non probe side) with a single predicate column -- model will assume the join does not filter any rows

Exactly. I think the very first next step for this PR would actually be starting on the selectivity aspect for the join filters. We don't currently compute it, but there a few good examples on how we can use a simplistic method from the propagated min/max values to determine the selectivity (if we can, obviously that might not be the case for some).

datafusion/core/src/physical_plan/join_utils.rs

alamb · 2022-10-12T17:29:07Z

I left some comment / style suggestions but nothing required -- @isidentical please let me know if you would like to merge this PR as is or make changes.

isidentical · 2022-10-12T19:29:45Z

@alamb thanks so much for all the reviews, I addressed all the stylistic points. Regarding the logical part (exponential join_selectivity, filter selectivity, cartesian product at the worst case scenario), I've tried to explain the current reasoning and potential for future work, but I can also start including some in this PR as well (depending on what is easier for you folks to review).

alamb · 2022-10-13T14:28:13Z

Thanks again @isidentical -- I think this is a great step forward

ursabot · 2022-10-13T14:31:53Z

Benchmark runs are scheduled for baseline = b654fde and contender = ac20bfd. ac20bfd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

xudong963 · 2022-10-24T08:11:43Z

datafusion/core/src/physical_plan/join_utils.rs

+
+    // With the assumption that the smaller input's domain is generally represented in the bigger
+    // input's domain, we can estimate the inner join's cardinality by taking the cartesian product
+    // of the two inputs and normalizing it by the selectivity factor.


One question: why use the selectivity factor(aka max_distinct) to normalize? @isidentical

Since this is an inner join, that means only the intersection on the join columns is going to be present in the result. In the conditions above, we have already verified that the ranges intersect so the number of distinct values (with a uniform distribution) is the most reliable way to actually find out the number of matching rows.

E.g. a.col1 has [1, 10000] (all distinct) and b.col1 has [1, 1000] (all distinct), for an inner join on a.col1=b.col1; we can estimate that the result cardinality as 10000 x 1000 / max(10000, 1000) = 1000. The same can be applied for distinct counts lower than the actual ranges (e.g. if a.col1 had 1000 distinct values instead 10000 then there might be 10000 rows hence the normalization above ^)

Got it, thanks a lot! @isidentical

mingmwang · 2022-10-26T15:50:00Z

I think the stats estimation should be applied to LogicalPlans instead of PhysicalPlans

Dandandan · 2022-10-26T16:02:54Z

I think the stats estimation should be applied to LogicalPlans instead of PhysicalPlans

Could you explain why? The estimation moved to physical plans at some point, as at the moment of a logical plan there are often no statistics available.

isidentical · 2022-10-26T16:07:36Z

By the way, should we try to keep the discussion in a single place for making it easier to follow/see by everyone? (we already have an open thread in the design doc, so maybe we should either move it to a new issue to discuss further (on top of the existing discussionS in #962) or keep it in the document)

mingmwang · 2022-10-26T16:08:13Z

Sure, let's discuss the topic in the google doc.

github-actions bot added the core Core DataFusion crate label Oct 10, 2022

Join cardinality computation for enabling cost-based nested join opti…

aa76e42

…mizations

isidentical force-pushed the gh-128 branch from de29367 to aa76e42 Compare October 10, 2022 22:42

isidentical marked this pull request as ready for review October 10, 2022 23:08

andygrove self-requested a review October 11, 2022 03:20

Dandandan reviewed Oct 11, 2022

View reviewed changes

datafusion/core/src/physical_plan/join_utils.rs Show resolved Hide resolved

isidentical mentioned this pull request Oct 11, 2022

Improve join cardinality estimation when there is no overlap in the min/max values #3802

Closed

Dandandan reviewed Oct 11, 2022

View reviewed changes

alamb approved these changes Oct 12, 2022

View reviewed changes

isidentical mentioned this pull request Oct 12, 2022

Never fallback to cartesian product for join estimation when we know the min/max values for columns #3813

Open

Addressing review feedback

30f6dd7

isidentical force-pushed the gh-128 branch from f2b29d9 to 30f6dd7 Compare October 12, 2022 19:27

Give up when we don't know the selectivity factor (no distinct count)

9e4467e

Dandandan approved these changes Oct 12, 2022

View reviewed changes

alamb merged commit ac20bfd into apache:master Oct 13, 2022

isidentical mentioned this pull request Oct 15, 2022

Infer the count of maximum distinct values from min/max #3837

Merged

xudong963 reviewed Oct 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join cardinality computation for cost-based nested join optimizations #3787

Join cardinality computation for cost-based nested join optimizations #3787

isidentical commented Oct 10, 2022

Dandandan Oct 11, 2022

isidentical Oct 11, 2022

alamb commented Oct 11, 2022

isidentical commented Oct 11, 2022

alamb commented Oct 11, 2022

isidentical commented Oct 11, 2022

Dandandan Oct 11, 2022

isidentical Oct 11, 2022

Dandandan Oct 11, 2022

alamb left a comment

alamb Oct 12, 2022

isidentical Oct 12, 2022

alamb Oct 13, 2022

alamb Oct 12, 2022

isidentical Oct 12, 2022

Dandandan Oct 12, 2022

isidentical Oct 12, 2022

alamb Oct 12, 2022

isidentical Oct 12, 2022

alamb commented Oct 12, 2022

isidentical commented Oct 12, 2022

alamb commented Oct 13, 2022

ursabot commented Oct 13, 2022

xudong963 Oct 24, 2022 •

edited

Loading

isidentical Oct 26, 2022

xudong963 Oct 26, 2022

mingmwang commented Oct 26, 2022

Dandandan commented Oct 26, 2022

isidentical commented Oct 26, 2022

mingmwang commented Oct 26, 2022

Join cardinality computation for cost-based nested join optimizations #3787

Join cardinality computation for cost-based nested join optimizations #3787

Conversation

isidentical commented Oct 10, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 11, 2022

isidentical commented Oct 11, 2022

alamb commented Oct 11, 2022

isidentical commented Oct 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 12, 2022

isidentical commented Oct 12, 2022

alamb commented Oct 13, 2022

ursabot commented Oct 13, 2022

xudong963 Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang commented Oct 26, 2022

Dandandan commented Oct 26, 2022

isidentical commented Oct 26, 2022

mingmwang commented Oct 26, 2022

xudong963 Oct 24, 2022 •

edited

Loading