Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize nested joins #128

Open
alamb opened this issue Apr 26, 2021 · 1 comment
Open

Optimize nested joins #128

alamb opened this issue Apr 26, 2021 · 1 comment
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10964

Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.

 

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Daniël Heres(Dandandan) @ 2020-12-22T13:07:20.151+0000:

Found some nice material from Spark on this:
[https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html]

basically the idea to use column level statistics such as:
* min/max
* nr of distinct values
* null count

to come up with e.g. selectivity of a filter.

Also there is a formula for (inner) join cardinality:

{{num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

1 participant