Optimize nested joins #128

alamb · 2021-04-26T13:24:30Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10964

Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.

alamb · 2021-04-26T13:24:32Z

Comment from Daniël Heres(Dandandan) @ 2020-12-22T13:07:20.151+0000:

Found some nice material from Spark on this:
[https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html]

basically the idea to use column level statistics such as:
* min/max
* nr of distinct values
* null count

to come up with e.g. selectivity of a filter.

Also there is a formula for (inner) join cardinality:

{{num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))}}

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

isidentical mentioned this issue Oct 10, 2022

Join cardinality computation for cost-based nested join optimizations #3787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize nested joins #128

Optimize nested joins #128

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

Optimize nested joins #128

Optimize nested joins #128

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021