You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.
The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.
The text was updated successfully, but these errors were encountered:
Comment from Daniël Heres(Dandandan) @ 2020-12-22T13:07:20.151+0000:
Found some nice material from Spark on this:
[https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html]
basically the idea to use column level statistics such as:
* min/max
* nr of distinct values
* null count
to come up with e.g. selectivity of a filter.
Also there is a formula for (inner) join cardinality:
{{num(A IJ B) = num(A)*num(B)/max(distinct(A.k),distinct(B.k))}}
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10964
Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables.
The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context.
The text was updated successfully, but these errors were encountered: