improve Filter pushdown to Join #5770

mingmwang · 2023-03-29T05:00:42Z

Which issue does this PR close?

Closes #.

Rationale for this change

Improve some TPCH query performance, simply the generate logical plan and physical plan.

What changes are included in this PR?

Convert filters to join filters for Inner Join
Avoid duplicated filters
Fixed unstable physical HashJoin plan

tpch-q7, tpch-q17, tpch-q19, tpch-q20 are impacted by this PR.

Are these changes tested?

Are there any user-facing changes?

mingmwang · 2023-03-29T08:33:07Z

@yahoNanJing
Please help me to review

alamb

I reviewed the plan changes and the code carefully -- nice work @mingmwang

alamb · 2023-03-29T17:06:00Z

benchmarks/expected-plans/q17.txt

+|               |   Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]                                                                                                                                                                                                                                                                                                                                                                                                      |
+|               |     Projection: lineitem.l_extendedprice                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|               |       Inner Join: part.p_partkey = __scalar_sq_1.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_1.__value AS Decimal128(30, 15))                                                                                                                                                                                                                                                                                                 |


This is a better plan because the redundant

Filter: part.p_partkey = lineitem.l_partkey AND lineitem.l_partkey = part.p_partkey

, which is already done by the earlier joins, is removed, right?

It also pushes the filter

| | Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_1.__value AS Decimal128(30, 15)) AND __scalar_sq_1.l_partkey = lineitem.l_partkey | `` Into the Join which seems like a win to me (avoid generating output)

This is a better plan because the redundant

Filter: part.p_partkey = lineitem.l_partkey AND lineitem.l_partkey = part.p_partkey

, which is already done by the earlier joins, is removed, right?

Yes, the duplicated filters are removed. Actually why the original plan include duplicate filters is because the push_down_filter rule infers additional filters and try to pushdown them down. If they can not be pushed down, those inferred filters are added back to the Filters, this is unnecessary, need to differ the inferred filters and the original filters.

Nice, thanks for the explanation.

alamb · 2023-03-29T17:09:30Z

datafusion/core/src/physical_plan/planner.rs

                            let cols = expr.to_columns()?;

-                            // Collect left & right field indices
+                            // Collect left & right field indices, the field indices are sorted in ascending order


If the sort order is important for later stages, can you make a note about the rationale (so the comment explains why the sorting is important, in addition to noting the output is sorted)

Sure, will do.

andygrove · 2023-03-30T19:03:32Z

@mingmwang Could you share any performance numbers for the improvements for the affected queries?

mingmwang · 2023-03-31T12:02:27Z

@mingmwang Could you share any performance numbers for the improvements for the affected queries?

Sure, will do. unfortunately, the performance improvement is just a little. For q17, the major bottleneck is still the Aggregation.

jackwener

Great job to me. Thanks @mingmwang .

jackwener · 2023-03-31T16:20:48Z

This PR remind me. I also notice some optimization.

We can do a EPIC list tasks about optimizer to collect those optimization like #5546.

such as predicate move around ......

improve filter pushdown to join

a3056f6

github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Mar 29, 2023

alamb approved these changes Mar 29, 2023

View reviewed changes

Dandandan approved these changes Mar 31, 2023

View reviewed changes

jackwener approved these changes Mar 31, 2023

View reviewed changes

Dandandan merged commit 5bc0051 into apache:main Apr 1, 2023

Dandandan mentioned this pull request Oct 18, 2024

Support non-equijoin predicate for EliminateCrossJoin #4866 #4877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve Filter pushdown to Join #5770

improve Filter pushdown to Join #5770

mingmwang commented Mar 29, 2023

mingmwang commented Mar 29, 2023

alamb left a comment

alamb Mar 29, 2023

alamb Mar 29, 2023

mingmwang Mar 31, 2023

Dandandan Mar 31, 2023

alamb Mar 29, 2023

mingmwang Mar 31, 2023

andygrove commented Mar 30, 2023

mingmwang commented Mar 31, 2023

jackwener left a comment

jackwener commented Mar 31, 2023 •

edited

Loading

improve Filter pushdown to Join #5770

improve Filter pushdown to Join #5770

Conversation

mingmwang commented Mar 29, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mingmwang commented Mar 29, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 29, 2023

Choose a reason for hiding this comment

alamb Mar 29, 2023

Choose a reason for hiding this comment

mingmwang Mar 31, 2023

Choose a reason for hiding this comment

Dandandan Mar 31, 2023

Choose a reason for hiding this comment

alamb Mar 29, 2023

Choose a reason for hiding this comment

mingmwang Mar 31, 2023

Choose a reason for hiding this comment

andygrove commented Mar 30, 2023

mingmwang commented Mar 31, 2023

jackwener left a comment

Choose a reason for hiding this comment

jackwener commented Mar 31, 2023 • edited Loading

jackwener commented Mar 31, 2023 •

edited

Loading