-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimizer pass to reduce left
/right
/full
joins to inner
join if possible
#2750
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2750 +/- ##
==========================================
+ Coverage 85.11% 85.20% +0.08%
==========================================
Files 273 274 +1
Lines 48240 48543 +303
==========================================
+ Hits 41060 41361 +301
- Misses 7180 7182 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code and test and they look good to me. Thanks @AssHero
Comment some style things and a question.
datafusion/sql/src/planner.rs
Outdated
/// Recursively traversese expr, if expr returns false when | ||
/// any inputs are null, treats columns of both sides as nonnullable columns. | ||
/// | ||
/// For and/or expr, extracts from all sub exprs and merges the columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here is
/// For and/or expr, extracts from all sub exprs and merges the columns. | |
/// For and expr, extracts from all sub exprs and merges the columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here is
For And and Or exprs, we need to extract columns from all sub exprs.
datafusion/sql/src/planner.rs
Outdated
fn reduce_outer_join( | ||
plan: &LogicalPlan, | ||
nonnullable_cols: &Vec<Column>, | ||
) -> Result<Option<LogicalPlan>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change this function to modify plan in place like
fn reduce_outer_join(plan: mut LogicalPlan, ...) -> Result<LogicalPlan>;
This might simplify the match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change this function to modify plan in place like
fn reduce_outer_join(plan: mut LogicalPlan, ...) -> Result<LogicalPlan>;This might simplify the match
inplace update is better, and I'll check it later.
datafusion/sql/src/planner.rs
Outdated
@@ -784,6 +784,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
|
|||
let filter_expr = self.sql_to_rex(predicate_expr, &join_schema, ctes)?; | |||
|
|||
// reduce outer joins | |||
let plans = reduce_outer_joins(plans, &filter_expr)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that whether this should be done here, or somewhere else like in optimize phase. Other parts are great ❤️ cc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that whether this should be done here, or somewhere else like in optimize phase. Other parts are great ❤️ cc @alamb
Here we have joins and quals from where and these make reduce possible. Please give me your suggestions, thanks!
datafusion/sql/src/planner.rs
Outdated
extract_nonnullable_columns(right, columns, false) | ||
} | ||
Operator::And => { | ||
if !top_level { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will something like Not(And(a > 10, b < 10))
hits this check and gets ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will something like
Not(And(a > 10, b < 10))
hits this check and gets ignored?
YES. If not from top_level, we can handle And expr as Or expr, this is better.
@AssHero Since this is a significant feature could you file an issue for it so that it gets included in the change logs. |
datafusion/core/tests/sql/joins.rs
Outdated
let results = execute_to_batches(&ctx, sql).await; | ||
assert_batches_sorted_eq!(expected, &results); | ||
|
||
// could not reduce, use left join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split the tests out into separate test methods?
datafusion/sql/src/planner.rs
Outdated
@@ -784,6 +784,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
|
|||
let filter_expr = self.sql_to_rex(predicate_expr, &join_schema, ctes)?; | |||
|
|||
// reduce outer joins | |||
let plans = reduce_outer_joins(plans, &filter_expr)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be implemented as an optimization rule rather than in the SQL query planner and this would mean that we would get the same optimizations regardless of whether the query was created from SQL or from the DataFrame API.
yes, I'll file an issue. |
Split the test cases into single test method. Implementing this as optimization rule is in progress. |
Thank you -- @AssHero I will put this on my list to review carefully tomorrow. This optimization in general is quite tricky so I want to make sure I give it my full attention |
Implementing reduce outer join as optimization rule. More improving is in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AssHero -- this is looking cool. I haven't had a chance to follow it all yet but I have some suggestions.
👍
/// For IS NOT NULL/NOT expr, always returns false for NULL input. | ||
/// extracts columns from these exprs. | ||
/// For all other exprs, fall through | ||
fn extract_nonnullable_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using Expr::nullable
instead? That seems very similar / the same as what you are doing with this function (trying to determine if an expression can be nullable / non nullable)?
If Expr::nullable
doesn't work correctly I think we should update that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extract_nonnullable_columns gets all columns of these exprs which returns false if inputs are null, and we use these columns to see if they appear in null rows of output for outer join. If so, this means these null rows does not meet the conditions, we can filter null rows, reduce outer to inner. So I think the purpose of extract_nonnullable_columns is not the same as Expr::nullable.
@@ -1229,6 +1230,7 @@ impl SessionState { | |||
if config.config_options.get_bool(OPT_FILTER_NULL_JOIN_KEYS) { | |||
rules.push(Arc::new(FilterNullJoinKeys::default())); | |||
} | |||
rules.push(Arc::new(ReduceOuterJoin::new())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pass should probably be applied after FilterPushdown, so that as many predicates as possible are available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After FilterPushdown, the filters like (c1 < 100) are distributed to tables. At this time, we need to check from the bottom to up, and this need more infos to reduce outer join.
For query: select * from (select * from tt1 left join tt2 on c1 = c3) a right join tt3 on a.c2 = tt3.c5 where a.c4 < 100;
after FilterPushdown, the logical plan is
| Projection: #a.c1, #a.c2, #a.c3, #a.c4, #tt3.c5, #tt3.c6
| -Inner Join: #a.c2 = #tt3.c5
| ---Projection: #a.c1, #a.c2, #a.c3, #a.c4, alias=a
| ----Projection: #tt1.c1, #tt1.c2, #tt2.c3, #tt2.c4, alias=a
| -----Inner Join: #tt1.c1 = #tt2.c3
| ------TableScan: tt1 projection=Some([c1, c2])
| ------Filter: #tt2.c4 < Int64(100)
| -------TableScan: tt2 projection=Some([c3, c4])
| ---TableScan: tt3 projection=Some([c5, c6])
we need to walk to the tablescan, and get the filter, then back to the parent plan to reduce the outer joins, I think this is more complex and complicated to implement.
Please let me know if I miss something?
{ | ||
let mut left_nonnullable = false; | ||
let mut right_nonnullable = false; | ||
for col in nonnullable_cols.iter_mut() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why iter_mut
and not iter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix this! Thanks!
let left_plan = reduce_outer_join( | ||
_optimizer, | ||
&join.left, | ||
nonnullable_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is a problem to use the same nonnullable_cols
here as the call to reduce_outer_join
could add new columns if a fitler is encountered somewhere down the left side (e.g. in a subquery). Maybe a clone()
should be passed here (or the length remembered and truncated prior to calling reduce_outer_join
on the right side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize this problem after pushing the code, use clone() here.
null_equals_null: join.null_equals_null, | ||
})) | ||
} | ||
LogicalPlan::Projection(Projection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Sorts can also rewrite column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for such query
select * from (select * from tt1 left join tt2 on c1 = c3) a right join tt3 on a.c2 = tt3.c5 where a.c4 < 100;
The a.c4 can be used to reduce left join to inner join, but we need to know that a.c4 is corresponding with tt2.c4.
I need to think about the case with sort.
assert_batches_sorted_eq!(expected, &results); | ||
|
||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some negative tests might be good -- like that
let sql = "select * from t1 left join t2 on t1.t1_id = t2.t2_id where t2.t2_id IS NULL";
isn't rewritten to an inner join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add more test cases in test part of this rule, including this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @AssHero -- I think the test coverage in this PR is quite 👍 and we can iterate it on master as we go forward.
left
/right
/full
joins to inner
join
left
/right
/full
joins to inner
joinleft
/right
/full
joins to inner
join if possible
Which issue does this PR close?
close #2757
Rationale for this change
try to reduce left/right/full join to inner join
for query: select ... from a left join b on ... where b.xx = 100;
if b.xx is null, and b.xx = 100 returns false, filterd those null rows.
Therefore, there is no need to produce null rows for output, we can use
inner join instead of left join.
Generally, right join/full join can also be reduced to inner join according to these rules.
What changes are included in this PR?
add reduce_outer_plan to try to reduce left/right/full join to inner join in src/planner.rs