Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic support for IN and NOT IN Subqueries by rewriting them to SEMI / ANTI Join #2421

Merged
merged 5 commits into from
May 4, 2022

Conversation

korowa
Copy link
Contributor

@korowa korowa commented May 2, 2022

Which issue does this PR close?

Partially #488.

Rationale for this change

Naive implementation of optimizer rule for replacing InSubquery with join, which allows to execute queries with IN (subquery) in case of proper WHERE condition.

What changes are included in this PR?

SubqueryFilterToJoin rule is able to replace Filter input in logical plan with Sem/AntiJoin in case InSubquery is a part of logical conjunction - this precondition allows to pushdown IN predicate before other predicates in Filter.

Cases when IN (subquery) cannot be pushed to Filters input due to its result being required for predicate evaluation, are handled by returning NotImplemented error for now.

Are there any user-facing changes?

Queries with IN (subquery) predicates start executing for described above filter combinations

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label May 2, 2022
@andygrove andygrove requested a review from Dandandan May 2, 2022 20:33
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @korowa this is looking good. This actually adds support for benchmark queries 16 and 18, which previously were not supported.

If you want to, you could enable these queries in the benchmark test at benchmarks/src/bin/tpch.rs by adding these tests. This could also be done as a separate PR.

#[tokio::test]
async fn run_q16() -> Result<()> {
    run_query(16).await
}

#[tokio::test]
async fn run_q18() -> Result<()> {
    run_query(18).await
}

@andygrove andygrove requested a review from alamb May 2, 2022 21:13
})?;

if !subqueries_in_regular.is_empty() {
return Err(DataFusionError::NotImplemented(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just revert to the original query here rather than fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fixed. My idea was to give user some explanation about why query is "incorrect" by throwing errors.

Now I see that if DF is able to produce this kind of logical plan, then it's valid (at least for some purposes maybe), even if we don't have physical implementation for some of its parts yet.

let right_key = right_schema.field(0).qualified_column();
let left_key = match *expr.clone() {
Expr::Column(col) => col,
_ => return Err(DataFusionError::NotImplemented(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. Can we just abort the optimization attempt rather than fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed

Ok(())
}
_ => Err(DataFusionError::Plan(
"Unknown expression while rewriting subquery to joins"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed

@korowa korowa force-pushed the naive_in_subquery branch from 29ad278 to 4f15cd1 Compare May 3, 2022 05:55
)),
};

let join_type = match negated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use if/else here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done

};

let schema = build_join_schema(
new_input.schema(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use optimized_input.schema() and avoid creating a mut

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to try_fold, no mutable variables required now

null_equals_null: false,
});

Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and return result here to use below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now folding

@korowa korowa force-pushed the naive_in_subquery branch from 70b833d to 4b13c2f Compare May 3, 2022 14:32
@alamb alamb changed the title naive InSubquery implementation Basic support for IN and NOT IN Subqueries by rewriting them to SEMI / ANTI May 4, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa. Really nice work. 🏅

This is very cool -- both good code as well as well tested.

@@ -1074,6 +1074,16 @@ mod tests {
run_query(14).await
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -305,7 +267,7 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> Result<LogicalPlan> {
LogicalPlan::Analyze { .. } => push_down(&state, plan),
LogicalPlan::Filter(Filter { input, predicate }) => {
let mut predicates = vec![];
split_members(predicate, &mut predicates);
utils::split_conjunction(predicate, &mut predicates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -556,6 +556,41 @@ pub fn rewrite_expression(expr: &Expr, expressions: &[Expr]) -> Result<Expr> {
}
}

/// converts "A AND B AND C" => [A, B, C]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for moving these functions into utils

)),
};

let join_type = if *negated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.build()?;

let expected = "Projection: #test.b [b:UInt32]\
\n Anti Join: #test.c = #test.c [a:UInt32, b:UInt32, c:UInt32]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell from this plan if the predicate is correct (because #test is used as the relation name in both the inner and outer query.

It might make these tests more readable if the relation name in the subquery was something different (like test_sq) so that this join predicate appears as #test.c = #sq_ test.c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - now table names in tests are different, so it should by much easier to read

.build()?;

let expected = "Projection: #test.b [b:UInt32]\
\n Semi Join: #test.b = #test.a [a:UInt32, b:UInt32, c:UInt32]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert_optimized_plan_eq(&plan, expected);
Ok(())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be helpful to coverage of the negative cases (aka cases that can't be rewritten like x IN (select ...) OR y = 5?

Copy link
Contributor Author

@korowa korowa May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cases for unsupported filter expression and for filters input being rewritten while filter remains untouched (checks that falling back to original query doesn't affect its recursive call results)

@korowa korowa force-pushed the naive_in_subquery branch from f6cae49 to e6065b4 Compare May 4, 2022 06:23
@korowa korowa force-pushed the naive_in_subquery branch from e6065b4 to 271695a Compare May 4, 2022 06:26
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice work @korowa -- thank you so much.

.project(vec![col("test.b")])?
.build()?;

let expected = "Projection: #test.b [b:UInt32]\
\n Semi Join: #test.c = #test.c [a:UInt32, b:UInt32, c:UInt32]\
\n Semi Join: #test.c = #sq.c [a:UInt32, b:UInt32, c:UInt32]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much easier to read -- thank you

}

/// Test for filter input modification in case filter not supported
/// Outer filter expression not modified while inner converted to join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit e8ba45c into apache:master May 4, 2022
@korowa
Copy link
Contributor Author

korowa commented May 4, 2022

Thanks for reviews! I hope I'll follow up with PR(s?) for currently unsupported cases soon.

This was referenced May 4, 2022
@korowa korowa deleted the naive_in_subquery branch May 5, 2022 09:53
@alamb alamb changed the title Basic support for IN and NOT IN Subqueries by rewriting them to SEMI / ANTI Basic support for IN and NOT IN Subqueries by rewriting them to SEMI / ANTI Join May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants