-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe join_on method #5210
Dataframe join_on method #5210
Conversation
let filter = if let Some(expr) = filter { | ||
// ambiguous check | ||
ensure_any_column_reference_is_unambiguous( | ||
&expr, | ||
&[self.schema(), right.schema()], | ||
)?; | ||
|
||
// normalize all columns in expression | ||
let using_columns = expr.to_columns()?; | ||
let filter = normalize_col_with_schemas( | ||
expr, | ||
&[self.schema(), right.schema()], | ||
&[using_columns], | ||
)?; | ||
Some(filter) | ||
} else { | ||
None | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to #4196
fix bug where you could do dataframe join with ambiguous column for the filter expr
instead of having the check done in both DataFrame join api and SQL planner join mod, unify by having check done inside the logical plan builder
this is technically an unrelated fix to the actual issue, so i can extract into separate issue if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jefffrey -- the code looks great and I have just a few small comments on tests
/// # Ok(()) | ||
/// # } | ||
/// ``` | ||
pub fn join_on( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
datafusion/core/src/dataframe.rs
Outdated
JoinType::Inner, | ||
[ | ||
col("a.c1").not_eq(col("b.c1")), | ||
col("a.c2").not_eq(col("b.c2")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible here to also add an equality predicate to demonstrate they are automatically recognized as equi preds?
Perhaps something like
col("a.c2").not_eq(col("b.c2")), | |
col("a.c2").eq(col("b.c2")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done as you suggested. it seems they still are considered as part of the filter, though this seems to track with the explicit SQL version too:
edit: nvm there's the extract_equijoin_predicate
logical optimization which extracts it into an equijoin predicate indeed
let filter = if let Some(expr) = filter { | ||
// ambiguous check | ||
ensure_any_column_reference_is_unambiguous( | ||
&expr, | ||
&[self.schema(), right.schema()], | ||
)?; | ||
|
||
// normalize all columns in expression | ||
let using_columns = expr.to_columns()?; | ||
let filter = normalize_col_with_schemas( | ||
expr, | ||
&[self.schema(), right.schema()], | ||
&[using_columns], | ||
)?; | ||
Some(filter) | ||
} else { | ||
None | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)
I want to take a look this PR tomorrow. @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks @Jefffrey -- let's wait for @liukun4515 to review before merging
@@ -68,6 +68,7 @@ execution. The plan is evaluated (executed) when an action method is invoked, su | |||
| filter | Filter a DataFrame to only include rows that match the specified filter expression. | | |||
| intersect | Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema | | |||
| join | Join this DataFrame with another DataFrame using the specified columns as join keys. | | |||
| join_on | Join this DataFrame with another DataFrame using arbitrary expressions. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Benchmark runs are scheduled for baseline = dee9fd7 and contender = 1b03a7a. 1b03a7a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #1254
Rationale for this change
What changes are included in this PR?
New method for DataFrame
join_on
allowing user to pass in arbitraryExpr
's which are AND'ed together to form theON
condition.Also fix to DataFrame join to enforce ambiguity check, like how was done by SQL planner
Are these changes tested?
New unit test
Are there any user-facing changes?
New method in DataFrame, doc updated