Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe join_on method #5210

Merged
merged 3 commits into from
Feb 9, 2023
Merged

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Feb 7, 2023

Which issue does this PR close?

Closes #1254

Rationale for this change

What changes are included in this PR?

New method for DataFrame join_on allowing user to pass in arbitrary Expr's which are AND'ed together to form the ON condition.

Also fix to DataFrame join to enforce ambiguity check, like how was done by SQL planner

Are these changes tested?

New unit test

Are there any user-facing changes?

New method in DataFrame, doc updated

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner labels Feb 7, 2023
Comment on lines +508 to +526
let filter = if let Some(expr) = filter {
// ambiguous check
ensure_any_column_reference_is_unambiguous(
&expr,
&[self.schema(), right.schema()],
)?;

// normalize all columns in expression
let using_columns = expr.to_columns()?;
let filter = normalize_col_with_schemas(
expr,
&[self.schema(), right.schema()],
&[using_columns],
)?;
Some(filter)
} else {
None
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #4196

fix bug where you could do dataframe join with ambiguous column for the filter expr

instead of having the check done in both DataFrame join api and SQL planner join mod, unify by having check done inside the logical plan builder

this is technically an unrelated fix to the actual issue, so i can extract into separate issue if needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey -- the code looks great and I have just a few small comments on tests

/// # Ok(())
/// # }
/// ```
pub fn join_on(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

JoinType::Inner,
[
col("a.c1").not_eq(col("b.c1")),
col("a.c2").not_eq(col("b.c2")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible here to also add an equality predicate to demonstrate they are automatically recognized as equi preds?

Perhaps something like

Suggested change
col("a.c2").not_eq(col("b.c2")),
col("a.c2").eq(col("b.c2")),

Copy link
Contributor Author

@Jefffrey Jefffrey Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as you suggested. it seems they still are considered as part of the filter, though this seems to track with the explicit SQL version too:

https://github.com/apache/arrow-datafusion/blob/f0c67193a3d18ff1d94f9dd55bfb1715e5473bf1/datafusion/sql/tests/integration_test.rs#L1661-L1672

edit: nvm there's the extract_equijoin_predicate logical optimization which extracts it into an equijoin predicate indeed

Comment on lines +508 to +526
let filter = if let Some(expr) = filter {
// ambiguous check
ensure_any_column_reference_is_unambiguous(
&expr,
&[self.schema(), right.schema()],
)?;

// normalize all columns in expression
let using_columns = expr.to_columns()?;
let filter = normalize_col_with_schemas(
expr,
&[self.schema(), right.schema()],
&[using_columns],
)?;
Some(filter)
} else {
None
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)

@liukun4515
Copy link
Contributor

I want to take a look this PR tomorrow. @alamb

@liukun4515 liukun4515 self-requested a review February 8, 2023 13:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @Jefffrey -- let's wait for @liukun4515 to review before merging

@@ -68,6 +68,7 @@ execution. The plan is evaluated (executed) when an action method is invoked, su
| filter | Filter a DataFrame to only include rows that match the specified filter expression. |
| intersect | Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema |
| join | Join this DataFrame with another DataFrame using the specified columns as join keys. |
| join_on | Join this DataFrame with another DataFrame using arbitrary expressions. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liukun4515 liukun4515 merged commit 1b03a7a into apache:master Feb 9, 2023
@ursabot
Copy link

ursabot commented Feb 9, 2023

Benchmark runs are scheduled for baseline = dee9fd7 and contender = 1b03a7a. 1b03a7a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@Jefffrey Jefffrey deleted the dataframe_join_on branch February 9, 2023 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support non-equi join (e.g. ON clause) in Dataframe API
4 participants