Dataframe join_on method #5210

Jefffrey · 2023-02-07T11:08:00Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

New method for DataFrame join_on allowing user to pass in arbitrary Expr's which are AND'ed together to form the ON condition.

Also fix to DataFrame join to enforce ambiguity check, like how was done by SQL planner

Are these changes tested?

New unit test

Are there any user-facing changes?

New method in DataFrame, doc updated

Jefffrey · 2023-02-07T11:09:58Z

datafusion/expr/src/logical_plan/builder.rs

+        let filter = if let Some(expr) = filter {
+            // ambiguous check
+            ensure_any_column_reference_is_unambiguous(
+                &expr,
+                &[self.schema(), right.schema()],
+            )?;
+
+            // normalize all columns in expression
+            let using_columns = expr.to_columns()?;
+            let filter = normalize_col_with_schemas(
+                expr,
+                &[self.schema(), right.schema()],
+                &[using_columns],
+            )?;
+            Some(filter)
+        } else {
+            None
+        };
+


related to #4196

fix bug where you could do dataframe join with ambiguous column for the filter expr

instead of having the check done in both DataFrame join api and SQL planner join mod, unify by having check done inside the logical plan builder

this is technically an unrelated fix to the actual issue, so i can extract into separate issue if needed

I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)

alamb

Thanks @Jefffrey -- the code looks great and I have just a few small comments on tests

alamb · 2023-02-07T21:23:53Z

datafusion/core/src/dataframe.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    pub fn join_on(


alamb · 2023-02-07T21:25:06Z

datafusion/core/src/dataframe.rs

+            JoinType::Inner,
+            [
+                col("a.c1").not_eq(col("b.c1")),
+                col("a.c2").not_eq(col("b.c2")),


Would it be possible here to also add an equality predicate to demonstrate they are automatically recognized as equi preds?

Perhaps something like

Suggested change

col("a.c2").not_eq(col("b.c2")),

col("a.c2").eq(col("b.c2")),

done as you suggested. it seems they still are considered as part of the filter, though this seems to track with the explicit SQL version too:

https://github.com/apache/arrow-datafusion/blob/f0c67193a3d18ff1d94f9dd55bfb1715e5473bf1/datafusion/sql/tests/integration_test.rs#L1661-L1672

edit: nvm there's the extract_equijoin_predicate logical optimization which extracts it into an equijoin predicate indeed

alamb · 2023-02-07T21:26:04Z

datafusion/expr/src/logical_plan/builder.rs

+        let filter = if let Some(expr) = filter {
+            // ambiguous check
+            ensure_any_column_reference_is_unambiguous(
+                &expr,
+                &[self.schema(), right.schema()],
+            )?;
+
+            // normalize all columns in expression
+            let using_columns = expr.to_columns()?;
+            let filter = normalize_col_with_schemas(
+                expr,
+                &[self.schema(), right.schema()],
+                &[using_columns],
+            )?;
+            Some(filter)
+        } else {
+            None
+        };
+


I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)

liukun4515 · 2023-02-08T13:10:49Z

I want to take a look this PR tomorrow. @alamb

alamb

LGTM -- thanks @Jefffrey -- let's wait for @liukun4515 to review before merging

alamb · 2023-02-08T18:29:38Z

docs/source/user-guide/dataframe.md

@@ -68,6 +68,7 @@ execution. The plan is evaluated (executed) when an action method is invoked, su
 | filter              | Filter a DataFrame to only include rows that match the specified filter expression.                                                        |
 | intersect           | Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema                                         |
 | join                | Join this DataFrame with another DataFrame using the specified columns as join keys.                                                       |
+| join_on             | Join this DataFrame with another DataFrame using arbitrary expressions.                                                                    |


liukun4515

LGTM

ursabot · 2023-02-09T11:02:04Z

Benchmark runs are scheduled for baseline = dee9fd7 and contender = 1b03a7a. 1b03a7a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dataframe join_on method

6172917

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner labels Feb 7, 2023

Jefffrey commented Feb 7, 2023

View reviewed changes

Fix formatting

4043a53

alamb reviewed Feb 7, 2023

View reviewed changes

Add tests

52e78c6

liukun4515 self-requested a review February 8, 2023 13:11

alamb approved these changes Feb 8, 2023

View reviewed changes

liukun4515 approved these changes Feb 9, 2023

View reviewed changes

liukun4515 merged commit 1b03a7a into apache:master Feb 9, 2023

Jefffrey deleted the dataframe_join_on branch February 9, 2023 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe join_on method #5210

Dataframe join_on method #5210

Jefffrey commented Feb 7, 2023

Jefffrey Feb 7, 2023

alamb Feb 7, 2023

Jefffrey Feb 8, 2023

alamb left a comment

alamb Feb 7, 2023

alamb Feb 7, 2023

Jefffrey Feb 8, 2023 •

edited

Loading

alamb Feb 7, 2023

liukun4515 commented Feb 8, 2023

alamb left a comment

alamb Feb 8, 2023

liukun4515 left a comment

ursabot commented Feb 9, 2023

	col("a.c2").not_eq(col("b.c2")),
	col("a.c2").eq(col("b.c2")),

Dataframe join_on method #5210

Dataframe join_on method #5210

Conversation

Jefffrey commented Feb 7, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Jefffrey Feb 7, 2023

Choose a reason for hiding this comment

alamb Feb 7, 2023

Choose a reason for hiding this comment

Jefffrey Feb 8, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 7, 2023

Choose a reason for hiding this comment

alamb Feb 7, 2023

Choose a reason for hiding this comment

Jefffrey Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 7, 2023

Choose a reason for hiding this comment

liukun4515 commented Feb 8, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 8, 2023

Choose a reason for hiding this comment

liukun4515 left a comment

Choose a reason for hiding this comment

ursabot commented Feb 9, 2023

Jefffrey Feb 8, 2023 •

edited

Loading