Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary column expressions in JOIN ON clause #3935

Closed
OneRaynyDay opened this issue Jul 8, 2022 · 5 comments
Closed

Allow arbitrary column expressions in JOIN ON clause #3935

OneRaynyDay opened this issue Jul 8, 2022 · 5 comments

Comments

@OneRaynyDay
Copy link
Contributor

Arbitrary column expressions

Typically in SQL one can express JOINs on arbitrary expressions. In polar's case, the JOIN must be an equality:

lf1.join(lf2, how="left", left_on=lf1.col("x"), right_on=lf2.col("y"))

is implicitly equivalent to:

SELECT * FROM lf1 LEFT JOIN lf2 ON lf1.x = lf2.y

There are two things (with varying degrees of difficulty to implement) which polars cannot currently do:

Allow arbitrary expressions on the LHS/RHS of the equality

SELECT * FROM lf1 LEFT JOIN lf2 ON lf1.x = lf2.y / 2

If we have lazy frames with the following content:

tbl1.fetch(n_rows=2)

x | y
-- | --
i64 | str
1 | "a"
2 | "b"

and:

tbl2.fetch(n_rows=2)

w | z
-- | --
i64 | str
2 | "c"
4 | "d"

and we try to join:

tbl1.join(tbl2, how="left", left_on=pl.col("x"), right_on=(pl.col("w") // 2)).fetch(n_rows=2)

x | y | z
-- | -- | --
i64 | str | str
1 | "a" | null
2 | "b" | "c"

it raises a PanicException: could not determine join column names. If I were to alias the second one the result is even worse (this should be considered a bug):

tbl1.join(tbl2, how="left", left_on=pl.col("x"), right_on=(pl.col("w") // 2).alias("w")).fetch(n_rows=2)

x | y | z
-- | -- | --
i64 | str | str
1 | "a" | null
2 | "b" | "c"

Here it seems like polars interpreted that expression as the field from the table! Uh oh. We expect z == ["c", "d"] not [null, "c"] along the column.

Allow arbitrary predicate functions for on

JOINs can get complicated:

SELECT * FROM lf1 LEFT JOIN lf2 ON lf1.x <= lf2.y AND lf2.x >= lf2.z

I don't think these non-equality predicate expressions can currently be expressed by pl.LazyFrame.join. AFAICT this can't be captured by join_asof either. I would like the ability to say something like:

tbl1.join(tbl2, how="left", on=(pl.col("x").bind(tbl1) <= pl.col("y").bind(tbl2)) & (pl.col("x").bind(tbl1) >= pl.col("z").bind(tbl2)))

to mean the same thing as the above SQL expression. Let me know what you think.

PS: Sorry for the barrage of tickets! I think polars is awesome and I'm currently scoping it out to see whether it fits my use case :)

@universalmind303
Copy link
Collaborator

closed via #4029

@Bonnevie
Copy link

Bonnevie commented Nov 1, 2023

should this really be closed @universalmind303 ? The second case with "Allow arbitrary predicate functions for on" is a fairly standard use-case - imagine that we need to link a set of intervals and a set of timestamped events, then we need a double comparison like the one described. I think a cross-join is needed to accomplish the same in polars now?

@bfeif
Copy link
Contributor

bfeif commented May 29, 2024

I agree, @Bonnevie . Not sure why this issue was closed, just enabling joins by pl.Expr objects doesn't really fix the problem

@cmdlineluser
Copy link
Contributor

There are also other issues for non-equi joins / intervals / overlaps:

@solomonpapathoti
Copy link

Hi, so can I do this?

tbl1.join(tbl2, how="inner", left_on=[pl.col("art_no"), pl.col("var_tu_key") // 1000], right_on=[(pl.col("art_no"), pl.col("var_no")])

for

SELECT * 
FROM lf1 AS il
INNER JOIN lf2 AS pdim
ON il.art_no = pdim.art_no
    AND DIV(il.var_tu_key, 1000) = pdim.var_no

Is there any feature added now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants