Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-equi join tracking issue #10068

Closed
magarick opened this issue Jul 25, 2023 · 18 comments
Closed

Non-equi join tracking issue #10068

magarick opened this issue Jul 25, 2023 · 18 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@magarick
Copy link
Contributor

Problem description

I know there have been a few issues raised about this before. I'd like to consolidate planning here since I'm going to start working on this but it's large and will have to proceed in parts. Probably a good idea to get agreement on what the API should look like too.
Further, there are sub-types of non-equi join that are conceptually distinct enough that they should probably have their own functions. And it's likely faster to implement them separately than trying to do it all one way. If conceptually distinct join types should get their own functions, this makes it easier to implement incrementally.
Obviously, teaming up on this would be great too.

Subtypes noted so far

  1. Match values in the left table to intervals that contain them in the right. This is kind of like an asof join, especially since in my experience you often want the intervals in the right table to be disjoint, though missing ranges are usually ok. In that case I've called this an annotation or tag join before. When intervals can overlap, you might want to take the first or last overlapping one for a value, but defining first and last requires care.
  2. Left and right both have intervals which match when they overlap. Similar to 1 since a match is when either endpoint of an interval falls within an interval on the other side.
  3. Keys match if their difference is within some range, or interval overlaps with tolerance (like foverlaps)

So, does it make sense to have different functions for conceptually different joins. The underlying algorithm may end up being the same, since DuckDB seems to claim they can do it all super fast with one type of join (https://duckdb.org/2022/05/27/iejoin.html, https://vldb.org/pvldb/vol8/p2074-khayyat.pdf)

@magarick magarick added the enhancement New feature or an improvement of an existing feature label Jul 25, 2023
@cmdlineluser
Copy link
Contributor

Incase it's useful information:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

https://stackoverflow.com/a/76483604

Originated from: #9376

@avimallu
Copy link
Contributor

avimallu commented Jul 25, 2023

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):

DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)

with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:

df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])

Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

@magarick
Copy link
Contributor Author

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

This is exactly why I'd like different functions for different concepts even if everything could be expressed through a generic non-equi join. It's more explicit and clear than trying to parse out what's happening from a generic expression that handles lots of cases.

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequality might be doable with a modified asof join. Again, conceptual differences matter.

@magarick
Copy link
Contributor Author

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):

DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)

with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:

df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])

Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

This is why I want to start with simple common cases. Something like you're describing, I think, will require a lot more effort and probably more knowledge than I have of the internals currently. I know data.table has x. and i. for referring to columns in another table inside of a join, and I guess it works. But even with my standards it feels magic and awkward and confusing. A way to refer to columns in the right join table could be nice, but it also sounds like a lot of work for a limited use case. Thought the other alternatives I can think of are incredibly awkward or require "stringly typed" functions.

@magarick
Copy link
Contributor Author

I think the natural starting point is a "range join" as it's similar to an asof with some additional complexity, it's very common to want, and it's a clear semantic category.

Within this, there are two operations that could potentially be considered distinct enough:

Point on the left, interval on the right:

  1. In addition to any equality terms we want $L \in [R_1, R_2)$ where the intervals could also be open, closed, or right-closed.
  2. In an asof join, there's a clear "nearest" point, but now no longer. However in many cases you have a preference for which interval matches to $L$ if they overlap. If you don't want all matches, you might want the row where $R_1$ is the largest value $\geq L$ or $R_2$ is the smallest $\leq L$. In this case, it's an asof join with variable look{ahead,behind}.
  3. Sometimes you only want to join if each point on the left uniquely matches an interval on the right. Often by making sure the intervals are disjoint per group. I've done this kind of tagging a lot withdata.table. For example, you could have customer arrival times on the left and number of employees available on the right. But I don't know if this should be directly checked by the join. It does bring up the question of utilities for handling intervals (combining, splitting to disjoint, checking for overlaps, etc.)
  4. In my experience, the right table is usually much smaller than the left for this case. Maybe that makes a difference.

Interval on both sides:

  1. This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.
  2. It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.
  3. However, this type of join also adds the complexity of different types of overlap. You might want to exclude cases where one interval completely contains another, for instance. I don't know if this is something that needs to be done in the first iteration.

@avimallu
Copy link
Contributor

  1. This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.

The author, Arun, mentioned he wrote this code in 4 hours for a conference he was attending. Don't remember where though; and it looks like it wasn't worked on much after that. That's probably why the tolerance was never implemented, and the function isn't very configurable.

  1. It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467
It was able to closely match a DuckDB solution that used interval joins.

@magarick
Copy link
Contributor Author

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467 It was able to closely match a DuckDB solution that used interval joins.

That doesn't look like it even needs a join. But it does need some interval manipulation tools, which I think there's also demand for. Could be worth filing a separate issue.

@Hoeze
Copy link

Hoeze commented Sep 27, 2023

Cross-linking #6856.
Interval joins are the last missing piece to replace pyranges in all of my code.

@kszlim
Copy link
Contributor

kszlim commented Apr 3, 2024

Just a 👍 as this seems like a very useful feature that would help my use cases a fair bit.

@Nicolas-SB
Copy link

+1

1 similar comment
@nikita-balyschew-db
Copy link

+1

@jshinonome
Copy link

+1

This allows to perform wj of kdb, last feature to replace kdb.

@ritchie46 ritchie46 removed the P-high Priority: high label Jun 16, 2024
@adamreeve
Copy link
Contributor

Hi @ritchie46, the Polars 1.0 release announcement mentioned non-equi join support under "Other short term plans". Are you able to provide any more details on those plans here? I was thinking about implementing the first join subtype mentioned in this issue description (match a value from one table with an interval in the other), but won't do anything if someone is already planning on working on this.

@ritchie46
Copy link
Member

Hey @adamreeve, sorry for the late reply. Missed this. Yes, I want to implement this join type https://vldb.org/pvldb/vol8/p2074-khayyat.pdf

This includes multiple range joins. I am not entirely sure about the interface yet, but the backend can already be started. Can you maybe ping me on discord?

@iliya-malecki
Copy link
Contributor

Hi @ritchie46 does this mean i can hope that in the future any expression that returns a boolean can be a valid join condition? e.g. pl.col('a').floordiv(42).is_between('x', 'y')?

@adamreeve
Copy link
Contributor

adamreeve commented Aug 9, 2024

I've started working on this and have a working implementation of the IEJoin algorithm: adamreeve/polars@main...iejoin

The Khayyat et al. paper doesn't account for duplicate values so I've also used some ideas from the DuckDB article.

I've just hacked this in as a DataFrame method initially to allow easy testing, but will start looking into integrating it as a proper join type.

@adamreeve
Copy link
Contributor

I've opened a draft PR that adds the IEJoin type but I think the API needs some discussion before this will be ready: #18365

@ritchie46
Copy link
Member

Added in #18365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests