Non-equi join tracking issue #10068

magarick · 2023-07-25T07:55:21Z

Problem description

I know there have been a few issues raised about this before. I'd like to consolidate planning here since I'm going to start working on this but it's large and will have to proceed in parts. Probably a good idea to get agreement on what the API should look like too.
Further, there are sub-types of non-equi join that are conceptually distinct enough that they should probably have their own functions. And it's likely faster to implement them separately than trying to do it all one way. If conceptually distinct join types should get their own functions, this makes it easier to implement incrementally.
Obviously, teaming up on this would be great too.

Subtypes noted so far

Match values in the left table to intervals that contain them in the right. This is kind of like an asof join, especially since in my experience you often want the intervals in the right table to be disjoint, though missing ranges are usually ok. In that case I've called this an annotation or tag join before. When intervals can overlap, you might want to take the first or last overlapping one for a value, but defining first and last requires care.
Left and right both have intervals which match when they overlap. Similar to 1 since a match is when either endpoint of an interval falls within an interval on the other side.
Keys match if their difference is within some range, or interval overlaps with tolerance (like foverlaps)

So, does it make sense to have different functions for conceptually different joins. The underlying algorithm may end up being the same, since DuckDB seems to claim they can do it all super fast with one type of join (https://duckdb.org/2022/05/27/iejoin.html, https://vldb.org/pvldb/vol8/p2074-khayyat.pdf)

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-07-25T10:46:21Z

Incase it's useful information:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

https://stackoverflow.com/a/76483604

Originated from: #9376

avimallu · 2023-07-25T14:25:53Z

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):

DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)

with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:

df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])

Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

magarick · 2023-07-27T00:15:28Z

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

This is exactly why I'd like different functions for different concepts even if everything could be expressed through a generic non-equi join. It's more explicit and clear than trying to parse out what's happening from a generic expression that handles lots of cases.

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequality might be doable with a modified asof join. Again, conceptual differences matter.

magarick · 2023-07-27T00:21:32Z

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):
DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)
with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:
df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])
Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

This is why I want to start with simple common cases. Something like you're describing, I think, will require a lot more effort and probably more knowledge than I have of the internals currently. I know data.table has x. and i. for referring to columns in another table inside of a join, and I guess it works. But even with my standards it feels magic and awkward and confusing. A way to refer to columns in the right join table could be nice, but it also sounds like a lot of work for a limited use case. Thought the other alternatives I can think of are incredibly awkward or require "stringly typed" functions.

magarick · 2023-07-29T02:35:38Z

I think the natural starting point is a "range join" as it's similar to an asof with some additional complexity, it's very common to want, and it's a clear semantic category.

Within this, there are two operations that could potentially be considered distinct enough:

Point on the left, interval on the right:

In addition to any equality terms we want $L \in [R_1, R_2)$ where the intervals could also be open, closed, or right-closed.
In an asof join, there's a clear "nearest" point, but now no longer. However in many cases you have a preference for which interval matches to $L$ if they overlap. If you don't want all matches, you might want the row where $R_1$ is the largest value $\geq L$ or $R_2$ is the smallest $\leq L$. In this case, it's an asof join with variable look{ahead,behind}.
Sometimes you only want to join if each point on the left uniquely matches an interval on the right. Often by making sure the intervals are disjoint per group. I've done this kind of tagging a lot withdata.table. For example, you could have customer arrival times on the left and number of employees available on the right. But I don't know if this should be directly checked by the join. It does bring up the question of utilities for handling intervals (combining, splitting to disjoint, checking for overlaps, etc.)
In my experience, the right table is usually much smaller than the left for this case. Maybe that makes a difference.

Interval on both sides:

This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.
It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.
However, this type of join also adds the complexity of different types of overlap. You might want to exclude cases where one interval completely contains another, for instance. I don't know if this is something that needs to be done in the first iteration.

avimallu · 2023-07-29T02:42:30Z

This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.

The author, Arun, mentioned he wrote this code in 4 hours for a conference he was attending. Don't remember where though; and it looks like it wasn't worked on much after that. That's probably why the tolerance was never implemented, and the function isn't very configurable.

It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467
It was able to closely match a DuckDB solution that used interval joins.

magarick · 2023-07-31T22:33:59Z

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467 It was able to closely match a DuckDB solution that used interval joins.

That doesn't look like it even needs a join. But it does need some interval manipulation tools, which I think there's also demand for. Could be worth filing a separate issue.

Hoeze · 2023-09-27T12:29:47Z

Cross-linking #6856.
Interval joins are the last missing piece to replace pyranges in all of my code.

kszlim · 2024-04-03T06:05:18Z

Just a 👍 as this seems like a very useful feature that would help my use cases a fair bit.

Nicolas-SB · 2024-05-23T13:50:30Z

+1

nikita-balyschew-db · 2024-05-23T13:52:27Z

+1

jshinonome · 2024-06-03T03:28:34Z

+1

This allows to perform wj of kdb, last feature to replace kdb.

adamreeve · 2024-07-11T21:56:42Z

Hi @ritchie46, the Polars 1.0 release announcement mentioned non-equi join support under "Other short term plans". Are you able to provide any more details on those plans here? I was thinking about implementing the first join subtype mentioned in this issue description (match a value from one table with an interval in the other), but won't do anything if someone is already planning on working on this.

ritchie46 · 2024-07-26T11:39:26Z

Hey @adamreeve, sorry for the late reply. Missed this. Yes, I want to implement this join type https://vldb.org/pvldb/vol8/p2074-khayyat.pdf

This includes multiple range joins. I am not entirely sure about the interface yet, but the backend can already be started. Can you maybe ping me on discord?

iliya-malecki · 2024-08-01T15:38:53Z

Hi @ritchie46 does this mean i can hope that in the future any expression that returns a boolean can be a valid join condition? e.g. pl.col('a').floordiv(42).is_between('x', 'y')?

adamreeve · 2024-08-09T03:31:13Z

I've started working on this and have a working implementation of the IEJoin algorithm: adamreeve/polars@main...iejoin

The Khayyat et al. paper doesn't account for duplicate values so I've also used some ideas from the DuckDB article.

I've just hacked this in as a DataFrame method initially to allow easy testing, but will start looking into integrating it as a proper join type.

adamreeve · 2024-08-26T04:52:18Z

I've opened a draft PR that adds the IEJoin type but I think the API needs some discussion before this will be ready: #18365

ritchie46 · 2024-09-07T13:16:04Z

Added in #18365

magarick added the enhancement New feature or an improvement of an existing feature label Jul 25, 2023

etiennebacher mentioned this issue Jul 25, 2023

TODO etiennebacher/tidypolars#2

Closed

13 tasks

cmdlineluser mentioned this issue Jul 28, 2023

.groupby_slices() #9467

Closed

etiennebacher mentioned this issue Nov 20, 2023

Implement the join_by() interface etiennebacher/tidypolars#60

Closed

This was referenced Jan 11, 2024

Join between Polars dataframes with inequality conditions #6856

Closed

Conditional join on expression #6131

Closed

stinodego added the accepted Ready for implementation label Jan 11, 2024

github-project-automation bot added this to Backlog Jan 11, 2024

github-project-automation bot moved this to Ready in Backlog Jan 11, 2024

stinodego added the P-high Priority: high label Jan 16, 2024

avimallu mentioned this issue Apr 3, 2024

Polars cross join 50x slower than DuckDB cross join #15456

Closed

2 tasks

cmdlineluser mentioned this issue Apr 26, 2024

Database-style indexes for efficient filtering? #7335

Open

cmdlineluser mentioned this issue May 29, 2024

Allow arbitrary column expressions in JOIN ON clause #3935

Closed

ritchie46 removed the P-high Priority: high label Jun 16, 2024

adamreeve mentioned this issue Aug 26, 2024

feat: Add IEJoin algorithm for non-equi joins and support Full non-equi joins #18365

Merged

ritchie46 closed this as completed Sep 7, 2024

github-project-automation bot moved this from Ready to Done in Backlog Sep 7, 2024

BrewTestBot mentioned this issue Jan 6, 2025

qsv 2.0.0 Homebrew/homebrew-core#203333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-equi join tracking issue #10068

Non-equi join tracking issue #10068

magarick commented Jul 25, 2023

cmdlineluser commented Jul 25, 2023

avimallu commented Jul 25, 2023 •

edited

Loading

magarick commented Jul 27, 2023

magarick commented Jul 27, 2023

magarick commented Jul 29, 2023

avimallu commented Jul 29, 2023

magarick commented Jul 31, 2023

Hoeze commented Sep 27, 2023

kszlim commented Apr 3, 2024

Nicolas-SB commented May 23, 2024

nikita-balyschew-db commented May 23, 2024

jshinonome commented Jun 3, 2024

adamreeve commented Jul 11, 2024

ritchie46 commented Jul 26, 2024

iliya-malecki commented Aug 1, 2024

adamreeve commented Aug 9, 2024 •

edited

Loading

adamreeve commented Aug 26, 2024

ritchie46 commented Sep 7, 2024

Non-equi join tracking issue #10068

Non-equi join tracking issue #10068

Comments

magarick commented Jul 25, 2023

Problem description

cmdlineluser commented Jul 25, 2023

avimallu commented Jul 25, 2023 • edited Loading

magarick commented Jul 27, 2023

magarick commented Jul 27, 2023

magarick commented Jul 29, 2023

avimallu commented Jul 29, 2023

magarick commented Jul 31, 2023

Hoeze commented Sep 27, 2023

kszlim commented Apr 3, 2024

Nicolas-SB commented May 23, 2024

nikita-balyschew-db commented May 23, 2024

jshinonome commented Jun 3, 2024

adamreeve commented Jul 11, 2024

ritchie46 commented Jul 26, 2024

iliya-malecki commented Aug 1, 2024

adamreeve commented Aug 9, 2024 • edited Loading

adamreeve commented Aug 26, 2024

ritchie46 commented Sep 7, 2024

avimallu commented Jul 25, 2023 •

edited

Loading

adamreeve commented Aug 9, 2024 •

edited

Loading