-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-equi join tracking issue #10068
Comments
Incase it's useful information:
https://stackoverflow.com/a/76483604 Originated from: #9376 |
Thanks for taking this up! While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode): DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index) with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4]) Any thoughts? Maybe the |
This is exactly why I'd like different functions for different concepts even if everything could be expressed through a generic non-equi join. It's more explicit and clear than trying to parse out what's happening from a generic expression that handles lots of cases.
Single inequality might be doable with a modified asof join. Again, conceptual differences matter. |
This is why I want to start with simple common cases. Something like you're describing, I think, will require a lot more effort and probably more knowledge than I have of the internals currently. I know data.table has |
I think the natural starting point is a "range join" as it's similar to an asof with some additional complexity, it's very common to want, and it's a clear semantic category. Within this, there are two operations that could potentially be considered distinct enough: Point on the left, interval on the right:
Interval on both sides:
|
The author, Arun, mentioned he wrote this code in 4 hours for a conference he was attending. Don't remember where though; and it looks like it wasn't worked on much after that. That's probably why the tolerance was never implemented, and the function isn't very configurable.
cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467 |
That doesn't look like it even needs a join. But it does need some interval manipulation tools, which I think there's also demand for. Could be worth filing a separate issue. |
Cross-linking #6856. |
Just a 👍 as this seems like a very useful feature that would help my use cases a fair bit. |
+1 |
1 similar comment
+1 |
+1 This allows to perform wj of kdb, last feature to replace kdb. |
Hi @ritchie46, the Polars 1.0 release announcement mentioned non-equi join support under "Other short term plans". Are you able to provide any more details on those plans here? I was thinking about implementing the first join subtype mentioned in this issue description (match a value from one table with an interval in the other), but won't do anything if someone is already planning on working on this. |
Hey @adamreeve, sorry for the late reply. Missed this. Yes, I want to implement this join type https://vldb.org/pvldb/vol8/p2074-khayyat.pdf This includes multiple range joins. I am not entirely sure about the interface yet, but the backend can already be started. Can you maybe ping me on discord? |
Hi @ritchie46 does this mean i can hope that in the future any expression that returns a boolean can be a valid join condition? e.g. |
I've started working on this and have a working implementation of the IEJoin algorithm: adamreeve/polars@main...iejoin The Khayyat et al. paper doesn't account for duplicate values so I've also used some ideas from the DuckDB article. I've just hacked this in as a DataFrame method initially to allow easy testing, but will start looking into integrating it as a proper join type. |
I've opened a draft PR that adds the IEJoin type but I think the API needs some discussion before this will be ready: #18365 |
Added in #18365 |
Problem description
I know there have been a few issues raised about this before. I'd like to consolidate planning here since I'm going to start working on this but it's large and will have to proceed in parts. Probably a good idea to get agreement on what the API should look like too.
Further, there are sub-types of non-equi join that are conceptually distinct enough that they should probably have their own functions. And it's likely faster to implement them separately than trying to do it all one way. If conceptually distinct join types should get their own functions, this makes it easier to implement incrementally.
Obviously, teaming up on this would be great too.
Subtypes noted so far
foverlaps
)So, does it make sense to have different functions for conceptually different joins. The underlying algorithm may end up being the same, since DuckDB seems to claim they can do it all super fast with one type of join (https://duckdb.org/2022/05/27/iejoin.html, https://vldb.org/pvldb/vol8/p2074-khayyat.pdf)
The text was updated successfully, but these errors were encountered: