-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars cross join 50x slower than DuckDB cross join #15456
Comments
@stinodego Here is a reproducible example 😄
Execute polars code: df_polars_only = df.lazy().join(
parts.lazy(),
how='cross'
).filter(
(pl.col('id') == pl.col('id')) &
(pl.col('start_date') <= pl.col('date') ) &
(pl.col('end_date') >= pl.col('date'))
).collect(streaming=True) timings: 16.7 s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each) Now execute in DuckDB: sqlcode = """
SELECT *
FROM df
CROSS JOIN parts
WHERE
df.id== parts.id
and df.start_date<= parts.date
and df.end_date>= parts.date
"""
duckdb.sql(sqlcode).pl() Timings: 18.7 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
This is why Polars really needs non-equi joins. DuckDB's
To get it, just use: sqlcode = """
EXPLAIN
SELECT *
FROM df
CROSS JOIN parts
WHERE
df.id== parts.id
and df.start_date<= parts.date
and df.end_date>= parts.date
"""
print(duckdb.sql(sqlcode).pl().get_column("explain_value").to_list()[0]) Underlying issue is #10068. |
@avimallu right, then I can just rewrite it as an inner join ^^ |
But not as an inner join on inequality conditions, since Polars doesn't support those yet, right? (Don't know if an inner non-equi join has a specific name.) |
Just doing an inner join on ID first and then filter afterwards is giving the same results |
Polars doesn't have non-equi joins yet. There is a tracking issue #10068 |
Checks
Reproducible example
See comment below.
Log output
No response
Issue description
Cross joins with polars are lot's slower than in duckdb, with streaming is the only way to get a result. If I don't project less columns in df with a select than it will take 10+ minutes, if I just select the relevant columns than it will take only 40 second on 0.20.18 (in 0.20.10 it was double, so that improved).
This will take 10+ minutes
With DuckDB on Polars Dataframes:
Expected behavior
Be as fast as DuckDB xD
Installed versions
The text was updated successfully, but these errors were encountered: