Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars cross join 50x slower than DuckDB cross join #15456

Closed
2 tasks done
ion-elgreco opened this issue Apr 3, 2024 · 6 comments
Closed
2 tasks done

Polars cross join 50x slower than DuckDB cross join #15456

ion-elgreco opened this issue Apr 3, 2024 · 6 comments
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Apr 3, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

See comment below.

Log output

No response

Issue description

Cross joins with polars are lot's slower than in duckdb, with streaming is the only way to get a result. If I don't project less columns in df with a select than it will take 10+ minutes, if I just select the relevant columns than it will take only 40 second on 0.20.18 (in 0.20.10 it was double, so that improved).

This will take 10+ minutes

df_polars_only = df.lazy().join(
    parts.lazy(),
    how='cross'
).filter(
    (pl.col('id') == pl.col('id_right')) & 
    (pl.col('start') <= pl.col('date') ) &
    (pl.col('end') >= pl.col('date'))
).collect(streaming=True)

With DuckDB on Polars Dataframes:

"""
SELECT *
FROM df
    CROSS JOIN parts
WHERE
    df.id== parts.id
    and df.start<= parts.date
    and df.end>= parts.date
"""

image

Expected behavior

Be as fast as DuckDB xD

Installed versions

--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:               3.10.12 (main, Aug  9 2023, 14:47:34) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.15.3
fastexcel:            0.10.2
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.23.5
openpyxl:             3.1.2
pandas:               1.5.3
pyarrow:              15.0.2
pydantic:             1.10.14
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.52
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@ion-elgreco ion-elgreco added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 3, 2024
@ion-elgreco
Copy link
Contributor Author

@stinodego Here is a reproducible example 😄

import datetime
import random

def random_date(start, end):
    """Generate a random datetime between `start` and `end`"""
    return start + datetime.timedelta(
        # Get a random amount of seconds between `start` and `end`
        seconds=random.randint(0, int((end - start).total_seconds())),
    )

df = pl.DataFrame({
    "id":list(range(0,1000))*1500,
    "start_date": [random_date(datetime.datetime(2015,1,1), datetime.datetime(2020,1,1),) for i in range(1500000)]
}).with_columns(end_date = pl.col('start_date') + pl.duration(hours=random.randint(24,240)))

parts = df.sample(1700).select(
    'id',
    pl.concat_list(pl.col('start_date','end_date')).list.mean().alias('date').cast(pl.Datetime)
)

Execute polars code:

df_polars_only = df.lazy().join(
    parts.lazy(),
    how='cross'
).filter(
    (pl.col('id') == pl.col('id')) & 
    (pl.col('start_date') <= pl.col('date') ) &
    (pl.col('end_date') >= pl.col('date'))
).collect(streaming=True)

timings: 16.7 s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now execute in DuckDB:

sqlcode = """
SELECT *
FROM df
    CROSS JOIN parts
WHERE
    df.id== parts.id
    and df.start_date<= parts.date
    and df.end_date>= parts.date
"""
duckdb.sql(sqlcode).pl()

Timings: 18.7 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@avimallu
Copy link
Contributor

avimallu commented Apr 3, 2024

This is why Polars really needs non-equi joins. DuckDB's EXPLAIN will tell you that it converts this cross join to an inner join:

┌───────────────────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             id            │                             
│         start_date        │                             
│          end_date         │                             
│             id            │                             
│            date           │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           INNER           │                             
│          id = id          │                             
│      end_date >= date     ├──────────────┐              
│     start_date <= date    │              │              
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              
│           EC: 1           │              │              
│          Cost: 1          │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         ARROW_SCAN        ││         ARROW_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            ││             id            │
│         start_date        ││            date           │
│          end_date         ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││           EC: 1           │
│           EC: 1           ││                           │
└───────────────────────────┘└───────────────────────────┘                             

To get it, just use:

sqlcode = """
EXPLAIN
SELECT *
FROM df
    CROSS JOIN parts
WHERE
    df.id== parts.id
    and df.start_date<= parts.date
    and df.end_date>= parts.date
"""
print(duckdb.sql(sqlcode).pl().get_column("explain_value").to_list()[0])

Underlying issue is #10068.

@ion-elgreco
Copy link
Contributor Author

@avimallu right, then I can just rewrite it as an inner join ^^

@avimallu
Copy link
Contributor

avimallu commented Apr 3, 2024

But not as an inner join on inequality conditions, since Polars doesn't support those yet, right?

(Don't know if an inner non-equi join has a specific name.)

@ion-elgreco
Copy link
Contributor Author

Just doing an inner join on ID first and then filter afterwards is giving the same results

@ritchie46
Copy link
Member

Polars doesn't have non-equi joins yet. There is a tracking issue #10068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants