Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB #222

auxten · 2024-09-09T09:18:22Z

Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB

Pandas: 2.2.2
Polars: 1.6.0
DuckDB: 1.0.0
chDB : 2.0.3

Assume data is already in memory, time cost on reading parquet and converting it into Pandas DataFrame are not counted as load time. But as Polars needs to convert Pandas DataFrame into Polars DataFrame to make it work, the convertion time is recorded in the "load_time"
Pandas and Polars queries are using their own api to emulate the operation of SQL.
During query Polars crashed the Python process several times, I don't know how to make it work. So Q39, Q42 are marked failed

Polars Q39 crash message:

Crash with:
  thread '<unnamed>' panicked at crates/polars-time/src/windows/duration.rs:215:21:
  expected leading integer in the duration string, found m
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
lambda x: x.filter(
    (pl.col("CounterID") == 62)
    & (pl.col("EventDate") >= pl.datetime(2013, 7, 1))
    & (pl.col("EventDate") <= pl.datetime(2013, 7, 31))
    & (pl.col("IsRefresh") == 0)
)
.group_by(
    [
        "TraficSourceID",
        "SearchEngineID",
        "AdvEngineID",
        # pl.when(pl.col("SearchEngineID").eq(0) & pl.col("AdvEngineID").eq(0))
        # .then(pl.col("Referer"))
        # .otherwise("")
        # .alias("Src"),
        "URL",
    ]
)
.agg(pl.len().alias("PageViews"))
.sort("PageViews", descending=True)
.slice(1000, 10),

Polars Q42 crash message:

Crash with:
  thread '<unnamed>' panicked at crates/polars-time/src/windows/duration.rs:215:21:
  expected leading integer in the duration string, found m
lambda x: x.filter(
    (pl.col("CounterID") == 62)
    & (pl.col("EventDate") >= pl.datetime(2013, 7, 14))
    & (pl.col("EventDate") <= pl.datetime(2013, 7, 15))
    & (pl.col("IsRefresh") == 0)
    & (pl.col("DontCountHits") == 0)
)
.group_by(pl.col("EventTime").dt.truncate("minute"))
.agg(pl.len().alias("PageViews"))
.slice(1000, 10),

auxten · 2024-09-09T09:59:07Z

@rschu1ze @alexey-milovidov Please have a look

rschu1ze · 2024-09-11T21:30:49Z

duckdb-dataframe/benchmark.sh

+
+sudo apt-get update
+sudo apt-get install -y python3-pip
+pip install pandas chdb


I had to use pip install --break-system-packages here (as it complained without the flag). Interestingly, the pip install commands in other ClickBench scripts also don't use that flag, so I am okay with not having it (it might be some weirdness with my system).

I also encountered the same problem and checked how other Python pkgs worked. I think this should be the Ubuntu 24 introduced new problem.
But after that I decided to leave it identical with other packages. Maybe we could fix all of them with another patch.

duckdb-dataframe/benchmark.sh

rschu1ze

Double-checked the measurements on a c6a.metal box, and they reproduced nicely (with a few % deviation here and there but that is expected).

qoega · 2024-09-12T09:02:28Z

Is it meant to be added to the main report? https://github.com/ClickHouse/ClickBench/blob/main/index.html

Or you can have a separate one as we have for hardware and versions.

auxten · 2024-09-12T09:07:17Z

Is it meant to be added to the main report? https://github.com/ClickHouse/ClickBench/blob/main/index.html

Or you can have a separate one as we have for hardware and versions.

I added a "dataframe" type. You can check it with this link

qoega · 2024-09-12T09:23:16Z

Polars Q0 and Q29 - did we check the result? Did we check that it is not using query result caching for 29?

qoega · 2024-09-12T09:25:47Z

polars/query.py

+        "Q29",
+        "SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(ResolutionWidth + 18), SUM(ResolutionWidth + 19), SUM(ResolutionWidth + 20), SUM(ResolutionWidth + 21), SUM(ResolutionWidth + 22), SUM(ResolutionWidth + 23), SUM(ResolutionWidth + 24), SUM(ResolutionWidth + 25), SUM(ResolutionWidth + 26), SUM(ResolutionWidth + 27), SUM(ResolutionWidth + 28), SUM(ResolutionWidth + 29), SUM(ResolutionWidth + 30), SUM(ResolutionWidth + 31), SUM(ResolutionWidth + 32), SUM(ResolutionWidth + 33), SUM(ResolutionWidth + 34), SUM(ResolutionWidth + 35), SUM(ResolutionWidth + 36), SUM(ResolutionWidth + 37), SUM(ResolutionWidth + 38), SUM(ResolutionWidth + 39), SUM(ResolutionWidth + 40), SUM(ResolutionWidth + 41), SUM(ResolutionWidth + 42), SUM(ResolutionWidth + 43), SUM(ResolutionWidth + 44), SUM(ResolutionWidth + 45), SUM(ResolutionWidth + 46), SUM(ResolutionWidth + 47), SUM(ResolutionWidth + 48), SUM(ResolutionWidth + 49), SUM(ResolutionWidth + 50), SUM(ResolutionWidth + 51), SUM(ResolutionWidth + 52), SUM(ResolutionWidth + 53), SUM(ResolutionWidth + 54), SUM(ResolutionWidth + 55), SUM(ResolutionWidth + 56), SUM(ResolutionWidth + 57), SUM(ResolutionWidth + 58), SUM(ResolutionWidth + 59), SUM(ResolutionWidth + 60), SUM(ResolutionWidth + 61), SUM(ResolutionWidth + 62), SUM(ResolutionWidth + 63), SUM(ResolutionWidth + 64), SUM(ResolutionWidth + 65), SUM(ResolutionWidth + 66), SUM(ResolutionWidth + 67), SUM(ResolutionWidth + 68), SUM(ResolutionWidth + 69), SUM(ResolutionWidth + 70), SUM(ResolutionWidth + 71), SUM(ResolutionWidth + 72), SUM(ResolutionWidth + 73), SUM(ResolutionWidth + 74), SUM(ResolutionWidth + 75), SUM(ResolutionWidth + 76), SUM(ResolutionWidth + 77), SUM(ResolutionWidth + 78), SUM(ResolutionWidth + 79), SUM(ResolutionWidth + 80), SUM(ResolutionWidth + 81), SUM(ResolutionWidth + 82), SUM(ResolutionWidth + 83), SUM(ResolutionWidth + 84), SUM(ResolutionWidth + 85), SUM(ResolutionWidth + 86), SUM(ResolutionWidth + 87), SUM(ResolutionWidth + 88), SUM(ResolutionWidth + 89) FROM hits;",
+        lambda x: x["ResolutionWidth"].sum()
+        + x["ResolutionWidth"].shift(1).sum()


What shift does(link)? we just needed +1. +2 etc and not play with indices

You are right, both Pandas and Polars are incorrect. I'll fix them later.

auxten · 2024-09-12T09:41:37Z

Polars Q0 and Q29 - did we check the result? Did we check that it is not using query result caching for 29?

As I mentioned:

But as Polars needs to convert Pandas DataFrame into Polars DataFrame to make it work, the convertion time is recorded in the "load_time"

For Q0, I think Polars dataframe just keeps some statistic data to make Q0 super fast.

ritchie46 · 2024-11-25T11:30:16Z

Polars should use the lazy API. I see that the eager API was used, which forces Polars to materialize every operation and doesn't allow any optimizations.

alexey-milovidov · 2024-11-25T13:03:41Z

Thanks, let's edit and re-run.

auxten added 4 commits September 9, 2024 17:18

Ignore parquet

8f26e05

Add pandas bench code on dataframe

822f74d

Add polars benchmark code on dataframe

9a567df

Add polars, duckdb-dataframe, chdb-dataframe

986c9b3

auxten force-pushed the main branch 2 times, most recently from 8738493 to 8a49048 Compare September 9, 2024 09:50

Update benchmark data on c6a.metal

5ba709a

auxten force-pushed the main branch from 8a49048 to 5ba709a Compare September 9, 2024 09:51

rschu1ze self-assigned this Sep 9, 2024

rschu1ze reviewed Sep 11, 2024

View reviewed changes

Update benchmark.sh

925e831

rschu1ze approved these changes Sep 11, 2024

View reviewed changes

rschu1ze merged commit d16c5b2 into ClickHouse:main Sep 11, 2024

qoega reviewed Sep 12, 2024

View reviewed changes

rschu1ze mentioned this pull request Sep 12, 2024

Add --break-system-packages to 'pip install' #223

Merged

auxten mentioned this pull request Nov 26, 2024

Polars benchmark are very suboptimal. #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB #222

Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB #222

auxten commented Sep 9, 2024

auxten commented Sep 9, 2024

rschu1ze Sep 11, 2024

auxten Sep 12, 2024

rschu1ze Sep 12, 2024

rschu1ze left a comment

qoega commented Sep 12, 2024

auxten commented Sep 12, 2024

qoega commented Sep 12, 2024

qoega Sep 12, 2024

auxten Sep 12, 2024

auxten commented Sep 12, 2024

ritchie46 commented Nov 25, 2024

alexey-milovidov commented Nov 25, 2024

Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB #222

Add benchmark on Pandas DataFrame for Pandas, Polars, DuckDB, chDB #222

Conversation

auxten commented Sep 9, 2024

auxten commented Sep 9, 2024

rschu1ze Sep 11, 2024

Choose a reason for hiding this comment

auxten Sep 12, 2024

Choose a reason for hiding this comment

rschu1ze Sep 12, 2024

Choose a reason for hiding this comment

rschu1ze left a comment

Choose a reason for hiding this comment

qoega commented Sep 12, 2024

auxten commented Sep 12, 2024

qoega commented Sep 12, 2024

qoega Sep 12, 2024

Choose a reason for hiding this comment

auxten Sep 12, 2024

Choose a reason for hiding this comment

auxten commented Sep 12, 2024

ritchie46 commented Nov 25, 2024

alexey-milovidov commented Nov 25, 2024