This project benchmarks two simple queries against Clickhouse and DuckDB. It imports ~1.2GB of Parquet data into each datastore and runs the benchmarks as native queries. It runs all benchmark iterations in the same session and does not reset caches.
Warning: This is far from a rigorous benchmark.
Benchmarks:
duckdb:groupby: avg=1.390s min=1.301s max=1.543s (10 runs)
clickhouse:groupby: avg=0.805s min=0.705s max=1.059s (10 runs)
duckdb:self-join: avg=0.635s min=0.616s max=0.743s (10 runs)
clickhouse:self-join: avg=0.701s min=0.660s max=0.743s (10 runs)
Executed on a Macbook Pro (2018) with 2.2 GHz 6-Core Intel Core i7 and 16 GB memory.
Disk usage:
Parquet: 1.2G
Clickhouse: 2.2G
DuckDB: 7.1G
Executable binary size:
Clickhouse: 363M
DuckDB: 37M
-
Clone this repo and
cd
into it -
Download data (~1.2GB of NYC taxi data)
mkdir -p data
cd data
curl https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv -o taxi_zone_lookup.csv
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-01.parquet
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-02.parquet
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-03.parquet
cd ..
- Install and run Clickhouse (source):
mkdir -p clickhouse
cd clickhouse
curl https://clickhouse.com/ | sh
./clickhouse server
- In a new terminal, import data into Clickhouse:
./clickhouse/clickhouse client --queries-file clickhouse_create_trips.sql
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-01.parquet
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-02.parquet
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-03.parquet
- Install DuckDB:
mkdir -p duckdb
cd duckdb
# Works on Intel Macs (for ARM and Linux, see: https://duckdb.org/docs/installation/index)
curl -O -L https://github.com/duckdb/duckdb/releases/download/v0.3.2/duckdb_cli-osx-amd64.zip
unzip duckdb_cli-osx-amd64.zip
cd ..
- Insert data into DuckDB:
./duckdb/duckdb ./duckdb/db.duckdb -c "CREATE TABLE trips AS SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-01.parquet')"
./duckdb/duckdb ./duckdb/db.duckdb -c "INSERT INTO trips SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-02.parquet')"
./duckdb/duckdb ./duckdb/db.duckdb -c "INSERT INTO trips SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-03.parquet')"
- Run the benchmark (requires Poetry):
poetry install
poetry run python benchmark.py
- Compute data size on disk:
echo "Parquet:" && du -hs data
echo "Clickhouse:" && du -hs clickhouse/store
echo "DuckDB:" && du -hs duckdb/db.duckdb
- Compute executable size:
echo "Clickhouse:" && du -hs clickhouse/clickhouse
echo "DuckDB:" && du -hs duckdb/duckdb