Skip to content

Commit

Permalink
Polars support for streaming Parquet over HTTP
Browse files Browse the repository at this point in the history
  • Loading branch information
simonw authored Nov 17, 2023
1 parent 29bb8c8 commit 0d8818d
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions duckdb/remote-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,3 +385,25 @@ Output:
Peak memory usage: 458.88 KiB.

This transfers around 290 MiB, effectively the same as DuckDB.

## Polars support for streaming Parquet over HTTP

[Polars](https://github.com/pola-rs/polars) provides "blazingly fast DataFrames in Rust, Python, Node.js, R and SQL". Apparently inspired by this post, they added support for answering this kind of query by streaming portions of Parquet files over HTTP in [#12493: Downloading https dataset doesn't appropriately push down predicates](https://github.com/pola-rs/polars/issues/12493).

In the released version, the following [should work](https://github.com/pola-rs/polars/issues/12493#issuecomment-1814763393):

```python
import polars as pl

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(56):
url = f"{base_url}{i:06}.parquet"
files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())
```

0 comments on commit 0d8818d

Please sign in to comment.