Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 support for Parquet #40

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

catkins
Copy link

@catkins catkins commented Nov 25, 2023

resolves #37

This is a rough cut of adding the extra required plumbing to work with parquet files in S3, and leveraging the smarts of object_store and polars to do predicate pushdown and only retrieve the required byte ranges from S3.

Example

df = Polars.scan_parquet(
  's3://ookla-open-data/parquet/performance/type=fixed/year=2023/quarter=3/2023-07-01_performance_fixed_tiles.parquet',
  storage_options: {
    aws_region: 'us-west-2',
    aws_access_key_id: ENV['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'],
  },
)

df.select("avg_d_kbps").median.collect

For the most part, I borrowed the existing patterns from py-polars.

https://github.com/pola-rs/polars/blob/main/py-polars/src/lazyframe.rs/#L258-L313

@catkins
Copy link
Author

catkins commented Nov 29, 2023

Any thoughts on the approach @ankane?

@catkins
Copy link
Author

catkins commented Jan 11, 2024

@ankane any interest in accepting a change like this? If so, I can rebase and add some test coverage for the changes.

@benben
Copy link

benben commented Feb 14, 2024

I'd love to have this available too! Happy to contribute if help is needed.

@catkins
Copy link
Author

catkins commented Feb 14, 2024

In #37 @ankane mentioned that he was hesitant about pulling in a rust TLS library into the pola.rs build 🤷‍♀️

I tried to do it in "userspace" on the ruby side but there was a lot of clever stuff around the range queries and pushdown that would have involved quite a lot of surgery to the Magnus bindings so I didn't end up persevering.

@benben
Copy link

benben commented Feb 14, 2024

I think

a) this lib should be on-par with features as all the other libs
b) S3 should be handled in polar/rust itself to keep it fast and optimized

so 👍 for this PR

@DeflateAwning
Copy link

This looks awesome! Would love if this got rebased to master so it can be reviewed/merged!

I would use this feature 100%!

@DeflateAwning
Copy link

Any advice on the state of this one? @catkins, would you be willing to update this with the current state of the rest of the project (i.e., rebase or merge in main)?

@catkins
Copy link
Author

catkins commented Sep 25, 2024

No updates, but if @ankane is keen on it, I can rebase and add some basic test coverage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Add support for scanning parquet from cloud storage / S3
3 participants