-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Delta Lake file format support #40573
Comments
Someone else can chime in but I imagine the first step here would be for Databricks to write a connector to Apache Arrow. Pandas uses Arrow for Parquet support. If Arrow were to support Delta then this would be a simpler add-in. You may wish to suggest this to them. That said, this might not be possible, at least to support Delta for its primary use case of ACID transaction support. It appears the current connectors at least assume something that uses or can use a Hive metastore (i.e. Spark, Hive, Presto, Athena) to preserve the atomicity guarantees of the Delta Lake format. Delta Lake rewrites the Hive metastore files atomically to preserve its ACID guarantees. To my knowledge, Arrow itself is not Hive-metastore aware and so would not be a candidate for this; pandas certainly is not Hive-metastore aware. Pandas or Arrow could in theory read the underlying files as with Parquet, but since the point of Delta is to support ACID transactions, the metastore is crucial in a way that isn't true of a standard Parquet table. So if you need that support you're back to using something that can use a Hive metastore to read the files- i.e. Spark. |
@Sutyke doesn't https://github.com/delta-io/delta-rs/tree/main/python#usage (i.e. https://pypi.org/project/deltalake/) already solve this use-case? See the examples in that readme on how to load a delta table into pandas dataframe. Delta-rs natively reads delta tables without the need of any other external dependencies or frameworks like JVM, Hive or Spark. It also preserves ACID guarantees during read/writes. The core of that deltalake pypi package is a full cross-platform Deltalake implementation in pure Rust. https://github.com/delta-io/connectors on the other hand requires JVM and Hive. It's for accessing Delta tables outside of Spark from other JVM based big data query engines. |
Perhaps we could add https://pypi.org/project/deltalake/ as an optional extra dependency to pandas itself to make deltalake support work out of the box for pandas users? |
@houqp this would be great idea some way to add it as extra dependency. If engine will be set to deltalake it is used to read parquet files in delta lake format: |
Thanks for the suggestion, but based on the response from core devs in #49692, there isn't much appetite to maintain a |
I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.
In the past delta lake run only on spark but now there are connectors where spark is not required:
https://pypi.org/project/deltalake/
https://github.com/delta-io/connectors
https://github.com/delta-io/delta-rs/tree/main/python
The text was updated successfully, but these errors were encountered: