Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Delta Lake file format support #40573

Closed
Sutyke opened this issue Mar 22, 2021 · 5 comments
Closed

ENH: Delta Lake file format support #40573

Sutyke opened this issue Mar 22, 2021 · 5 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label

Comments

@Sutyke
Copy link

Sutyke commented Mar 22, 2021

I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.

In the past delta lake run only on spark but now there are connectors where spark is not required:

https://pypi.org/project/deltalake/
https://github.com/delta-io/connectors
https://github.com/delta-io/delta-rs/tree/main/python

@Liam3851
Copy link
Contributor

Someone else can chime in but I imagine the first step here would be for Databricks to write a connector to Apache Arrow. Pandas uses Arrow for Parquet support. If Arrow were to support Delta then this would be a simpler add-in. You may wish to suggest this to them.

That said, this might not be possible, at least to support Delta for its primary use case of ACID transaction support. It appears the current connectors at least assume something that uses or can use a Hive metastore (i.e. Spark, Hive, Presto, Athena) to preserve the atomicity guarantees of the Delta Lake format. Delta Lake rewrites the Hive metastore files atomically to preserve its ACID guarantees. To my knowledge, Arrow itself is not Hive-metastore aware and so would not be a candidate for this; pandas certainly is not Hive-metastore aware. Pandas or Arrow could in theory read the underlying files as with Parquet, but since the point of Delta is to support ACID transactions, the metastore is crucial in a way that isn't true of a standard Parquet table. So if you need that support you're back to using something that can use a Hive metastore to read the files- i.e. Spark.

@jbrockmendel jbrockmendel added Enhancement IO Data IO issues that don't fit into a more specific label labels Mar 24, 2021
@houqp
Copy link
Contributor

houqp commented Mar 25, 2021

@Sutyke doesn't https://github.com/delta-io/delta-rs/tree/main/python#usage (i.e. https://pypi.org/project/deltalake/) already solve this use-case? See the examples in that readme on how to load a delta table into pandas dataframe. Delta-rs natively reads delta tables without the need of any other external dependencies or frameworks like JVM, Hive or Spark. It also preserves ACID guarantees during read/writes. The core of that deltalake pypi package is a full cross-platform Deltalake implementation in pure Rust.

https://github.com/delta-io/connectors on the other hand requires JVM and Hive. It's for accessing Delta tables outside of Spark from other JVM based big data query engines.

@houqp
Copy link
Contributor

houqp commented Mar 25, 2021

Perhaps we could add https://pypi.org/project/deltalake/ as an optional extra dependency to pandas itself to make deltalake support work out of the box for pandas users?

@Sutyke
Copy link
Author

Sutyke commented Mar 25, 2021

@houqp this would be great idea some way to add it as extra dependency.
@Liam3851 Currently for example for pandas.read_parquet there is parameter engine. As delta lakes are actually parquet files, what about just to add extra engine deltalake?

If engine will be set to deltalake it is used to read parquet files in delta lake format:
engine{‘auto’, ‘pyarrow’, ‘fastparquet’, new parameter : 'deltalake'}, default ‘auto’

@mroeschke
Copy link
Member

Thanks for the suggestion, but based on the response from core devs in #49692, there isn't much appetite to maintain a read_deltalake function (especially since we link to this in the ecosystem docs where the deltalake packagage provides a to_pandas function https://pandas.pydata.org/docs/ecosystem.html#deltalake). Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants