ENH: Delta Lake file format support #40573

Sutyke · 2021-03-22T17:39:36Z

I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.

In the past delta lake run only on spark but now there are connectors where spark is not required:

https://pypi.org/project/deltalake/
https://github.com/delta-io/connectors
https://github.com/delta-io/delta-rs/tree/main/python

Liam3851 · 2021-03-24T14:45:14Z

Someone else can chime in but I imagine the first step here would be for Databricks to write a connector to Apache Arrow. Pandas uses Arrow for Parquet support. If Arrow were to support Delta then this would be a simpler add-in. You may wish to suggest this to them.

That said, this might not be possible, at least to support Delta for its primary use case of ACID transaction support. It appears the current connectors at least assume something that uses or can use a Hive metastore (i.e. Spark, Hive, Presto, Athena) to preserve the atomicity guarantees of the Delta Lake format. Delta Lake rewrites the Hive metastore files atomically to preserve its ACID guarantees. To my knowledge, Arrow itself is not Hive-metastore aware and so would not be a candidate for this; pandas certainly is not Hive-metastore aware. Pandas or Arrow could in theory read the underlying files as with Parquet, but since the point of Delta is to support ACID transactions, the metastore is crucial in a way that isn't true of a standard Parquet table. So if you need that support you're back to using something that can use a Hive metastore to read the files- i.e. Spark.

houqp · 2021-03-25T06:32:08Z

@Sutyke doesn't https://github.com/delta-io/delta-rs/tree/main/python#usage (i.e. https://pypi.org/project/deltalake/) already solve this use-case? See the examples in that readme on how to load a delta table into pandas dataframe. Delta-rs natively reads delta tables without the need of any other external dependencies or frameworks like JVM, Hive or Spark. It also preserves ACID guarantees during read/writes. The core of that deltalake pypi package is a full cross-platform Deltalake implementation in pure Rust.

https://github.com/delta-io/connectors on the other hand requires JVM and Hive. It's for accessing Delta tables outside of Spark from other JVM based big data query engines.

houqp · 2021-03-25T06:42:41Z

Perhaps we could add https://pypi.org/project/deltalake/ as an optional extra dependency to pandas itself to make deltalake support work out of the box for pandas users?

Sutyke · 2021-03-25T08:28:00Z

@houqp this would be great idea some way to add it as extra dependency.
@Liam3851 Currently for example for pandas.read_parquet there is parameter engine. As delta lakes are actually parquet files, what about just to add extra engine deltalake?

If engine will be set to deltalake it is used to read parquet files in delta lake format:
engine{‘auto’, ‘pyarrow’, ‘fastparquet’, new parameter : 'deltalake'}, default ‘auto’

mroeschke · 2022-11-30T16:52:30Z

Thanks for the suggestion, but based on the response from core devs in #49692, there isn't much appetite to maintain a read_deltalake function (especially since we link to this in the ecosystem docs where the deltalake packagage provides a to_pandas function https://pandas.pydata.org/docs/ecosystem.html#deltalake). Closing

jbrockmendel added Enhancement IO Data IO issues that don't fit into a more specific label labels Mar 24, 2021

Sutyke mentioned this issue Mar 25, 2021

Pandas support for Delta Lake delta-io/delta#467

Closed

MrPowers mentioned this issue Oct 9, 2022

Thoughts on adding read_delta to pandas delta-io/delta-rs#869

Closed

fvaleye mentioned this issue Nov 14, 2022

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

Closed

5 tasks

mroeschke closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Delta Lake file format support #40573

ENH: Delta Lake file format support #40573

Sutyke commented Mar 22, 2021

Liam3851 commented Mar 24, 2021

houqp commented Mar 25, 2021 •

edited

Loading

houqp commented Mar 25, 2021

Sutyke commented Mar 25, 2021

mroeschke commented Nov 30, 2022

ENH: Delta Lake file format support #40573

ENH: Delta Lake file format support #40573

Comments

Sutyke commented Mar 22, 2021

Liam3851 commented Mar 24, 2021

houqp commented Mar 25, 2021 • edited Loading

houqp commented Mar 25, 2021

Sutyke commented Mar 25, 2021

mroeschke commented Nov 30, 2022

houqp commented Mar 25, 2021 •

edited

Loading