ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

fvaleye · 2022-11-14T11:10:19Z

closes ENH: Delta Lake file format support #40573
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Delta Lake is an open-source storage framework, a Python library delta-rs is available to access it using Pandas. Integrating deltalake in pandas as an optional extra dependency to pandas makes the support work out of the box for pandas users.

jreback

-1 on adding api

why isn't this just parquet?

fvaleye · 2022-11-14T12:32:52Z

Hello @jreback 👋,

Thank you for your review!

I included Delta Lake as a new API file because (by definition) it uses a transaction log with its PROTOCOL to read different files format, including data in parquet. I don't have a strong opinion on this; I was asking the question myself actually: what would be the best place for this format's integration?

If a new engine in parquet is the best place for you, I could easily move the integration in this part, WDYT?

mroeschke · 2022-11-16T19:25:10Z

I would also be -1 to adding a dedicated API for this as well.

Additionally, since this functionality is essentially deltalake.DeltaTable(**kwargs).to_pandas(**kwargs), the simplicity of this functionality makes me skeptical this needs to live in pandas as it would incur testing and dependency management overhead.

Maybe it may be more appropriate to add this to the ecosystem docs?

fvaleye · 2022-11-16T20:31:08Z

Hello @mroeschke 👋,

Thanks for the input, this PR already included deltalake to the ecosystem docs. I was inspired by the other apis implementation.

Should I close this PR or move the integration into another engine option in parquet?

MrPowers · 2022-11-17T02:16:49Z

I think the advantage of adding this is that Delta Lake is arguably the best way to read data into a pandas DataFrame. Disclosure: I am on the Delta Lake team.

The less data you load into a pandas DataFrame, the better. Delta Lake lets users easily load less data into pandas DataFrames (via column pruning, file skipping from metadata, and predicate pushdown filtering). pandas only allows for column pruning via read_parquet(columns=""), but there is a lot more data skipping that could be provided to end users.

pandas to_parquet has a partition_cols argument that lets users take advantage of disk partitioning on the write side, but there isn't any from_parquet argument that'd let them take advantage of file skipping on the read side. See the Dask read_parquet filters argument for more the type of interface that could be exposed. Delta Lake could fill the gap and let users take advantage of data skipping for partitioned datasets.

I'd argue that the Z ORDERING offered by Delta Lake is better than the disk partitioning currently offered by pandas.

Another "why isn't this just parquet" argument is that Delta Lake allows for schema enforcement that Parquet does not provide. All the pandas data source connectors are schema-on-read. It'd be nice for pandas users to have a schema-on-write connector.

I wrote blog posts on reading Delta Lakes into pandas DataFrames and how to version you data / time travel with pandas with some more background info.

If this gets added, we should probably add the filters argument. I do think this would provide pandas users with a fundamentally better way to read data and would give them the data skipping capabilities they can get with other query engines. I think these features are especially pertinent for pandas users because of the strict pandas memory limits. I can also see the argument that this should be kept separate and would like to reiterate my bias here. Thank you.

mroeschke · 2022-11-18T17:53:04Z

Thanks for the input, this #40636 already included deltalake to the ecosystem docs. I was inspired by the other apis implementation.

Ah okay I didn't realize this was already in the ecosystem docs.

WillAyd · 2022-11-23T21:19:10Z

Agree with existing -1 comments here - we already have a pretty exhaustive API that we are very conservative with when it comes to additions. I think @mroeschke suggestion covers the functionality

mroeschke · 2022-11-30T16:49:26Z

Thanks for the pull request, but since we have a few -1's here from the core dev team it appears there not much appetite to maintain a dedicated function for this format, so closing.

jreback requested changes Nov 14, 2022

View reviewed changes

fvaleye added 7 commits November 14, 2022 19:13

Add the support of the Delta Lake format in Pandas

74f6bac

Merge branch 'main' into enhancement/deltalake-format-io-support

bcf1879

Merge branch 'main' into enhancement/deltalake-format-io-support

0e4307e

Merge branch 'main' into enhancement/deltalake-format-io-support

511a2e0

Merge branch 'main' into enhancement/deltalake-format-io-support

d2d479c

Merge branch 'main' into enhancement/deltalake-format-io-support

1e1bddc

Merge branch 'main' into enhancement/deltalake-format-io-support

d211d1c

fvaleye mentioned this pull request Nov 16, 2022

Thoughts on adding read_delta to pandas delta-io/delta-rs#869

Closed

Merge branch 'main' into enhancement/deltalake-format-io-support

172ba73

fvaleye added 3 commits November 17, 2022 08:11

Merge branch 'main' into enhancement/deltalake-format-io-support

fbc1223

Merge branch 'main' into enhancement/deltalake-format-io-support

8fc22ca

Merge branch 'main' into enhancement/deltalake-format-io-support

d6b0a5d

fvaleye added 6 commits November 19, 2022 09:40

Merge branch 'main' into enhancement/deltalake-format-io-support

6e81a6b

Merge branch 'main' into enhancement/deltalake-format-io-support

614ea97

Merge branch 'main' into enhancement/deltalake-format-io-support

b5e3f58

Merge branch 'main' into enhancement/deltalake-format-io-support

389e465

Merge branch 'main' into enhancement/deltalake-format-io-support

3512a7b

Merge branch 'main' into enhancement/deltalake-format-io-support

271df4a

Merge branch 'main' into enhancement/deltalake-format-io-support

8165b6f

mroeschke closed this Nov 30, 2022

mroeschke mentioned this pull request Nov 30, 2022

ENH: Delta Lake file format support #40573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

fvaleye commented Nov 14, 2022

jreback left a comment

fvaleye commented Nov 14, 2022 •

edited

Loading

mroeschke commented Nov 16, 2022

fvaleye commented Nov 16, 2022 •

edited

Loading

MrPowers commented Nov 17, 2022

mroeschke commented Nov 18, 2022

WillAyd commented Nov 23, 2022

mroeschke commented Nov 30, 2022

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

Conversation

fvaleye commented Nov 14, 2022

jreback left a comment

Choose a reason for hiding this comment

fvaleye commented Nov 14, 2022 • edited Loading

mroeschke commented Nov 16, 2022

fvaleye commented Nov 16, 2022 • edited Loading

MrPowers commented Nov 17, 2022

mroeschke commented Nov 18, 2022

WillAyd commented Nov 23, 2022

mroeschke commented Nov 30, 2022

fvaleye commented Nov 14, 2022 •

edited

Loading

fvaleye commented Nov 16, 2022 •

edited

Loading