Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

Closed
wants to merge 18 commits into from
Closed

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

wants to merge 18 commits into from

Conversation

fvaleye
Copy link

@fvaleye fvaleye commented Nov 14, 2022

Delta Lake is an open-source storage framework, a Python library delta-rs is available to access it using Pandas. Integrating deltalake in pandas as an optional extra dependency to pandas makes the support work out of the box for pandas users.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on adding api

why isn't this just parquet?

@fvaleye
Copy link
Author

fvaleye commented Nov 14, 2022

Hello @jreback 👋,

Thank you for your review!

I included Delta Lake as a new API file because (by definition) it uses a transaction log with its PROTOCOL to read different files format, including data in parquet. I don't have a strong opinion on this; I was asking the question myself actually: what would be the best place for this format's integration?

If a new engine in parquet is the best place for you, I could easily move the integration in this part, WDYT?

@mroeschke
Copy link
Member

I would also be -1 to adding a dedicated API for this as well.

Additionally, since this functionality is essentially deltalake.DeltaTable(**kwargs).to_pandas(**kwargs), the simplicity of this functionality makes me skeptical this needs to live in pandas as it would incur testing and dependency management overhead.

Maybe it may be more appropriate to add this to the ecosystem docs?

@fvaleye
Copy link
Author

fvaleye commented Nov 16, 2022

Hello @mroeschke 👋,

Thanks for the input, this PR already included deltalake to the ecosystem docs. I was inspired by the other apis implementation.

Should I close this PR or move the integration into another engine option in parquet?

@MrPowers
Copy link

I think the advantage of adding this is that Delta Lake is arguably the best way to read data into a pandas DataFrame. Disclosure: I am on the Delta Lake team.

The less data you load into a pandas DataFrame, the better. Delta Lake lets users easily load less data into pandas DataFrames (via column pruning, file skipping from metadata, and predicate pushdown filtering). pandas only allows for column pruning via read_parquet(columns=""), but there is a lot more data skipping that could be provided to end users.

pandas to_parquet has a partition_cols argument that lets users take advantage of disk partitioning on the write side, but there isn't any from_parquet argument that'd let them take advantage of file skipping on the read side. See the Dask read_parquet filters argument for more the type of interface that could be exposed. Delta Lake could fill the gap and let users take advantage of data skipping for partitioned datasets.

I'd argue that the Z ORDERING offered by Delta Lake is better than the disk partitioning currently offered by pandas.

Another "why isn't this just parquet" argument is that Delta Lake allows for schema enforcement that Parquet does not provide. All the pandas data source connectors are schema-on-read. It'd be nice for pandas users to have a schema-on-write connector.

I wrote blog posts on reading Delta Lakes into pandas DataFrames and how to version you data / time travel with pandas with some more background info.

If this gets added, we should probably add the filters argument. I do think this would provide pandas users with a fundamentally better way to read data and would give them the data skipping capabilities they can get with other query engines. I think these features are especially pertinent for pandas users because of the strict pandas memory limits. I can also see the argument that this should be kept separate and would like to reiterate my bias here. Thank you.

@mroeschke
Copy link
Member

Thanks for the input, this #40636 already included deltalake to the ecosystem docs. I was inspired by the other apis implementation.

Ah okay I didn't realize this was already in the ecosystem docs.

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2022

Agree with existing -1 comments here - we already have a pretty exhaustive API that we are very conservative with when it comes to additions. I think @mroeschke suggestion covers the functionality

@mroeschke
Copy link
Member

Thanks for the pull request, but since we have a few -1's here from the core dev team it appears there not much appetite to maintain a dedicated function for this format, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Delta Lake file format support
5 participants