-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692
ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on adding api
why isn't this just parquet?
Hello @jreback 👋, Thank you for your review! I included Delta Lake as a new API file because (by definition) it uses a transaction log with its PROTOCOL to read different files format, including data in parquet. I don't have a strong opinion on this; I was asking the question myself actually: what would be the best place for this format's integration? If a new engine in parquet is the best place for you, I could easily move the integration in this part, WDYT? |
I would also be -1 to adding a dedicated API for this as well. Additionally, since this functionality is essentially Maybe it may be more appropriate to add this to the ecosystem docs? |
Hello @mroeschke 👋, Thanks for the input, this PR already included Should I close this PR or move the integration into another |
I think the advantage of adding this is that Delta Lake is arguably the best way to read data into a pandas DataFrame. Disclosure: I am on the Delta Lake team. The less data you load into a pandas DataFrame, the better. Delta Lake lets users easily load less data into pandas DataFrames (via column pruning, file skipping from metadata, and predicate pushdown filtering). pandas only allows for column pruning via pandas I'd argue that the Z ORDERING offered by Delta Lake is better than the disk partitioning currently offered by pandas. Another "why isn't this just parquet" argument is that Delta Lake allows for schema enforcement that Parquet does not provide. All the pandas data source connectors are schema-on-read. It'd be nice for pandas users to have a schema-on-write connector. I wrote blog posts on reading Delta Lakes into pandas DataFrames and how to version you data / time travel with pandas with some more background info. If this gets added, we should probably add the |
Ah okay I didn't realize this was already in the ecosystem docs. |
Agree with existing -1 comments here - we already have a pretty exhaustive API that we are very conservative with when it comes to additions. I think @mroeschke suggestion covers the functionality |
Thanks for the pull request, but since we have a few -1's here from the core dev team it appears there not much appetite to maintain a dedicated function for this format, so closing. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Delta Lake is an open-source storage framework, a Python library delta-rs is available to access it using Pandas. Integrating
deltalake
in pandas as an optional extra dependency to pandas makes the support work out of the box for pandas users.