-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add hdfs support in python #1359
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Thanks for doing this @Kimahriman. Could we add some documentation for users on what the prerequisites are for using this? However one thing you mentioned made me a little concerned, but lets make sure I got this right. Would this mean that we always require JAVA to be available when building the wheels and or installing the library? In that case I think we would have to find another way to make HDFS available on the python side. For me not requiring Java is one of the main driver to even get involved in the project 😆, and having that as a requirement would break a lot of use cases. One potential approach could be to do something like #900. |
Yeah it's kinda unfortunate. The pyarrow approach would be interesting since that's how all the data files would be read at that point right? I'm curious if building pyarrow requires Java for however they do the libhdfs bindings? Definitely not my area of expertise, just a pyspark power user that thought this would be interesting to figure out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make hdfs
an optional (and non-default) feature for the Python package?
Seems like it adds Java as a build dependency. That's okay, if it's not default. But does it mean it's also a runtime dependency for users who aren't using HDFS? If not, I'm fine including in wheels by default. Otherwise, we might be better off finishing the fsspec wrapper.
python/Cargo.toml
Outdated
@@ -39,7 +39,7 @@ features = ["extension-module", "abi3", "abi3-py37"] | |||
[dependencies.deltalake] | |||
path = "../rust" | |||
version = "0" | |||
features = ["azure", "gcs", "python", "datafusion", "unity-experimental"] | |||
features = ["azure", "gcs", "python", "datafusion", "unity-experimental", "hdfs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this an optional feature of the Python library, since it has that extra java dependency?
You can add below:
hdfs = ["deltalake/hdfs"]
And then in Python release add it to the feature flags for release:
FEATURES_FLAG: --no-default-features --features native-tls |
Yeah I think that all would work! I don't think Java would be required at runtime unless you try to use HDFS, but I guess should try to verify that. I made it an optional dependency that gets installed for the integration tests, didn't touch the release workflows. |
Well nevermind, I think I was wrong. I used a fresh docker container, installed Java, built with HDFS support, then uninstalled Java and the unit tests fail with
so this might just not be viable, might only be able to do the pyarrow based approach. |
That's unfortunate. Either datafusion-objectstore-hdfs needs to change to load the library when called (using libloading) or we'll need to instead support fsspec. |
Given the state of this work, I am going to convert it to a draft since I do not believe it would be suitable to merge at this point. |
Closing in favor of #2612 |
Description
The description of the main changes of your pull request
Adds support for HDFS in the Python package by enabling the
hdfs
feature from the Rust library. Additionally adds a simple integration test, and docker-compose support to run the integration test locally.Had to update some of the workflows to accommodate. The trickiest one was the Python 3.7 stage that uses an old OS, which doesn't have a new enough clang library to compile the HDFS object store I guess?
This might require Java to build and develop on any Python related thing now? Not totally sure what side effects this might induce, first time playing around with this library (avid Spark user).
Related Issue(s)
Closes #1358
Documentation