-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python write_deltalake() to Non-AWS S3 failing #890
Comments
possibly very related to the issue I recently filed: #883 |
I am not too deep into the S§ side of things, but for the error message it seems the underlying file system is trying to get credentials from an ecs metadata endpoint, which seems strage since in the snipplet that is not configured. Just to eliminate - could there be en environment variable configured to cause this? Then again I might be comepletly off - this is just after a quick scan. |
I currently get |
I am working around this now by:
|
Thanks, I will give this a try. |
We just released 0.6.4, which includes several fixes related to passing down credentials. Could you check whether writing is working for you on that version? |
I have tried with 0.6.4 and face a new error, which seems to be with multi-part uploads. BUT I need to confirm if this is an issue with our object store (Swiftstack) or not. For reference, I'm able to successfully write the same dataframe to s3 with pyarrow directly (using write_dataset).
in the meantime, I wasn't able to find options to the PyArrow s3 filesystem to configure multi-part upload, so please let me know if you are aware of any options and I'll try to test with different configs. update: our current theory is that the issue might be related to the "complete-multipart-upload" not including all necessary xml components. Is it correct that the write path is using the rusoto s3 library? or is it using the arrow c++ s3 implementation? Update 2: if I write a very small table (14KB), then everything works! So I think this points to the issue being multi-part uploads as presumably they're disabled below a certain threshold. |
I am also hitting another issue
storage_options is set like this:
failure happens here: https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L164 which I guess is happening because my source data is using a pyarrow filesystem that also defines the endpoint_url field and so the key is duplicated. Possibly? |
Yes, that will be fixed in #912. |
Hello! I will give it a go, will let you know as soon as possible! |
I can confirm that I have similar problem as @joshuarobinson
|
I am also facing this error with some datasets; I am not entirely sure if this is related, or I must open a new issue. These datasets work fine when reading with pandas or pyarrow. Also, I am able to upload them to delta lake using pyspark.
|
@shazamkash please open a new issue for that error. |
The |
This will be fixed in the next release. |
Environment
Delta-rs version: 0.6.2
Binding: Python
Environment:
Docker container:
Python: 3.10.7
OS: Debian GNU/Linux 11 (bullseye)
S3: Non-AWS (Ceph based)
Bug
What happened:
Delta lake write is failing when trying to write table to Ceph based S3 (non-AWS). I am writing the table to a path which does not contain any delta table or any sort of file previously.
I have also tried different mode but writing the table still does not work and throws the same error.
My code:
Fails with the following error:
Any idea what might be the problem? I am able to read the delta tables with the same storage_options.
The text was updated successfully, but these errors were encountered: