Python write_deltalake() to Non-AWS S3 failing #890

shazamkash · 2022-10-17T06:51:52Z

Environment

Delta-rs version: 0.6.2

Binding: Python

Environment:
Docker container:
Python: 3.10.7
OS: Debian GNU/Linux 11 (bullseye)
S3: Non-AWS (Ceph based)

Bug

What happened:

Delta lake write is failing when trying to write table to Ceph based S3 (non-AWS). I am writing the table to a path which does not contain any delta table or any sort of file previously.

I have also tried different mode but writing the table still does not work and throws the same error.

My code:

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net"
                  }
df = pd.DataFrame({'x': [1, 2, 3]})
table_uri = "s3a://<bucket-name>/delta_test"
dl.writer.write_deltalake(table_uri, df, storage_options=storage_options)

Fails with the following error:

Any idea what might be the problem? I am able to read the delta tables with the same storage_options.

The text was updated successfully, but these errors were encountered:

joshuarobinson · 2022-10-17T15:18:31Z

possibly very related to the issue I recently filed: #883

roeap · 2022-10-19T21:04:26Z

I am not too deep into the S§ side of things, but for the error message it seems the underlying file system is trying to get credentials from an ecs metadata endpoint, which seems strage since in the snipplet that is not configured.

Just to eliminate - could there be en environment variable configured to cause this?

Then again I might be comepletly off - this is just after a quick scan.

Thelin90 · 2022-11-01T14:51:46Z

I currently get SignatureDoesNotMatch when providing credentials.

Thelin90 · 2022-11-02T18:57:05Z

I am working around this now by:

Create the delta table locally
Upload table to s3 via s3fs

shazamkash · 2022-11-03T08:37:43Z

I am working around this now by:

Create the delta table locally

Upload table to s3 via s3fs

Thanks, I will give this a try.

wjones127 · 2022-11-28T18:22:25Z

We just released 0.6.4, which includes several fixes related to passing down credentials. Could you check whether writing is working for you on that version?

joshuarobinson · 2022-11-29T11:11:00Z

I have tried with 0.6.4 and face a new error, which seems to be with multi-part uploads. BUT I need to confirm if this is an issue with our object store (Swiftstack) or not. For reference, I'm able to successfully write the same dataframe to s3 with pyarrow directly (using write_dataset).

Traceback (most recent call last):
  File "/delta_write.py", line 28, in <module>
    write_deltalake('s3://joshuarobinson/test_deltalake/delta/', df, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 250, in write_deltalake
    ds.write_dataset(
  File "/usr/local/lib/python3.10/site-packages/pyarrow/dataset.py", line 988, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in pyarrow._dataset._filesystemdataset_write
deltalake.PyDeltaTableError: Generic S3 error: Error performing complete multipart request: response error "<?xml version='1.0' encoding='UTF-8'?>
<Error><Code>EntityTooSmall</Code><Message>Your proposed upload is smaller than the minimum allowed object size.</Message></Error>", after 0 retries: HTTP status client error (400 Bad Request) for url (https://pbss.s8k.io/joshuarobinson/test_deltalake/delta/0-899c5178-cb3a-4f75-9072-aa68ce4b162d-0.parquet?uploadId=MWUwZWRiOTAtYTRlMC00MzcxLTk1NDItYzc1MjA4OGVmYzAy)

in the meantime, I wasn't able to find options to the PyArrow s3 filesystem to configure multi-part upload, so please let me know if you are aware of any options and I'll try to test with different configs.

update: our current theory is that the issue might be related to the "complete-multipart-upload" not including all necessary xml components.

Is it correct that the write path is using the rusoto s3 library? or is it using the arrow c++ s3 implementation?

Update 2: if I write a very small table (14KB), then everything works! So I think this points to the issue being multi-part uploads as presumably they're disabled below a certain threshold.

joshuarobinson · 2022-11-29T13:38:29Z

I am also hitting another issue

Traceback (most recent call last):
  File "/delta_write.py", line 30, in <module>
    write_deltalake('s3://joshuarobinson/test_deltalake/delta/', df, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 164, in write_deltalake
    storage_options = dict(
TypeError: dict() got multiple values for keyword argument 'AWS_ENDPOINT_URL'

storage_options is set like this:

storage_options = {"AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}

failure happens here: https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L164

which I guess is happening because my source data is using a pyarrow filesystem that also defines the endpoint_url field and so the key is duplicated. Possibly?

wjones127 · 2022-11-29T15:39:56Z

Yes, that will be fixed in #912.

Thelin90 · 2022-11-29T17:41:07Z

Hello! I will give it a go, will let you know as soon as possible!

shazamkash · 2022-11-29T23:38:13Z

I can confirm that I have similar problem as @joshuarobinson


PyDeltaTableError: Generic S3 error: Error performing complete multipart request: response error "<?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooSmall</Code><BucketName>kafka-shazam</BucketName><RequestId>tx00000b6e0ebdcf5b90da8-0063869453-35b55ff45-default</RequestId><HostId>35b55ff45-default-default</HostId></Error>", after 0 retries: HTTP status client error (400 Bad Request) for url (https://xxxxxx/kafka-shazam/delta_test/rs_test/0-4b41fe34-3605-4b3c-8cd9-2ab61eb8f34a-0.parquet?uploadId=2%7EhY7nplOyckHBl7ePcT67gt77ySSzIpJ)

shazamkash · 2022-11-29T23:45:29Z

I am also facing this error with some datasets; I am not entirely sure if this is related, or I must open a new issue. These datasets work fine when reading with pandas or pyarrow. Also, I am able to upload them to delta lake using pyspark.


IndexError: 0 out of bounds
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/deltalake/writer.py", line 208, in visitor
    stats = get_file_stats_from_metadata(written_file.metadata)
  File "/opt/conda/lib/python3.9/site-packages/deltalake/writer.py", line 369, in get_file_stats_from_metadata
    name = metadata.row_group(0).column(column_idx).path_in_schema
  File "pyarrow/_parquet.pyx", line 769, in pyarrow._parquet.FileMetaData.row_group
  File "pyarrow/_parquet.pyx", line 506, in pyarrow._parquet.RowGroupMetaData.__cinit__
IndexError: 0 out of bounds

wjones127 · 2022-11-30T02:08:37Z

@shazamkash please open a new issue for that error.

wjones127 · 2022-11-30T04:38:47Z

The EntityTooSmall error is a bug in the S3 implementation, and it's triggered if any files in the table are over 5 MB. (It doesn't seem to happen in the local emulators we use for testing, but does happen in AWS S3.) I have a fix ready in apache/arrow-rs#3234, which will hopefully be included in the next release. Thanks for reporting this!

wjones127 · 2023-01-24T02:18:23Z

This will be fixed in the next release.

shazamkash added the bug Something isn't working label Oct 17, 2022

wjones127 mentioned this issue Nov 30, 2022

object_store(aws): EntityTooSmall error on multi-part upload apache/arrow-rs#3233

Closed

wjones127 closed this as completed Jan 24, 2023

Zan-L mentioned this issue Jun 16, 2024

0.18.1 reintroduces S3 multipart upload bug #2605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python write_deltalake() to Non-AWS S3 failing #890

Python write_deltalake() to Non-AWS S3 failing #890

shazamkash commented Oct 17, 2022

joshuarobinson commented Oct 17, 2022

roeap commented Oct 19, 2022

Thelin90 commented Nov 1, 2022

Thelin90 commented Nov 2, 2022

shazamkash commented Nov 3, 2022

wjones127 commented Nov 28, 2022

joshuarobinson commented Nov 29, 2022 •

edited

Loading

joshuarobinson commented Nov 29, 2022 •

edited

Loading

wjones127 commented Nov 29, 2022

Thelin90 commented Nov 29, 2022

shazamkash commented Nov 29, 2022 •

edited

Loading

shazamkash commented Nov 29, 2022

wjones127 commented Nov 30, 2022

wjones127 commented Nov 30, 2022

wjones127 commented Jan 24, 2023

Python write_deltalake() to Non-AWS S3 failing #890

Python write_deltalake() to Non-AWS S3 failing #890

Comments

shazamkash commented Oct 17, 2022

Environment

Bug

joshuarobinson commented Oct 17, 2022

roeap commented Oct 19, 2022

Thelin90 commented Nov 1, 2022

Thelin90 commented Nov 2, 2022

shazamkash commented Nov 3, 2022

wjones127 commented Nov 28, 2022

joshuarobinson commented Nov 29, 2022 • edited Loading

joshuarobinson commented Nov 29, 2022 • edited Loading

wjones127 commented Nov 29, 2022

Thelin90 commented Nov 29, 2022

shazamkash commented Nov 29, 2022 • edited Loading

shazamkash commented Nov 29, 2022

wjones127 commented Nov 30, 2022

wjones127 commented Nov 30, 2022

wjones127 commented Jan 24, 2023

joshuarobinson commented Nov 29, 2022 •

edited

Loading

joshuarobinson commented Nov 29, 2022 •

edited

Loading

shazamkash commented Nov 29, 2022 •

edited

Loading