Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

shazamkash · 2023-03-31T19:54:38Z

Environment

Delta-rs version: 0.8.1

Binding: Python

Environment:
Docker container:
Python: 3.10.7
OS: Debian GNU/Linux 11 (bullseye)
S3: Non-AWS (Ceph based)

Bug

What happened:
When reading delta table, the table is read fine and also exists. But then converting that table to pandas or from pyarrow dataset to table is failing with the same error below.

I have tried reading the same table with PySpark and it works fine. The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the same delta-rs version.

Error:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

How to reproduce it:
My Code:

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

# Coverting to pandas fails
dt.to_pandas()

# Converting from pyarrow dataset to table fails as well
dataset = dt.to_pyarrow_dataset()
dataset.to_table()

More details:
I am not sure if this information helps. But I get the same error when reading using Polars.

The text was updated successfully, but these errors were encountered:

roeap · 2023-04-01T05:01:24Z

@shazamkash - Thanks for reporting this!

From the response you showed it seems like we are running into some sort of throttling on the storage side. Though not quite sure why. Could you see what happens if you configure the pyarrow s3 filesystem adn pass that to to_pyayrrow_dataset? https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends.

shazamkash · 2023-04-03T10:51:41Z

@roeap

I tried what you asked and please find the code and errors below:

Code:

from pyarrow import fs
import deltalake as dl

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

s3 = fs.S3FileSystem(access_key=f"{credentials.access_key}",
                                 secret_key=f"{credentials.secret_key}",
                                 endpoint_override='https://xxx.yyy.zzz.net')

# Fails
dataset = dt.to_pyarrow_dataset(filesystem=s3)

# Fails
dataset = dt.to_pandas(filesystem=s3)

Error from dt.to_pyarrow_dataset()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[28], line 2
      1 dataset = dt.to_pyarrow_dataset(filesystem=s3)
----> 2 dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Error from dt.to_pandas()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[26], line 1
----> 1 dt.to_pandas(filesystem=s3)

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Here is a list of files which I can get running the following code and this works as well

dataset = dt.to_pyarrow_dataset(filesystem=s3)
dataset.files

List of files:

['0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-1.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-3.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-2.parquet']

shazamkash · 2023-04-05T12:47:15Z

@roeap

Another thing I noticed is that, this only happens with data which is "big" in size few 100's MB to few GB and is split into multiple parquet files . I can read the tables which are very small like few 10's of MB and are save in singular file.

Any help would be appreciated in this matter. Because I have read the same data before with delta-rs with an older version and back then it worked fine. Unfortunately I don't remember now what was the exact delta-rs version.

Also here is the full error which I was able to get now:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

tsafacjo · 2023-11-01T21:54:57Z

can I take it ?

roeap · 2023-11-09T10:26:13Z

@tsafacjo - certainly :)

shazamkash added the bug Something isn't working label Mar 31, 2023

shazamkash mentioned this issue Apr 5, 2023

PyDeltaTableError: Generic S3 error: Error performing get request pola-rs/polars#8008

Closed

2 tasks

rtyler added the binding/python Issues for the Python package label Apr 14, 2023

djouallah mentioned this issue May 9, 2023

add a clear warning in the documentation that optimize is not concurrently safe in Cloudflare R2 #1348

Closed

rtyler added the good first issue Good for newcomers label Oct 25, 2023

roeap assigned tsafacjo Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

shazamkash commented Mar 31, 2023 •

edited

Loading

roeap commented Apr 1, 2023

shazamkash commented Apr 3, 2023

shazamkash commented Apr 5, 2023 •

edited

Loading

tsafacjo commented Nov 1, 2023

roeap commented Nov 9, 2023

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

Comments

shazamkash commented Mar 31, 2023 • edited Loading

Environment

Bug

roeap commented Apr 1, 2023

shazamkash commented Apr 3, 2023

shazamkash commented Apr 5, 2023 • edited Loading

tsafacjo commented Nov 1, 2023

roeap commented Nov 9, 2023

shazamkash commented Mar 31, 2023 •

edited

Loading

shazamkash commented Apr 5, 2023 •

edited

Loading