Python library error reading from large Delta Table on S3 #882

joshuarobinson · 2022-10-12T14:13:47Z

Environment: ubuntu 22.04, reading from on-prem S3 object store to Arrow Table

Delta-rs version: 0.6.2

Binding: Python

Bug

What happened:
PanicException when reading in from a Delta table in an on-prem S3 object store (Swift).

Reading some tables seem to work okay and others not, anecdotally the smaller ones are okay and larger are not.

How to reproduce it:

storage_options = {"AWS_ACCESS_KEY_ID": ACCESS_KEY, "AWS_SECRET_ACCESS_KEY":SECRET_KEY, "AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}
TBL_PATH="s3://warehouse/tpcds_sf1000_dlt/catalog_sales/"
dt = DeltaTable(TBL_PATH, storage_options=storage_options)
ds = dt.to_pyarrow_dataset()
df = ds.head(100).to_pandas() # FAILS here

More details:

Error message:

thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
Traceback (most recent call last):
  File "/delta_read_duckdb.py", line 22, in <module>
    df = ds.head(100).to_pandas()
  File "pyarrow/_dataset.pyx", line 365, in pyarrow._dataset.Dataset.head
  File "pyarrow/_dataset.pyx", line 2623, in pyarrow._dataset.Scanner.head
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
pyo3_runtime.PanicException: dispatch dropped without returning error
FATAL: exception not rethrown

The text was updated successfully, but these errors were encountered:

joshuarobinson · 2022-10-12T14:15:57Z

When I run with RUST_BACKTRACE=full, I get the following stacktrace fragment (other frames are unknown):

30:     0x7fc4ff82f2ab - cfunction_call
                               at /usr/src/python/Objects/methodobject.c:543
  31:     0x7fc4ff7f230d - _PyObject_MakeTpCall
                               at /usr/src/python/Objects/call.c:215
  32:     0x7fc4ff7e239f - _PyObject_CallFunctionVa
                               at /usr/src/python/Objects/call.c:479:18
  33:     0x7fc4ff92b316 - _PyObject_CallMethod_SizeT
                               at /usr/src/python/Objects/call.c:651
  34:     0x7fc4f9a7a8a1 - _ZN5arrow2py14PyReadableFile4ReadEl
  35:     0x7fc4f9a7533b - _ZN5arrow2py14PyReadableFile6ReadAtEll.localalias
  36:     0x7fc4f92dbe2e - _ZN7parquet16ReaderProperties9GetStreamESt10shared_ptrIN5arrow2io16RandomAccessFileEEll
  37:     0x7fc4f93abae2 - _ZN7parquet18SerializedRowGroup19GetColumnPageReaderEi
  38:     0x7fc4f93553ef - _ZN7parquet14RowGroupReader19GetColumnPageReaderEi
  39:     0x7fc4f93ef833 - _ZN7parquet5arrow12_GLOBAL__N_19GetReaderERKNS0_11SchemaFieldERKSt10shared_ptrIN5arrow5FieldEERKS5_INS0_13ReaderContextEEPSt10unique_ptrINS0_16ColumnReaderImplESt14default_deleteISG_EE
  40:     0x7fc4f9409b57 - _ZN7parquet5arrow12_GLOBAL__N_114FileReaderImpl15GetFieldReadersERKSt6vectorIiSaIiEES7_PS3_ISt10shared_ptrINS0_16ColumnReaderImplEESaISA_EEPS8_IN5arrow6SchemaEE.constprop.0
  41:     0x7fc4f940c652 - _ZN7parquet5arrow12_GLOBAL__N_114FileReaderImpl15DecodeRowGroupsESt10shared_ptrIS2_ERKSt6vectorIiSaIiEES9_PN5arrow8internal8ExecutorE
  42:     0x7fc4f940e1ec - _ZN7parquet5arrow17RowGroupGenerator15ReadOneRowGroupEPN5arrow8internal8ExecutorESt10shared_ptrINS0_12_GLOBAL__N_114FileReaderImplEEiRKSt6vectorIiSaIiEE
  43:     0x7fc4f93e0b1e - _ZN5arrow8internal6FnOnceIFvvEE6FnImplISt5_BindIFNS_6detail14ContinueFutureENS_6FutureISt8functionIFNS8_ISt10shared_ptrINS_11RecordBatchEEEEvEEEEPFSG_PNS0_8ExecutorESA_IN7parquet5arrow12_GLOBAL__N_114FileReaderImplEEiRKSt6vectorIiSaIiEEESI_SN_iSQ_EEE6invokeEv
  44:     0x7fc4fac69e7b - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5arrow8internal10ThreadPool21LaunchWorkersUnlockedEiEUlvE_EEEEE6_M_runEv
  45:     0x7fc4fbb805f0 - execute_native_thread_routine
  46:     0x7fc4ff64dfa3 - start_thread
  47:     0x7fc4ff3efeff - clone
  48:                0x0 - <unknown>

roeap · 2022-10-19T20:59:04Z

Hmm. this one is a bit of a puzzle, as it seems that the connection to the storage location fails at some point during download of a file.

Could you try to use the pyarrow native S3 file system to see if the error still occurs? You have to wrap it in a sutree filesystem though as described here. https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends

ycrouin · 2022-10-23T17:41:54Z

Using

filesystem = fs.SubTreeFileSystem("<bucket>/<table key>", fs.S3FileSystem())
dt.to_pandas(filesystem=filesystem)

solves the PanicException issue I had on my table (which is not that big)

joshuarobinson · 2022-10-26T11:53:51Z

I also tried with the SubTreeFileSystem and that has solved my PanicException too.

Does that give any hints as to what's wrong with the original version? I'm happy to debug further...

wjones127 · 2022-10-26T14:54:20Z

No need to debug further; I've found a couple issues in #893 while adding integration tests with S3.

roeap · 2022-11-18T07:39:54Z

@joshuarobinson - is it possible for you to build off current main and see if the error still exists? We have some indication that the dropped clients issue could be resolved now.

jacobdanovitch · 2022-11-19T21:12:44Z

I got a similar (but not identical) traceback with a table of a few thousand rows on S3-compatible storage (LakeFS backed by Minio):

deltalake.PyDeltaTableError: Generic S3 error: Error performing get request main/silver/re/list/part-00003-a23a8231-bcb4-40f7-81d7-1dc3515ea120-c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (http://lakefs:8000/<bucket>/main/silver/re/list/part-00003-a23a8231-bcb4-40f7-81d7-1dc3515ea120-c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

Table was written using Spark 3.1.1 / delta-spark 2.1.1, reading using delta-rs==0.6.3. Hadn't been getting the error with a few hundred rows, Was also solved by using the PyArrow filesystem as suggested above.

joshuarobinson · 2022-11-21T12:28:20Z

@roeap I don't currently have the bandwidth to build off main, but I'll follow and try on the next release

wjones127 · 2022-11-28T18:41:07Z

@joshuarobinson We just released 0.6.4. Let us know if you still have any issue with reading tables.

roeap · 2023-01-27T08:12:14Z

@joshuarobinson - could you check if the latest release fixes this for you?

joshuarobinson · 2023-01-29T18:30:36Z

@roeap I can confirm that version 0.7.0 now allows me to write a delta table successfully. thanks for the follow-up

roeap · 2023-01-29T20:46:24Z

Great! Will close the issue then :)

joshuarobinson added the bug Something isn't working label Oct 12, 2022

roeap mentioned this issue Nov 18, 2022

Threading issues accessing ADLSGen2 table from Python #915

Closed

roeap closed this as completed Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python library error reading from large Delta Table on S3 #882

Python library error reading from large Delta Table on S3 #882

joshuarobinson commented Oct 12, 2022

joshuarobinson commented Oct 12, 2022

roeap commented Oct 19, 2022

ycrouin commented Oct 23, 2022 •

edited

Loading

joshuarobinson commented Oct 26, 2022

wjones127 commented Oct 26, 2022

roeap commented Nov 18, 2022

jacobdanovitch commented Nov 19, 2022 •

edited

Loading

joshuarobinson commented Nov 21, 2022

wjones127 commented Nov 28, 2022

roeap commented Jan 27, 2023

joshuarobinson commented Jan 29, 2023

roeap commented Jan 29, 2023

Python library error reading from large Delta Table on S3 #882

Python library error reading from large Delta Table on S3 #882

Comments

joshuarobinson commented Oct 12, 2022

Environment: ubuntu 22.04, reading from on-prem S3 object store to Arrow Table

Bug

joshuarobinson commented Oct 12, 2022

roeap commented Oct 19, 2022

ycrouin commented Oct 23, 2022 • edited Loading

joshuarobinson commented Oct 26, 2022

wjones127 commented Oct 26, 2022

roeap commented Nov 18, 2022

jacobdanovitch commented Nov 19, 2022 • edited Loading

joshuarobinson commented Nov 21, 2022

wjones127 commented Nov 28, 2022

roeap commented Jan 27, 2023

joshuarobinson commented Jan 29, 2023

roeap commented Jan 29, 2023

ycrouin commented Oct 23, 2022 •

edited

Loading

jacobdanovitch commented Nov 19, 2022 •

edited

Loading