Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python library error reading from large Delta Table on S3 #882

Closed
joshuarobinson opened this issue Oct 12, 2022 · 12 comments
Closed

Python library error reading from large Delta Table on S3 #882

joshuarobinson opened this issue Oct 12, 2022 · 12 comments
Labels
bug Something isn't working

Comments

@joshuarobinson
Copy link

Environment: ubuntu 22.04, reading from on-prem S3 object store to Arrow Table

Delta-rs version: 0.6.2

Binding: Python


Bug

What happened:
PanicException when reading in from a Delta table in an on-prem S3 object store (Swift).

Reading some tables seem to work okay and others not, anecdotally the smaller ones are okay and larger are not.

How to reproduce it:

storage_options = {"AWS_ACCESS_KEY_ID": ACCESS_KEY, "AWS_SECRET_ACCESS_KEY":SECRET_KEY, "AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}
TBL_PATH="s3://warehouse/tpcds_sf1000_dlt/catalog_sales/"
dt = DeltaTable(TBL_PATH, storage_options=storage_options)
ds = dt.to_pyarrow_dataset()
df = ds.head(100).to_pandas() # FAILS here

More details:

Error message:

thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
Traceback (most recent call last):
  File "/delta_read_duckdb.py", line 22, in <module>
    df = ds.head(100).to_pandas()
  File "pyarrow/_dataset.pyx", line 365, in pyarrow._dataset.Dataset.head
  File "pyarrow/_dataset.pyx", line 2623, in pyarrow._dataset.Scanner.head
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
pyo3_runtime.PanicException: dispatch dropped without returning error
FATAL: exception not rethrown
@joshuarobinson joshuarobinson added the bug Something isn't working label Oct 12, 2022
@joshuarobinson
Copy link
Author

When I run with RUST_BACKTRACE=full, I get the following stacktrace fragment (other frames are unknown):

30:     0x7fc4ff82f2ab - cfunction_call
                               at /usr/src/python/Objects/methodobject.c:543
  31:     0x7fc4ff7f230d - _PyObject_MakeTpCall
                               at /usr/src/python/Objects/call.c:215
  32:     0x7fc4ff7e239f - _PyObject_CallFunctionVa
                               at /usr/src/python/Objects/call.c:479:18
  33:     0x7fc4ff92b316 - _PyObject_CallMethod_SizeT
                               at /usr/src/python/Objects/call.c:651
  34:     0x7fc4f9a7a8a1 - _ZN5arrow2py14PyReadableFile4ReadEl
  35:     0x7fc4f9a7533b - _ZN5arrow2py14PyReadableFile6ReadAtEll.localalias
  36:     0x7fc4f92dbe2e - _ZN7parquet16ReaderProperties9GetStreamESt10shared_ptrIN5arrow2io16RandomAccessFileEEll
  37:     0x7fc4f93abae2 - _ZN7parquet18SerializedRowGroup19GetColumnPageReaderEi
  38:     0x7fc4f93553ef - _ZN7parquet14RowGroupReader19GetColumnPageReaderEi
  39:     0x7fc4f93ef833 - _ZN7parquet5arrow12_GLOBAL__N_19GetReaderERKNS0_11SchemaFieldERKSt10shared_ptrIN5arrow5FieldEERKS5_INS0_13ReaderContextEEPSt10unique_ptrINS0_16ColumnReaderImplESt14default_deleteISG_EE
  40:     0x7fc4f9409b57 - _ZN7parquet5arrow12_GLOBAL__N_114FileReaderImpl15GetFieldReadersERKSt6vectorIiSaIiEES7_PS3_ISt10shared_ptrINS0_16ColumnReaderImplEESaISA_EEPS8_IN5arrow6SchemaEE.constprop.0
  41:     0x7fc4f940c652 - _ZN7parquet5arrow12_GLOBAL__N_114FileReaderImpl15DecodeRowGroupsESt10shared_ptrIS2_ERKSt6vectorIiSaIiEES9_PN5arrow8internal8ExecutorE
  42:     0x7fc4f940e1ec - _ZN7parquet5arrow17RowGroupGenerator15ReadOneRowGroupEPN5arrow8internal8ExecutorESt10shared_ptrINS0_12_GLOBAL__N_114FileReaderImplEEiRKSt6vectorIiSaIiEE
  43:     0x7fc4f93e0b1e - _ZN5arrow8internal6FnOnceIFvvEE6FnImplISt5_BindIFNS_6detail14ContinueFutureENS_6FutureISt8functionIFNS8_ISt10shared_ptrINS_11RecordBatchEEEEvEEEEPFSG_PNS0_8ExecutorESA_IN7parquet5arrow12_GLOBAL__N_114FileReaderImplEEiRKSt6vectorIiSaIiEEESI_SN_iSQ_EEE6invokeEv
  44:     0x7fc4fac69e7b - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5arrow8internal10ThreadPool21LaunchWorkersUnlockedEiEUlvE_EEEEE6_M_runEv
  45:     0x7fc4fbb805f0 - execute_native_thread_routine
  46:     0x7fc4ff64dfa3 - start_thread
  47:     0x7fc4ff3efeff - clone
  48:                0x0 - <unknown>

@roeap
Copy link
Collaborator

roeap commented Oct 19, 2022

Hmm. this one is a bit of a puzzle, as it seems that the connection to the storage location fails at some point during download of a file.

Could you try to use the pyarrow native S3 file system to see if the error still occurs? You have to wrap it in a sutree filesystem though as described here. https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends

@ycrouin
Copy link

ycrouin commented Oct 23, 2022

Using

filesystem = fs.SubTreeFileSystem("<bucket>/<table key>", fs.S3FileSystem())
dt.to_pandas(filesystem=filesystem)

solves the PanicException issue I had on my table (which is not that big)

@joshuarobinson
Copy link
Author

I also tried with the SubTreeFileSystem and that has solved my PanicException too.

Does that give any hints as to what's wrong with the original version? I'm happy to debug further...

@wjones127
Copy link
Collaborator

No need to debug further; I've found a couple issues in #893 while adding integration tests with S3.

@roeap
Copy link
Collaborator

roeap commented Nov 18, 2022

@joshuarobinson - is it possible for you to build off current main and see if the error still exists? We have some indication that the dropped clients issue could be resolved now.

@jacobdanovitch
Copy link

jacobdanovitch commented Nov 19, 2022

I got a similar (but not identical) traceback with a table of a few thousand rows on S3-compatible storage (LakeFS backed by Minio):

deltalake.PyDeltaTableError: Generic S3 error: Error performing get request main/silver/re/list/part-00003-a23a8231-bcb4-40f7-81d7-1dc3515ea120-c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (http://lakefs:8000/<bucket>/main/silver/re/list/part-00003-a23a8231-bcb4-40f7-81d7-1dc3515ea120-c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

Table was written using Spark 3.1.1 / delta-spark 2.1.1, reading using delta-rs==0.6.3. Hadn't been getting the error with a few hundred rows, Was also solved by using the PyArrow filesystem as suggested above.

@joshuarobinson
Copy link
Author

@roeap I don't currently have the bandwidth to build off main, but I'll follow and try on the next release

@wjones127
Copy link
Collaborator

@joshuarobinson We just released 0.6.4. Let us know if you still have any issue with reading tables.

@roeap
Copy link
Collaborator

roeap commented Jan 27, 2023

@joshuarobinson - could you check if the latest release fixes this for you?

@joshuarobinson
Copy link
Author

@roeap I can confirm that version 0.7.0 now allows me to write a delta table successfully. thanks for the follow-up

@roeap
Copy link
Collaborator

roeap commented Jan 29, 2023

Great! Will close the issue then :)

@roeap roeap closed this as completed Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants