Threading issues accessing ADLSGen2 table from Python #915

craustin · 2022-11-04T21:56:25Z

Environment

Delta-rs version: v0.6.2

Binding: Python 3.8

Environment:

Cloud provider: Delta table on ADLS Gen2; Process running in container
OS: Debian 10
Other: pyarrow 10.0.0 (but repros w/ earlier versions too)

Bug

What happened:

I'm following the tutorial here:

from deltalake import DeltaTable
import pyarrow.dataset as ds
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "..."}
dt = DeltaTable("abfss://[email protected]/data", storage_options=storage_options)
dataset = dt.to_pyarrow_dataset()
df = dataset.to_table(filter=(ds.field("my_date") == date(2022,11,3)) & (ds.field("my_field") == 'some_key')).to_pandas()

thread '<unnamed>' panicked at 'dispatch dropped without returning error', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329:35
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In [54], line 1
----> 1 df = dataset.to_table(filter=(ds.field("my_date") == date(2022,11,3)) & (ds.field("my_field") == 'some_key')).to_pandas()

File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:332, in pyarrow._dataset.Dataset.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2661, in pyarrow._dataset.Scanner.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
PanicException: dispatch dropped without returning error

Sometimes the first line (thread '<unnamed>'...) is printed more than once. Sometimes I get this error instead:

...
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:332, in pyarrow._dataset.Dataset.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2661, in pyarrow._dataset.Scanner.to_table()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/.conda/envs/my_env/lib/python3.8/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
OSError: Tried reading 8369271 bytes starting at position 50076484 from file but only got 4908220

What you expected to happen:
No errors. pd.DataFrame comes back.

How to reproduce it:
Run above code.

More details:
Possibly related, running the DeltaTable(...) constructor in Jupyter notebook hangs and never returns - so I'm running in a raw python / ipython interpreter. I don't see the hang there, but I do see the above threading issues.

(`delta-rs` is awesome by the way. Thank you! 🙏)

The text was updated successfully, but these errors were encountered:

wjones127 · 2022-11-04T22:00:35Z

Thanks for reporting this. I think I might have reproduced something similar while working on #893. I'll look into this specific example (the other one was in S3) in #912. My guess is there are some places we were panicking (calling .unwrap()) in the filesystem dispatch and we need to make sure we are handling those.

craustin · 2022-11-09T07:58:37Z

Let me know of any possible workarounds to the Jupyter hang or the ADLS threading issues (maybe same root cause.) I can file a separate issue for the Jupyter hang, if that'd be helpful. I can try to help out w/ the fix / tests too, if you point me in the right direction.

dennyglee · 2022-11-12T00:08:44Z

QQ @craustin - we just recently released delta-rs python 0.6.3 which includes #893, I was wondering if the same behavior is happening with the latest release? Thanks!

craustin · 2022-11-12T00:12:21Z

Yes, we see the same behavior in v0.6.3.

dennyglee · 2022-11-12T00:38:22Z

Got it - thanks for the update! Will dive into this a bit, thanks!

dennyglee · 2022-11-15T05:26:06Z

@craustin Quick confirmation, if you ran below
df = dataset.to_table().to_pandas()

instead of the filtered statement, the query would work as expected?

The reason I'm asking is because I'm wondering if there is something specific to your dataset (e.g. size, etc.). At least a quick test on one of my datasets (528 rows, 30 columns) in ADLSgen2 resulted in the query working as expected.

craustin · 2022-11-15T06:19:16Z

I get the same errors when I remove the filter. When I run df = dataset.to_table().to_pandas() I get the following error (for different filenames each time:

PyDeltaTableError: Generic MicrosoftAzure error: Error performing get request data/part-00000-caeae246-493e-46fb-9df0-724b1521604c.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://stmystorageaccount.blob.core.windows.net/my-container/data/part-00000-caeae246-493e-46fb-9df0-724b1521604c.c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

There are ~70 transactions in the delta log - and ~50 parquet files (snappy format, from Databricks.) ~1M rows and 20 columns.

Also, I mentioned in the initial issue- I get an indefinite hang (or so it seems) when running from Jupyter. Are you able to run from a notebook? Seems like it's likely a related threading issue. I don't see the hang when running from python or ipython REPLs, but I get the above errors.

dennyglee · 2022-11-15T06:21:24Z

Got it, thanks for the additional info. I'll setup a larger example to see if I can repro. Any chance you can run this on a smaller dataset to see if you can get the same error in the interim?

craustin · 2022-11-15T06:28:02Z

I think it might not be size-related, but rather something wrong with threading - at least in my environment somehow. I just tried with a table w/ 1 transaction, 6 snappy parquet files, 6 rows, and 4 columns - and I still get the same runtime dropped the dispatch task error - with one of the 6 filenames each time.

dennyglee · 2022-11-15T13:01:22Z

Gotcha, thanks for diving in more. Okay, let me see if I can repro your specific environment and go backwards from there.

craustin · 2022-11-16T01:23:15Z

I was able to repro the error (runtime dropped the dispatch task) on my personal laptop (Apple M1 Pro), outside of our work env (with an Azure Tryout subscription.) I created a simple storage account and create a Delta table with the below code - and I get the error:

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://[email protected]/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options)

dennyglee · 2022-11-16T02:56:00Z

Thanks for the additional info. Can you do me a favor and ping me on Delta Users Slack via dennyglee as I have a few more questions.

Don't worry, will update this issue with pertinent info; I just didn't want to flood it with lots of small Qs

dennyglee · 2022-11-16T05:24:36Z

Thanks @craustin - appreciate the Slack conversation and updating the issue as noted above to help clarify the issue. One possible reason for the dispatch issue appears to be associated with partitions. For example:

works (do not specify partitions)

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})

# do not specify partitions
table_root = "abfss://[email protected]/data"
write_deltalake(table_root, df, storage_options=storage_options)

fails (specify partitions)

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://[email protected]/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options)

with my error looking like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/deltalake/writer.py", line 254, in write_deltalake
    ds.write_dataset(
  File "/opt/homebrew/lib/python3.9/site-packages/pyarrow/dataset.py", line 988, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in pyarrow._dataset._filesystemdataset_write
deltalake.PyDeltaTableError: Generic MicrosoftAzure error: Error performing put request thingyratings/tables/test0_part/x=15/0-8aa89f84-d4ac-482f-8f55-9947fccd20f7-0.parquet: response error "request error", after 0 retries: error sending request for url (https://lakeything.blob.core.windows.net/root/thingyratings/tables/test0_part/x=15/0-8aa89f84-d4ac-482f-8f55-9947fccd20f7-0.parquet?comp=block&blockid=ICAgICAgICAgICAgICAgICAgIDA%3D): dispatch task is gone: runtime dropped the dispatch task

dennyglee · 2022-11-16T06:19:03Z

Quick update, it appears that this may be too many open files via the Azure Rust SDK? For example, for the above scenario, could you try using the max_open_files as unsatisfactory workaround?

write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options, max_open_files=8)

Note, when I write this locally, the write_deltalake(table_local, df, partition_by=["x"]) works as expected.

PadenZach · 2022-11-17T15:09:59Z

This appears to be affecting more than just ADLSGen2, and is also affecting reading. I'm having similar issues with aws s3.

tbl = DeltaTable(table_uri).to_pyarrow_dataset().to_table()

Results in:

PyDeltaTableError: Generic S3 error: Error performing get request root_prefix/delta_table/parta=A/partb=B/part-00003-3ba999ac-4e93-4ec0-9866-2ffde6d0b0f1.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://s3.us-east-1.amazonaws.com/my_bucket/root_prefix/delta_table/parta%3A/partb%3DB/part-00003-3ba999ac-4e93-4ec0-9866-2ffde6d0b0f1.c000.snappy.parquet): dispatch task is gone: runtime dropped the dispatch task

Per Denny's suggestion, I tried looking into reducing max_open_files, but this doesnt exist for the to_table function. Instead, we have fragment_readahead for files and batch_readahead for batches.

I tried two things:
dt1.to_table(fragment_readahead=1, batch_readahead=1)
This resulted in a new errors:
PyDeltaTableError: Generic S3 error: Error performing get request <file_path>/part-00082-ce112410-8fc0-4f4e-9435-220c209d7cf3.c000.snappy.parquet: response error "request error", after 0 retries: error sending request for url (https://s3.us-east-1.amazonaws.com/<file_path>/part-00082-ce112410-8fc0-4f4e-9435-220c209d7cf3.c000.snappy.parquet): connection closed before message completed

Wondering if this may just be a speed/performance issue from reducing things so low, I tried increased batch_readahead up again, but keeping fragment_readahead to 1.

dt1.to_table(fragment_readahead=1, batch_readahead=8) This resulted in me being able to succesfully read the entire table into memory.

The also appears to be affecting other dataset operations, and I get the orginal dispatch error while trying to use the deltalake -> pyarrow dataset -> duckdb interoperability. However, it seems to be rather intermittent, perhaps supporting the idea that this may be related to the number of files accessed.

Details:

python 3.8
deltalake 0.6.3
pyarrow 10.x (issue exists in 9.x but the fragment_readahead workaround is not).
Dataset is about 5gbs of snappy-compressed parquet, less than 10 columns. It uses hive partitioning and has ~100 files (Perhaps I should compact it later :) )

Let me know if this may be different enough to merit opening up a new ticket.

Thanks!

roeap · 2022-11-17T17:11:08Z

We do have a PR open, that - while named completely different - should hopefully remedy the situation #933. It being applicable not only to azure is actually a good thing here, since the root cause we identified is not azure related ... @wjones127, is there anything I can do to help drive that PR home? :)

wjones127 · 2022-11-17T18:13:00Z

I'll be able to look at it this weekend, but if you have time to reproduce this issue and see if it fixes, that would be helpful.

@wjones127

# Description This PR builds in top of the changes to handling the runtime in #933. In my local tests this fixed #915. Additionally, I added the runtime as a property on the fs handler to avoid re-creating it on every call. In some non-representative tests with a large number of very small partitions it cut the runtime in about half. cc @wjones127 # Related Issue(s)  # Documentation

roeap · 2022-11-17T19:01:48Z

Reopening to await confirmation from the affected :)

dennyglee · 2022-11-17T22:30:04Z

Let me know if I can help repro / test it as well, eh?!

roeap · 2022-11-17T22:54:27Z

If you could build current main and see if that fixes it, that would be awesome.

dennyglee · 2022-11-17T23:15:16Z

You got it! I’ll check later tonight :-)

…

On Thu, Nov 17, 2022 at 14:54 Robert Pack ***@***.***> wrote: If you could build current main and see if that fixes it, that would be awesome. — Reply to this email directly, view it on GitHub <#915 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALBHLK3NPHWL2TVF23EEQDWI2ZS3ANCNFSM6AAAAAARXSXIWM> . You are receiving this because you commented.Message ID: ***@***.***>

PadenZach · 2022-11-18T02:42:24Z

Ran a few of the queries that were causing errors in main very often. No dispatch errors so far! Looking great! 💯

tbl = DeltaTable(table_uri).to_pyarrow_dataset().to_table() runs as expected.
More advanced queries via duckdb and datafusion also seem to be working much better in my very small query collection

dennyglee · 2022-11-18T07:43:37Z

Like @PadenZach I also re-ran my repo steps and works like a charm :)

roeap · 2022-11-18T08:02:56Z

just realized that #882 likely has the same root cause, just for keeping track :)

craustin · 2022-11-20T03:08:48Z

Latest main build looks good to me. I don't see any of these "dispatch dropped" errors anymore!

craustin added the bug Something isn't working label Nov 4, 2022

dennyglee mentioned this issue Nov 16, 2022

Overwrite mode does not work with Azure #939

Closed

roeap mentioned this issue Nov 17, 2022

feat: improve write perfromance of DeltaFileSystemHandler #943

Merged

roeap closed this as completed in #943 Nov 17, 2022

roeap reopened this Nov 17, 2022

craustin closed this as completed Nov 20, 2022

xbrianh mentioned this issue Jan 23, 2023

deltalake.DeltaTable constructor hangs in Jupyter #1093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading issues accessing ADLSGen2 table from Python #915

Threading issues accessing ADLSGen2 table from Python #915

craustin commented Nov 4, 2022 •

edited

Loading

wjones127 commented Nov 4, 2022

craustin commented Nov 9, 2022

dennyglee commented Nov 12, 2022

craustin commented Nov 12, 2022

dennyglee commented Nov 12, 2022

dennyglee commented Nov 15, 2022 •

edited

Loading

craustin commented Nov 15, 2022 •

edited

Loading

dennyglee commented Nov 15, 2022

craustin commented Nov 15, 2022

dennyglee commented Nov 15, 2022

craustin commented Nov 16, 2022 •

edited

Loading

dennyglee commented Nov 16, 2022

dennyglee commented Nov 16, 2022

dennyglee commented Nov 16, 2022 •

edited

Loading

PadenZach commented Nov 17, 2022

roeap commented Nov 17, 2022

wjones127 commented Nov 17, 2022

roeap commented Nov 17, 2022

dennyglee commented Nov 17, 2022

roeap commented Nov 17, 2022

dennyglee commented Nov 17, 2022 via email

PadenZach commented Nov 18, 2022

dennyglee commented Nov 18, 2022

roeap commented Nov 18, 2022

craustin commented Nov 20, 2022

Threading issues accessing ADLSGen2 table from Python #915

Threading issues accessing ADLSGen2 table from Python #915

Comments

craustin commented Nov 4, 2022 • edited Loading

Environment

Bug

(delta-rs is awesome by the way. Thank you! 🙏)

wjones127 commented Nov 4, 2022

craustin commented Nov 9, 2022

dennyglee commented Nov 12, 2022

craustin commented Nov 12, 2022

dennyglee commented Nov 12, 2022

dennyglee commented Nov 15, 2022 • edited Loading

craustin commented Nov 15, 2022 • edited Loading

dennyglee commented Nov 15, 2022

craustin commented Nov 15, 2022

dennyglee commented Nov 15, 2022

craustin commented Nov 16, 2022 • edited Loading

dennyglee commented Nov 16, 2022

dennyglee commented Nov 16, 2022

works (do not specify partitions)

fails (specify partitions)

dennyglee commented Nov 16, 2022 • edited Loading

PadenZach commented Nov 17, 2022

roeap commented Nov 17, 2022

wjones127 commented Nov 17, 2022

roeap commented Nov 17, 2022

dennyglee commented Nov 17, 2022

roeap commented Nov 17, 2022

dennyglee commented Nov 17, 2022 via email

PadenZach commented Nov 18, 2022

dennyglee commented Nov 18, 2022

roeap commented Nov 18, 2022

craustin commented Nov 20, 2022

craustin commented Nov 4, 2022 •

edited

Loading

(`delta-rs` is awesome by the way. Thank you! 🙏)

dennyglee commented Nov 15, 2022 •

edited

Loading

craustin commented Nov 15, 2022 •

edited

Loading

craustin commented Nov 16, 2022 •

edited

Loading

dennyglee commented Nov 16, 2022 •

edited

Loading