-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading issues accessing ADLSGen2 table from Python #915
Comments
Thanks for reporting this. I think I might have reproduced something similar while working on #893. I'll look into this specific example (the other one was in S3) in #912. My guess is there are some places we were panicking (calling |
Let me know of any possible workarounds to the Jupyter hang or the ADLS threading issues (maybe same root cause.) I can file a separate issue for the Jupyter hang, if that'd be helpful. I can try to help out w/ the fix / tests too, if you point me in the right direction. |
Yes, we see the same behavior in v0.6.3. |
Got it - thanks for the update! Will dive into this a bit, thanks! |
@craustin Quick confirmation, if you ran below instead of the filtered statement, the query would work as expected? The reason I'm asking is because I'm wondering if there is something specific to your dataset (e.g. size, etc.). At least a quick test on one of my datasets (528 rows, 30 columns) in ADLSgen2 resulted in the query working as expected. |
I get the same errors when I remove the filter. When I run
There are ~70 transactions in the delta log - and ~50 parquet files (snappy format, from Databricks.) ~1M rows and 20 columns. Also, I mentioned in the initial issue- I get an indefinite hang (or so it seems) when running from Jupyter. Are you able to run from a notebook? Seems like it's likely a related threading issue. I don't see the hang when running from |
Got it, thanks for the additional info. I'll setup a larger example to see if I can repro. Any chance you can run this on a smaller dataset to see if you can get the same error in the interim? |
I think it might not be size-related, but rather something wrong with threading - at least in my environment somehow. I just tried with a table w/ 1 transaction, 6 snappy parquet files, 6 rows, and 4 columns - and I still get the same |
Gotcha, thanks for diving in more. Okay, let me see if I can repro your specific environment and go backwards from there. |
I was able to repro the error ( import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://[email protected]/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options) |
Thanks for the additional info. Can you do me a favor and ping me on Delta Users Slack via dennyglee as I have a few more questions. Don't worry, will update this issue with pertinent info; I just didn't want to flood it with lots of small Qs |
Thanks @craustin - appreciate the Slack conversation and updating the issue as noted above to help clarify the issue. One possible reason for the dispatch issue appears to be associated with partitions. For example: works (do not specify partitions)
fails (specify partitions)
with my error looking like
|
Quick update, it appears that this may be too many open files via the Azure Rust SDK? For example, for the above scenario, could you try using the
Note, when I write this locally, the |
This appears to be affecting more than just ADLSGen2, and is also affecting reading. I'm having similar issues with aws s3.
Results in:
Per Denny's suggestion, I tried looking into reducing max_open_files, but this doesnt exist for the to_table function. Instead, we have fragment_readahead for files and batch_readahead for batches. I tried two things: Wondering if this may just be a speed/performance issue from reducing things so low, I tried increased batch_readahead up again, but keeping fragment_readahead to 1.
The also appears to be affecting other dataset operations, and I get the orginal dispatch error while trying to use the deltalake -> pyarrow dataset -> duckdb interoperability. However, it seems to be rather intermittent, perhaps supporting the idea that this may be related to the number of files accessed. Details:
Let me know if this may be different enough to merit opening up a new ticket. Thanks! |
We do have a PR open, that - while named completely different - should hopefully remedy the situation #933. It being applicable not only to azure is actually a good thing here, since the root cause we identified is not azure related ... @wjones127, is there anything I can do to help drive that PR home? :) |
I'll be able to look at it this weekend, but if you have time to reproduce this issue and see if it fixes, that would be helpful. |
# Description This PR builds in top of the changes to handling the runtime in #933. In my local tests this fixed #915. Additionally, I added the runtime as a property on the fs handler to avoid re-creating it on every call. In some non-representative tests with a large number of very small partitions it cut the runtime in about half. cc @wjones127 # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
Reopening to await confirmation from the affected :) |
Let me know if I can help repro / test it as well, eh?! |
If you could build current main and see if that fixes it, that would be awesome. |
You got it! I’ll check later tonight :-)
…On Thu, Nov 17, 2022 at 14:54 Robert Pack ***@***.***> wrote:
If you could build current main and see if that fixes it, that would be
awesome.
—
Reply to this email directly, view it on GitHub
<#915 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALBHLK3NPHWL2TVF23EEQDWI2ZS3ANCNFSM6AAAAAARXSXIWM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Ran a few of the queries that were causing errors in main very often. No dispatch errors so far! Looking great! 💯
|
Like @PadenZach I also re-ran my repo steps and works like a charm :) |
just realized that #882 likely has the same root cause, just for keeping track :) |
Latest |
Environment
Delta-rs version: v0.6.2
Binding: Python 3.8
Environment:
Bug
What happened:
I'm following the tutorial here:
Sometimes the first line (
thread '<unnamed>'...
) is printed more than once. Sometimes I get this error instead:What you expected to happen:
No errors.
pd.DataFrame
comes back.How to reproduce it:
Run above code.
More details:
Possibly related, running the
DeltaTable(...)
constructor in Jupyter notebook hangs and never returns - so I'm running in a rawpython
/ipython
interpreter. I don't see the hang there, but I do see the above threading issues.(
delta-rs
is awesome by the way. Thank you! 🙏)The text was updated successfully, but these errors were encountered: