-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when running complex query #14939
Comments
A minimal working example reproducing the problem would be much appreciated. |
Having a hard time creating something, will keep trying, but I've found that it seemingly is caused by:
Ie. the default EDIT |
Hmm, i'm actually not sure what the issue is. I'm still experiencing a deadlock, it happens so intermittently. |
I managed to capture the verbose output for when it's deadlocking, any ideas?
|
Ran into another instance of it, turns out it doesn't matter what combination of:
|
map_batches
@ritchie46 anything pop out to you? I'm also using a lot of window functions in this query if that's relevant. |
Do you use |
I removed the |
Here's the obfuscated query plan:
|
I've tried compiling
|
Not sure if this is helpful, but it seems to help a little if I turn down the concurrency budget, though that slows down my query quite dramatically. |
I've tried running it under valgrind's helgrind, and am unable to reproduce, it probably modifies the runtime charecteristics too much. Haven't tried https://github.com/tokio-rs/loom or https://github.com/awslabs/shuttle i'm curious if they should be integrated into testing for polars? |
@ritchie46 i locally compiled polars and replaced all usages of OnceLock/RwLock/Mutex with tracing-mutex and it didn't panic when running, so I'm guessing the deadlocking either has something to do with the GIL (though there's no |
Can you try generating a flamegraph? We can maybe see where the threads stall using it - let it deadlock for a while before killing the PID from another terminal. |
I did try to (I tried to generate a flamegraph by attaching to the deadlocked process) but since it wasn't making any progress it didn't seem to log anything in perf. |
You could get a backtrace in gdb. 👀 |
Alright, in that case I think we can try to use gdb -p <PID> -ex 'thread apply all bt' -ex detach -ex quit > backtrace.txt |
If i encounter this again I'll capture one, after upgrading to 0.20.20 I have yet to encounter a deadlock, perhaps it's been fixed by #15626 or something. Lets keep this issue open for a while, I'll close it once I'm sure it's been fixed? |
@nameexhaustion @ritchie46 looks like it's not fixed, had to run a 30s query over 500 times for it to deadlock, here's the backtrace. |
Nice. I'm guessing a timeout for now, given I can see 4 occurences of Thread 162 (Thread 0x7f80111ff700 (LWP 10692)):
#0 0x00007f824e6c92a9 in syscall () from /lib64/libc.so.6
#1 0x00007f8245c90a43 in tokio::runtime::park::Inner::park () from /path/to/venv/lib/python3.11/site-packages/polars/polars.abi3.so
#2 0x00007f824499da6c in polars_io::pl_async::RuntimeManager::block_on_potential_spawn::{{closure}} () from /path/to/venv/lib/python3.11/site-packages/polars/polars.abi3.so
#3 0x00007f824499a780 in <polars_lazy::physical_plan::executors::scan::parquet::ParquetExec as polars_lazy::physical_plan::executors::executor::Executor>::execute::{{closure}} () from /path/to/venv/lib/python3.11/site-packages/polars/polars.abi3.so Would you be able to give this a try - #15643. I have re-added a the connection timeout so I'm hoping that instead of hanging it should display an error. |
I gave that a try and it didn't work, still deadlocked/hung. |
Is there anything else i can do to help diagnose this @nameexhaustion ? |
Does this look like it might be the problem?
|
I think the backtrace is telling us we are stuck on an async future, but we need to try and confirm this. I have updated #15643 to set the timeout to 3 minutes instead of 7, and also set the non-connect timeout - please give it another try. Please also try using this function instead of directly calling collect_all_with_timeout_and_retrydef collect_all_with_timeout_and_retry(
lfs: list[pl.LazyFrame],
heartbeat_lf: pl.LazyFrame,
*,
timeout_secs=30 + int(os.getenv("POLARS_ASYNC_TIMEOUT", "3")) * 60,
heartbeat_interval_secs=10,
) -> list[pl.LazyFrame]:
import warnings
import asyncio
from time import perf_counter
print(
f"collect_all_with_timeout_and_retry {timeout_secs = } {heartbeat_interval_secs = }"
)
is_complete = asyncio.Event()
async def collect_loop():
excs = []
for i in range(1, 4):
if i == 3:
raise Exception(f"collect_loop failed after {i} tries ({excs = })")
task = pl.collect_all_async(lfs)
try:
out = await asyncio.wait_for(task, timeout=timeout_secs)
except Exception as e:
excs.append(e)
continue
break
if i > 1:
raise Exception(
f"collect_loop finished, but required {i} tries ({excs = })"
)
return out
async def heartbeat_loop():
while not is_complete.is_set():
start = perf_counter()
task = heartbeat_lf.select(pl.first()).head(1).collect_async()
try:
await asyncio.wait_for(task, timeout=heartbeat_interval_secs)
except Exception as e:
raise f"heartbeat_loop timed out {e = }"
elapsed = perf_counter() - start
sleep = heartbeat_interval_secs - elapsed
sleep = max(0, sleep)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
next(asyncio.as_completed([is_complete.wait(), asyncio.sleep(sleep)]))
async def f():
task = asyncio.create_task(heartbeat_loop())
out = await collect_loop()
is_complete.set()
await task
return out
return asyncio.run(f()) |
Is it possible: // Try to get cached grouptuples
let (mut groups, _, cache_key) = if state.cache_window() {
let mut cache_key = String::with_capacity(32 * group_by_columns.len());
write!(&mut cache_key, "{}", state.branch_idx).unwrap();
for s in &group_by_columns {
cache_key.push_str(s.name());
}
let mut gt_map = state.group_tuples.write().unwrap();
// we run sequential and partitioned
// and every partition run the cache should be empty so we expect a max of 1.
debug_assert!(gt_map.len() <= 1);
if let Some(gt) = gt_map.get_mut(&cache_key) {
if df.height() > 0 {
assert!(!gt.is_empty());
};
// We take now, but it is important that we set this before we return!
// a next windows function may get this cached key and get an empty if this
// does not happen
(std::mem::take(gt), true, cache_key)
} else {
drop(gt_map); # Added a drop to the guard here
(create_groups()?, false, cache_key)
} Specifically dropping the guard in the Seems to prevent me from deadlocking (though I'm not 100% sure if it's working or not or just getting lucky). I am running a lot of window expressions, so it's possible it's a very tough deadlock to trigger that most people wouldn't encounter. @nameexhaustion |
Okay nvm that didn't work. |
@nameexhaustion gave your change a try, it didn't seem to timeout after deadlocking for 5m+ wasn't sure about the |
The python function was to:
Did you manage to get any output from it at all? You may have to ensure you don't have any |
@kszlim that's a good find. We should always drop mutexes before we spawn work with rayon as we don't want to hold the mutex when pushing other tasks on the work scheduler. This snippet in the window functions was indeed holding a rwlock when parallelizing work with rayon. Will issue a PR. |
When running with just a |
Still getting deadlock on 0.20.30, but this might be related I unfortunately don't have code to reproduce this, but I have some backtraces captured by gdb. latest_backtrace.txt Not sure if these are helpful. @stinodego this is related to what we discussed. |
Is it possible there's some sort of wrong assumption being made about the pyo3/GIL that's causing the deadlock? |
I wonder if this is the issue?
iirc if you run blocking tasks on the tokio runtime at all, it can cause a deadlock? |
Still no MRE, but i've noticed that |
Seems to also occur a lot more readily when doing something like: from itertools import batched
import psutil
all_ids = [...] # list containing thousands of ids
results = []
for ids in batched(all_ids, max(len(all_ids) // psutil.cpu_count(), 1)):
ldf = query_function(partition_on=ids)
results.append(ldf)
dfs = pl.collect_all(results)
pl.concat(dfs) vs from itertools import batched
import psutil
all_ids = [...] # list containing thousands of ids
results = []
for ids in batched(all_ids, max(len(all_ids) // psutil.cpu_count(), 1)):
ldf = query_function(partition_on=ids)
results.append(ldf.collect().lazy())
dfs = pl.collect_all(results)
pl.concat(dfs) I'm guessing the |
@nameexhaustion i'm starting to get convinced that this is caused somewhere here https://github.com/pola-rs/polars/blob/main/crates/polars-io/src/parquet/read/read_impl.rs#L734 i notice the oneshot channel in my backtrace when using |
Okay, I've simplified my code to: import os
os.environ["POLARS_CONCURRENCY_BUDGET"] = "1500"
import polars as pl
import psutil
import polars.selectors as cs
from collections.abc import Generator, Iterable
from itertools import islice
from typing import TypeVar
T = TypeVar("T")
def batched(iterable: Iterable[T], n: int) -> Generator[tuple[T, ...], None, None]:
"""Yield successive n-sized chunks from iterable."""
if n < 1:
msg = "n must be at least one"
raise ValueError(msg)
it = iter(iterable)
while batch := tuple(islice(it, n)):
yield batch
def get_ldf():
url = f"s3://my_s3_bucket/table_name/*/*.parquet"
ldf = pl.scan_parquet(
url,
retries=10, # TODO: Move a bunch of this stuff into proper config
hive_partitioning=True,
)
return ldf
base_ldf = get_ldf()
NUM_IDS = 2000
IDS = list(range(NUM_IDS))
ID_COL = 'some_hive_partition_id'
for i in range(100):
ldfs = []
for batch in batched(IDS, NUM_IDS // psutil.cpu_count()):
ldf = base_ldf.filter(pl.col(ID_COL).is_in(batch))
ldf = ldf.group_by(ID_COL).agg(cs.float().mean().name.suffix("_mean"), cs.float().median().name.suffix("_median"), cs.float().min().name.suffix("_min"), cs.float().max().name.suffix("_max"), cs.float().std().name.suffix("_std"))
ldfs.append(ldf)
print(f"On iteration: {i}")
dfs = pl.collect_all(ldfs)
df = pl.concat(dfs)
print(df) and I get this backtrace: @nameexhaustion @ritchie46 I'm not sure, but maybe this is enough for you guys to track down the issue? It's not a full repro, but you can see there's nothing exotic being done in the query. My polars version is |
Perhaps there needs to be a cleaner delineation between sync and async code? I think the |
Closing this in lieu of #18086 |
Checks
Reproducible example
Log output
Output right before deadlock:
Issue description
Deadlocks when running the collection in a loop many times.
Expected behavior
Doesn't deadlock
Installed versions
EDIT:
I no longer thing it has anything to do with
map_batches
ormap_elements
, i've removed them from my code and it still occurs. Seems like it gets hung shortly after:This is also a
scan_parquet
against the cloud with a few joins.The text was updated successfully, but these errors were encountered: