Possible deadlock in `impl Display for PyErr` #4764

bschoenmaeckers · 2024-12-04T14:14:48Z

Bug Description

When updating polars to pyo3 0.23 I encountered a deadlock when multiple threads try to print a PyErr. When I forced the PyErrState to be evaluated with the current held py token the deadlock went away.

Steps to Reproduce

When I remove the e.value(py); pyo3 will deadlock.

https://github.com/pola-rs/polars/blob/db2410f47fb6abd5770c9707a3e969711a9ae4a2/crates/polars-python/src/on_startup.rs#L58-L72

fn warning_function(msg: &str, warning: PolarsWarning) {
    Python::with_gil(|py| {
        let warn_fn = pl_utils(py)
            .bind(py)
            .getattr(intern!(py, "_polars_warn"))
            .unwrap();

        if let Err(e) = warn_fn.call1((msg, Wrap(warning).into_pyobject(py).unwrap())) {
            // Force load lazy state of the exception so we don't deadlock in Display impl of PyErr
            e.value(py);

            eprintln!("{e}")
        }
    });
}

The imported python fuction _polars_warn looks like this

import warnings

def _polars_warn(msg: str, category: type[Warning] = UserWarning) -> None:
    warnings.warn(
        msg,
        category=category,
        stacklevel=find_stacklevel(),
    )

Backtrace

No response

Your operating system and version

Windows 11

Your Python version (`python --version`)

Python 3.9.20

Your Rust version (`rustc --version`)

rustc 1.85.0-nightly (6b6a867ae 2024-11-27)

Your PyO3 version

0.23.2

How did you install python? Did you use a virtualenv?

python.org installer

Additional Info

ref pola-rs/polars#20111

The text was updated successfully, but these errors were encountered:

bschoenmaeckers · 2024-12-04T16:07:31Z

Note that the warnings are converted to errors and so will lead to a PyErr in warning_function

ritchie46 · 2024-12-04T16:08:41Z

I can reproduce. Now I understand why it only happens during pytest. Only there it triggers the error branch.

bschoenmaeckers · 2024-12-04T16:10:01Z

I can reproduce. Now I understand why it only happens during pytest. Only there it triggers the error branch.

Yes forgot to mention that. Sorry for the confusion.

ritchie46 · 2024-12-04T16:15:40Z

For full context. We can acquire the gil. We start multithreaded in a py.allow_threads context. Every thread calls a python lambda. This all works well if we don't trigger the Error branch mentioned above.

ngoldbaum · 2024-12-04T16:39:58Z

I wouldn't be surprised if bisecting pointed to #4671 as the culprit. It would be nice to get rid of the lazy error state entirely as alluded to there.

davidhewitt · 2024-12-04T21:57:24Z

I would support that guess. @bschoenmaeckers can you give me more specific instructions to repro? I tried to do so just now by cloning your branch of the polars repo, removing that e.value(py) evaluation, and running make test, with no luck.

bschoenmaeckers · 2024-12-04T22:23:13Z

Try maturin dev & pytest tests/unit/operations/map/test_map_elements.py -n 4 --verbose in the py-polars directory. This should produce the deadlock when e.value(py); is removed. I got a more consistent result when I increased the concurrent jobs to 8 or more.

davidhewitt · 2024-12-04T23:17:32Z

Thanks, so I have reproduced and got a thread dump of the reproduction.

One thread is holding the GIL and waiting to lock stderr:

  thread #4, name = 'polars-2'
    frame #0: 0x0000000194fa4a9c libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000194fe03f8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000194fdddbc libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x000000030a4bff30 polars.abi3.so`std::io::stdio::Stderr::lock [inlined] std::sys::sync::mutex::pthread::Mutex::lock at pthread.rs:113:26 [opt]
    frame #4: 0x000000030a4bff24 polars.abi3.so`std::io::stdio::Stderr::lock [inlined] std::sync::reentrant_lock::ReentrantLock<T>::lock at reentrant_lock.rs:292:28 [opt]
    frame #5: 0x000000030a4bfea0 polars.abi3.so`std::io::stdio::Stderr::lock at stdio.rs:959:29 [opt]
    frame #6: 0x000000030a4c00d0 polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:1024:9 [opt]
    frame #7: 0x000000030a4c076c polars.abi3.so`std::io::stdio::_eprint [inlined] <std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:998:9 [opt]
    frame #8: 0x000000030a4c075c polars.abi3.so`std::io::stdio::_eprint [inlined] std::io::stdio::print_to at stdio.rs:1122:21 [opt]
    frame #9: 0x000000030a4c0744 polars.abi3.so`std::io::stdio::_eprint at stdio.rs:1243:5 [opt]
    frame #10: 0x00000003028fe824 polars.abi3.so`polars_python::on_startup::warning_function::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c18509e)) at on_startup.rs:69:13
    frame #11: 0x0000000300efa144 polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c185180) at marker.rs:412:9
    frame #12: 0x00000003028fe604 polars.abi3.so`polars_python::on_startup::warning_function(msg=(data_ptr = "Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.\xff\xff\xff\xff\xff", length = 139), warning=MapWithoutReturnDtypeWarning) at on_startup.rs:59:5
    frame #13: 0x00000003007c26bc polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::map_elements(self=0x0000000111a3b880, function=0x000000016c185678, return_dtype=Option<polars_python::conversion::Wrap<polars_core::datatypes::dtype::DataType>> @ 0x000000016c1859b0, skip_nulls=true) at map.rs:24:13
    frame #14: 0x00000003007c2e50 polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::__pymethod_map_elements__((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c185d3f), _slf=0x0000000111a3b870, _args=0x0000000111a3bc70, _kwargs=0x0000000111a70800) at map.rs:12:1

Another thread is holding the stderr lock and waiting to acquire the GIL:

  thread #6, name = 'polars-4'
    frame #0: 0x0000000194fa55cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000194fe3894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001049df33c Python`take_gil + 480
    frame #3: 0x00000001049dfa58 Python`PyEval_RestoreThread + 24
    frame #4: 0x000000030a36ef3c polars.abi3.so`<pyo3::gil::SuspendGIL as core::ops::drop::Drop>::drop(self=0x000000016c59c818) at gil.rs:313:13
    frame #5: 0x000000030a36c5e4 polars.abi3.so`core::ptr::drop_in_place<pyo3::gil::SuspendGIL>((null)=0x000000016c59c818) at mod.rs:521:1
    frame #6: 0x000000030a365a4c polars.abi3.so`pyo3::marker::Python::allow_threads((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c82f), f={closure_env#0} @ 0x000000016c59c830) at marker.rs:522:5
    frame #7: 0x000000030a809ea0 polars.abi3.so`pyo3::err::err_state::PyErrState::make_normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c936)) at err_state.rs:95:9
    frame #8: 0x000000030a37914c polars.abi3.so`pyo3::err::err_state::PyErrState::as_normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c9ce)) at err_state.rs:76:9
    frame #9: 0x000000030a364a40 polars.abi3.so`pyo3::err::PyErr::normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c9fe)) at mod.rs:763:9
    frame #10: 0x000000030a363c60 polars.abi3.so`pyo3::err::PyErr::value(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59ca1e)) at mod.rs:299:9
    frame #11: 0x000000030a364c3c polars.abi3.so`<pyo3::err::PyErr as core::fmt::Display>::fmt::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59cc5b)) at mod.rs:782:25
    frame #12: 0x000000030a36588c polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c59cd20) at marker.rs:412:9
    frame #13: 0x000000030a364c0c polars.abi3.so`<pyo3::err::PyErr as core::fmt::Display>::fmt(self=0x000000016c59d000, f=0x000000016c59cd70) at mod.rs:781:9
    frame #14: 0x000000030a4e6f24 polars.abi3.so`core::fmt::write [inlined] core::fmt::rt::Argument::fmt at rt.rs:177:76 [opt]
    frame #15: 0x000000030a4e6f18 polars.abi3.so`core::fmt::write at mod.rs:1189:21 [opt]
    frame #16: 0x000000030a4c00f4 polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt [inlined] std::io::Write::write_fmt at mod.rs:1887:15 [opt]
    frame #17: 0x000000030a4c00dc polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:1024:9 [opt]
    frame #18: 0x000000030a4c076c polars.abi3.so`std::io::stdio::_eprint [inlined] <std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:998:9 [opt]
    frame #19: 0x000000030a4c075c polars.abi3.so`std::io::stdio::_eprint [inlined] std::io::stdio::print_to at stdio.rs:1122:21 [opt]
    frame #20: 0x000000030a4c0744 polars.abi3.so`std::io::stdio::_eprint at stdio.rs:1243:5 [opt]
    frame #21: 0x00000003028fe824 polars.abi3.so`polars_python::on_startup::warning_function::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59d09e)) at on_startup.rs:69:13
    frame #22: 0x0000000300efa144 polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c59d180) at marker.rs:412:9
    frame #23: 0x00000003028fe604 polars.abi3.so`polars_python::on_startup::warning_function(msg=(data_ptr = "Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.", length = 139), warning=MapWithoutReturnDtypeWarning) at on_startup.rs:59:5
    frame #24: 0x00000003007c26bc polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::map_elements(self=0x0000000111a3bca0, function=0x000000016c59d678, return_dtype=Option<polars_python::conversion::Wrap<polars_core::datatypes::dtype::DataType>> @ 0x000000016c59d9b0, skip_nulls=true) at map.rs:24:13

#4671 is the cause of the GIL switch here which leads to the deadlock.

However I now wonder if more generally there is potential for GIL switches inside formatting traits under io write locks to create deadlocks. These would be significantly rarer, but still possible given we call __str__ and __repr__ which can execute arbitrary Python code and theoretically lead to a GIL switch.

🤔

ritchie46 · 2024-12-05T07:43:34Z

Can you explain me what happens? Can GIL switches happen involuntarily while holding a different lock (e.g. an io write lock)? Then I assume another thread tries to acquire that lock and we're deadlocked?

davidhewitt · 2024-12-05T09:21:48Z

Essentially, yes. You have to assume that doing arbitrary Python operations can lead to GIL switches. I think in practice as long as you do very little Python work then the chance of a GIL switch while holding a lock can be realistically zero, because IIRC since 3.11 the interpreter only switches pure-Python code after sufficiently many turns of the eval loop (maybe only at function call boundaries?).

But there always exists the possibility of native code releasing the GIL to allow Python threads to make progress. This is essentially the cause of https://pyo3.rs/v0.23.3/faq.html#im-experiencing-deadlocks-using-pyo3-with-stdsynconcelock-stdsynclazylock-lazy_static-and-once_cell - in particular Python::import releases the GIL while it does some of the IO work.

In this case I think #4671 has introduced a switch which wasn't previously in your case, and #4766 would probably resolve the deadlock.

In general the IO locks seem like a rare and sneaky source of deadlocks; I've never seen someone hit this before. I think we could probably mitigate by various options:

remove the Display / Debug impls to encourage up-front formatting before acquiring the lock (breaking change which hurts ergonomics, so not super keen unless we really think we have to do this)
writing to Python IO streams (Enable writing to python stdio streams #3920) is probably a neat improvement here
- it avoids the need to take the Rust streams' global locks, which probably don't make sense when Python is also interacting with the output streams anyway.
- it's unlikely that people are using Rust for high speed readwrite to the process IO.
- and arguably it's nicer if Rust extensions respect any high-level override of sys.stdout and sys.stderr in Python code

ritchie46 · 2024-12-05T10:14:53Z

Ouch, that's more tricky than I assumed.

bschoenmaeckers · 2024-12-09T15:06:36Z

#4766 fixes the deadlock in polars 👍

bschoenmaeckers added the bug label Dec 4, 2024

davidhewitt mentioned this issue Dec 5, 2024

eagerly complete once in normalized error state #4766

Merged

davidhewitt mentioned this issue Dec 5, 2024

"lazy" state of PyErr ... has challenges #4584

Open

ngoldbaum closed this as completed Dec 9, 2024

bschoenmaeckers mentioned this issue Dec 9, 2024

build(python): Update pyo3 and numpy crates to version 0.23 pola-rs/polars#20111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock in `impl Display for PyErr` #4764

Possible deadlock in `impl Display for PyErr` #4764

bschoenmaeckers commented Dec 4, 2024 •

edited

Loading

bschoenmaeckers commented Dec 4, 2024

ritchie46 commented Dec 4, 2024

bschoenmaeckers commented Dec 4, 2024

ritchie46 commented Dec 4, 2024

ngoldbaum commented Dec 4, 2024

davidhewitt commented Dec 4, 2024

bschoenmaeckers commented Dec 4, 2024

davidhewitt commented Dec 4, 2024

ritchie46 commented Dec 5, 2024

davidhewitt commented Dec 5, 2024 •

edited

Loading

ritchie46 commented Dec 5, 2024

bschoenmaeckers commented Dec 9, 2024

Possible deadlock in impl Display for PyErr #4764

Possible deadlock in impl Display for PyErr #4764

Comments

bschoenmaeckers commented Dec 4, 2024 • edited Loading

Bug Description

Steps to Reproduce

Backtrace

Your operating system and version

Your Python version (python --version)

Your Rust version (rustc --version)

Your PyO3 version

How did you install python? Did you use a virtualenv?

Additional Info

bschoenmaeckers commented Dec 4, 2024

ritchie46 commented Dec 4, 2024

bschoenmaeckers commented Dec 4, 2024

ritchie46 commented Dec 4, 2024

ngoldbaum commented Dec 4, 2024

davidhewitt commented Dec 4, 2024

bschoenmaeckers commented Dec 4, 2024

davidhewitt commented Dec 4, 2024

ritchie46 commented Dec 5, 2024

davidhewitt commented Dec 5, 2024 • edited Loading

ritchie46 commented Dec 5, 2024

bschoenmaeckers commented Dec 9, 2024

Possible deadlock in `impl Display for PyErr` #4764

Possible deadlock in `impl Display for PyErr` #4764

bschoenmaeckers commented Dec 4, 2024 •

edited

Loading

Your Python version (`python --version`)

Your Rust version (`rustc --version`)

davidhewitt commented Dec 5, 2024 •

edited

Loading