Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock in impl Display for PyErr #4764

Closed
bschoenmaeckers opened this issue Dec 4, 2024 · 12 comments
Closed

Possible deadlock in impl Display for PyErr #4764

bschoenmaeckers opened this issue Dec 4, 2024 · 12 comments
Labels

Comments

@bschoenmaeckers
Copy link
Contributor

bschoenmaeckers commented Dec 4, 2024

Bug Description

When updating polars to pyo3 0.23 I encountered a deadlock when multiple threads try to print a PyErr. When I forced the PyErrState to be evaluated with the current held py token the deadlock went away.

Steps to Reproduce

When I remove the e.value(py); pyo3 will deadlock.

https://github.com/pola-rs/polars/blob/db2410f47fb6abd5770c9707a3e969711a9ae4a2/crates/polars-python/src/on_startup.rs#L58-L72

fn warning_function(msg: &str, warning: PolarsWarning) {
    Python::with_gil(|py| {
        let warn_fn = pl_utils(py)
            .bind(py)
            .getattr(intern!(py, "_polars_warn"))
            .unwrap();

        if let Err(e) = warn_fn.call1((msg, Wrap(warning).into_pyobject(py).unwrap())) {
            // Force load lazy state of the exception so we don't deadlock in Display impl of PyErr
            e.value(py);

            eprintln!("{e}")
        }
    });
}

The imported python fuction _polars_warn looks like this

import warnings

def _polars_warn(msg: str, category: type[Warning] = UserWarning) -> None:
    warnings.warn(
        msg,
        category=category,
        stacklevel=find_stacklevel(),
    )

Backtrace

No response

Your operating system and version

Windows 11

Your Python version (python --version)

Python 3.9.20

Your Rust version (rustc --version)

rustc 1.85.0-nightly (6b6a867ae 2024-11-27)

Your PyO3 version

0.23.2

How did you install python? Did you use a virtualenv?

python.org installer

Additional Info

ref pola-rs/polars#20111

@bschoenmaeckers
Copy link
Contributor Author

Note that the warnings are converted to errors and so will lead to a PyErr in warning_function

@ritchie46
Copy link

I can reproduce. Now I understand why it only happens during pytest. Only there it triggers the error branch.

@bschoenmaeckers
Copy link
Contributor Author

I can reproduce. Now I understand why it only happens during pytest. Only there it triggers the error branch.

Yes forgot to mention that. Sorry for the confusion.

@ritchie46
Copy link

For full context. We can acquire the gil. We start multithreaded in a py.allow_threads context. Every thread calls a python lambda. This all works well if we don't trigger the Error branch mentioned above.

@ngoldbaum
Copy link
Contributor

I wouldn't be surprised if bisecting pointed to #4671 as the culprit. It would be nice to get rid of the lazy error state entirely as alluded to there.

@davidhewitt
Copy link
Member

I would support that guess. @bschoenmaeckers can you give me more specific instructions to repro? I tried to do so just now by cloning your branch of the polars repo, removing that e.value(py) evaluation, and running make test, with no luck.

@bschoenmaeckers
Copy link
Contributor Author

Try maturin dev & pytest tests/unit/operations/map/test_map_elements.py -n 4 --verbose in the py-polars directory. This should produce the deadlock when e.value(py); is removed. I got a more consistent result when I increased the concurrent jobs to 8 or more.

@davidhewitt
Copy link
Member

Thanks, so I have reproduced and got a thread dump of the reproduction.

One thread is holding the GIL and waiting to lock stderr:

  thread #4, name = 'polars-2'
    frame #0: 0x0000000194fa4a9c libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000194fe03f8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000194fdddbc libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x000000030a4bff30 polars.abi3.so`std::io::stdio::Stderr::lock [inlined] std::sys::sync::mutex::pthread::Mutex::lock at pthread.rs:113:26 [opt]
    frame #4: 0x000000030a4bff24 polars.abi3.so`std::io::stdio::Stderr::lock [inlined] std::sync::reentrant_lock::ReentrantLock<T>::lock at reentrant_lock.rs:292:28 [opt]
    frame #5: 0x000000030a4bfea0 polars.abi3.so`std::io::stdio::Stderr::lock at stdio.rs:959:29 [opt]
    frame #6: 0x000000030a4c00d0 polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:1024:9 [opt]
    frame #7: 0x000000030a4c076c polars.abi3.so`std::io::stdio::_eprint [inlined] <std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:998:9 [opt]
    frame #8: 0x000000030a4c075c polars.abi3.so`std::io::stdio::_eprint [inlined] std::io::stdio::print_to at stdio.rs:1122:21 [opt]
    frame #9: 0x000000030a4c0744 polars.abi3.so`std::io::stdio::_eprint at stdio.rs:1243:5 [opt]
    frame #10: 0x00000003028fe824 polars.abi3.so`polars_python::on_startup::warning_function::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c18509e)) at on_startup.rs:69:13
    frame #11: 0x0000000300efa144 polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c185180) at marker.rs:412:9
    frame #12: 0x00000003028fe604 polars.abi3.so`polars_python::on_startup::warning_function(msg=(data_ptr = "Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.\xff\xff\xff\xff\xff", length = 139), warning=MapWithoutReturnDtypeWarning) at on_startup.rs:59:5
    frame #13: 0x00000003007c26bc polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::map_elements(self=0x0000000111a3b880, function=0x000000016c185678, return_dtype=Option<polars_python::conversion::Wrap<polars_core::datatypes::dtype::DataType>> @ 0x000000016c1859b0, skip_nulls=true) at map.rs:24:13
    frame #14: 0x00000003007c2e50 polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::__pymethod_map_elements__((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c185d3f), _slf=0x0000000111a3b870, _args=0x0000000111a3bc70, _kwargs=0x0000000111a70800) at map.rs:12:1

Another thread is holding the stderr lock and waiting to acquire the GIL:

  thread #6, name = 'polars-4'
    frame #0: 0x0000000194fa55cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000194fe3894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001049df33c Python`take_gil + 480
    frame #3: 0x00000001049dfa58 Python`PyEval_RestoreThread + 24
    frame #4: 0x000000030a36ef3c polars.abi3.so`<pyo3::gil::SuspendGIL as core::ops::drop::Drop>::drop(self=0x000000016c59c818) at gil.rs:313:13
    frame #5: 0x000000030a36c5e4 polars.abi3.so`core::ptr::drop_in_place<pyo3::gil::SuspendGIL>((null)=0x000000016c59c818) at mod.rs:521:1
    frame #6: 0x000000030a365a4c polars.abi3.so`pyo3::marker::Python::allow_threads((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c82f), f={closure_env#0} @ 0x000000016c59c830) at marker.rs:522:5
    frame #7: 0x000000030a809ea0 polars.abi3.so`pyo3::err::err_state::PyErrState::make_normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c936)) at err_state.rs:95:9
    frame #8: 0x000000030a37914c polars.abi3.so`pyo3::err::err_state::PyErrState::as_normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c9ce)) at err_state.rs:76:9
    frame #9: 0x000000030a364a40 polars.abi3.so`pyo3::err::PyErr::normalized(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59c9fe)) at mod.rs:763:9
    frame #10: 0x000000030a363c60 polars.abi3.so`pyo3::err::PyErr::value(self=0x000000016c59d000, (null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59ca1e)) at mod.rs:299:9
    frame #11: 0x000000030a364c3c polars.abi3.so`<pyo3::err::PyErr as core::fmt::Display>::fmt::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59cc5b)) at mod.rs:782:25
    frame #12: 0x000000030a36588c polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c59cd20) at marker.rs:412:9
    frame #13: 0x000000030a364c0c polars.abi3.so`<pyo3::err::PyErr as core::fmt::Display>::fmt(self=0x000000016c59d000, f=0x000000016c59cd70) at mod.rs:781:9
    frame #14: 0x000000030a4e6f24 polars.abi3.so`core::fmt::write [inlined] core::fmt::rt::Argument::fmt at rt.rs:177:76 [opt]
    frame #15: 0x000000030a4e6f18 polars.abi3.so`core::fmt::write at mod.rs:1189:21 [opt]
    frame #16: 0x000000030a4c00f4 polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt [inlined] std::io::Write::write_fmt at mod.rs:1887:15 [opt]
    frame #17: 0x000000030a4c00dc polars.abi3.so`<&std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:1024:9 [opt]
    frame #18: 0x000000030a4c076c polars.abi3.so`std::io::stdio::_eprint [inlined] <std::io::stdio::Stderr as std::io::Write>::write_fmt at stdio.rs:998:9 [opt]
    frame #19: 0x000000030a4c075c polars.abi3.so`std::io::stdio::_eprint [inlined] std::io::stdio::print_to at stdio.rs:1122:21 [opt]
    frame #20: 0x000000030a4c0744 polars.abi3.so`std::io::stdio::_eprint at stdio.rs:1243:5 [opt]
    frame #21: 0x00000003028fe824 polars.abi3.so`polars_python::on_startup::warning_function::{{closure}}((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x000000016c59d09e)) at on_startup.rs:69:13
    frame #22: 0x0000000300efa144 polars.abi3.so`pyo3::marker::Python::with_gil(f={closure_env#0} @ 0x000000016c59d180) at marker.rs:412:9
    frame #23: 0x00000003028fe604 polars.abi3.so`polars_python::on_startup::warning_function(msg=(data_ptr = "Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.", length = 139), warning=MapWithoutReturnDtypeWarning) at on_startup.rs:59:5
    frame #24: 0x00000003007c26bc polars.abi3.so`polars_python::series::map::<impl polars_python::series::PySeries>::map_elements(self=0x0000000111a3bca0, function=0x000000016c59d678, return_dtype=Option<polars_python::conversion::Wrap<polars_core::datatypes::dtype::DataType>> @ 0x000000016c59d9b0, skip_nulls=true) at map.rs:24:13

#4671 is the cause of the GIL switch here which leads to the deadlock.

However I now wonder if more generally there is potential for GIL switches inside formatting traits under io write locks to create deadlocks. These would be significantly rarer, but still possible given we call __str__ and __repr__ which can execute arbitrary Python code and theoretically lead to a GIL switch.

🤔

@ritchie46
Copy link

Can you explain me what happens? Can GIL switches happen involuntarily while holding a different lock (e.g. an io write lock)? Then I assume another thread tries to acquire that lock and we're deadlocked?

@davidhewitt
Copy link
Member

davidhewitt commented Dec 5, 2024

Essentially, yes. You have to assume that doing arbitrary Python operations can lead to GIL switches. I think in practice as long as you do very little Python work then the chance of a GIL switch while holding a lock can be realistically zero, because IIRC since 3.11 the interpreter only switches pure-Python code after sufficiently many turns of the eval loop (maybe only at function call boundaries?).

But there always exists the possibility of native code releasing the GIL to allow Python threads to make progress. This is essentially the cause of https://pyo3.rs/v0.23.3/faq.html#im-experiencing-deadlocks-using-pyo3-with-stdsynconcelock-stdsynclazylock-lazy_static-and-once_cell - in particular Python::import releases the GIL while it does some of the IO work.

In this case I think #4671 has introduced a switch which wasn't previously in your case, and #4766 would probably resolve the deadlock.

In general the IO locks seem like a rare and sneaky source of deadlocks; I've never seen someone hit this before. I think we could probably mitigate by various options:

  • remove the Display / Debug impls to encourage up-front formatting before acquiring the lock (breaking change which hurts ergonomics, so not super keen unless we really think we have to do this)
  • writing to Python IO streams (Enable writing to python stdio streams #3920) is probably a neat improvement here
    • it avoids the need to take the Rust streams' global locks, which probably don't make sense when Python is also interacting with the output streams anyway.
    • it's unlikely that people are using Rust for high speed readwrite to the process IO.
    • and arguably it's nicer if Rust extensions respect any high-level override of sys.stdout and sys.stderr in Python code

@ritchie46
Copy link

Ouch, that's more tricky than I assumed.

@bschoenmaeckers
Copy link
Contributor Author

#4766 fixes the deadlock in polars 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants