-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix weird behavior when panicking inside synchronization primitives #58042
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
I haven't yet fixed the |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Sorry I don't really understand what's motivating this PR, this seems like something that surely would have been caught in the half a decade these types have been in use. Is this something like the platform is causing a panic at a location that wasn't expected? If so, where? |
The 4 test cases that I've added are motivating this. Run them today and you'll run into
I guess not. Like I said, it can be tricky to trigger unwinding while a thread is blocked.
Yes, inside a call to |
☔ The latest upstream changes (presumably #58316) made this pull request unmergeable. Please resolve the merge conflicts. |
Thanks for the explanations, but I don't think this is something that we'll want to support in the standard library. Thread cancellation seems like it's an additional features to the channels in the standard library, and I think it'd be fine to document that they're not compatible but otherwise feature development of channels is encouraged to happen externally in crates like crossbeam-channel for this use case. |
I don't think you've understood the heart of the issue yet. The problem is not specific to thread cancellation. The issue occurs when when the stack is unwound while an MPSC channel receiver is blocking on a receiver. Thread cancellation is just an easy way to showcase this on Linux, but there are other ways to trigger unwinding at that point. |
Can you show an example test case that deadlocks without using thread cancellation? |
According to the Linux man page for #![feature(rustc_private)] // for libc, you can also use crates.io libc instead.
extern crate libc;
#[no_mangle]
pub unsafe extern "C" fn pthread_cond_timedwait(
_cond: *mut libc::pthread_cond_t,
_mutex: *mut libc::pthread_mutex_t,
_abstime: *const libc::timespec
) -> libc::c_int {
*libc::__errno_location() = libc::EINTR;
return 1;
}
fn main() {
let (s, r) = std::sync::mpsc::channel();
s.send(()).unwrap();
r.recv().unwrap();
s.send(()).unwrap();
r.recv().unwrap();
r.recv_timeout(std::time::Duration::from_millis(1)).unwrap()
} |
If unwinding were supported in Wasm, I'm pretty this would hang right now after printing the panic message: #[cfg(test)]
extern crate wasm_bindgen_test;
#[cfg(test)]
use wasm_bindgen_test::*;
#[cfg(test)]
#[wasm_bindgen_test]
pub fn recv() {
let (tx, rx) = std::sync::mpsc::channel::<()>();
let _ = tx.clone();
let _ = rx.recv();
} |
EINTR should be handled by the condvar implementation and it's a bug if it isn't, and wasm is a special case that I don't think really counts for this because it doesn't have threads anyway |
Note to triage: I will return to this PR March 4. |
ping from triage @jethrogb it is march 4 today ;) you have a conflict to resolve |
@Dylan-DPC lol the conflict is the least of my problems on this PR :) |
@jethrogb i know :P |
In private chat message, Alex suggested I ping @Amanieu @aturon @BurntSushi @dtolnay @Kimundi @KodrAus @sfackler @SimonSapin @withoutboats This discussion is also relevant for #58461 as these issues address similar problems. I think So, these PRs consists of two parts:
NB. I'm sure these problems will come up again when replacing some of the primitives in std with parking_lot and crossbeam-channel. |
To be clear I recommended the possibility looping in other folks, not necessarily pinging every historical and current libs team member by name. |
I'm sorry if I pinged people who are not actually on the team right now, I got the list from https://www.rust-lang.org/governance/teams/library. |
If the underlying platform is violating its own ABI, either due to updates or because the user is deliberately messing with us (LD_PRELOAD), then I don't think that we can reasonably be expected to function correctly. A panic message followed by either a crash or a deadlock should be a pretty clear message to the user that says "YOUR PLATFORM IS BROKEN". |
My manpage (and the POSIX spec) specifically say that |
Note that I've provided 3 different test cases only one of which has to do with the "platform is violating its own ABI". I really think you should all focus less on the test cases and more on the general issue. The Ubuntu 16.04 man page definitely says EINTR may happen. |
Another way a panic might occur in “unexpected” places is if the process is using Linux seccomp. I must say that for a language that has a big focus on correct error handling, I'm somewhat surprised by the pushback on doing exactly that. |
I personally agree with @Amanieu that if fundamental APIs are changed we can't really do anything about that and crashing as soon as we can is the best bet. For example if If actual real-world APIs disagree with the spec I think it's fine to handle that, and as I've said a number of times on #58461 as well it's fine to handle EINTR for condvars, we'd just loop again or return that we timed out. If this comes up with seccomp as well that's a new error to handle and we can add that in. As a meta point the discussion is being strewn across this PR and #58461 and keeping track of them isn't the easiest to do. Can one of these PRs be closed and merged with the other or something like that? |
I've closed #58461 I'm only interested in solving the general problem of making the synchronization primitives robust against panicking. I'm not interested in auditing all code in sync/ and sys/ for possible error conditions that were missed. Maybe once Rust gets a way to enforce "no panics" as a function attribute that can be done. If you don't think there's a general problem to be solved, then I can't make any progress on this PR. |
ping from triage @jethrogb any updates? |
Should be |
Ping from triage @alexcrichton |
I'm going to close this PR because those on the libs team who have voiced their thoughts on this PR have indicated that they would not take this strategy. Otherwise it looks like this PR isn't making much progress. We're always interested in fixes to primitives if necessary, but I don't think we're going to place too many safeguards in place to have the standard library work as correctly as possible in the face of a system that is misbehaving. Rather we'll make sure an error message gets out and nothing segfaults at minimum. |
Deadlocks are not in the set of acceptable behaviors you just described ("an error message gets out and nothing segfaults"). |
Currently, when the stack is unwound while an MPSC channel receiver is blocking on a receive, this will result in the execution of
unreachable!()
or a deadlock. Various platforms may trigger unwinding while a thread is blocked, due to an error condition.This PR changes the MPSC drop logic so that unwinding works properly. It also adds tests for all MPSC variants. It can be tricky to trigger unwinding while a thread is blocked. Here I've used pthread_cancel on Linux.
Fixes fortanix/rust-sgx#86