Fix weird behavior when panicking inside synchronization primitives #58042

jethrogb · 2019-02-01T08:07:26Z

Currently, when the stack is unwound while an MPSC channel receiver is blocking on a receive, this will result in the execution of unreachable!() or a deadlock. Various platforms may trigger unwinding while a thread is blocked, due to an error condition.

This PR changes the MPSC drop logic so that unwinding works properly. It also adds tests for all MPSC variants. It can be tricky to trigger unwinding while a thread is blocked. Here I've used pthread_cancel on Linux.

Fixes fortanix/rust-sgx#86

MPSC oneshot
MPSC sync
MPSC shared
MPSC stream
Condvar (Don't poison the Mutex when panicking inside Condvar::wait #58461)
thread::park_timeout (Don't poison the Mutex when panicking inside Condvar::wait #58461 (comment))

Fixes fortanix/rust-sgx#86

rust-highfive · 2019-02-01T08:07:37Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

jethrogb · 2019-02-01T08:07:45Z

I haven't yet fixed the stream and shared implementations, I wanted to get some feedback on this PR first.

rust-highfive · 2019-02-01T09:20:42Z

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

travis_time:end:01643448:start=1549008525142296323,finish=1549008597642748465,duration=72500452142
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-6.0
---
[01:10:27] FF..............................................
[01:10:27] thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:502:22
[01:10:27] failures:
[01:10:27] 
[01:10:27] ---- [run-fail] run-fail/mpsc-recv-unwind/stream.rs stdout ----
[01:10:27] 
[01:10:27] error: Error: expected failure status (Signal(6)) but received status ExitStatus(ExitStatus(256)).
[01:10:27] status: exit code: 1
[01:10:27] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail/mpsc-recv-unwind/stream/a"
[01:10:27] ------------------------------------------
[01:10:27] Deadlock detected
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] ------------------------------------------
[01:10:27] stderr:
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] thread '[run-fail] run-fail/mpsc-recv-unwind/stream.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:3311:9
[01:10:27] 
[01:10:27] ---- [run-fail] run-fail/mpsc-recv-unwind/shared.rs stdout ----
[01:10:27] 
[01:10:27] error: Error: expected failure status (Signal(6)) but received status ExitStatus(ExitStatus(256)).
[01:10:27] status: exit code: 1
[01:10:27] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail/mpsc-recv-unwind/shared/a"
[01:10:27] ------------------------------------------
[01:10:27] Deadlock detected
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] ------------------------------------------
[01:10:27] stderr:
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] thread '[run-fail] run-fail/mpsc-recv-unwind/shared.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:3311:9
[01:10:27] 
[01:10:27] failures:
[01:10:27] failures:
[01:10:27]     [run-fail] run-fail/mpsc-recv-unwind/shared.rs
[01:10:27]     [run-fail] run-fail/mpsc-recv-unwind/stream.rs
[01:10:27] test result: FAILED. 145 passed; 2 failed; 1 ignored; 0 measured; 0 filtered out
[01:10:27] 
[01:10:27] 
[01:10:27] 
[01:10:27] 
[01:10:27] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/run-fail" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "run-fail" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
[01:10:27] 
[01:10:27] 
[01:10:27] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:10:27] Build completed unsuccessfully in 0:11:04
[01:10:27] Build completed unsuccessfully in 0:11:04
[01:10:27] make: *** [check] Error 1
[01:10:27] Makefile:48: recipe for target 'check' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0103c0bb
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri Feb  1 09:20:35 UTC 2019
---
travis_time:end:1ba81ae2:start=1549012837125012095,finish=1549012837130425587,duration=5413492
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:1b5084f8
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|oBRT, Aborted.
#0  0x00007fce80c35428 in ?? ()
[Current thread is 1 (LWP 24071)]
#0  0x00007fce80c35428 in ?? ()
#1  0x00007fce80c3702a in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

alexcrichton · 2019-02-01T16:38:18Z

Sorry I don't really understand what's motivating this PR, this seems like something that surely would have been caught in the half a decade these types have been in use. Is this something like the platform is causing a panic at a location that wasn't expected? If so, where?

jethrogb · 2019-02-04T11:34:42Z

Sorry I don't really understand what's motivating this PR

The 4 test cases that I've added are motivating this. Run them today and you'll run into unreachable! or deadlocks. Ideally, the tests would be run-pass tests, but I couldn't figure out how to write them that way, so instead I'm checking for a specific failure condition that indicates everything as expected.

this seems like something that surely would have been caught in the half a decade these types have been in use.

I guess not. Like I said, it can be tricky to trigger unwinding while a thread is blocked.

Is this something like the platform is causing a panic at a location that wasn't expected? If so, where?

Yes, inside a call to mpsc::Receiver::recv

bors · 2019-02-09T04:03:19Z

☔ The latest upstream changes (presumably #58316) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2019-02-11T12:33:33Z

Thanks for the explanations, but I don't think this is something that we'll want to support in the standard library. Thread cancellation seems like it's an additional features to the channels in the standard library, and I think it'd be fine to document that they're not compatible but otherwise feature development of channels is encouraged to happen externally in crates like crossbeam-channel for this use case.

jethrogb · 2019-02-12T05:20:58Z

I don't think you've understood the heart of the issue yet. The problem is not specific to thread cancellation. The issue occurs when when the stack is unwound while an MPSC channel receiver is blocking on a receiver. Thread cancellation is just an easy way to showcase this on Linux, but there are other ways to trigger unwinding at that point.

alexcrichton · 2019-02-12T13:03:03Z

Can you show an example test case that deadlocks without using thread cancellation?

jethrogb · 2019-02-13T06:14:53Z

According to the Linux man page for pthread_cond_timedwait, it may return with EINTR. I haven't been able to figure out how to trigger that, but here I've simulated that by just implementing pthread_cond_timedwait. This program never exits.

#![feature(rustc_private)] // for libc, you can also use crates.io libc instead.

extern crate libc;

#[no_mangle]
pub unsafe extern "C" fn pthread_cond_timedwait(
    _cond: *mut libc::pthread_cond_t,
    _mutex: *mut libc::pthread_mutex_t,
    _abstime: *const libc::timespec
) -> libc::c_int {
    *libc::__errno_location() = libc::EINTR;
    return 1;
}

fn main() {
    let (s, r) = std::sync::mpsc::channel();
    s.send(()).unwrap();
    r.recv().unwrap();
    s.send(()).unwrap();
    r.recv().unwrap();
    r.recv_timeout(std::time::Duration::from_millis(1)).unwrap()
}

jethrogb · 2019-02-13T09:59:51Z

If unwinding were supported in Wasm, I'm pretty this would hang right now after printing the panic message:

#[cfg(test)]
extern crate wasm_bindgen_test;
#[cfg(test)]
use wasm_bindgen_test::*;

#[cfg(test)]
#[wasm_bindgen_test]
pub fn recv() {
    let (tx, rx) = std::sync::mpsc::channel::<()>();
    let _ = tx.clone();
    let _ = rx.recv();
}

alexcrichton · 2019-02-14T17:58:43Z

EINTR should be handled by the condvar implementation and it's a bug if it isn't, and wasm is a special case that I don't think really counts for this because it doesn't have threads anyway

jethrogb · 2019-02-20T01:47:10Z

Note to triage: I will return to this PR March 4.

Dylan-DPC-zz · 2019-03-04T20:30:05Z

ping from triage @jethrogb it is march 4 today ;) you have a conflict to resolve

jethrogb · 2019-03-04T21:29:13Z

@Dylan-DPC lol the conflict is the least of my problems on this PR :)

Dylan-DPC-zz · 2019-03-04T21:37:13Z

@jethrogb i know :P

jethrogb · 2019-03-04T22:43:17Z

In private chat message, Alex suggested I ping @Amanieu @aturon @BurntSushi @dtolnay @Kimundi @KodrAus @sfackler @SimonSapin @withoutboats

This discussion is also relevant for #58461 as these issues address similar problems.

I think std should always behave reasonably in the face of uncommon errors triggered by the platform. In my opinion, "reasonable behavior" does not include deadlocks, actually calling unreachable!, poisoning locks incorrectly, or double panics. The code in sync/ and sys/ is littered with unwrap and assert!, many of those could trigger the "unreasonable" behavior I described in these PRs.

So, these PRs consists of two parts:

A fix for the dubious behavior caused by uncommon platform errors.
A test for these fixes. The proposed test cases are pretty weird, but that's just to be able to test these kinds of failures consistently. Just because we can't trigger a particular failure easily without going through hoops doesn't mean these things can't happen. A future change to the Linux kernel could just start returning an error code from a syscall that we weren't expecting before. People might be using Rust with LD_PRELOAD wrappers for some kind of instrumentation. New platforms added to std might error in previously unexpected cases. Et cetera.

NB. I'm sure these problems will come up again when replacing some of the primitives in std with parking_lot and crossbeam-channel.

alexcrichton · 2019-03-04T22:51:01Z

To be clear I recommended the possibility looping in other folks, not necessarily pinging every historical and current libs team member by name.

jethrogb · 2019-03-04T22:53:24Z

I'm sorry if I pinged people who are not actually on the team right now, I got the list from https://www.rust-lang.org/governance/teams/library.

Amanieu · 2019-03-04T23:18:57Z

If the underlying platform is violating its own ABI, either due to updates or because the user is deliberately messing with us (LD_PRELOAD), then I don't think that we can reasonably be expected to function correctly.

A panic message followed by either a crash or a deadlock should be a pretty clear message to the user that says "YOUR PLATFORM IS BROKEN".

Amanieu · 2019-03-04T23:20:13Z

According to the Linux man page for pthread_cond_timedwait, it may return with EINTR.

My manpage (and the POSIX spec) specifically say that pthread_cond_timedwait will never return EINTR.

jethrogb · 2019-03-04T23:25:19Z

Note that I've provided 3 different test cases only one of which has to do with the "platform is violating its own ABI". I really think you should all focus less on the test cases and more on the general issue.

The Ubuntu 16.04 man page definitely says EINTR may happen.

jethrogb · 2019-03-04T23:41:41Z

Another way a panic might occur in “unexpected” places is if the process is using Linux seccomp.

I must say that for a language that has a big focus on correct error handling, I'm somewhat surprised by the pushback on doing exactly that.

alexcrichton · 2019-03-14T15:50:38Z

I personally agree with @Amanieu that if fundamental APIs are changed we can't really do anything about that and crashing as soon as we can is the best bet. For example if pthread_create actually spawned two threads, I wouldn't expect us to gracefully handle that and shut down one of them while the other continues to run user code.

If actual real-world APIs disagree with the spec I think it's fine to handle that, and as I've said a number of times on #58461 as well it's fine to handle EINTR for condvars, we'd just loop again or return that we timed out. If this comes up with seccomp as well that's a new error to handle and we can add that in.

As a meta point the discussion is being strewn across this PR and #58461 and keeping track of them isn't the easiest to do. Can one of these PRs be closed and merged with the other or something like that?

jethrogb · 2019-03-14T16:34:26Z

I've closed #58461

I'm only interested in solving the general problem of making the synchronization primitives robust against panicking. I'm not interested in auditing all code in sync/ and sys/ for possible error conditions that were missed. Maybe once Rust gets a way to enforce "no panics" as a function attribute that can be done. If you don't think there's a general problem to be solved, then I can't make any progress on this PR.

Dylan-DPC-zz · 2019-03-25T13:23:35Z

ping from triage @jethrogb any updates?

jethrogb · 2019-03-25T16:15:34Z

Should be S-waiting-for-review

Centril · 2019-03-30T12:41:45Z

Ping from triage @alexcrichton

alexcrichton · 2019-04-01T13:15:19Z

I'm going to close this PR because those on the libs team who have voiced their thoughts on this PR have indicated that they would not take this strategy. Otherwise it looks like this PR isn't making much progress.

We're always interested in fixes to primitives if necessary, but I don't think we're going to place too many safeguards in place to have the standard library work as correctly as possible in the face of a system that is misbehaving. Rather we'll make sure an error message gets out and nothing segfaults at minimum.

jethrogb · 2019-04-01T15:38:16Z

Deadlocks are not in the set of acceptable behaviors you just described ("an error message gets out and nothing segfaults").

Fix reaching unreachable code or deadlocks when unwinding in MPSC

95e8613

Fixes fortanix/rust-sgx#86

rust-highfive assigned alexcrichton Feb 1, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 1, 2019

jethrogb mentioned this pull request Feb 14, 2019

Don't poison the Mutex when panicking inside Condvar::wait #58461

Closed

Dylan-DPC-zz added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 4, 2019

jethrogb changed the title ~~Fix reaching unreachable code or deadlocks when unwinding in MPSC~~ Fix weird behavior when panicking inside synchronization primitives Mar 14, 2019

jethrogb mentioned this pull request Mar 19, 2019

Add SGX thread parker Amanieu/parking_lot#123

Merged

Centril added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 30, 2019

alexcrichton closed this Apr 1, 2019

jethrogb mentioned this pull request Apr 1, 2019

Synchronization primitives not robust against unwinding #59614

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix weird behavior when panicking inside synchronization primitives #58042

Fix weird behavior when panicking inside synchronization primitives #58042

jethrogb commented Feb 1, 2019 •

edited

Loading

rust-highfive commented Feb 1, 2019

jethrogb commented Feb 1, 2019

rust-highfive commented Feb 1, 2019

alexcrichton commented Feb 1, 2019

jethrogb commented Feb 4, 2019

bors commented Feb 9, 2019

alexcrichton commented Feb 11, 2019

jethrogb commented Feb 12, 2019

alexcrichton commented Feb 12, 2019

jethrogb commented Feb 13, 2019

jethrogb commented Feb 13, 2019

alexcrichton commented Feb 14, 2019

jethrogb commented Feb 20, 2019

Dylan-DPC-zz commented Mar 4, 2019

jethrogb commented Mar 4, 2019

Dylan-DPC-zz commented Mar 4, 2019

jethrogb commented Mar 4, 2019

alexcrichton commented Mar 4, 2019

jethrogb commented Mar 4, 2019

Amanieu commented Mar 4, 2019

Amanieu commented Mar 4, 2019

jethrogb commented Mar 4, 2019

jethrogb commented Mar 4, 2019

alexcrichton commented Mar 14, 2019

jethrogb commented Mar 14, 2019 •

edited

Loading

Dylan-DPC-zz commented Mar 25, 2019

jethrogb commented Mar 25, 2019

Centril commented Mar 30, 2019

alexcrichton commented Apr 1, 2019

jethrogb commented Apr 1, 2019

Fix weird behavior when panicking inside synchronization primitives #58042

Fix weird behavior when panicking inside synchronization primitives #58042

Conversation

jethrogb commented Feb 1, 2019 • edited Loading

rust-highfive commented Feb 1, 2019

jethrogb commented Feb 1, 2019

rust-highfive commented Feb 1, 2019

alexcrichton commented Feb 1, 2019

jethrogb commented Feb 4, 2019

bors commented Feb 9, 2019

alexcrichton commented Feb 11, 2019

jethrogb commented Feb 12, 2019

alexcrichton commented Feb 12, 2019

jethrogb commented Feb 13, 2019

jethrogb commented Feb 13, 2019

alexcrichton commented Feb 14, 2019

jethrogb commented Feb 20, 2019

Dylan-DPC-zz commented Mar 4, 2019

jethrogb commented Mar 4, 2019

Dylan-DPC-zz commented Mar 4, 2019

jethrogb commented Mar 4, 2019

alexcrichton commented Mar 4, 2019

jethrogb commented Mar 4, 2019

Amanieu commented Mar 4, 2019

Amanieu commented Mar 4, 2019

jethrogb commented Mar 4, 2019

jethrogb commented Mar 4, 2019

alexcrichton commented Mar 14, 2019

jethrogb commented Mar 14, 2019 • edited Loading

Dylan-DPC-zz commented Mar 25, 2019

jethrogb commented Mar 25, 2019

Centril commented Mar 30, 2019

alexcrichton commented Apr 1, 2019

jethrogb commented Apr 1, 2019

jethrogb commented Feb 1, 2019 •

edited

Loading

jethrogb commented Mar 14, 2019 •

edited

Loading