Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel external triggers if one of multiple fails on start #2248

Merged

Conversation

itowlson
Copy link
Contributor

Fixes #2245.

There may still be a nasty if one of the child triggers fails to launch, or if a trigger unexpectedly exits without reporting an error. But this does fix, albeit uglily, the "typical" case of a trigger failing (and specifically a non-multi-aware trigger crashing during startup).

I'm concerned about using timers here, because they're obviously quite fragile; but I can't think of another way to get at "all spin trigger-foo process have had the chance to start their child trigger-foo processes and hook them up to Ctrl+C." Open to ideas!

@itowlson itowlson force-pushed the terminate-child-triggers-on-trigger-exit branch from f1ac4a3 to 6f1bcd5 Compare January 26, 2024 01:58
@itowlson itowlson marked this pull request as ready for review January 28, 2024 23:07
Copy link
Collaborator

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooof. It certainly isn't pretty. I think it's fine for now. I wonder if there's some sort of supervisor tree architecture we can adopt that makes some of this book keeping easier, but that's really hard when you have to jump the process boundary... Hmm....


#[cfg(windows)]
fn get_pids(_trigger_processes: &[tokio::process::Child]) -> Vec<usize> {
vec![]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this would be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chuckle It's not "correct". This PR tracks the original Ctrl+C propagation code, which was cfg-ed out on Windows (because the signals stuff doesn't have an equivalent); so although I need to have a function here (and could indeed get OS-level process handles) I can't do anything meaningful with them, at least not without a lot more research than "preserve what the old code did".

I can look at implementing Ctrl+C propagation on Windows for sure. But I'd be inclined to look at a slightly different design for that, with processes starting suspended and being placed in a job object. That's more than I want to bite off in this PR though...!

Comment on lines +104 to +114
#[cfg(windows)]
fn set_kill_on_ctrl_c(_child: &tokio::process::Child) {}

#[cfg(not(windows))]
fn set_kill_on_ctrl_c(child: &tokio::process::Child) {
if let Some(pid) = child.id().map(|id| nix::unistd::Pid::from_raw(id as i32)) {
_ = ctrlc::set_handler(move || {
_ = nix::sys::signal::kill(pid, nix::sys::signal::SIGTERM);
});
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option 🤷

Suggested change
#[cfg(windows)]
fn set_kill_on_ctrl_c(_child: &tokio::process::Child) {}
#[cfg(not(windows))]
fn set_kill_on_ctrl_c(child: &tokio::process::Child) {
if let Some(pid) = child.id().map(|id| nix::unistd::Pid::from_raw(id as i32)) {
_ = ctrlc::set_handler(move || {
_ = nix::sys::signal::kill(pid, nix::sys::signal::SIGTERM);
});
}
}
fn set_kill_on_ctrl_c(child: &tokio::process::Child) {
if cfg!(windows) {
return;
}
if let Some(pid) = child.id().map(|id| nix::unistd::Pid::from_raw(id as i32)) {
_ = ctrlc::set_handler(move || {
_ = nix::sys::signal::kill(pid, nix::sys::signal::SIGTERM);
});
}
}

// this because it kills the Spin process but that doesn't cascade to the
// child plugin trigger process.) So add a hopefully insignificant delay
// between them to reduce the chance of this happening.
const MULTI_TRIGGER_START_OFFSET: tokio::time::Duration = tokio::time::Duration::from_millis(20);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::time::Duration is an alias for std::time::Duration.

I'm...not really sure why they did that...

https://github.com/tokio-rs/tokio/blob/131e7b4e49c8849298ba54b4e0c99f4b81d869e3/tokio/src/time/mod.rs#L107-L109

// Re-export for convenience

🤔

Comment on lines +254 to +259
if is_multi {
// Allow time for the child `spin` process to launch the trigger
// and hook up its cancellation. Mitigates the race condition
// noted on the constant (see there for more info).
tokio::time::sleep(MULTI_TRIGGER_START_OFFSET).await;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this should be necessary both here and above?

Comment on lines 429 to 452
#[cfg(windows)]
fn get_pids(_trigger_processes: &[tokio::process::Child]) -> Vec<usize> {
vec![]
}

#[cfg(not(windows))]
fn get_pids(trigger_processes: &[tokio::process::Child]) -> Vec<nix::unistd::Pid> {
use itertools::Itertools;
// https://github.com/nix-rust/nix/issues/656
let pids = trigger_processes
trigger_processes
.iter()
.flat_map(|child| child.id().map(|id| nix::unistd::Pid::from_raw(id as i32)))
.collect_vec();
ctrlc::set_handler(move || {
for pid in &pids {
if let Err(err) = nix::sys::signal::kill(*pid, nix::sys::signal::SIGTERM) {
tracing::warn!("Failed to kill trigger handler process: {:?}", err)
}
.collect_vec()
}

#[cfg(windows)]
fn kill_them_all_god_will_know_his_own(_pids: &[usize]) {}

#[cfg(not(windows))]
fn kill_them_all_god_will_know_his_own(pids: &[nix::unistd::Pid]) {
// https://github.com/nix-rust/nix/issues/656
for pid in pids {
// println!("killing {pid}");
if let Err(err) = nix::sys::signal::kill(*pid, nix::sys::signal::SIGTERM) {
tracing::warn!("Failed to kill trigger handler process: {:?}", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified a bit by passing the u32s from Child::id into kill_... and doing the cast to Pids in there.

@itowlson itowlson force-pushed the terminate-child-triggers-on-trigger-exit branch from 6f1bcd5 to 9ad3591 Compare January 29, 2024 19:44
@itowlson itowlson merged commit 04aa3bb into fermyon:main Jan 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In multi-trigger, if one trigger crashes, others may keep running
3 participants