Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown gracefully on SIGPIPE, SIGHUP #2114

Closed
michaelsproul opened this issue Dec 22, 2020 · 3 comments
Closed

Shutdown gracefully on SIGPIPE, SIGHUP #2114

michaelsproul opened this issue Dec 22, 2020 · 3 comments
Labels
A1 good first issue Good for newcomers UX-and-logs v2.0.0 Altair on mainnet release (v2.0.0)

Comments

@michaelsproul
Copy link
Member

Description

When running Lighthouse in a shell pipeline it fails to exit gracefully because of the lack of signal handler for SIGPIPE. I noticed that my Lighthouse nodes wouldn't shutdown gracefully from my runner script, and it took a bit of digging to find the cause.

This command reproduces the issue (bash assumed):

$ lighthouse bn --network pyrmont |& tee -a pyrmont_bn.log

Or equivalently:

$ lighthouse bn --network pyrmont 2>&1 | tee -a pyrmont_bn.log

Both of these commands redirect lighthouse's stderr (2) to its stdout (&1), and then send stdout to tee for writing to the terminal and a file simultaneously (a common UNIX pattern). If you hit ^C while running either of these commands Lighthouse will exit immediately, without running the shutdown handlers, running drop on the BeaconChain, or printing these messages:

Dec 22 18:09:01.939 INFO Shutting down..
Dec 22 18:09:07.463 INFO Saved beacon chain to disk              service: beacon

Version

Lighthouse v1.0.4

Steps to resolve

Currently we use the ctrlc crate to register a signal handler for SIGINT and SIGTERM. Unfortunately it seems this crate doesn't allow us to register handlers for other signals.

ctrlc::set_handler(move || {
if let Some(ctrlc_send) = ctrlc_send_c.try_borrow_mut().unwrap().take() {
if let Err(e) = ctrlc_send.send(()) {
error!(
log,
"Error sending ctrl-c message";
"error" => e
);
}
}
})

Using another crate (or maybe tokio::signal), we should ensure that this handler runs for all of the signals we can reasonably handle. Based on the list here I think that should just be SIGINT, SIGTERM, plus:

  • SIGPIPE: The SIGPIPE signal is sent to a process when it attempts to write to a pipe without a process connected to the other end.
  • SIGHUP: The SIGHUP signal is sent to a process when its controlling terminal is closed.
@mstallmo
Copy link
Contributor

Would love to take a stab at resolving this if no one is working on it already!

@michaelsproul
Copy link
Member Author

@mstallmo That would be awesome! There's nobody working on it currently, it's all yours!

@paulhauner paulhauner added the v1.5.1 To be included in the v1.5.1 relase label Aug 2, 2021
@michaelsproul michaelsproul added v1.5.2 The release after v1.5.1 v2.0.0 Altair on mainnet release (v2.0.0) and removed v1.5.1 To be included in the v1.5.1 relase v1.5.2 The release after v1.5.1 labels Aug 26, 2021
bors bot pushed a commit that referenced this issue Aug 30, 2021
## Issue Addressed

Resolves #2114 

Swapped out the ctrlc crate for tokio signals to hook register handlers for SIGPIPE and SIGHUP along with SIGTERM and SIGINT.

## Proposed Changes

- Swap out the ctrlc crate for tokio signals for unix signal handing
- Register signals for SIGPIPE and SHIGUP that trigger the same shutdown procedure as SIGTERM and SIGINT

## Additional Info

I tested these changes against the examples in the original issue and noticed some interesting behavior on my machine. When running `lighthouse bn --network pyrmont |& tee -a pyrmont_bn.log` or `lighthouse bn --network pyrmont 2>&1 | tee -a pyrmont_bn.log` none of the above signals are sent to the lighthouse program in a way I was able to observe. 

The only time it seems that the signal gets sent to the lighthouse program is if there is no redirection of stderr to stdout. I'm not as familiar with the details of how unix signals work in linux with a redirect like that so I'm not sure if this is a bug in the program or expected behavior.

Signals are correctly received without the redirection and if the above signals are sent directly to the program with something like `kill`.
@michaelsproul
Copy link
Member Author

Finally got to the bottom of this here: #2486 (review)

For anyone wanting to see the shutdown output in tee, run with tee -i to stop tee from dying on interrupt

bors bot pushed a commit that referenced this issue Sep 3, 2021
## Proposed Changes

Remove the SIGPIPE handler added in #2486.

We saw some of the testnet nodes running under `systemd` being stopped due to `journald` restarts. The systemd docs state:

> If systemd-journald.service is stopped, the stream connections associated with all services are terminated. Further writes to those streams by the service will result in EPIPE errors. In order to react gracefully in this case it is recommended that programs logging to standard output/error ignore such errors. If the SIGPIPE UNIX signal handler is not blocked or turned off, such write attempts will also result in such process signals being generated, see signal(7).

From https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html

## Additional Info

It turned out that the issue described in #2114 was due to `tee`'s behaviour rather than Lighthouse's, so the SIGPIPE handler isn't required for any current use case. An alternative to disabling it all together would be to exit with a non-zero code so that systemd knows to restart the process, but it seems more desirable to tolerate journald glitches than to restart frequently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A1 good first issue Good for newcomers UX-and-logs v2.0.0 Altair on mainnet release (v2.0.0)
Projects
None yet
Development

No branches or pull requests

3 participants