Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resiliency to process corruption #79

Open
casperisfine opened this issue Dec 8, 2023 · 1 comment
Open

Improve resiliency to process corruption #79

casperisfine opened this issue Dec 8, 2023 · 1 comment

Comments

@casperisfine
Copy link
Contributor

casperisfine commented Dec 8, 2023

Context

A fundamental problem Pitchfork has to deal with is that both POSIX and Linux don't quite support running anything but async-signal safe function after a fork().

In practice, as long as you never spawned any background thread, you are fine. But many ruby applications and gems do spawn threads, and in presence of such background threads if we happen to fork at the wrong time, it can result in a sub process that is in an unrecoverable state.

The typical case is forking while a background thread hold a lock, in the child this lock will remain locked and trying to access it will dead lock.

For instance this can happen with OpenSSL 3:

    [/usr/lib/x86_64-linux-gnu/libc.so.6] pthread_rwlock_wrlock
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] CRYPTO_THREAD_write_lock
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] CRYPTO_alloc_ex_data
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] OPENSSL_thread_stop
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] OPENSSL_cleanup
    [/usr/lib/x86_64-linux-gnu/libc.so.6] secure_getenv

So any background thread that use a SSL connection may break reforking.

That's what Pitchfork.prevent_fork is for, but still, we should try to handle such scenario as gracefully as possible.

Action Plan

  • If we detect such case we should terminate the affected process.
  • Ideally we replace that process with a new one, but if for some reason we can't, we should gracefully terminate the whole server (last resort).
  • We should consider "reverting" Spawn molds instead of promoting workers #42.
    • Spawning the new mold out of a worker has the nice property of not impacting capacity as much
    • However that fork is risky because workers are even more likely than molds to have background threads.
    • We should probably warn for every thread in the mold (Puma does something similar)
    • (optional) We could provide a way to run background threads in a dedicated process outside the mold.
    • Provide a callback to validate post-fork processes
      • Maybe even validate the usual suspects by default (OpenSSL)
casperisfine pushed a commit that referenced this issue Dec 8, 2023
Ref: #79

It can happen that the new mold was forked while at an unsafe
point, causing the middle process to crash.

When we detect this happens, we should abandon this mold.

Currently abandoning the mold cause a graceful shutdown,
in the future we could try creating a new mold to replace it.

Co-Authored-By: Étienne Barrié <[email protected]>
casperisfine pushed a commit that referenced this issue Dec 8, 2023
Ref: #79

It can happen that the new mold was forked while at an unsafe
point, causing the middle process to crash.

When we detect this happens, we should abandon this mold.

Currently abandoning the mold cause a graceful shutdown,
in the future we could try creating a new mold to replace it.

Co-Authored-By: Étienne Barrié <[email protected]>
casperisfine pushed a commit that referenced this issue Dec 8, 2023
Ref: #79

It can happen that the new mold was forked while at an unsafe
point, causing the middle process to crash.

When we detect this happens, we should abandon this mold.

Currently abandoning the mold cause a graceful shutdown,
in the future we could try creating a new mold to replace it.

Co-Authored-By: Étienne Barrié <[email protected]>
casperisfine pushed a commit that referenced this issue Dec 8, 2023
Ref: #79

It can happen that the new mold was forked while at an unsafe
point, causing the middle process to crash.

When we detect this happens, we should abandon this mold.

Currently abandoning the mold cause a graceful shutdown,
in the future we could try creating a new mold to replace it.

Co-Authored-By: Étienne Barrié <[email protected]>
casperisfine pushed a commit that referenced this issue Dec 8, 2023
Ref: #79

It can happen that the new mold was forked while at an unsafe
point, causing the middle process to crash.

When we detect this happens, we should abandon this mold.

Currently abandoning the mold cause a graceful shutdown,
in the future we could try creating a new mold to replace it.

Co-Authored-By: Étienne Barrié <[email protected]>
@casperisfine
Copy link
Contributor Author

Another possibility would be to "feature test" that the mold isn't corrupted before registering to the monitor.

e.g. most issues I've seen happen during exit, so Process.wait(fork {}) may be enough to confirm the process isn't hosed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant