Pitchfork.fork_sibling: handle the middle process getting stuck #74

casperisfine · 2023-10-31T09:15:47Z

That middle process only forks again and then exit, as such it should be very fast.

However because of after fork callbacks (on Process._fork) and at_exit callbacks, it's possible that a bug cause it to get stuck or be very slow.

To be resilient to this kind of bug, we hard kill the middle process if it didn't exit in 2 seconds.

And to make it easier to debug, we kill it with SIGBUS so it produce a crash report with a backtrace, core dump etc.

That middle process only forks again and then exit, as such it should be very fast. However because of after fork callbacks (on `Process._fork`) and `at_exit` callbacks, it's possible that a bug cause it to get stuck or be very slow. To be resilient to this kind of bug, we hard kill the middle process if it didn't exit in 2 seconds. And to make it easier to debug, we kill it with `SIGBUS` so it produce a crash report with a backtrace, core dump etc.

Ref: open-telemetry#363 Ref: Shopify/pitchfork#74 At Shopify we're using the Pitchfork web server, and one thing it does is that it sometimes do a "double fork", similar to the classic daemonization dance. Recently we noticed that sometimes the "middle process" of that daemonization dance would get "stuck". After adding some debug we figured it's because we have: ```ruby at_exit do # Triggers flushing of buffer on shutdown OpenTelemetry.tracer_provider.shutdown(timeout: 60) end ``` So the child process inherits lots of spans from its parents, and normally OpenTelemetry deal gracefully with that by droping the inherited spans. But here because we called `shutdown`, `@keep_running` is set to `false` hence `reset_on_fork` is not called. I don't understand why we shouldn't call it during shutdown though, aside from avoiding to respawn the thread. With this change we expect our "middle process" to drop all the inherited spans and to have nothing to flush.

#1537) * BatchSpanProcessor#force_flush: purge inherited spans even on shutdown Ref: #363 Ref: Shopify/pitchfork#74 At Shopify we're using the Pitchfork web server, and one thing it does is that it sometimes do a "double fork", similar to the classic daemonization dance. Recently we noticed that sometimes the "middle process" of that daemonization dance would get "stuck". After adding some debug we figured it's because we have: ```ruby at_exit do # Triggers flushing of buffer on shutdown OpenTelemetry.tracer_provider.shutdown(timeout: 60) end ``` So the child process inherits lots of spans from its parents, and normally OpenTelemetry deal gracefully with that by droping the inherited spans. But here because we called `shutdown`, `@keep_running` is set to `false` hence `reset_on_fork` is not called. I don't understand why we shouldn't call it during shutdown though, aside from avoiding to respawn the thread. With this change we expect our "middle process" to drop all the inherited spans and to have nothing to flush. * Appease the cop --------- Co-authored-by: Jean Boussier <[email protected]> Co-authored-by: Francis Bogsanyi <[email protected]>

casperisfine requested review from DazWorrall and bmansoob October 31, 2023 09:15

DazWorrall approved these changes Oct 31, 2023

View reviewed changes

casperisfine merged commit 4f8f367 into master Oct 31, 2023
10 checks passed

casperisfine deleted the fork-sibling-timeout branch October 31, 2023 09:31

casperisfine mentioned this pull request Oct 31, 2023

BatchSpanProcessor#force_flush: purge inherited spans even on shutdown open-telemetry/opentelemetry-ruby#1537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pitchfork.fork_sibling: handle the middle process getting stuck #74

Pitchfork.fork_sibling: handle the middle process getting stuck #74

casperisfine commented Oct 31, 2023

Pitchfork.fork_sibling: handle the middle process getting stuck #74

Pitchfork.fork_sibling: handle the middle process getting stuck #74

Conversation

casperisfine commented Oct 31, 2023