Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pitchfork.fork_sibling: handle the middle process getting stuck #74

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

casperisfine
Copy link
Contributor

That middle process only forks again and then exit, as such it should be very fast.

However because of after fork callbacks (on Process._fork) and at_exit callbacks, it's possible that a bug cause it to get stuck or be very slow.

To be resilient to this kind of bug, we hard kill the middle process if it didn't exit in 2 seconds.

And to make it easier to debug, we kill it with SIGBUS so it produce a crash report with a backtrace, core dump etc.

That middle process only forks again and then exit, as such it should
be very fast.

However because of after fork callbacks (on `Process._fork`) and
`at_exit` callbacks, it's possible that a bug cause it to get stuck
or be very slow.

To be resilient to this kind of bug, we hard kill the middle process
if it didn't exit in 2 seconds.

And to make it easier to debug, we kill it with `SIGBUS` so it produce
a crash report with a backtrace, core dump etc.
@casperisfine casperisfine merged commit 4f8f367 into master Oct 31, 2023
10 checks passed
@casperisfine casperisfine deleted the fork-sibling-timeout branch October 31, 2023 09:31
casperisfine pushed a commit to casperisfine/opentelemetry-ruby that referenced this pull request Oct 31, 2023
Ref: open-telemetry#363
Ref: Shopify/pitchfork#74

At Shopify we're using the Pitchfork web server, and one thing it
does is that it sometimes do a "double fork", similar to the classic
daemonization dance.

Recently we noticed that sometimes the "middle process" of that
daemonization dance would get "stuck". After adding some debug
we figured it's because we have:

```ruby
at_exit do
  # Triggers flushing of buffer on shutdown
  OpenTelemetry.tracer_provider.shutdown(timeout: 60)
end
```

So the child process inherits lots of spans from its parents,
and normally OpenTelemetry deal gracefully with that by droping
the inherited spans.

But here because we called `shutdown`, `@keep_running` is set to `false`
hence `reset_on_fork` is not called.

I don't understand why we shouldn't call it during shutdown though, aside
from avoiding to respawn the thread.

With this change we expect our "middle process" to drop all the inherited
spans and to have nothing to flush.
fbogsany added a commit to open-telemetry/opentelemetry-ruby that referenced this pull request Oct 31, 2023
#1537)

* BatchSpanProcessor#force_flush: purge inherited spans even on shutdown

Ref: #363
Ref: Shopify/pitchfork#74

At Shopify we're using the Pitchfork web server, and one thing it
does is that it sometimes do a "double fork", similar to the classic
daemonization dance.

Recently we noticed that sometimes the "middle process" of that
daemonization dance would get "stuck". After adding some debug
we figured it's because we have:

```ruby
at_exit do
  # Triggers flushing of buffer on shutdown
  OpenTelemetry.tracer_provider.shutdown(timeout: 60)
end
```

So the child process inherits lots of spans from its parents,
and normally OpenTelemetry deal gracefully with that by droping
the inherited spans.

But here because we called `shutdown`, `@keep_running` is set to `false`
hence `reset_on_fork` is not called.

I don't understand why we shouldn't call it during shutdown though, aside
from avoiding to respawn the thread.

With this change we expect our "middle process" to drop all the inherited
spans and to have nothing to flush.

* Appease the cop

---------

Co-authored-by: Jean Boussier <[email protected]>
Co-authored-by: Francis Bogsanyi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants