Skip to content

Commit

Permalink
Implement spawn_timeout configuration
Browse files Browse the repository at this point in the history
Some low level fork safety bugs can cause workers to lock up before
they reach the process client loop.

See: #67

While such bugs should be avoided, Pitchfork should still try
to recover from it, and killing newly spawned workers that didn't
reach ready state should prevent cascading failure.
  • Loading branch information
byroot committed Sep 8, 2023
1 parent 9899ef4 commit 18aa140
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 1 deletion.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Unreleased

- Implement `spawn_timeout` to protect against bugs causing workers to get stuck before they reach ready state.

# 0.8.0

- Add an `after_monitor_ready` callback, called in the monitor process at end of boot.
Expand Down
13 changes: 13 additions & 0 deletions docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,19 @@ exit or be SIGKILL-ed due to timeouts.
See https://nginx.org/en/docs/http/ngx_http_upstream_module.html
for more details on nginx upstream configuration.

### `spawn_timeout`

```ruby
timeout 5
```

Sets the timeout for a newly spawned worker to be ready after being spawned.

This timeout is a safeguard against various low-level fork safety bugs that could cause
a process to dead-lock.

The default of `10` seconds is quite generous and likely doesn't need to be adjusted.

### `logger`

```ruby
Expand Down
5 changes: 5 additions & 0 deletions lib/pitchfork/configurator.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class Configurator
DEFAULTS = {
:soft_timeout => 20,
:cleanup_timeout => 2,
:spawn_timeout => 10,
:timeout => 22,
:logger => default_logger,
:worker_processes => 1,
Expand Down Expand Up @@ -174,6 +175,10 @@ def timeout(seconds, cleanup: 2)
set_int(:timeout, soft_timeout + cleanup_timeout, 5)
end

def spawn_timeout(seconds)
set_int(:spawn_timeout, seconds, 1)
end

def worker_processes(nr)
set_int(:worker_processes, nr, 1)
end
Expand Down
6 changes: 5 additions & 1 deletion lib/pitchfork/http_server.rb
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def extend_deadline(extra_time)
end

# :stopdoc:
attr_accessor :app, :timeout, :soft_timeout, :cleanup_timeout, :worker_processes,
attr_accessor :app, :timeout, :soft_timeout, :cleanup_timeout, :spawn_timeout, :worker_processes,
:after_worker_fork, :after_mold_fork,
:listener_opts, :children,
:orig_app, :config, :ready_pipe,
Expand Down Expand Up @@ -556,6 +556,10 @@ def after_fork_internal
def spawn_worker(worker, detach:)
logger.info("worker=#{worker.nr} gen=#{worker.generation} spawning...")

# We set the deadline before spawning the child so that if for some
# reason it gets stuck before reaching the worker loop,
# the monitor process will kill it.
worker.update_deadline(@spawn_timeout)
Pitchfork.fork_sibling do
worker.pid = Process.pid

Expand Down
25 changes: 25 additions & 0 deletions test/integration/test_boot.rb
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,29 @@ def test_boot_broken_after_mold_fork

assert_exited(pid, 1)
end

def test_boot_worker_stuck_in_spawn
addr, port = unused_port

pid = spawn_server(app: APP, config: <<~RUBY)
listen "#{addr}:#{port}"
worker_processes 2
refork_after [50, 100, 1000]
spawn_timeout 2
after_worker_fork do |_server, worker|
if worker.nr == 1
sleep 20 # simulate a stuck worker
end
end
RUBY


assert_healthy("http://#{addr}:#{port}")

assert_stderr("worker=0 gen=0 ready")
assert_stderr(/worker=1 pid=\d+ registered/)
assert_stderr(/worker=1 pid=\d+ timed out, killing/, timeout: 4)

assert_clean_shutdown(pid)
end
end

0 comments on commit 18aa140

Please sign in to comment.