Skip to content

Commit

Permalink
Handle helper program startup failure as its death (#1395)
Browse files Browse the repository at this point in the history
Squid quits when started helper programs die too fast without responding
to requests[^1] because such a Squid instance is unlikely to provide
acceptable service (and a full restart may actually fix the problem or,
at the very least, is more likely to bring the needed admin attention).

The same logic now applies when Squid fails to start a helper (i.e. when
ipcCreate() fails). There is no conceptual difference between those two
kinds of failures as far as helper handling code is concerned, so we now
treat them the same way.

Without these changes, helper start failures may result in an unusable
(but running) Squid instance, especially if no helpers can be started at
all, because new transactions get stuck waiting in the queue until
clients timeout. Such persistent ipcCreate() failures may be caused, for
example, by its fork() hitting an overcommit memory limits.

[^1]: The actual condition excludes cases where at least startup=N
helpers are still running. That exclusion and other helper failure
handling details are problematic, but adjusting that code is outside
_this_ fix scope: Here, we only apply _existing_ handling logic to a
missed case.
  • Loading branch information
eduard-bagdasaryan authored and squid-anubis committed Jul 11, 2023
1 parent a1b1756 commit 654835f
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 8 deletions.
50 changes: 42 additions & 8 deletions src/helper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,8 @@ helperOpenServers(helper * hlp)

assert(nargs <= HELPER_MAX_ARGS);

int successfullyStarted = 0;

for (k = 0; k < need_new; ++k) {
getCurrentTime();
rfd = wfd = -1;
Expand All @@ -268,6 +270,7 @@ helperOpenServers(helper * hlp)
continue;
}

++successfullyStarted;
++ hlp->childs.n_running;
++ hlp->childs.n_active;
srv = new helper_server;
Expand Down Expand Up @@ -317,6 +320,13 @@ helperOpenServers(helper * hlp)
comm_read(srv->readPipe, srv->rbuf, srv->rbuf_sz - 1, call);
}

// Call handleFewerServers() before hlp->last_restart is updated because
// that method uses last_restart to measure the delay since previous start.
// TODO: Refactor last_restart code to measure failure frequency rather than
// detecting a helper #X failure that is being close to the helper #Y start.
if (successfullyStarted < need_new)
hlp->handleFewerServers(false);

hlp->last_restart = squid_curtime;
safe_free(shortname);
safe_free(procname);
Expand Down Expand Up @@ -376,6 +386,8 @@ helperStatefulOpenServers(statefulhelper * hlp)

assert(nargs <= HELPER_MAX_ARGS);

int successfullyStarted = 0;

for (int k = 0; k < need_new; ++k) {
getCurrentTime();
int rfd = -1;
Expand All @@ -395,6 +407,7 @@ helperStatefulOpenServers(statefulhelper * hlp)
continue;
}

++successfullyStarted;
++ hlp->childs.n_running;
++ hlp->childs.n_active;
helper_stateful_server *srv = new helper_stateful_server;
Expand Down Expand Up @@ -436,6 +449,13 @@ helperStatefulOpenServers(statefulhelper * hlp)
comm_read(srv->readPipe, srv->rbuf, srv->rbuf_sz - 1, call);
}

// Call handleFewerServers() before hlp->last_restart is updated because
// that method uses last_restart to measure the delay since previous start.
// TODO: Refactor last_restart code to measure failure frequency rather than
// detecting a helper #X failure that is being close to the helper #Y start.
if (successfullyStarted < need_new)
hlp->handleFewerServers(false);

hlp->last_restart = squid_curtime;
safe_free(shortname);
safe_free(procname);
Expand Down Expand Up @@ -838,21 +858,35 @@ helper::handleKilledServer(HelperServerBase *srv, bool &needsNewServers)
--childs.n_active;
debugs(84, DBG_CRITICAL, "WARNING: " << id_name << " #" << srv->index << " exited");

if (childs.needNew() > 0) {
debugs(80, DBG_IMPORTANT, "Too few " << id_name << " processes are running (need " << childs.needNew() << "/" << childs.n_max << ")");
handleFewerServers(srv->stats.replies >= 1);

if (childs.n_active < childs.n_startup && last_restart > squid_curtime - 30) {
if (srv->stats.replies < 1)
fatalf("The %s helpers are crashing too rapidly, need help!\n", id_name);
else
debugs(80, DBG_CRITICAL, "ERROR: The " << id_name << " helpers are crashing too rapidly, need help!");
}
if (childs.needNew() > 0) {
srv->flags.shutdown = true;
needsNewServers = true;
}
}
}

void
helper::handleFewerServers(const bool madeProgress)
{
const auto needNew = childs.needNew();

if (!needNew)
return; // some server(s) have died, but we still have enough

debugs(80, DBG_IMPORTANT, "Too few " << id_name << " processes are running (need " << needNew << "/" << childs.n_max << ")" <<
Debug::Extra << "active processes: " << childs.n_active <<
Debug::Extra << "processes configured to start at (re)configuration: " << childs.n_startup);

if (childs.n_active < childs.n_startup && last_restart > squid_curtime - 30) {
if (madeProgress)
debugs(80, DBG_CRITICAL, "ERROR: The " << id_name << " helpers are crashing too rapidly, need help!");
else
fatalf("The %s helpers are crashing too rapidly, need help!", id_name);
}
}

void
helper_server::HelperServerClosed(helper_server *srv)
{
Expand Down
5 changes: 5 additions & 0 deletions src/helper.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,11 @@ class helper
/// \param needsNewServers true if new servers must started, false otherwise
void handleKilledServer(HelperServerBase *srv, bool &needsNewServers);

/// Reacts to unexpected server death(s), including a failure to start server(s)
/// and an unexpected exit of a previously started server. \sa handleKilledServer()
/// \param madeProgress whether the died server(s) responded to any requests
void handleFewerServers(bool madeProgress);

public:
wordlist *cmdline;
dlink_list servers;
Expand Down

0 comments on commit 654835f

Please sign in to comment.