-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invoke restart drain for failed healthcheck #195
Invoke restart drain for failed healthcheck #195
Conversation
The current dependency chain in monit is: nginx_cc -> cloud_controller_ng -> ccng_monit_http_healthcheck The health check is defined as base process so that a failure triggers a restart of the dependent processes. This is now changed to the logical dependency chain, i.e. the order in which the processes should be started: ccng_monit_http_healthcheck -> nginx_cc -> cloud_controller_ng This removes the implicit restart trigger for the other processes in case the health check fails; but we can add a directive to explicitly invoke the 'restart_drain' script instead.
- The wait_for_server_to_become_healthy is not needed anymore as the health check is started after cc and nginx. - The printed status code was always 0 - it was the exit code of the 'if' statement, not the 'curl' command. Co-authored-by: Andy Paine <[email protected]>
jobs/cloud_controller_ng/templates/ccng_monit_http_healthcheck.sh.erb
Outdated
Show resolved
Hide resolved
start program "/var/vcap/jobs/bpm/bin/bpm start cloud_controller_ng -p ccng_monit_http_healthcheck" | ||
stop program "/var/vcap/jobs/bpm/bin/bpm stop cloud_controller_ng -p ccng_monit_http_healthcheck" | ||
if 1 restart within 2 cycles then exec "/var/vcap/jobs/cloud_controller_ng/bin/restart_drain" | ||
depends on nginx_cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a circular dependency? cloud_controller_ng goes down (say totalmem > threshhold), monit restarts nginx_cc, which causes this to call the restart_drain
script?
And does monit end up running restart_drain
twice in that case? Once at line L#9 and once at L#14?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restart_drain
script does two things to prevent any interference with monit
:
-
It calls
monit unmonitor
here. This preventsmonit
from taking any actions in the following cycles. The same is done when BOSH stops a VM; it first callsmonit unmonitor
for all jobs and then thedrain
script of the jobs (if provided). -
It uses a PID guard to ensure that there is no a parallel execution of this script (here).
So the scenario you described (totalmem > threshhold) will result in the following:
monit
calls therestart_drain
script,restart_drain
callsmonit unmonitor
,- graceful shutdown,
restart_drain
callsmonit monitor
,monit
restarts the processes (as they are stopped)
There still might be an edge case that could lead to restart_drain
being called twice. The totalmem
(cloud_controller_ng) and the restart
(ccng_monit_http_healthcheck) checks could be triggered independently at the same time (i.e. both conditions are true in the same monit
cycle). Then the PID guard mentioned above should ensure that one invocation triggers the graceful shutdown and the other one just returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the thorough explanation.
Instead of re-using config options from the route registration health check, add dedicated options to configure timeout and retries for the ccng_monit_http_healthcheck process. There are situations when cloud controller is busy with processing multiple long-running requests; although it makes sense to temporarily deregister its route to prevent additional load on the instance, the (internal) health check should not restart the instance at the same time, but only when it stays busy for too long.
- bin/shutdown_drain does not need to echo '0'; this is already done by bin/drain (https://github.com/cloudfoundry/capi-release/blob/2b52d846f82c7dbe893ffc21f879c15a3fa1da36/jobs/cloud_controller_ng/templates/drain.sh.erb#L9). - As the bin/shutdown_drain script is never invoked with arguments, logging the invocation is superfluous.
@@ -6,8 +6,5 @@ $LOAD_PATH.unshift('/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/l | |||
require 'cloud_controller/drain' | |||
|
|||
@drain = VCAP::CloudController::Drain.new('/var/vcap/sys/log/cloud_controller_ng') | |||
@drain.log_invocation(ARGV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular reason to remove this log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bin/shutdown_drain
script is never invoked with any arguments. So the log contained useless "Drain invoked with " messages.
|
||
echo $(date --rfc-3339=ns) 'Waiting for Cloud Controller to initially become healthy at' | ||
|
||
wait_for_server_to_become_healthy "${URL}" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the removal of this initial wait period done with the idea that the health-check logic below should allow for sufficient time for CCNG to start up? We have done some digging and couldn't find a recorded need for this initial wait period, but were curious if you did any investigation around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initial waiting was required before when the healthcheck was started as first process in the dependency chain. As it has now been moved to the end of the chain, both cc and nginx are already started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just now saw that you already answered this (and our other questions) in the individual commit descriptions 😅 . Thanks.
Thanks for the PR! We tried this out and successfully witnessed a graceful restart after a failed healthcheck. This seems like a valuable change to include, we just had one last question (commented above). We will merge after that question is resolved.
|
Thanks for contributing to the
capi_release
. To speed up the process of reviewing your pull request please provide us with:A short explanation of the proposed change:
A failing health check stops the Cloud Controller without invoking the
restart_drain
script. BPM will terminate the process with a grace period of 20s; any long-running requests will be cancelled and lead to a502 Bad Gateway
error at the client. With this change therestart_drain
script will also be called for failing health checks, giving long-running requests a higher (configurable) timeout.An explanation of the use cases your change solves
On large foundations (i.e. many orgs, spaces, apps, services), we experience slow performance of Cloud Controllers. Sometimes cloud controllers even don't respond to new requests due to high load. Whereas it makes sense that these Cloud Controllers get deregistered from Gorouters, the timeout for the internal health check should be higher and a failing health check should trigger a restart with a graceful shutdown.
Links to any other associated PRs
I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the
develop
branchI have run CF Acceptance Tests on bosh lite