Skip to content

Commit

Permalink
Ensure silk-daemon-healthchecker doesn't restart silk forever on failure
Browse files Browse the repository at this point in the history
  • Loading branch information
geofffranks committed Dec 7, 2022
1 parent 6ccee62 commit 3279493
Show file tree
Hide file tree
Showing 33 changed files with 498 additions and 330 deletions.
4 changes: 4 additions & 0 deletions jobs/silk-daemon/spec
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,7 @@ properties:
'rfc3339' is the recommended format. It will result in all timestamps in the drain log controlled by silk-daemon to be in RFC3339 format, which is human readable. This does not include stderr logs from golang libraries.
'deprecated' will result in all timestamps being in the format they were before the rfc3339 flag was introduced for the drain log. We do not recommend using this flag unless you have scripts that expect a particular timestamp format.
default: "rfc3339"

healthchecker.failure_counter_file:
description: "File used by the healthchecker to monitor consecutive failures."
default: /var/vcap/data/silk-daemon/counters/consecutive_healthchecker_failures.count
3 changes: 3 additions & 0 deletions jobs/silk-daemon/templates/bpm.yml.erb
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,6 @@ processes:
args:
- "-c"
- "/var/vcap/jobs/silk-daemon/config/healthchecker.yml"
additional_volumes:
- path: /var/vcap/data/silk-daemon/counters
writable: true
1 change: 1 addition & 0 deletions jobs/silk-daemon/templates/healthchecker.yml.erb
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
<%=
config = {
'component_name' => 'silk-daemon-healthchecker',
'failure_counter_file' => p('healthchecker.failure_counter_file'),
'log_level' => p('logging.level'),
'healthcheck_endpoint' => {
'host' => '127.0.0.1',
Expand Down
23 changes: 15 additions & 8 deletions jobs/silk-daemon/templates/restart-silk-daemon.erb
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/usr/bin/env bash

PIDFILE="/var/vcap/sys/run/silk-daemon/restart-silk-daemon.pid"
FAILURE_COUNTER_FILE="<%= p("healthchecker.failure_counter_file") %>"

# As this script might run longer than a monit cycle (10s) and thus might be
# triggered several times, it must be ensured that it runs only once.
[[ -s "$PIDFILE" ]] && exit

function on_exit {
/var/vcap/bosh/bin/monit reload silk-daemon-healthchecker
rm -f $PIDFILE
}

Expand All @@ -18,12 +18,19 @@ echo "$BASHPID" > "$PIDFILE"
LOGFILE="/var/vcap/sys/log/silk-daemon/restart-silk-daemon.log"
echo "$(date) - pid: $BASHPID - Monit triggered restart" >> "$LOGFILE"

/var/vcap/bosh/bin/monit restart silk-daemon
sleep 1
echo "$(date) - pid: $BASHPID - Waiting for silk-daemon to be restarted" >> "$LOGFILE"
failure_counter="$(cat ${FAILURE_COUNTER_FILE})"

until /var/vcap/bosh/bin/monit summary | grep silk-daemon | grep -v healthchecker | grep running; do
if (( failure_counter < 10 )); then
/var/vcap/bosh/bin/monit restart silk-daemon
sleep 1
done

echo "$(date) - pid: $BASHPID - silk-daemon was restarted" >> "$LOGFILE"
echo "$(date) - pid: $BASHPID - Waiting for silk-daemon to be restarted" >> "$LOGFILE"

until /var/vcap/bosh/bin/monit summary | grep silk-daemon | grep -v healthchecker | grep running; do
sleep 1
done
/var/vcap/bosh/bin/monit reload silk-daemon-healthchecker
echo "$(date) - pid: $BASHPID - silk-daemon was restarted" >> "$LOGFILE"
else
echo "$(date) - pid: $BASHPID - 10 consecutive failures in a row. Stopping healthcheck to avoid constantly bringing down the main service." >> "${LOGFILE}"
/var/vcap/bosh/bin/monit unmonitor silk-daemon-healthchecker
fi
4 changes: 2 additions & 2 deletions src/code.cloudfoundry.org/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ replace github.com/gogo/protobuf => github.com/gogo/protobuf v1.3.2
replace github.com/hashicorp/consul => github.com/hashicorp/consul v0.7.0

require (
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221117171434-3d123025a8c3
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221205130414-742bd12bf674
code.cloudfoundry.org/debugserver v0.0.0-20210608171006-d7658ce493f4
code.cloudfoundry.org/diego-logging-client v0.0.0-20220314190632-277a9c460661
code.cloudfoundry.org/executor v0.0.0-20220401134035-4e7113938d00
Expand Down Expand Up @@ -57,7 +57,7 @@ require (
github.com/cloudfoundry/sonde-go v0.0.0-20200416163440-a42463ba266b // indirect
github.com/containernetworking/plugins v0.0.0-00010101000000-000000000000 // indirect
github.com/fsnotify/fsnotify v1.4.9 // indirect
github.com/go-sql-driver/mysql v1.6.0 // indirect
github.com/go-sql-driver/mysql v1.7.0 // indirect
github.com/go-test/deep v1.0.8 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.2 // indirect
Expand Down
7 changes: 4 additions & 3 deletions src/code.cloudfoundry.org/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ cloud.google.com/go v0.26.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMT
cloud.google.com/go v0.34.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw=
code.cloudfoundry.org/bbs v0.0.0-20210727125654-2ad50317f7ed h1:lXyKwHvjX8AofvDkI/LVrjro0iXNMztPUBHjxxHxGKs=
code.cloudfoundry.org/bbs v0.0.0-20210727125654-2ad50317f7ed/go.mod h1:XKlGVVXFi5EcHHMPzw3xgONK9PeEZuUbIC43XNwxD10=
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221117171434-3d123025a8c3 h1:zG5o5H4cYWfZymMLRxRTDT/++y9TWh/nsMW0n61QO7k=
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221117171434-3d123025a8c3/go.mod h1:AckMTCaBOprDW2tD4G9t1vuTPhFOrl6mItQna5MdC+c=
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221205130414-742bd12bf674 h1:1RfyBh7rQ59pWhSzO02abAaPu5GTUjThz2i9a5ePRwY=
code.cloudfoundry.org/cf-networking-helpers v0.0.0-20221205130414-742bd12bf674/go.mod h1:B49nvupgBP7F6MuIoOe1z0yXSG+1TgkDNT1oi1EOuIE=
code.cloudfoundry.org/cfhttp/v2 v2.0.1-0.20210513172332-4c5ee488a657 h1:8rnhkeAe8Bnx+8r3unO++S3syBw8P22qPbw3LLFWEoc=
code.cloudfoundry.org/cfhttp/v2 v2.0.1-0.20210513172332-4c5ee488a657/go.mod h1:Fwt0o/haXfwgOHMom4AM96pXCVw9EAiIcSsPb8hWK9s=
code.cloudfoundry.org/clock v1.0.0 h1:kFXWQM4bxYvdBw2X8BbBeXwQNgfoWv1vqAk2ZZyBN2o=
Expand Down Expand Up @@ -160,8 +160,9 @@ github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V
github.com/go-logfmt/logfmt v0.5.0/go.mod h1:wCYkCAKZfumFQihp8CzCvQ3paCTfi41vtzG1KdI/P7A=
github.com/go-logr/logr v1.2.3 h1:2DntVwHkVopvECVRSlL5PSo9eG+cAkDCuckLubN+rq0=
github.com/go-sql-driver/mysql v1.5.0/go.mod h1:DCzpHaOWr8IXmIStZouvnhqoel9Qv2LBy8hT2VhHyBg=
github.com/go-sql-driver/mysql v1.6.0 h1:BCTh4TKNUYmOmMUcQ3IipzF5prigylS7XXjEkfCHuOE=
github.com/go-sql-driver/mysql v1.6.0/go.mod h1:DCzpHaOWr8IXmIStZouvnhqoel9Qv2LBy8hT2VhHyBg=
github.com/go-sql-driver/mysql v1.7.0 h1:ueSltNNllEqE3qcWBTD0iQd3IpL/6U+mJxLkazJ7YPc=
github.com/go-sql-driver/mysql v1.7.0/go.mod h1:OXbVy3sEdcQ2Doequ6Z5BW6fXNQTmx+9S1MCJN5yJMI=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/go-task/slim-sprig v0.0.0-20210107165309-348f09dbbbc0/go.mod h1:fyg7847qk6SyHyPtNmDHnmrv/HOrqktSC+C9fM+CJOE=
github.com/go-test/deep v1.0.8 h1:TDsG77qcSprGbC6vTN8OuXp5g+J+b5Pcguhf7Zt61VM=
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 3279493

Please sign in to comment.