Skip to content

Commit

Permalink
roachprod: avoid extra dead event in roachprod monitor
Browse files Browse the repository at this point in the history
Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in #66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes #66522
Closes #66528

Release note: None
  • Loading branch information
tbg committed Jun 16, 2021
1 parent a021651 commit b0c459d
Showing 1 changed file with 22 additions and 5 deletions.
27 changes: 22 additions & 5 deletions pkg/cmd/roachprod/install/cluster_synced.go
Original file line number Diff line number Diff line change
Expand Up @@ -366,22 +366,39 @@ if [ ! -f "{{.Store}}/CURRENT" ]; then
fi
{{- end}}
lastpid=""
# Init with -1 so that when cockroach is initially dead, we print
# a dead event for it.
lastpid=-1
while :; do
{{ if .Local }}
pid=$(lsof -i :{{.Port}} -sTCP:LISTEN | awk '!/COMMAND/ {print $2}')
pid=${pid:-0} # default to 0
status=unknown
status="unknown"
{{- else }}
# When CRDB is not running, this is zero.
pid=$(systemctl show cockroach --property MainPID --value)
status=$(systemctl show cockroach --property ExecMainStatus --value)
{{- end }}
if [[ "${lastpid}" == -1 && "${pid}" != 0 ]]; then
# On the first iteration through the loop, if the process is running,
# don't register a PID change (which would trigger an erroneous dead
# event).
lastpid=0
fi
# Output a dead event whenever the PID changes from a nonzero value to
# any other value. In particular, we emit a dead event when the node stops
# (lastpid is nonzero, pid is zero), but not when the process then starts
# again (lastpid is zero, pid is nonzero).
if [ "${pid}" != "${lastpid}" ]; then
# Output a dead event on every PID change, except if initial PID is live.
if [[ ! ("${lastpid}" == "" && "${pid}" != 0) ]]; then
if [ "${lastpid}" != 0 ]; then
if [ "${pid}" != 0 ]; then
# If the PID changed but neither is zero, then the status refers to
# the new incarnation. We lost the actual exit status of the old PID.
status="unknown"
fi
echo "dead (exit status ${status})"
fi
fi
if [ "${pid}" != 0 ]; then
echo "${pid}"
fi
Expand Down

0 comments on commit b0c459d

Please sign in to comment.