roachprod: avoid extra dead event in roachprod monitor

Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in #66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes #66522 Closes #66528 Release note: None
cockroachdb · Jun 16, 2021 · b0c459d · b0c459d
1 parent a021651
commit b0c459d
Showing 1 changed file with 22 additions and 5 deletions.
diff --git a/pkg/cmd/roachprod/install/cluster_synced.go b/pkg/cmd/roachprod/install/cluster_synced.go
@@ -366,22 +366,39 @@ if [ ! -f "{{.Store}}/CURRENT" ]; then
 fi
 {{- end}}
 
-lastpid=""
+# Init with -1 so that when cockroach is initially dead, we print
+# a dead event for it.
+lastpid=-1
 while :; do
 {{ if .Local }}
   pid=$(lsof -i :{{.Port}} -sTCP:LISTEN | awk '!/COMMAND/ {print $2}')
 	pid=${pid:-0} # default to 0
-	status=unknown
+	status="unknown"
 {{- else }}
+  # When CRDB is not running, this is zero.
 	pid=$(systemctl show cockroach --property MainPID --value)
 	status=$(systemctl show cockroach --property ExecMainStatus --value)
 {{- end }}
 
+  if [[ "${lastpid}" == -1 && "${pid}" != 0 ]]; then
+    # On the first iteration through the loop, if the process is running,
+    # don't register a PID change (which would trigger an erroneous dead
+    # event).
+    lastpid=0
+  fi
+  # Output a dead event whenever the PID changes from a nonzero value to
+  # any other value. In particular, we emit a dead event when the node stops
+  # (lastpid is nonzero, pid is zero), but not when the process then starts
+  # again (lastpid is zero, pid is nonzero).
   if [ "${pid}" != "${lastpid}" ]; then
-		# Output a dead event on every PID change, except if initial PID is live.
-		if [[ ! ("${lastpid}" == "" && "${pid}" != 0) ]]; then
+    if [ "${lastpid}" != 0 ]; then
+      if [ "${pid}" != 0 ]; then
+        # If the PID changed but neither is zero, then the status refers to
+        # the new incarnation. We lost the actual exit status of the old PID.
+        status="unknown"
+      fi
     	echo "dead (exit status ${status})"
-		fi
+    fi
 		if [ "${pid}" != 0 ]; then
 			echo "${pid}"
     fi