storage: write to local storage before updating liveness

Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.
irfansharif · Oct 21, 2019 · 90b350b · 90b350b
1 parent 0a6705d
commit 90b350b
Showing 1 changed file with 10 additions and 0 deletions.
diff --git a/pkg/storage/node_liveness.go b/pkg/storage/node_liveness.go
@@ -769,6 +769,16 @@ func (nl *NodeLiveness) updateLiveness(
 		if err := ctx.Err(); err != nil {
 			return err
 		}
+
+		for _, eng := range nl.engines {
+			// We synchronously write to all disks before updating liveness because we
+			// don't want any excessively slow disks to prevent leases from being
+			// shifted to other nodes. A slow/stalled disk would block here and cause
+			// the node to lose its leases.
+			if err := engine.WriteSyncNoop(ctx, eng); err != nil {
+				return errors.Wrapf(err, "couldn't update node liveness because disk write failed")
+			}
+		}
 		if err := nl.updateLivenessAttempt(ctx, update, oldLiveness, handleCondFailed); err != nil {
 			// Intentionally don't errors.Cause() the error, or we'd hop past errRetryLiveness.
 			if _, ok := err.(*errRetryLiveness); ok {