Skip to content

Commit

Permalink
storage: write to local storage before updating liveness
Browse files Browse the repository at this point in the history
Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978
(which introduced a stricter version of a similar check). cockroachdb#32978 was
later disabled by default in cockroachdb#36484, leaving us without the protections
first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591.

Release note: None.
  • Loading branch information
irfansharif committed Oct 21, 2019
1 parent 0a6705d commit 90b350b
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions pkg/storage/node_liveness.go
Original file line number Diff line number Diff line change
Expand Up @@ -769,6 +769,16 @@ func (nl *NodeLiveness) updateLiveness(
if err := ctx.Err(); err != nil {
return err
}

for _, eng := range nl.engines {
// We synchronously write to all disks before updating liveness because we
// don't want any excessively slow disks to prevent leases from being
// shifted to other nodes. A slow/stalled disk would block here and cause
// the node to lose its leases.
if err := engine.WriteSyncNoop(ctx, eng); err != nil {
return errors.Wrapf(err, "couldn't update node liveness because disk write failed")
}
}
if err := nl.updateLivenessAttempt(ctx, update, oldLiveness, handleCondFailed); err != nil {
// Intentionally don't errors.Cause() the error, or we'd hop past errRetryLiveness.
if _, ok := err.(*errRetryLiveness); ok {
Expand Down

0 comments on commit 90b350b

Please sign in to comment.