stability: Don't restart after consistency check failure #40442

bdarnell · 2019-09-03T20:04:14Z

When the consistency checker fails, nodes crash. This is suboptimal in a lot of ways, but especially because if the node automatically restarts (such as by systemd or k8s), it may come back up and serve inconsistent data before the consistency checker's next cycle detects the problem and crashes it again (the checker runs on a 24h cycle by default). If the operator is monitoring downtime or flapping instead of individual crashes, they may not even notice these relatively infrequent crashes.

Instead, until we can implement a more graceful solution to this problem, we may want to take steps to keep the node from restarting at all if its last failure was due to the consistency checker. This has two downsides: collateral unavailability if the consistency checker takes down multiple replicas (the checker doesn't currently try to identify the minority node(s)), and an increased risk that the downed node might lead to upreplication of an incorrect replica. The former downside seems OK (and is at least compatible with our CAP-consistent philosophy), although I'm not sure about the latter.

This could be implemented by either writing a deathrattle file to disk before crashing the node and checking it on startup, or exiting with a special status code to tell systemd/k8s not to restart and to alert an operator.

tbg · 2019-09-03T21:50:52Z

We already write a RocksDB checkpoint in 19.1 onwards in this case. We could take the existence of a consistency-created checkpoint as a deathrattle.
The special status code option has the downside that it needs to be taught to users and in the absence of that there's a good chance that the corrupt node will be rebooted anyway.

petermattis · 2019-09-12T15:58:12Z

I'm planning for #40509 to get into 19.2 as it is low risk (a debug tool that is only run manually). For 20.1 we could consider always running a RocksDB/Pebble consistency check at startup after a range consistency violation has been detected.

Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. Fixes cockroachdb#40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner.

42401: storage: prevent node from restarting after consistency check fatal r=bdarnell a=tbg NB: I'll update the inconsistency roachtest with an end-to-end check on the restart protection. ---- Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. Fixes #40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner. Co-authored-by: Tobias Schottdorf <[email protected]>

Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. We use the same strategy when a replica finds that its internal invariants are violated. Fixes #40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner.

tbg mentioned this issue Nov 12, 2019

storage: prevent node from restarting after consistency check fatal #42401

Merged

craig bot closed this as completed in ba6699b Nov 18, 2019

tbg mentioned this issue Nov 19, 2019

roachtest: add inconsistency test #42264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Don't restart after consistency check failure #40442

stability: Don't restart after consistency check failure #40442

bdarnell commented Sep 3, 2019

tbg commented Sep 3, 2019

petermattis commented Sep 12, 2019

stability: Don't restart after consistency check failure #40442

stability: Don't restart after consistency check failure #40442

Comments

bdarnell commented Sep 3, 2019

tbg commented Sep 3, 2019

petermattis commented Sep 12, 2019