-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Don't restart after consistency check failure #40442
Comments
We already write a RocksDB checkpoint in 19.1 onwards in this case. We could take the existence of a consistency-created checkpoint as a deathrattle. |
I'm planning for #40509 to get into 19.2 as it is low risk (a debug tool that is only run manually). For 20.1 we could consider always running a RocksDB/Pebble consistency check at startup after a range consistency violation has been detected. |
Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. Fixes cockroachdb#40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner.
42401: storage: prevent node from restarting after consistency check fatal r=bdarnell a=tbg NB: I'll update the inconsistency roachtest with an end-to-end check on the restart protection. ---- Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. Fixes #40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner. Co-authored-by: Tobias Schottdorf <[email protected]>
Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. We use the same strategy when a replica finds that its internal invariants are violated. Fixes #40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner.
When the consistency checker fails, nodes crash. This is suboptimal in a lot of ways, but especially because if the node automatically restarts (such as by systemd or k8s), it may come back up and serve inconsistent data before the consistency checker's next cycle detects the problem and crashes it again (the checker runs on a 24h cycle by default). If the operator is monitoring downtime or flapping instead of individual crashes, they may not even notice these relatively infrequent crashes.
Instead, until we can implement a more graceful solution to this problem, we may want to take steps to keep the node from restarting at all if its last failure was due to the consistency checker. This has two downsides: collateral unavailability if the consistency checker takes down multiple replicas (the checker doesn't currently try to identify the minority node(s)), and an increased risk that the downed node might lead to upreplication of an incorrect replica. The former downside seems OK (and is at least compatible with our CAP-consistent philosophy), although I'm not sure about the latter.
This could be implemented by either writing a deathrattle file to disk before crashing the node and checking it on startup, or exiting with a special status code to tell systemd/k8s not to restart and to alert an operator.
The text was updated successfully, but these errors were encountered: