Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Don't restart after consistency check failure #40442

Closed
bdarnell opened this issue Sep 3, 2019 · 2 comments · Fixed by #42401
Closed

stability: Don't restart after consistency check failure #40442

bdarnell opened this issue Sep 3, 2019 · 2 comments · Fixed by #42401

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Sep 3, 2019

When the consistency checker fails, nodes crash. This is suboptimal in a lot of ways, but especially because if the node automatically restarts (such as by systemd or k8s), it may come back up and serve inconsistent data before the consistency checker's next cycle detects the problem and crashes it again (the checker runs on a 24h cycle by default). If the operator is monitoring downtime or flapping instead of individual crashes, they may not even notice these relatively infrequent crashes.

Instead, until we can implement a more graceful solution to this problem, we may want to take steps to keep the node from restarting at all if its last failure was due to the consistency checker. This has two downsides: collateral unavailability if the consistency checker takes down multiple replicas (the checker doesn't currently try to identify the minority node(s)), and an increased risk that the downed node might lead to upreplication of an incorrect replica. The former downside seems OK (and is at least compatible with our CAP-consistent philosophy), although I'm not sure about the latter.

This could be implemented by either writing a deathrattle file to disk before crashing the node and checking it on startup, or exiting with a special status code to tell systemd/k8s not to restart and to alert an operator.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

We already write a RocksDB checkpoint in 19.1 onwards in this case. We could take the existence of a consistency-created checkpoint as a deathrattle.
The special status code option has the downside that it needs to be taught to users and in the absence of that there's a good chance that the corrupt node will be rebooted anyway.

@petermattis
Copy link
Collaborator

I'm planning for #40509 to get into 19.2 as it is low risk (a debug tool that is only run manually). For 20.1 we could consider always running a RocksDB/Pebble consistency check at startup after a range consistency violation has been detected.

tbg added a commit to tbg/cockroach that referenced this issue Nov 12, 2019
Many deployments auto-restart crashed nodes unconditionally.  As a
result, we are often pulled into inconsistency failures relatively late,
which results in a loss of information (in particular if the source of
the problem turns out to be one of the nodes that did not actually
crash); if the node that crashes *is* the one with the problem,
restarting it lets it serve data again (until the next time the
consistency checker nukes it), which is bad as well.

This commit introduces a "death rattle" - if a node is told to terminate
as the result of a consistency check, it will write a marker file that
prevents subsequent restarts of the node with an informative message
alerting the operator that there is a serious problem that needs to be
addressed.

Fixes cockroachdb#40442.

Release note (general change): nodes that have been terminated as the
result of a failed consistency check now refuse to restart, making it
more likely that the operator notices that there is a persistent issue
in a timely manner.
craig bot pushed a commit that referenced this issue Nov 18, 2019
42401: storage: prevent node from restarting after consistency check fatal r=bdarnell a=tbg

NB: I'll update the inconsistency roachtest with an end-to-end check on the restart protection.

----

Many deployments auto-restart crashed nodes unconditionally.  As a
result, we are often pulled into inconsistency failures relatively late,
which results in a loss of information (in particular if the source of
the problem turns out to be one of the nodes that did not actually
crash); if the node that crashes *is* the one with the problem,
restarting it lets it serve data again (until the next time the
consistency checker nukes it), which is bad as well.

This commit introduces a "death rattle" - if a node is told to terminate
as the result of a consistency check, it will write a marker file that
prevents subsequent restarts of the node with an informative message
alerting the operator that there is a serious problem that needs to be
addressed.

Fixes #40442.

Release note (general change): nodes that have been terminated as the
result of a failed consistency check now refuse to restart, making it
more likely that the operator notices that there is a persistent issue
in a timely manner.

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig craig bot closed this as completed in ba6699b Nov 18, 2019
tbg added a commit that referenced this issue Nov 18, 2019
Many deployments auto-restart crashed nodes unconditionally. As a
result, we are often pulled into inconsistency failures relatively late,
which results in a loss of information (in particular if the source of
the problem turns out to be one of the nodes that did not actually
crash); if the node that crashes *is* the one with the problem,
restarting it lets it serve data again (until the next time the
consistency checker nukes it), which is bad as well.

This commit introduces a "death rattle" - if a node is told to terminate
as the result of a consistency check, it will write a marker file that
prevents subsequent restarts of the node with an informative message
alerting the operator that there is a serious problem that needs to be
addressed. We use the same strategy when a replica finds that its
internal invariants are violated.

Fixes #40442.

Release note (general change): nodes that have been terminated as the
result of a failed consistency check now refuse to restart, making it
more likely that the operator notices that there is a persistent issue
in a timely manner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants