Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: resolve outages involving failure to apply quickly and easily #75903

Open
joshimhoff opened this issue Feb 2, 2022 · 7 comments
Open
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Feb 2, 2022

Is your feature request related to a problem? Please describe.
IMPORTANT NOTE: I am sorry if I misunderstand details. This is the SRE way sometimes. Good luck reading my scribbles, KV!!

Non-deterministic failures to apply a raft command can crash the node:

This poses a problem for replicas that fail for any reason to apply an entry. If
the failure wasn't deterministic across all replicas then they can't carry on
applying entries, as their state may have diverged from their peers. The only
reasonable recourse is to signal that the replica has become corrupted. This
demonstrates why it is necessary to separate deterministic command failures from
non-deterministic state transition failures. The former, which we call "command
rejection" is permissible as long as all replicas come to the same decision to
reject the command and handle the rejection in the same way (e.g. decide not to
make any state transition). The latter, on the other hand, it not permissible,
and is typically handled by crashing the node.

If the non-deterministic failure to apply will actually happen repeatedly until an operator takes some action, we have a large scale outage on our hand with high impact & MTTR. I won't say which customer but a recent on-prem customer experience just such an outage.

Describe the solution you'd like
I don't really see why CRDB can't determine that no node will ever successfully apply a command. If CRDB can determine this, can CRDB stop trying to apply the entry on all replicas automatically, without affecting correctness?

With MTTR = human response time, we could require some human input & even take downtime, to simplify implementation. Imagine something like this:

  1. An operator sets a cluster setting or CLI flag called delete_raft_entries_that_are_not_applied_on_any_nodes (prob CLI flag since cluster may be too down to set a cluster setting).
  2. Each CRDB node stops serving.
  3. Each CRDB node looks for entries that are committed but not applied. Entries that are committed but not applied are gossiped or something.
  4. If an entry has been committed but not applied on all nodes, we disable it from applying on all replicas, via a sweet trick from Nathan involving setting MaxLeaseIndex to 0.

Describe alternatives you've considered

Additional context
N/A

Jira issue: CRDB-12874

Epic CRDB-39898

@joshimhoff joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. O-sre For issues SRE opened or otherwise cares about tracking. labels Feb 2, 2022
@joshimhoff joshimhoff changed the title kvserver: resolve outages involving failure to apply quickly kvserver: resolve outages involving failure to apply quickly and easily Feb 2, 2022
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 3, 2022

I agree with the goal here -- it's a significant weakness of Raft that a single failing command can cause head-of-line blocking and prevent all progress. That said, I'm not sure if an auto-repair mechanism is going to be practical -- certainly not in the short term.

In the outage you mentioned, the nodes were crash-looping due to a panic. We can't do anything when the node is crashed, so that's the first thing we'll have to look into: consider trying to catch replica-level panics at the replica layer, and limit the blast radius to the range rather than the node. This is a good idea in principle, and we're going to be making some improvements in that direction, but it's tricky because...

The second problem is correctness. Ok, so we've now handled some arbitrary panic coming from "somewhere" (could be deep in Pebble, or in a third-party library). Or we could have some other arbitrary error bubble up. How do we know that this failure isn't affecting correctness? Maybe some partial state was written somewhere before it failed. Maybe the failure was in a shared component that can affect correctness in other unrelated ranges. Maybe the failure affected internal non-atomic Raft state leaving it inconsistent. How can we know? We can't, not without extensively classifying all possible failure modes in all possible components (including external ones such as libraries and the operating system).

But let's say that we try to return appropriate errors for all sorts of failure modes, and handle the ones that are safe to handle (or just accept that correctness may be violated). We now have these replicas continually trying (and failing) to apply this log entry. First of all, how do we detect that something is even wrong? And should we catch stuck entries too, or just erroring ones? We can use the new replica circuit breakers, but is it stuck due to a quorum loss or due to log application? What if's only stuck on two replicas, but a third one managed to apply it? We'll need a new failure detector that tries to figure all of this out.

Ok, so we find a stuck entry and would like to tell these replicas to skip it. Is it safe to skip the entry? Maybe there are subsequent entries in the log that depend on this entry for correctness? Let's chop off the entire remainder of the log to be sure, and abort all requests that depend on them. But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen? That is the Raft contract, after all, and we've just violated it.

Ok, let's accept that some callers will get upset. How do we coordinate this? We'd normally use Raft to coordinate and replicate state changes, but Raft is borked, so we're going to have to have additional infrastructure to replicate this. Whose responsibility is it to do this coordination? We'd normally use the leaseholder, but what if there is no lease? Let's assume "someone" does it -- but we better make sure only one actor decides to do it, otherwise it can start chopping off logs after the range has been fixed by someone else. It sends off an RPC to the replicas. What if that RPC succeeds on two nodes, but not on a third node for some reason? What if one of the nodes somehow manages to recover and get the application through (maybe there was a temporary disk problem), while the others didn't? We now have replica divergence. Should we automatically recover from that too? Which diverged replica do we choose? What if we've already acknowledged some of these applied entries to clients?

I guess the point I'm trying to make is that here be dragons, and if something goes horribly wrong like this I would never trust a computer to automatically diagnose and recover from it. Users presumably pick CRDB because they value correctness over availability (to some extent), and so we should always err on the side of correctness. I think the best thing we can do here is to fail hard (preferably limited to the range), get a human involved to assess the failure, and provide some tooling for offline Raft surgery. And as always, lots more tests to ensure these sorts of failures don't happen in the first place.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Feb 3, 2022

This all makes sense. Thanks for the all the color!!!

Is it safe to skip the entry? Maybe there are subsequent entries in the log that depend on this entry for correctness? Let's chop off the entire remainder of the log to be sure, and abort all requests that depend on them. But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen? That is the Raft contract, after all, and we've just violated it.

Above is prob the most critical piece I was missing. Not all commands are safe to skip, as I was overly focused on the recent outage which involved a command that likely was safe to skip. In retrospect this is obvious :)

But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen?

Could we block on returning success to clients on at least one node successfully applying a command? Or is that a non-starter given performance goals? If we did block on returning success, would we still have the above correctness concern or no?

Maybe some partial state was written somewhere before it failed. Maybe the failure was in a shared component that can affect correctness in other unrelated ranges. Maybe the failure affected internal non-atomic Raft state leaving it inconsistent. How can we know? We can't, not without extensively classifying all possible failure modes in all possible components (including external ones such as libraries and the operating system).

Is the partial bad state we are worried about in durable storage or memory? If it is in durable storage, sounds really tough & also unfortunate. If it it's in memory, could we rely on restarting the node a single time to clear the partial bad state?

I agree with the goal here -- it's a significant weakness of Raft that a single failing command can cause head-of-line blocking and prevent all progress.

Here's a different weird idea: Could we make a problem like the recent pebble crashes happen before a command is committed somehow? If pebble had transactions (IIUC rocksdb did but pebble doesn't), could we open a transaction, do the application, and then roll it back? And if that transaction fails for some reason, the command is rejected without being proposed.

@erikgrinaker
Copy link
Contributor

Is it safe to skip the entry? Maybe there are subsequent entries in the log that depend on this entry for correctness? Let's chop off the entire remainder of the log to be sure, and abort all requests that depend on them. But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen? That is the Raft contract, after all, and we've just violated it.

Above is prob the most critical piece I was missing. Not all commands are safe to skip, as I was overly focused on the recent outage which involved a command that likely was safe to skip. In retrospect this is obvious :)

It may actually turn out that all the commands we use in CRDB would be safe to skip. But Raft can't know for sure, so we'd essentially have to audit all commands and uphold this through developer discipline, which is brittle.

But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen?

Could we block on returning success to clients on at least one node successfully applying a command? Or is that a non-starter given performance goals? If we did block on returning success, would we still have the above correctness concern or no?

We often do this. But e.g. for intent writes we're satisfied as soon as it's committed to Raft, and don't bother waiting for application, as this would increase transaction latency.

Maybe some partial state was written somewhere before it failed. Maybe the failure was in a shared component that can affect correctness in other unrelated ranges. Maybe the failure affected internal non-atomic Raft state leaving it inconsistent. How can we know? We can't, not without extensively classifying all possible failure modes in all possible components (including external ones such as libraries and the operating system).

Is the partial bad state we are worried about in durable storage or memory? If it is in durable storage, sounds really tough & also unfortunate. If it it's in memory, could we rely on restarting the node a single time to clear the partial bad state?

It could be either, and we can't really know. At least with offline recovery, we only have to worry about persisted state.

I agree with the goal here -- it's a significant weakness of Raft that a single failing command can cause head-of-line blocking and prevent all progress.

Here's a different weird idea: Could we make a problem like the recent pebble crashes happen before a command is committed somehow? If pebble had transactions (IIUC rocksdb did but pebble doesn't), could we open a transaction, do the application, and then roll it back? And if that transaction fails for some reason, the command is rejected without being proposed.

That's typically what we do: we evaluate the request first, then we take the resulting write batch and put that through Raft (log truncation doesn't because that would be inefficient). That batch is atomic, like a transaction, and application basically commits that batch. But what if the error that failed the application came from within that batch commit machinery? Can we still trust that the batch was atomic and successfully rolled back? We really can't, and that's the crux of the problem: with arbitrary failures, we have no idea if the guarantees we normally rely on still hold.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Feb 3, 2022

It may actually turn out that all the commands we use in CRDB would be safe to skip. But Raft can't know for sure, so we'd essentially have to audit all commands and uphold this through developer discipline, which is brittle.

Ah I see. Very interesting.

Maybe I'm being overly optimistic but it sounds that having a better story around quickly recovering from failures of this kind might actually be possible eventually, but it requires a lot of careful thought, engineering work, & especially automated testing, else we would be taking way too much risk. Certainly relying on the discipline of engineers to keep a rarely run recovery codepath deep in the weeds of KV working is not such a nice idea. I totally feel you on rarely run recovery codepaths being exactly the place where things go very wrong. And my experience is in being code owner of a system with way way way less complicated & stringent correctness goals as KV; the KV goals obviously raise the stakes.

Thx for the discussion!

@erikgrinaker
Copy link
Contributor

Yeah, I think there's plenty of stuff to pull at here in terms of offline recovery tools, blast radii, testing, and general resilience. But I'd want a human in the loop when something does go very horribly wrong.

@joshimhoff
Copy link
Collaborator Author

Wrote up one more ticket. Perhaps an exponential backoff on application is a low eng cost way to reduce blast radius, as it side steps all these correctness concerns (eventually the application will happen)?

#75944

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants