-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: resolve outages involving failure to apply quickly and easily #75903
Comments
I agree with the goal here -- it's a significant weakness of Raft that a single failing command can cause head-of-line blocking and prevent all progress. That said, I'm not sure if an auto-repair mechanism is going to be practical -- certainly not in the short term. In the outage you mentioned, the nodes were crash-looping due to a panic. We can't do anything when the node is crashed, so that's the first thing we'll have to look into: consider trying to catch replica-level panics at the replica layer, and limit the blast radius to the range rather than the node. This is a good idea in principle, and we're going to be making some improvements in that direction, but it's tricky because... The second problem is correctness. Ok, so we've now handled some arbitrary panic coming from "somewhere" (could be deep in Pebble, or in a third-party library). Or we could have some other arbitrary error bubble up. How do we know that this failure isn't affecting correctness? Maybe some partial state was written somewhere before it failed. Maybe the failure was in a shared component that can affect correctness in other unrelated ranges. Maybe the failure affected internal non-atomic Raft state leaving it inconsistent. How can we know? We can't, not without extensively classifying all possible failure modes in all possible components (including external ones such as libraries and the operating system). But let's say that we try to return appropriate errors for all sorts of failure modes, and handle the ones that are safe to handle (or just accept that correctness may be violated). We now have these replicas continually trying (and failing) to apply this log entry. First of all, how do we detect that something is even wrong? And should we catch stuck entries too, or just erroring ones? We can use the new replica circuit breakers, but is it stuck due to a quorum loss or due to log application? What if's only stuck on two replicas, but a third one managed to apply it? We'll need a new failure detector that tries to figure all of this out. Ok, so we find a stuck entry and would like to tell these replicas to skip it. Is it safe to skip the entry? Maybe there are subsequent entries in the log that depend on this entry for correctness? Let's chop off the entire remainder of the log to be sure, and abort all requests that depend on them. But what if someone was satisfied that once the entry was committed to Raft it's going to get applied at some point, and expects that to happen? That is the Raft contract, after all, and we've just violated it. Ok, let's accept that some callers will get upset. How do we coordinate this? We'd normally use Raft to coordinate and replicate state changes, but Raft is borked, so we're going to have to have additional infrastructure to replicate this. Whose responsibility is it to do this coordination? We'd normally use the leaseholder, but what if there is no lease? Let's assume "someone" does it -- but we better make sure only one actor decides to do it, otherwise it can start chopping off logs after the range has been fixed by someone else. It sends off an RPC to the replicas. What if that RPC succeeds on two nodes, but not on a third node for some reason? What if one of the nodes somehow manages to recover and get the application through (maybe there was a temporary disk problem), while the others didn't? We now have replica divergence. Should we automatically recover from that too? Which diverged replica do we choose? What if we've already acknowledged some of these applied entries to clients? I guess the point I'm trying to make is that here be dragons, and if something goes horribly wrong like this I would never trust a computer to automatically diagnose and recover from it. Users presumably pick CRDB because they value correctness over availability (to some extent), and so we should always err on the side of correctness. I think the best thing we can do here is to fail hard (preferably limited to the range), get a human involved to assess the failure, and provide some tooling for offline Raft surgery. And as always, lots more tests to ensure these sorts of failures don't happen in the first place. |
This all makes sense. Thanks for the all the color!!!
Above is prob the most critical piece I was missing. Not all commands are safe to skip, as I was overly focused on the recent outage which involved a command that likely was safe to skip. In retrospect this is obvious :)
Could we block on returning success to clients on at least one node successfully applying a command? Or is that a non-starter given performance goals? If we did block on returning success, would we still have the above correctness concern or no?
Is the partial bad state we are worried about in durable storage or memory? If it is in durable storage, sounds really tough & also unfortunate. If it it's in memory, could we rely on restarting the node a single time to clear the partial bad state?
Here's a different weird idea: Could we make a problem like the recent pebble crashes happen before a command is committed somehow? If pebble had transactions (IIUC rocksdb did but pebble doesn't), could we open a transaction, do the application, and then roll it back? And if that transaction fails for some reason, the command is rejected without being proposed. |
It may actually turn out that all the commands we use in CRDB would be safe to skip. But Raft can't know for sure, so we'd essentially have to audit all commands and uphold this through developer discipline, which is brittle.
We often do this. But e.g. for intent writes we're satisfied as soon as it's committed to Raft, and don't bother waiting for application, as this would increase transaction latency.
It could be either, and we can't really know. At least with offline recovery, we only have to worry about persisted state.
That's typically what we do: we evaluate the request first, then we take the resulting write batch and put that through Raft (log truncation doesn't because that would be inefficient). That batch is atomic, like a transaction, and application basically commits that batch. But what if the error that failed the application came from within that batch commit machinery? Can we still trust that the batch was atomic and successfully rolled back? We really can't, and that's the crux of the problem: with arbitrary failures, we have no idea if the guarantees we normally rely on still hold. |
Ah I see. Very interesting. Maybe I'm being overly optimistic but it sounds that having a better story around quickly recovering from failures of this kind might actually be possible eventually, but it requires a lot of careful thought, engineering work, & especially automated testing, else we would be taking way too much risk. Certainly relying on the discipline of engineers to keep a rarely run recovery codepath deep in the weeds of KV working is not such a nice idea. I totally feel you on rarely run recovery codepaths being exactly the place where things go very wrong. And my experience is in being code owner of a system with way way way less complicated & stringent correctness goals as KV; the KV goals obviously raise the stakes. Thx for the discussion! |
Yeah, I think there's plenty of stuff to pull at here in terms of offline recovery tools, blast radii, testing, and general resilience. But I'd want a human in the loop when something does go very horribly wrong. |
Wrote up one more ticket. Perhaps an exponential backoff on application is a low eng cost way to reduce blast radius, as it side steps all these correctness concerns (eventually the application will happen)? |
We have marked this issue as stale because it has been inactive for |
Is your feature request related to a problem? Please describe.
IMPORTANT NOTE: I am sorry if I misunderstand details. This is the SRE way sometimes. Good luck reading my scribbles, KV!!
Non-deterministic failures to apply a raft command can crash the node:
If the non-deterministic failure to apply will actually happen repeatedly until an operator takes some action, we have a large scale outage on our hand with high impact & MTTR. I won't say which customer but a recent on-prem customer experience just such an outage.
Describe the solution you'd like
I don't really see why CRDB can't determine that no node will ever successfully apply a command. If CRDB can determine this, can CRDB stop trying to apply the entry on all replicas automatically, without affecting correctness?
With MTTR = human response time, we could require some human input & even take downtime, to simplify implementation. Imagine something like this:
delete_raft_entries_that_are_not_applied_on_any_nodes
(prob CLI flag since cluster may be too down to set a cluster setting).MaxLeaseIndex
to 0.Describe alternatives you've considered
Additional context
N/A
Jira issue: CRDB-12874
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: