-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: reduce blast radius of Raft application errors #75944
Comments
There are two kinds of failures to consider. Both currently crash the node.
The recent escalation that triggered this was a panic. However, that was due to insufficient bounds/overflow checks in Pebble, and it should arguably have been an error instead. So this seems to be the path forward:
|
I think I'm suggesting a different approach. Keep crashing the node, regardless of whether it's a panic or an error. But when the node starts up again, don't retry doing the apply immediately, as we do today IIUC hence the crash loop experienced by the recent on-prem customer. Retry it with an exponential backoff policy governing time between attempts. Clearly to determine how long to wait between apply attempts, we need some way to know or at least approximate how many attempts at doing some application have happened already. We can't use memory since the node will be restating. Perhaps the time since the command was committed gives us such an approximation? Or perhaps we can write a counter to disk keeping track of the number of application attempts? I wonder if this idea is also not feasible for reasons I will soon understand! But what is nice about it is if I do understand correctly is that it reduces blast radius without requiring KV to ensure no inconsistent state is left around in case of errors / catch panics / stop crashing, which all sounds hard, given all the correctness details you were mentioning in the last ticket, and which needs to be done on a case by case basis. |
I'm extraordinarily nervous about anything which tries to recover a range from an unexpected condition and keep it available. I'd much rather take an approach suggested in your last message, @joshimhoff, which is to try and cordon off and isolate a range/replica that is continually causing a crash. Doing so would at least keep the cluster up, though it might not be usable as the range may be unavailable and part of a table that is critical to the health of the cluster or application. For example, consider what would happen if the unavailable range was holding system tables. The main advantage I see to isolating problematic ranges like this is that we'd hopefully have more online debugging tools to diagnose what is going on, such as the range debug pages. |
Makes sense!
Plus, as you say, there's some chance that the outage doesn't actually break the whole application. Considering how the CC console uses CRDB, one can imagine various single range outages that would leave the CC console degraded but still working pretty well. And also with serverless, so long as the broken range is NOT a system range, we likely would have a KV cluster that works fine for all tenants other than the one whose range has the command that can't be applied. That's a big enough reduction in blast radius in an important CRL product to be motivating IMO. |
We could combine the above error handling sketch with @tbg's new replica circuit breakers: if application errors we trip the circuit breaker (cordoning off the range), then periodically retry application and untrip the breaker if it succeeds. The breaker already trips on hung application. I don't think I'd extend that to panic recovery though, but we could if we wanted to. |
I wouldn't do this as a first pass. When an error results, we might already be in an inconsistent state. I'd stay close to what Peter suggested above: permanently trip the breaker for that replica until the node is restarted. The story doesn't end there, though. If that node holds the lease, this makes the range unavailable, so it's desirable to be able to shed the lease. We probably don't want to let liveness expire - that's too heavy-handed, but if we had lease transfers that don't hop on the replication layer (i.e. an RPC to another node that tells it that we stopped using our lease as of timestamp XYZ and lets it steal the lease) then we could straightforwardly do it. I am sure we discussed such a mechanism before, but can't find the issue now. Might be worth filing separately but I'll see if anyone here remembers first. |
Then there's the next layer to peel off - when there are lease preferences set, what stops the other nodes from sending the lease right back in case the unhealthy replica is the preferred leaseholder? It all comes down to a need to communicate per-replica health between nodes (or requiring opt-in from the recipient for any operation). |
It's always seemed really unfortunate to me that a broken leaseholder replica won't passively lose the lease, since we tie the lease to node liveness rather than replica liveness (we have the same problem e.g. in the case of a replica deadlock). I don't know what the best solution is, but I think that's the problem we should solve rather than introducing side channels for active lease transfers. |
Yes, I sympathize with that point of view. Before epoch-based leases (when all leases were expiration-based), we had this property. To limit the per-range overhead, we went and introduced epoch-based leases and are now finding the opposite, that more granularity would be desired. To come back to out-of-band lease transfers, they might be simpler and more contained, so there is possibly still a role for them. But as mentioned before, they don't solve the general problem. Ultimately, the cluster needs to know which replicas are having trouble to avoid pathologies. |
#62199 is related to some of the discussion here |
Just talked with @erikgrinaker. I'm playing around with a POC of this during breather work. At this point, my goal is more learning than building something useful. Though I follow the above discussion about shedding leases, if I ignore that given that we have a separate ticket for #77035, I have an idea of how to implement this that doesn't seem so complicated. I figure the reason it doesn't is simply bc/ I don't understand KV that well, and one way to "fix" that is to write enough code that I understand things better. What better week to do that than breather week.
I don't see why we need the breaker. What I imagine doing is:
Above will reduce rate of crashes without operator involvement. |
I'm not sure why we'd panic again? Seems like we should just recover the panic and keep the node running, with an exponential retry in the Raft scheduler. We're likely vulnerable to inconsistent data during the backoff delay anyway, so we don't gain anything by crashing the node (or maybe not because there may not be a lease but that's sort of beside the point here).
To prevent anyone from accessing the range and reading incorrect data. For your purposes here, you can probably ignore this. |
I suppose the simplest possible solution would be to keep some state on the replica about the last attempt, and then return early from cockroach/pkg/kv/kvserver/store_raft.go Lines 505 to 506 in 98a66b5
Alternatively, you'd keep track of it in the Raft scheduler around here and avoid scheduling the range again until the backoff timer expires: cockroach/pkg/kv/kvserver/scheduler.go Lines 307 to 309 in 86f51ef
|
Thanks for the pointers!!
It sounds like you are saying that any inconsistency issue will be eliminated once the entry for which application (at first) errored or paniced is successfully applied (that is, on retry)? As a result of ^^, there is no need to restart the server? OTOH, if we weren't confident about ^^, it would be wise to restart the server before backing off. I was partly basing the alway panic suggestion on what @tbg said here:
Ah I see! I didn't realize the inconsistency post apply failure or panic issue could lead to inconsistent reads. I thought it was on the write side instead. What I thought was: To retry the application, it is best to restart the server first, else we risk inconsistency issues. I think the reason I thought it was NOT on the read side is that this suggests an apply failure or panic could lead to an inconsistent read today: Reads & applies happen concurrently. There is some time delta between the apply failure or panic happening and the server actually restarting obviously, so you'd just need to get unlucky with the timing of the apply failure or panic & the read. |
You're right. Application is unlikely to succeed again on a retry, and we probably wouldn't want to recover even if application did succeed. But I think the main motivation here is to avoid restart loops, which are very disruptive, and even with exponential backoffs we would still be in a restart loop.
Yeah, that's probably prudent -- if application fails, just make the replica unschedulable until someone looks at it and restarts the node.
Yes, application is always on the write path, but if someone then comes along and tries to read data that was only partially updated during application it would read inconsistent data. You're right that it's probably safer to restart before retrying, to at least avoid inconsistent state in memory. However, application is very unlikely to succeed again after the restart, so I'd argue that it's better to not restart automatically and wait for someone to fix the underlying problem and restart the node.
Yeah, this depends on the Go scheduler, who may schedule some other thread during the panic. But we don't really have any control over that. |
I've written a very hacky POC. I wrote the following patch to test locally:
After a panic, I see a log line on startup indicating that backing off is indeed happening
The POC is here: #77858. The POC catches a panic in apply, records info about the panic needed to implement a backoff using the storage engine, and then re-throws the panic. The POC also backs off at startup time in case of a recent panic with the right current applied index. A new store local key is introduced for recording info about panics. This POC doesn't trip the breaker. I think I understand why we need to do that now: IIUC, the code I added introduces additional time between panic and server crash (the time waiting on storage engine to record info about the panic), and we don't want reads to be served then. This POC doesn't deal with shedding leases. To me, that is fine, as we have #77035 for that, and also even without that, an exponential backoff on application seems like an improvement over the status quo to me. There are some other TODOs in the code (e.g. the POC does a linear backoff only right now). I am sure I am missing rather obvious things, but I thought it'd be fun to write a POC over breather work.
I hear you on this, but I still wonder if exponential backoff is a nice thing to do here. It eliminates the need to make a judgement call about the likelihood of an apply retry succeeding post server restart. If it turns out that there are some cases where a retry will actually succeed without operator involvement (e.g. temporary issues with infra like disks), that will happen quickly, because early in the backoff, the backoff time is very short. If it turns out that we do need an operator, the backoff time will quickly grow very large, large enough that the impact of the crashes is quite small (e.g. one crash per 10m... even one crash per 30m). The fact that one policy handles both these cases gracefully is one of the reasons I love a good ol' exponential backoff. There is also some precedent in the CRDB job system for following this pattern: #44594 |
If the goal here is to limit the blast radius of application errors, I think we need to prevent a range problem from becoming a node problem. A node crash is very disruptive: in the best case, it will cause up to 10 seconds of unavailability for leases on that node and abort all transactions that it's the gateway for. The failure is also likely to happen across all replicas, so it would crash multiple nodes, potentially taking out the whole cluster. That disruption does not seem worth it on the off-chance that a restart fixes the problem -- application is designed to be deterministic, so a failure is also likely to be deterministic. You do bring up a good point about e.g. disk failures and such that might be temporary, limited to a single component, and could resolve themselves. There are multiple failure modes here that we should try to handle. I think what we should do is:
|
Two meta Qs:
I know you already said this, but thanks for saying it again, as I am used to spaces where retries do have a good shot at fixing the issue!! I think the real reason that I like the exponential backoff approach is that with it there is no need to handle the multiple failures modes you list above separately. A relatively simple backoff if recent panic component running locally on each node leads to a reduction in blast radius for the cluster as a whole. Even if the chances of a retry fixing the issue are very low, if it is not zero, we do want to handle that case, and to me at least it's nice that the solution I am pushing for handles that case and the more common repeated failure on application, with a relatively low amount of complexity (assuming that the POC I wrote is on the right track (I ack that it might not be lol lol)). I hear you pointing out that panics are disruptive and so questioning that we are getting the reduction in blast radius we want. Here's an idea:
I wonder if you will not like this because the initial crashes will still cause node wide impact? I'd argue this difference in POV is a values thing as opposed to a purely technical issue. That <= 3 nodes will experience a bounded number of crashes in case of a rare event like application panic / failure doesn't bother me. To me, the blast radius reduction of what I suggest compared to the status quo is quite significant, and shooting for ~zero crashes isn't motivating. If I try to be more objective about that feeling, what I'd say is that CRDB will be reliable enough for our customers even if a node experiences a bounded number of crashes in case of a rare event like application failure. I am particularly motivated to not ever have extended multi-tenant outages in serverless (a goal I really believe in), and a bounded number of crashes means the outage will affect a single-tenant only (assuming affected range is in tenant key space) once the crash limit is hit. |
Sure, ping me when you get in tomorrow.
Cordoning alone will handle both of these failure modes better:
It isn't clear to me why we would choose to restart instead. The failure detector idea is to try and improve things beyond these measures, e.g. with a borked range.
My question would be what do the restarts buy us. It doesn't seem like they really buy us anything, so why do it?
I don't think this is true? If a range has a deterministic application failure (which is what happened in the outage that motivated this), then all replicas of that range will cause node crashes at the same time. If 3 nodes go down, presumably all other tenants that have >=2 range replicas on those nodes will also be affected? |
Yes, while the nodes are crashing, there will be impact on other tenants. Based on your feedback, I was imagining a world tho where there is a limit to the number of crashes enforced in the backoff code (the bullet up above about that starts with "cap the exponential backoff on a certain number of crashes + retries". Once that limit is hit, no more crashes. In that world, the outage will stop affecting multiple tenants eventually, without operator involvement. But anyway thanks for laying out the reasons for preferring cordoning so clearly!!! Looking forward to chatting more soon. |
Just a heads up that I'm following the discussion. Don't have much to add to Erik's points. It would be nice to avoid the need for a per-Replica lease transfer at least for a POC. Perhaps we can achieve that by gossiping (up to a certain number) the defunct replica descriptors (rangeid:replicaid:leaseseq pairs), and have the invariant that if a pair is gossiped as defunct, they're no longer using their lease (i.e. it can be acquired even if node is still live). The same signal could be used by the allocator to remove the offending replica from the range, at which point it ceases to exist (pending replicaGC, which we should try to make tolerant to errors). |
@erikgrinaker and I just chatted! I'm gonna POC:
One restart is not a big deal at a reliability level IMHO. We also think it side steps the need to actively shed the lease (see the above stuff from @tbg about gossip (can we avoid using gossip??)), or implement replica liveness (#77035), which is a big (& exciting) project. It may also be that cordoning on startup is simpler than cordoning post startup (e.g. no need to cancel anything).
This does confuse me a little bit though now that I am writing the idea down: Does a single restart def shed the lease? That is, if KV starts up again so fast that no liveness heartbeat is missed, could the lease remain on the restarted node? If answer to ^^ is "one restart may not shed the lease", enough restarts would, which suggests that an exponential backoff of sorts may be a way to improve the status quo (even tho the restarts do cause impact) without biting off #77035. Could even restart until the lease is shed (doesn't sound elegant). |
To add a bit more context of how I see this playing out: Long term, the ideal solution would be to cordon the replica and have it passively lose its lease and get removed from the range, by some form of replica-level heartbeats/liveness. Cordoning involves blocking all read/write access to the range (including cancelling in-flight requests) and preventing it from being scheduled -- probably using a new type of circuit breaker (#77366). Medium term, gossiping the faulty replicas might be a cheaper way to shed the lease and get removed from the range. We'll still need the same cordoning mechanism. Short-term, a single restart and then cordoning the replica would avoid the lease shedding issue and cancellation of in-flight requests. It would still need to cordon the replica, and ideally get it removed from the range (but the latter may be more work than it's worth, at least for a PoC). There also needs to be a way to un-cordon the replica when the problem has been resolved. We'll discuss how much of this we want to address for the 22.2 cycle.
No, I think this will give the node a new liveness epoch, so it won't be able to resume any of its existing leases. |
Ah got it! |
OK, here is an new POC, @erikgrinaker & @tbg: #78092 I haven't tested on a multi-node cluster, but I wrote the following integration test to test the behavior on a single node cluster. Was very satisfying. I love the CRDB test infra!!!
|
We also discussed cordoning in this weekly KV/Repl Eng meeting. |
Adding O-support here, since this was a major factor in a large outage. |
Is your feature request related to a problem? Please describe.
Non-deterministic failures to apply a raft command can crash the node:
If the non-deterministic failure to apply will actually happen repeatedly until an operator takes some action, we have a large scale outage on our hand with high impact & MTTR. I won't say which customer but a recent on-prem customer experience just such an outage.
Describe the solution you'd like
Re: outages of this sort, @erikgrinaker says at #75903:
Building on that, I see why we crash the node, but I don't see why we scale of the outage needs to be quite as large as it is today.
Can we add an exponential backoff or similar to the per-replica apply loop? That is, if application keeps failing, continue crashing & retrying as needed, but retry application with greater and greater sleep times between last application and retry?
Then the node will eventually stay up for a longer time, allowing it to serve other requests, in particular requests to ranges other than the one that has the command that cannot be applied.
This would especially be nice in a serverless context. In that context, we'd really like to keep outages from affecting many tenants at once. In the current state, an outage involving repeated (but non-deterministic) failure to apply some command post commit would affect all tenants with data hosted on the multiple crashing nodes. If we do something like what is proposed here, perhaps the impact would be mostly limited to the tenant whose range has the command that can't be applied.
Also, adding an exponential backoff sounds pretty straight-forward, tho maybe it's not due to details I'm missing...
Describe alternatives you've considered
Additional context
N/A
Jira issue: CRDB-12888
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: