-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: more aggressive reproposals for lease requests #104024
Comments
cc @cockroachdb/replication |
Here's an example trace:
|
We should also be careful to avoid quadratic log growth under unavailability, since the Raft scheduler will attempt to reacquire leases every 3 seconds. This is similar to the buildup we've seen with replica circuit breaker probes. See #103908. |
I've looked at the above trace and the code and while no thought experiment survives contact with reality, at least in theory our reproposal should already be near optimal for reducing latency. When no leader is known, we don't repropose (actually we do, but raft just drops the proposals). The first thing a leader does is append an empty entry. When anyone sees an empty entry being appended, we repropose everything12. So it's reactive, and should be really snappy in practice (if a follower is behind on the raft log, it might as well catch up a bit before adding more work to the system... but technically it is not exactly as reactive as it could be here, for it learns of the leader before seeing the empty entry). The trace above seems likely to have originated from a slow follower. I am assuming it's in the wake of a split (can't tell from the trace). We see that the follower requests a lease, likely because the lease in the split looks expired to the follower - which makes sense if it is applying the split "much later" than it happened in real time. The proposal is flushed to raft and we don't see if it's actually sent or dropped, but either way, nothing happens for a bit. It is flushed again, notably with a 390ms delay, so probably this node is pretty slow, and then is rejected because by now there is a raft leader (note how we didn't hit this condition in the first round, so likely there wasn't a raft leader then). It's also not unlikely that the reproposal was actually triggered by the aforementioned All in all, there is something interesting going on here - looks like we were perhaps waiting for a raft election, in which case the first attempt at proposal likely saw it dropped. (With 6b8391c there would be an entry in the trace for that, but alas). I'll run an experiment locally to see if I can get "bad" lease latencies by just splitting a bunch. Footnotes |
Definitely possible that we hit an election tie and/or election timeout here. May not be anything to do here if we flush pending proposals on leader changes. |
My take-away here is that there's nothing "obviously" broken and that the problems are likely a layer below, having to do with raft election stale-mates or missed places where we ought to call I think we should focus our energy on improving observability for "all things below raft". I'll add metrics for reproposals and replication latencies. In the above experiment, we did see dropped proposals, though they were all dropped on followers: I assume (but can't prove due to lack of replication latency metric) that these were all reactively reproposed once a leader was known, and so that the delay incurred was near optimal (assuming the leader was known soon thereafter). #83262 would help prove this, so it's something we should get in. |
The original experiments where we saw this had a cluster with 200 ms RTTs and heavy load (TPCC import). We probably hit a few election stalemates following election timeouts. Reproposing on leader change seems sufficient, I think we can close this for now. |
In #98124 we saw that lease requests would sometimes get reproposed, typically due to leader changes and often following range splits. Since lease requests are latency-sensitive, especially with expiration lease extensions, we should consider more aggressively reproposing them (e.g. after 1 second).
Jira issue: CRDB-28319
Epic CRDB-25199
The text was updated successfully, but these errors were encountered: