-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: persistent outage when liveness leaseholder deadlocks #80713
Comments
What's interesting is that we do have a disk stall roachtest but it didn't catch it: cockroach/pkg/cmd/roachtest/tests/disk_stall.go Lines 1 to 139 in 21a286f
|
This seems closely related to #81100, so I'll pick this one up while I'm at it. |
Took a first stab at this over in #81137, where DistSender will prioritize available nodes over unavailable ones (since the liveness leaseholder would become unavailable when the disk stalls). However, this has a bunch of issues -- not least that when the liveness leaseholder goes down then all nodes will be considered unavailable, and so it can still get stuck on the old leaseholder anyway if it's sorted first (e.g. by latency). I think this issue points towards a more fundamental flaw in the whole model we use around lease acquisition, in that we rely on KV clients to detect failed leaseholders and contact new candidates to trigger lease acquisitions. If the clients fail to do either of these two tasks then leases won't move. Some other random ideas:
Thoughts or ideas? |
Looking for something to play with over breather week. Maybe it's this one? Perhaps can limit the scope by focusing on system ranges if needed. In #81100 (comment), @erikgrinaker says:
To me, this suggests a slightly different understanding of the outage than what @tbg said in this ticket:
Which is right? Was there a new available leaseholder or was there not a new available leaseholder, since no kvclient poked a functional kvserver triggering it to acquire the lease, since the NLHE was never returned? Erik's theory makes concrete sense to me FWIW.
I like all these ideas. I have a weird related idea:
Then no need for NLHE. At some level, is the problem that we require the kvserver to send a NLHE (as the kvserver may be non-functional), not that we require a kvclient to send a message to kvserver in order to kick off a lease acquisition attempt? I sort of wonder if the cost of sending back a "I'm a leaseholder" message on every RPC to kvserver is too high to stomach? But IDK; perf stuff like this is hard for me to judge. I thought I'd share the idea anyway. |
Most likely there was no other leaseholder, since we've found that nodes will keep retrying the previous leaseholder until they get a response. Unless someone else happened to send a request to a different replica, for some reason. Can probably find out by digging through the debug.zip, don't remember the details of the original incident. The no-leaseholder scenario will need a solution in any case, which I suspect will indirectly solve the other-leaseholder scenario.
Meh. I don't think any of them are particularly good, but given the current state this is what I could come up with for short/medium-term fixes. What I'd ideally want here is for replicas to sort out the lease even in the absence of client requests, similarly to how vanilla Raft will transfer leadership even if there are no proposals.
We sort of have this channel already, in the RPC heartbeats. But it doesn't carry lease information currently.
I think both. Ideally, we'd want replicas to sort out the lease between themselves, and for clients to efficiently discover lease changes. These concerns are currently conflated, but they're really separate concerns.
Been a while since I looked at RPC heartbeats, but piggybacking on them might be viable. I think doing it per-batch would possibly be too expensive, and it's unclear whether it'd be better than an aggregate signal. We could e.g. send a sparse bitmap of ranges that we're actively asserting leases for, or something. This would have to integrate with quiescence somehow. |
All makes sense! Good point re: aggregate signal maybe being just as good. |
There may not have been here, but it looks like https://github.com/cockroachlabs/support/issues/1808 is a possible other instance of the same problem. Here, it looks as though only n5 didn't manage to update its cache; other nodes seem to be reaching liveness OK. But as Erik pointed out, solving one solves the other - the issue is that we can't rely on "proactive" invalidation of a cached leaseholder. |
A customer ran into this again recently: https://cockroachdb.zendesk.com/agent/tickets/19526 |
Describe the problem
See #79648.
Quoting an internal support issue:
To Reproduce
See above PR (note that second commit has a hacky fix so to get a repro, remove that)
Expected behavior
Failover as usual
Additional data / screenshots
Environment:
Additional context
Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.
Jira issue: CRDB-15539
gz#13737
Epic CRDB-19227
gz#19526
The text was updated successfully, but these errors were encountered: