-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: leases appear stuck when draining a node #101885
Comments
It's possible that this is #100761, but there are is some other weirdness that might be worth digging into. In the latest instance of this, we noticed the following (this is after three attempts to drain):
|
Also worth noting that the patch isn't in 23.1.0-beta.2. It will unlikely make @yuzefovich - will defer to you to figure out if we should slap a GA blocker label on this or not. |
I'll add a GA-blocker to justify backporting #101884 to 23.1.0. |
Although leases being stuck is a KV issue, so I'll add to the KV board too, cc @kvoli |
From Austen:
|
Looks like the voter incoming was transient. Still stuck but with leader/lease split only. When running manually via the RQ:
The full status entry for `r158617`
|
Another 144488 had the same issue but resolved itself after 5 minutes:
|
The issue is when we are in a configuration where the leader and leaseholder are different, the replicate queue is unable to fix the issue. It has a check A reasonable fix is to reverse the check here to only exclude replicas if the state is There is a potential issue of why we have a leader/leaseholder split for this range, and it may be worth tracking that down as well, but its possible that they just "organically appear" and this bug would prevent them from recovering. |
If a leaseholder is not the raft leader then it does not track the progress of peers. It would not know whether the lease transfer target is caught up on its log or whether it requires a snapshot. See #81561 for more detail about why this would be unsafe and the consequences that allowing such lease transfers could have.
Which bug are you referring to? Why would it prevent the leaseholder from receiving leadership through |
We see a split leader/leaseholder range that's quiescent. ➜ jq '.responsesByNodeId."3".infos[0].raftState.lead' status.json
"5"
➜ jq '.responsesByNodeId."3".infos[0].state.state.lease.replica.replicaId' status.json
4
➜ jq '.responsesByNodeId."3".infos[0].quiescent' status.json
true This would explain why
Is this racy? Did we break things when we touched them earlier this release in #97289? |
n6's lease has remained valid since 2023-04-19 21:46:03. The n3:5 replica (leader) only came into existence at 22:13:40:
So the lease has existed for as long as n3 has existed, and thus for as long as it can have been the leader, and it's still valid. The logic in #97289 should not allow it to quiesce. This shouldn't be racy, because if there are any outstanding proposals or unapplied log entries we don't quiesce either, so there can't be another lease in flight that we haven't seen. However, what would happen if n6 had a liveness blip, either globally or from n3 gossip's point of view? There are two conditions that mark an epoch-based lease as expired:
I think the second condition is problematic. The assumption we make when we allow expired leases to quiesce is that someone has to reacquire the expired lease, and this will wake up the Raft group. But I think this assumption is flawed in two ways:
It's still not clear that this is what happened here (I'll see if I can find any indications that n6's liveness may have blipped on n3), but I think 1 above is reason enough to revert this change. We no longer strictly need this logic anyway, now that we eagerly extend expiration leases. |
Submitted a revert in #101899. I'm going to keep this issue open and try to figure out if this is what happened, or if there's more going on. I have to deal with some L2 stuff first though. |
101899: Revert "kvserver: allow expired leases to quiesce" r=erikgrinaker a=erikgrinaker This reverts commit 76afb00. This change was flawed, because it assumed an expired lease would have to be reacquired which would wake up the range, but that's not necessarily true. In cases where the Raft leader is not colocated with the leaseholder, the Raft leader may have a stale liveness record, or the leaseholder may temporarily fail to heartbeat liveness without anyone bumping its epoch. If this happens, it's possible for an expired lease to revert to a valid lease from the leader's point of view, but at that point it's too late: the range is already quiesced. This can prevent e.g. lease transfers, because we don't allow leaseholders to transfer their lease if they're not the Raft leader (they need to make sure the target is not behind on the Raft log). We won't attempt to colocate the Raft leader with the leaseholder for a quiesced range, since we don't tick it. Touches #101885. Touches #97289. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>
Do we want to put a big warning or a Fatal in the replicate queue where it detects quiescent lease in the state leaseholder not leader? It would be much easier to find this sort of issue. It didn't register to me that this state was so problematic. |
Sure, if we can do so without a risk of races. Let's not do fatal though, at least not in production builds. |
For 23.1 we can just add an warning log to this. I agree it is hard to avoid the risk of races here. I can take a look and add something if there is an easy change. I created #101902 to track this. |
cc @cockroachdb/replication |
I'm a bit stuck here. I only have time series to go by, since the debug.zip I grabbed only had 48 hours of logs, and the necessary logs are now rotated out. The timeline is something like this:
There were no liveness failures nor epoch increments in the interval, so the hypothesis above about n3's view of the lease flipping to Over the next 20 minutes, we see n3 moving Raft leaderships to leaseholders, as we would expect for unquiesced leaders. I've reviewed the quiescence code, and I don't see any other explanation for why we fail to quiesce except the hypothesis above, but I also haven't been able to confirm that that's the bug we hit here. I don't know if I'll be able to make much more progress here, so I'll remove the GA blocker, but will try some more restarts and drains on the test cluster to see if it reproduces. |
Did another rolling drain+restart of the 23.1 test cluster. Except for #102574, the drains all went through successfully, and |
Describe the problem
When draining a node in a 23.1.0-beta.2 cluster, I've noticed that occasionally a node will get stuck draining:
To Reproduce
We have a test cluster with instructions on the topology and workloads here.
This doesn't reproduce reliably, but I've been doing the following:
roachprod ssh $CLUSTER:$NODE
$ ./cockroach node drain --self --insecure
roachprod stop $CLUSTER:$NODE
, thenstart
Expected behavior
We expect the drain to complete in a timely fashion, rather than hanging.
Environment:
23.1.0-beta.2
Additional context
See this internal thread (internal).
Jira issue: CRDB-27173
Epic CRDB-27235
The text was updated successfully, but these errors were encountered: