-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: properly handle invalid leases when draining #58613
Comments
ZenDesk: 6855 |
I started working on this by designing a unit test in #59371 and I found myself unable to block a drain on an invalid lease. The verbose logging I get with This was confusing me until I found this condition in needsLeaseTransfer := len(r.Desc().Replicas().VoterDescriptors()) > 1 &&
drainingLease.OwnedBy(s.StoreID()) &&
r.IsLeaseValid(ctx, drainingLease, s.Clock().Now()) The last part especially |
Oh, then maybe the first part of my comment in the support repo isn't actually a problem. I think you will see the drain fail if you make the expired leaseholder the raft leader, because then you hit cockroach/pkg/kv/kvserver/store.go Line 1134 in 457dffb
which I think was a noop for an expired lease. But I am seeing on master that this code has already changed? I think my comment will apply to 20.2 though. |
I am reproducing on 20.2 with the following steps:
The drain then takes forever and seems not to converge. With |
Now the question is what to do about this. It seems likely (to me) that the following PRs by Andrei have helped: #55148 #55624 #55619 Especially the 3rd one.
|
I tried the following repro script on release-20.2 vs release-20.2+c1b11f24bb68e20704f40eb40a1519d21774c54d and the latter very reliably drains within seconds, while the former gets stuck or takes a very long time to complete.
I think backporting c1b11f2 to release-20.2 is the way to go. We should work on the testing on master, but given the constraints we should feel comfortable backporting this to 20.2 even if it does not fix all of the problems - it seems to reliably fix the major problem. |
Describe the problem
Invalid leases are not handled properly on 20.1 and likely beyond. An invalid lease can persist when a node goes down (it is immaterial whetherit comes back up or not), in which case inactive ranges may still reference a lease held by the down (or restarted) node.
The drain code takes no actions on these leases, but still considers the lease as one that needs an action. This means drain will not complete.
Additionally, the raft leadership transfer code also checks the lease, so similarly raft leaderships won't be transferred away - no matter who holds the lease.
To Reproduce
Do what the kv/gracefuldraining roachtest does, but restart the node before draining. Also, before draining, restart another member of the cluster as well (or all members).
See https://github.com/cockroachlabs/support/issues/697#issuecomment-733941083 for an internal report on a manual reproduction of this as well as code pointers into problematic code (on 20.1).
Expected behavior
Invalid leases on the node that should be drained should not be considered as requiring a lease transfer.
When transferring raft leadership, this should occur regardless of the validity of the lease.
gz#6855
The text was updated successfully, but these errors were encountered: