Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: properly handle invalid leases when draining #58613

Closed
tbg opened this issue Jan 7, 2021 · 7 comments
Closed

kvserver: properly handle invalid leases when draining #58613

tbg opened this issue Jan 7, 2021 · 7 comments
Assignees
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Jan 7, 2021

Describe the problem

Invalid leases are not handled properly on 20.1 and likely beyond. An invalid lease can persist when a node goes down (it is immaterial whetherit comes back up or not), in which case inactive ranges may still reference a lease held by the down (or restarted) node.

The drain code takes no actions on these leases, but still considers the lease as one that needs an action. This means drain will not complete.

Additionally, the raft leadership transfer code also checks the lease, so similarly raft leaderships won't be transferred away - no matter who holds the lease.

To Reproduce

Do what the kv/gracefuldraining roachtest does, but restart the node before draining. Also, before draining, restart another member of the cluster as well (or all members).

See https://github.com/cockroachlabs/support/issues/697#issuecomment-733941083 for an internal report on a manual reproduction of this as well as code pointers into problematic code (on 20.1).

Expected behavior
Invalid leases on the node that should be drained should not be considered as requiring a lease transfer.
When transferring raft leadership, this should occur regardless of the validity of the lease.

gz#6855

@tbg tbg added A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jan 7, 2021
@tbg tbg assigned knz Jan 7, 2021
@fabiog1901
Copy link
Contributor

ZenDesk: 6855

@knz
Copy link
Contributor

knz commented Jan 25, 2021

I started working on this by designing a unit test in #59371 and I found myself unable to block a drain on an invalid lease.

The verbose logging I get with -vmodule=store=2 tells me that the replica with an invalid lease is not even considered by the drain ("not moving out").

This was confusing me until I found this condition in kvserver/store.go:

    needsLeaseTransfer := len(r.Desc().Replicas().VoterDescriptors()) > 1 &&
                          drainingLease.OwnedBy(s.StoreID()) &&
                          r.IsLeaseValid(ctx, drainingLease, s.Clock().Now())

The last part especially IsLeaseValid() tells me we're already doing the right thing here. What did I miss?

@tbg
Copy link
Member Author

tbg commented Jan 25, 2021

Oh, then maybe the first part of my comment in the support repo isn't actually a problem. I think you will see the drain fail if you make the expired leaseholder the raft leader, because then you hit

r.maybeTransferRaftLeadership(ctx)

which I think was a noop for an expired lease. But I am seeing on master that this code has already changed? I think my comment will apply to 20.2 though.

@knz
Copy link
Contributor

knz commented Jan 25, 2021

I am reproducing on 20.2 with the following steps:

  1. roachprod create local -n 3
  2. roachprod put local ./cockroach
  3. roachprod start local
  4. wait for up replication
  5. roachprod stop local (ungraceful stop)
  6. roachprod start local
  7. cockroach node drain

The drain then takes forever and seems not to converge.

With master, the problem does not repro any more.

@knz
Copy link
Contributor

knz commented Jan 25, 2021

Now the question is what to do about this.

It seems likely (to me) that the following PRs by Andrei have helped: #55148 #55624 #55619

Especially the 3rd one.

  • Are these things we want to backport to 20.2?
  • What's the minimally invasive thing we can do in v20.2?
  • I can't backport the new unit tests introduced in server: avoid expired leases during drain #59371 since they rely on Alex' refactor, unless I also backport Alex' refactor. Do we want to do this?

@tbg
Copy link
Member Author

tbg commented Jan 28, 2021

I tried the following repro script on release-20.2 vs release-20.2+c1b11f24bb68e20704f40eb40a1519d21774c54d and the latter very reliably drains within seconds, while the former gets stuck or takes a very long time to complete.

#!/bin/bash
set -euxo pipefail

bin/roachprod destroy local || true
bin/roachprod create -n 3 local
bin/roachprod put local ./cockroach ./cockroach
bin/roachprod start local
# Wait for up-replication.
sleep 15s
./cockroach workload init kv --splits 100
./cockroach workload run kv --read-percent 0 --duration=10s
bin/roachprod stop local
bin/roachprod start local --args "--vmodule=store=2"
./cockroach node drain --insecure

I think backporting c1b11f2 to release-20.2 is the way to go. We should work on the testing on master, but given the constraints we should feel comfortable backporting this to 20.2 even if it does not fix all of the problems - it seems to reliably fix the major problem.

@lunevalex
Copy link
Collaborator

@tbg @knz is there anything left to do here or can we close this?

@tbg tbg closed this as completed Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants