kvserver: properly handle invalid leases when draining #58613

tbg · 2021-01-07T16:00:58Z

Describe the problem

Invalid leases are not handled properly on 20.1 and likely beyond. An invalid lease can persist when a node goes down (it is immaterial whetherit comes back up or not), in which case inactive ranges may still reference a lease held by the down (or restarted) node.

The drain code takes no actions on these leases, but still considers the lease as one that needs an action. This means drain will not complete.

Additionally, the raft leadership transfer code also checks the lease, so similarly raft leaderships won't be transferred away - no matter who holds the lease.

To Reproduce

Do what the kv/gracefuldraining roachtest does, but restart the node before draining. Also, before draining, restart another member of the cluster as well (or all members).

See https://github.com/cockroachlabs/support/issues/697#issuecomment-733941083 for an internal report on a manual reproduction of this as well as code pointers into problematic code (on 20.1).

Expected behavior
Invalid leases on the node that should be drained should not be considered as requiring a lease transfer.
When transferring raft leadership, this should occur regardless of the validity of the lease.

gz#6855

fabiog1901 · 2021-01-08T13:49:42Z

ZenDesk: 6855

knz · 2021-01-25T12:34:00Z

I started working on this by designing a unit test in #59371 and I found myself unable to block a drain on an invalid lease.

The verbose logging I get with -vmodule=store=2 tells me that the replica with an invalid lease is not even considered by the drain ("not moving out").

This was confusing me until I found this condition in kvserver/store.go:

    needsLeaseTransfer := len(r.Desc().Replicas().VoterDescriptors()) > 1 &&
                          drainingLease.OwnedBy(s.StoreID()) &&
                          r.IsLeaseValid(ctx, drainingLease, s.Clock().Now())

The last part especially IsLeaseValid() tells me we're already doing the right thing here. What did I miss?

tbg · 2021-01-25T12:46:06Z

Oh, then maybe the first part of my comment in the support repo isn't actually a problem. I think you will see the drain fail if you make the expired leaseholder the raft leader, because then you hit

cockroach/pkg/kv/kvserver/store.go

Line 1134 in 457dffb

r.maybeTransferRaftLeadership(ctx)

which I think was a noop for an expired lease. But I am seeing on master that this code has already changed? I think my comment will apply to 20.2 though.

knz · 2021-01-25T13:33:54Z

I am reproducing on 20.2 with the following steps:

roachprod create local -n 3
roachprod put local ./cockroach
roachprod start local
wait for up replication
roachprod stop local (ungraceful stop)
roachprod start local
cockroach node drain

The drain then takes forever and seems not to converge.

With master, the problem does not repro any more.

knz · 2021-01-25T13:37:44Z

Now the question is what to do about this.

It seems likely (to me) that the following PRs by Andrei have helped: #55148 #55624 #55619

Especially the 3rd one.

Are these things we want to backport to 20.2?
What's the minimally invasive thing we can do in v20.2?
I can't backport the new unit tests introduced in server: avoid expired leases during drain #59371 since they rely on Alex' refactor, unless I also backport Alex' refactor. Do we want to do this?

tbg · 2021-01-28T12:21:07Z

I tried the following repro script on release-20.2 vs release-20.2+c1b11f24bb68e20704f40eb40a1519d21774c54d and the latter very reliably drains within seconds, while the former gets stuck or takes a very long time to complete.

#!/bin/bash
set -euxo pipefail

bin/roachprod destroy local || true
bin/roachprod create -n 3 local
bin/roachprod put local ./cockroach ./cockroach
bin/roachprod start local
# Wait for up-replication.
sleep 15s
./cockroach workload init kv --splits 100
./cockroach workload run kv --read-percent 0 --duration=10s
bin/roachprod stop local
bin/roachprod start local --args "--vmodule=store=2"
./cockroach node drain --insecure

I think backporting c1b11f2 to release-20.2 is the way to go. We should work on the testing on master, but given the constraints we should feel comfortable backporting this to 20.2 even if it does not fix all of the problems - it seems to reliably fix the major problem.

lunevalex · 2021-04-15T16:19:06Z

@tbg @knz is there anything left to do here or can we close this?

tbg added A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jan 7, 2021

tbg assigned knz Jan 7, 2021

tbg added backport-20.1.x labels Jan 8, 2021

knz mentioned this issue Jan 25, 2021

server: avoid expired leases during drain #59371

Draft

knz mentioned this issue Jan 25, 2021

roachtest: transfer-leases/quit failed [slow upreplication] #58492

Closed

knz mentioned this issue Jan 29, 2021

release-20.2: kvserver: stop pretending to deal with Raft leadership when draining #59578

Merged

tbg closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: properly handle invalid leases when draining #58613

kvserver: properly handle invalid leases when draining #58613

tbg commented Jan 7, 2021 •

edited by richd-cockroachlabs

Loading

fabiog1901 commented Jan 8, 2021

knz commented Jan 25, 2021

tbg commented Jan 25, 2021 •

edited

Loading

knz commented Jan 25, 2021

knz commented Jan 25, 2021

tbg commented Jan 28, 2021

lunevalex commented Apr 15, 2021

kvserver: properly handle invalid leases when draining #58613

kvserver: properly handle invalid leases when draining #58613

Comments

tbg commented Jan 7, 2021 • edited by richd-cockroachlabs Loading

fabiog1901 commented Jan 8, 2021

knz commented Jan 25, 2021

tbg commented Jan 25, 2021 • edited Loading

knz commented Jan 25, 2021

knz commented Jan 25, 2021

tbg commented Jan 28, 2021

lunevalex commented Apr 15, 2021

tbg commented Jan 7, 2021 •

edited by richd-cockroachlabs

Loading

tbg commented Jan 25, 2021 •

edited

Loading