kv: controlling range leadership #6929

andreimatei · 2016-05-26T18:59:03Z

I started working on a TestCluster class that's supposed to wrap a configurable number of TestServers and let tests control how ranges are replicated and who the leader of each one is. A higher-level MultiTestContext. It's gonna be used for testing DistSQL in general and in particular this wrapper I'm building over over the LeaderCache and the RangeDescriptorCache to be used for enquiring about the leadership of key spans. And David has tried to use something like this in the past for running more realistic ("multi-node") benchmarks. Seems like a good thing, right? It's gonna be great.

One area when I quickly ran into problems is how controlling the leadership of ranges is gonna work (technically leadership in the sense of holder of the LeaderLease, not Raft leadership). I would like the TestCluster to let the client say that it wants node n1 to become the leader of range r1. And 1ms later, it wants n2 to become the leader of that range.
Seems like we don't currently have a way for a node to release its LeaderLease, and then we also don't have a clear way for force a node to become a leader when there's no leader.
Besides being necessary for testing, we can imagine that this capability would be useful in the real world when we start talking about collocating ranges (and their leadership).
I've talked to Tobias a bit about the topic, and here's a plan I want to drop in the pool and see if it makes a splash.

We introduce a ShortenLease(t) raft command. This command can only be proposed by the current lease holder, and has the effect that it moves up the expiration of the current lease to time t. t is chosen as the current clock of the leader (which is higher than the timestamps from its read cache - i.e. the highest timestamp it already served a read at).
In between when the command is proposed and when it's applied, the leader replica is gonna remember this t and refuse to serve reads above it.
The refusal is done by returning a NotLeaderError, redirecting to the nextLeader. nextLeader is a hint given to the leader replica when ShortenLease is proposed, indicating if there's another replica that we intend to make the leader after the current lease expires. This is so we clients following the redirect don't proceed to random node and force it to inadvertently become the leader.
This ShortenLease sounds like it will be also useful for draining a node, particularly since Tobi says our LeaderLeases are gonna become longer than the current 1s when they get... tied... to... raft ticks(?).

We now have a way to expire leases. We still need a way to (probabilistically) elect a leader of the puppeteer's choosing.

We introduce a GetOrElectLeader(replicaID) RPC method, which returns the current leader if there's a valid leader lease, or tries to acquire the lease otherwise - through the usual LeaderLease raft command. This is to be used for asking a node to become leader. It's also to be used for LeaderLease cache population purposes, when you don't know who the leader is, and would also like to favor one particular node to become leader if there is none (this will be useful for SQL planning, when you need to figure out who a leader is but you also have a preference on the subject).
An alternative here is to somehow use the DistSender directly to send a read command for the range to the node that you want to elect leader. The downside is that some DistSender methods for talking directly to a particular replica would have to be exposed for testing, and that we'd be doing read without caring about the actual values, just for its side effect of electing a leader.

So with these two together, a test has the means to try to move leadership at will. It's all probabilistic - there's no guarantee that a rando doesn't grab the lease in between shortening the old lease and acquiring the new one on the desired target, but the test can keep trying until it gets what it wants.

Does this sound sane?

@tschottdorf @bdarnell @cuongdo @RaduBerinde

The text was updated successfully, but these errors were encountered:

tbg · 2016-05-26T19:08:30Z

We introduce a ShortenLease(t) raft command.

This would just be the existing LeaderLease command but it would allow for an Expiration which is smaller than that of the existing lease. t can be chosen arbitrarily. I don't want to get into the details because what's above is worthy of some discussion to get right, but that's for when that's actually implemented. The major aspect to get correct is that no reads have been served with a timestamp that's higher than what the new expiration of the lease is at apply time.

I think this sounds like the sanest, most non-invasive way to get what you want. Lease shortening also has some general use cases, though it could be a little tricky to get right.

Rather than diving into the deep end of this, I think it's worth speccing out what you actually want from this magical test cluster before this turns into a moloch.

andreimatei · 2016-05-26T19:21:48Z

What I want out of this TestCluster immediately is for testing this wrapper I have on top of the LeaderCache. I want to setup a bunch of ranges and check that I'm able to figure out who the leaders of the corresponding key spans is. Then I want to move leadership around, and see that my cache gets updated properly. And when there's no leader for a range, I want to see that the "cache" actually forces leadership on the node that it considers most desirable - generally collocated with other ranges that are part of the same request.
Then, what we generally need for testing DistSQL planning is to be able to construct arbitrary topologies and see that the plans we produce make sense - data that's colocated gets to be part of the same flow, processors are scheduled close to their data, etc. And also we'll probably want to change leadership in the middle of operations and see that queries still work.

vivekmenezes · 2016-05-26T20:45:47Z

I think it is valuable and gonna take enough time to develop that it is
worth writing up an RFC before you move forward on it. This is going to be
useful in testing not just distributed sql functionality.

On Thu, May 26, 2016 at 3:22 PM Andrei Matei [email protected]
wrote:

What I want out of this TestCluster immediately is for testing this
wrapper I have on top of the LeaderCache. I want to setup a bunch of
ranges and check that I'm able to figure out who the leaders of the
corresponding key spans is. Then I want to move leadership around, and see
that my cache gets updated properly. And when there's no leader for a
range, I want to see that the "cache" actually forces leadership on the
node that it considers most desirable - generally collocated with other
ranges that are part of the same request.
Then, what we generally need for testing DistSQL planning is to be able to
construct arbitrary topologies and see that the plans we produce make sense

data that's colocated gets to be part of the same flow, processors are
scheduled close to their data, etc. And also we'll probably want to change
leadership in the middle of operations and see that queries still work.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6929 (comment)

andreimatei · 2016-05-26T22:25:36Z

Upon further consultations, seems like we can drop the ShortenLease(t) command and instead expand the existing LeaderLease to support lease transfers. So, I guess the old leader would propose this command, with a lease naming another node and with timestamp=clock() + max clock offset. After proposing, the old leader starts a "stasis" period, and refuses any reads (to avoid the "single-register linearizability failure" described here).

GetOrElectLeader stays, to be used independently of TransferLease, for cache population purposes. Except instead of being an RPC as I said before, it can just be a (non-raft) admin command.

@vivekmenezes I might put this in an RFC if more discussion follows or if the scope grows somehow. Otherwise I don't know what more to say than these paragraphs. I'm sure the devil is in details to be seen at implementation time; I wish I understood the code and indeed these issues better beforehand but alas.

tamird · 2016-05-27T00:21:18Z

The point of an RFC is to document the devilish details before making code
changes which are implicitly assumed to eventually be merged.
On May 26, 2016 18:25, "Andrei Matei" [email protected] wrote:

Upon further consultations, seems like we can drop the ShortenLease(t)
command and instead expand the existing LeaderLease to support lease
transfers. So, I guess the old leader would propose this command, with a
lease naming another node and with timestamp=clock() + max clock offset.
After proposing, the old leader starts a "stasis" period, and refuses any
reads (to avoid the "single-register linearizability failure" described
here
https://github.com/cockroachdb/cockroach/blob/ce67233/roachpb/data.proto#L329
).

GetOrElectLeader stays, to be used independently of TransferLease, for
cache population purposes. Except instead of being an RPC as I said before,
it can just be a (non-raft) admin command.

@vivekmenezes https://github.com/vivekmenezes I might put this in an
RFC if more discussion follows or if the scope grows somehow. Otherwise I
don't know what more to say than these paragraphs. I'm sure the devil is in
details to be seen at implementation time; I wish I understood the code and
indeed these issues better beforehand but alas.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABdsPPBFBcjAp5NJLtnrcW1hMNqU7osAks5qFh3pgaJpZM4In2PJ
.

vivekmenezes · 2016-05-27T12:13:29Z

Changing leadership seems a lot simpler than picking a leader. Is there a strong need to pick a leader?

andreimatei · 2016-05-27T18:03:14Z

Well, what does it mean to change leadership if you can't assign a new one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a transfer is just as difficult.

vivekmenezes · 2016-05-27T19:25:08Z

I think its about giving up leadership and not participating in the
election for a bit. It appears to be straight forward to do that. I suppose
picking a specific leader is also not complex, but im curious if you need
that functionality too.

On Fri, May 27, 2016, 2:03 PM Andrei Matei [email protected] wrote:

Well, what does it mean to change leadership if you can't assign a new
one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a
transfer is just as difficult.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ALOpBPeRZCbxLV3r4mhCU6n2KfK5nWJiks5qFzH3gaJpZM4In2PJ
.

tbg · 2016-05-27T19:37:18Z

Leadership is opportunistic, so if a node gives up leadership it'll regain
it with the next incoming read/write (or even just queue activity). That's
why transfer is important.

On Fri, May 27, 2016 at 3:25 PM vivekmenezes [email protected]
wrote:

I think its about giving up leadership and not participating in the
election for a bit. It appears to be straight forward to do that. I suppose
picking a specific leader is also not complex, but im curious if you need
that functionality too.

On Fri, May 27, 2016, 2:03 PM Andrei Matei [email protected]
wrote:

Well, what does it mean to change leadership if you can't assign a new
one? If the same node gets the lease again, have you changed anything?
Why do you think that simply "changing" is simpler? It seems to me that a
transfer is just as difficult.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#6929 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe/ALOpBPeRZCbxLV3r4mhCU6n2KfK5nWJiks5qFzH3gaJpZM4In2PJ

.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#6929 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135Ja5VZnPtouyPLsDilBQEsxUCjn9ks5qF0UjgaJpZM4In2PJ
.

-- Tobias

bdarnell · 2016-05-30T18:02:37Z

GetOrElectLeader stays, to be used independently of TransferLease

What is TransferLease? (This is the only time that name is used)

What exactly is the flow you're now proposing? Does the client send LeaderLease requests in directly or does it use a new TransferLease RPC to ask the old lease holder to do it?

GetLeader() (with the implication that it elects whichever node it lands on if there is no leader) is fine; GetOrElectLeader(replicaID) would need to be fleshed out in more detail (if we need it at all - can we just use GetLeader and transfer leadership separately if it's not where we want it?). In particular I'd prefer to hide all knowledge of replicas at the DistSender level and below; I'd rather not have higher-level code manipulating range leadership (except in tests, which is why I'd prefer to use a blind GetLeader from the sql code and make lease transfers test-only).

andreimatei · 2016-05-31T20:45:54Z

What is TransferLease? (This is the only time that name is used)

Sorry, there's no TransferLease. What I'm really doing is using LeaderLease for transfers, setting the ReplicaID in the request to the next leader replica. So, the old leader has to be the one that proposes this command, and it will do some things before proposing to make sure it doesn't serve more reads after proposing.

And again you're right - there's no GetOrElectLeader, just GetLeader. But GetLeader would be an RPC, not a Raft command, right?

bdarnell · 2016-06-03T19:34:34Z

So if the old leader must propose the new LeaderLease, what causes that to happen? I'm still not clear on the overall flow.

But GetLeader would be an RPC, not a Raft command, right?

Right, GetLeader would be a read RPC, not a raft command.

tbg · 2016-06-07T14:47:00Z

ping @andreimatei re: @bdarnell's qs above.

andreimatei · 2016-06-07T15:45:20Z

The old leader would propose the new LeaderLease when a test tells it to do some, through some sort of test interface that accesses the old leader replica directly (TBD exactly how that's going to look, maybe just a method on the replica). It would also happen on a node shutdown - the node would iterate over its replicas, see which ones are leaders and scramble to transfer leadership away to another random replica.
I think it's likely other uses in production will come up where we'll want to control leadership - once the sql planner / sql statistics become more sophisticated and we want to collocate leadership, analogous to how we'll want to collocate replicas for related ranges. At that point, I guess we'd introduce an RPC.
Makes sense?

bdarnell · 2016-06-11T22:23:39Z

SGTM for tests and orderly shutdown; I think we'll probably want a full RFC once we're thinking about introducing an RPC for use in live clusters.

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. towards cockroachdb#6929

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929

andreimatei · 2016-09-26T16:49:39Z

A TransferLeaseRequest has been introduced in the commit above.

andreimatei self-assigned this May 26, 2016

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016

storage: permit range leadership lease transfers

1262bbe

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. towards cockroachdb#6929

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016

storage: permit range leadership lease transfers

2719595

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. towards cockroachdb#6929

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 22, 2016

storage: permit range leadership lease transfers

88fcb9d

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jun 30, 2016

storage: permit range leadership lease transfers

de23fee

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jul 7, 2016

storage: permit range leadership lease transfers

a9ce86c

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jul 11, 2016

storage: permit range leadership lease transfers

3418637

Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929

tbg mentioned this issue Sep 20, 2016

storage: transfer leases on overfull stores #9465

Closed

andreimatei closed this as completed Sep 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: controlling range leadership #6929

kv: controlling range leadership #6929

andreimatei commented May 26, 2016

tbg commented May 26, 2016

andreimatei commented May 26, 2016

vivekmenezes commented May 26, 2016

andreimatei commented May 26, 2016

tamird commented May 27, 2016

vivekmenezes commented May 27, 2016

andreimatei commented May 27, 2016

vivekmenezes commented May 27, 2016

tbg commented May 27, 2016

bdarnell commented May 30, 2016

andreimatei commented May 31, 2016 •

edited

Loading

bdarnell commented Jun 3, 2016

tbg commented Jun 7, 2016

andreimatei commented Jun 7, 2016

bdarnell commented Jun 11, 2016

andreimatei commented Sep 26, 2016

kv: controlling range leadership #6929

kv: controlling range leadership #6929

Comments

andreimatei commented May 26, 2016

tbg commented May 26, 2016

andreimatei commented May 26, 2016

vivekmenezes commented May 26, 2016

andreimatei commented May 26, 2016

tamird commented May 27, 2016

vivekmenezes commented May 27, 2016

andreimatei commented May 27, 2016

vivekmenezes commented May 27, 2016

tbg commented May 27, 2016

bdarnell commented May 30, 2016

andreimatei commented May 31, 2016 • edited Loading

bdarnell commented Jun 3, 2016

tbg commented Jun 7, 2016

andreimatei commented Jun 7, 2016

bdarnell commented Jun 11, 2016

andreimatei commented Sep 26, 2016

andreimatei commented May 31, 2016 •

edited

Loading