-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: controlling range leadership #6929
Comments
This would just be the existing I think this sounds like the sanest, most non-invasive way to get what you want. Lease shortening also has some general use cases, though it could be a little tricky to get right. Rather than diving into the deep end of this, I think it's worth speccing out what you actually want from this magical test cluster before this turns into a moloch. |
What I want out of this |
I think it is valuable and gonna take enough time to develop that it is On Thu, May 26, 2016 at 3:22 PM Andrei Matei [email protected]
|
Upon further consultations, seems like we can drop the
@vivekmenezes I might put this in an RFC if more discussion follows or if the scope grows somehow. Otherwise I don't know what more to say than these paragraphs. I'm sure the devil is in details to be seen at implementation time; I wish I understood the code and indeed these issues better beforehand but alas. |
The point of an RFC is to document the devilish details before making code
|
Changing leadership seems a lot simpler than picking a leader. Is there a strong need to pick a leader? |
Well, what does it mean to change leadership if you can't assign a new one? If the same node gets the lease again, have you changed anything? |
I think its about giving up leadership and not participating in the On Fri, May 27, 2016, 2:03 PM Andrei Matei [email protected] wrote:
|
Leadership is opportunistic, so if a node gives up leadership it'll regain On Fri, May 27, 2016 at 3:25 PM vivekmenezes [email protected]
-- Tobias |
What is What exactly is the flow you're now proposing? Does the client send
|
Sorry, there's no And again you're right - there's no |
So if the old leader must propose the new LeaderLease, what causes that to happen? I'm still not clear on the overall flow.
Right, GetLeader would be a read RPC, not a raft command. |
ping @andreimatei re: @bdarnell's qs above. |
The old leader would propose the new |
SGTM for tests and orderly shutdown; I think we'll probably want a full RFC once we're thinking about introducing an RPC for use in live clusters. |
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. towards cockroachdb#6929
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. towards cockroachdb#6929
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929
Allow the leader of a range to propose a LeaderLease request naming another replica as the leader. referencing cockroachdb#6929
A |
I started working on a
TestCluster
class that's supposed to wrap a configurable number of TestServers and let tests control how ranges are replicated and who the leader of each one is. A higher-level MultiTestContext. It's gonna be used for testing DistSQL in general and in particular this wrapper I'm building over over the LeaderCache and the RangeDescriptorCache to be used for enquiring about the leadership of key spans. And David has tried to use something like this in the past for running more realistic ("multi-node") benchmarks. Seems like a good thing, right? It's gonna be great.One area when I quickly ran into problems is how controlling the leadership of ranges is gonna work (technically leadership in the sense of holder of the LeaderLease, not Raft leadership). I would like the
TestCluster
to let the client say that it wants node n1 to become the leader of range r1. And 1ms later, it wants n2 to become the leader of that range.Seems like we don't currently have a way for a node to release its
LeaderLease
, and then we also don't have a clear way for force a node to become a leader when there's no leader.Besides being necessary for testing, we can imagine that this capability would be useful in the real world when we start talking about collocating ranges (and their leadership).
I've talked to Tobias a bit about the topic, and here's a plan I want to drop in the pool and see if it makes a splash.
ShortenLease(t)
raft command. This command can only be proposed by the current lease holder, and has the effect that it moves up the expiration of the current lease to time t. t is chosen as the current clock of the leader (which is higher than the timestamps from its read cache - i.e. the highest timestamp it already served a read at).In between when the command is proposed and when it's applied, the leader replica is gonna remember this t and refuse to serve reads above it.
The refusal is done by returning a
NotLeaderError
, redirecting to the nextLeader. nextLeader is a hint given to the leader replica whenShortenLease
is proposed, indicating if there's another replica that we intend to make the leader after the current lease expires. This is so we clients following the redirect don't proceed to random node and force it to inadvertently become the leader.This
ShortenLease
sounds like it will be also useful for draining a node, particularly since Tobi says our LeaderLeases are gonna become longer than the current 1s when they get... tied... to... raft ticks(?).We now have a way to expire leases. We still need a way to (probabilistically) elect a leader of the puppeteer's choosing.
GetOrElectLeader(replicaID)
RPC method, which returns the current leader if there's a valid leader lease, or tries to acquire the lease otherwise - through the usualLeaderLease
raft command. This is to be used for asking a node to become leader. It's also to be used forLeaderLease
cache population purposes, when you don't know who the leader is, and would also like to favor one particular node to become leader if there is none (this will be useful for SQL planning, when you need to figure out who a leader is but you also have a preference on the subject).An alternative here is to somehow use the
DistSender
directly to send a read command for the range to the node that you want to elect leader. The downside is that some DistSender methods for talking directly to a particular replica would have to be exposed for testing, and that we'd be doing read without caring about the actual values, just for its side effect of electing a leader.So with these two together, a test has the means to try to move leadership at will. It's all probabilistic - there's no guarantee that a rando doesn't grab the lease in between shortening the old lease and acquiring the new one on the desired target, but the test can keep trying until it gets what it wants.
Does this sound sane?
@tschottdorf @bdarnell @cuongdo @RaduBerinde
The text was updated successfully, but these errors were encountered: