storage: leadership handoff #326

tbg · 2015-02-17T16:57:33Z

Some nodes may be better suited to be leader of a Raft group than others, for instance because it

receives a lot of client traffic
has good latencies to the other nodes
it just isn't as bad as the current leader in some other way.

The leader should periodically consider stepping down, based on appropriate criteria. That should boil down simply to sending a special Raft message to the suggested new leader which will then start elections.

bdarnell · 2015-02-17T17:14:16Z

This is discussed in section 3.10 of https://ramcloud.stanford.edu/~ongaro/thesis.pdf

Related: instead of choosing election timeouts from a uniform random distribution, we could skew the results according to the nodes' fitness to lead, so the best leaders are more likely to win elections.

spencerkimball · 2015-02-18T21:46:50Z

Yes, this is going to be important. Election timeouts sound like a good way to go with standard raft. But won't the coalesced heartbeats interfere? I think we will want preemptive leader reelection in the event there's a big amount of traffic to a follower. Ben, we're not forwarding from followers, correct? Given the machinery we have in place at range replicas who are consensus group leaders, we're going to have to always redirect the client to the leader replica. I guess when we do that, we should include something in the request header which shows the original, best latency follower and keep a little histogram in the leader. The leader could then decide based on a particularly lopsided histogram, whether to go for a new election.

bdarnell · 2015-02-18T21:53:06Z

Yes, coalesced heartbeats will interfere, although we could still skew election timeouts for factors that are global to a node instead of specific to a range. (so an underloaded node is more likely to win than an overloaded one)

We are forwarding commands from the followers, so the leader will be able to see where client traffic is coming from and decide to step down (assuming the clients are intelligently choosing their lowest-latency replica).

spencerkimball · 2015-02-18T21:58:00Z

We're forwarding through raft? Because commands have to start at store, then go down through the range, then into Raft...

bdarnell · 2015-02-18T22:02:57Z

Yes, we're forwarding through raft. Commands can be proposed on followers, in which case the raft layer will forward them to the leader (this is something that came for free with the move to etcd/raft).

spencerkimball · 2015-02-18T23:06:32Z

Well I suppose it's not even relevant because the code in range.go won't allow the command to be proposed on a non-leader replica anyway. The redirect will happen at a level above raft. But what that means is that the optimization which @tschottdorf was considering from that paper cited from the quorum leases paper is also going to be inapplicable.

bdarnell · 2015-02-18T23:15:03Z

Maybe the code in range.go should be changed. A node can lose its leadership at any time, so even if we pass an IsLeader check we still have to either allow for forwarding at the raft level or fail due to lost leadership. Given that, why not just allow commands to be proposed on the follower, and return the current leader as a redirection hint with the response?

cockroach-team · 2015-02-19T00:29:44Z

Latency of forwarding to leader for a response you mostly already know
answer to sounds daunting in a wide area cluster. This is going to be a
pretty common case.

On Wednesday, February 18, 2015, Ben Darnell [email protected]
wrote:

Maybe the code in range.go should be changed. A node can lose its
leadership at any time, so even if we pass an IsLeader check we still have
to either allow for forwarding at the raft level or fail due to lost
leadership. Given that, why not just allow commands to be proposed on the
follower, and return the current leader as a redirection hint with the
response?

—
Reply to this email directly or view it on GitHub
#326 (comment)
.

You received this message because you are subscribed to the Google Groups
"Cockroach DB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected]
javascript:_e(%7B%7D,'cvml','cockroach-db%[email protected]');
.
For more options, visit https://groups.google.com/d/optout.

bdarnell · 2015-02-19T00:35:33Z

Why do you mostly already know the answer (answer to what)? I'm not sure what exactly you're referring to as a common case, but with the redirection hint in the response clients should converge on talking to the leaders most of the time.

xiang90 · 2015-02-19T18:58:15Z

@bdarnell @spencerkimball @tschottdorf

I think someone has tried this on our raft impl (handoff leader when there is a GC or something). I can find it out later.

spencerkimball · 2015-02-19T23:31:50Z

I meant the answer to who the current leader is. Assume a request arrives at a range replica which happens to be a follower. That follower usually knows (or thinks it knows) the replica which is the leader. So it can just return that immediately as a redirect. If the request then arrives at the specified replica and it isn't the leader, it again redirects. If a replica doesn't know the leader, then the client can be instructed to backoff and retry.

I'm saying that forwarding to the leader is a pretty high latency operation. If a replica already believes it's not the leader, why not just immediately return the location of the replica it thinks is the leader?

bdarnell · 2015-02-19T23:49:10Z

Forwarding to the leader is high latency, but so is returning to the client and having the client re-submit to the leader. If we assume that clients start by talking to their nearest replica (and that that nearest replica is not necessarily in the same DC as the client), then forwarding to the leader is likely to be lower latency because the client is no closer to the leader than the follower is.

In any case, whether we redirect before handling the original request or forward and attach a redirect hint for the next request, I don't think it makes that much difference because it only affects the client's first request for a given range. Once the client's location cache has warmed up it should be able to find the leaders more reliably.

The advantage of forwarding actually has to do with the corner cases of coalesced heartbeats (#315). The client will find out the current leader regardless of what we do here, but by forwarding we allow the server to learn if its view of the leader is out of date. This can break up the deadlocks we're concerned about in that thread.

tbg · 2016-09-26T15:59:13Z

See #9462 and #9465. Since this is at this point an ancient issue and the discussion has moved elsewhere, I'm going to close this.

tbg changed the title ~~Raft: Smart leader election~~ Raft: Leadership handoff Feb 17, 2015

bdarnell mentioned this issue Feb 20, 2015

storage: send MsgAppResp to replica co-located with client #325

Closed

tbg mentioned this issue Feb 28, 2015

storage: Coalesced Heartbeats Corner Cases #315

Closed

tamird added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2015

petermattis modified the milestone: 1.0 Feb 14, 2016

petermattis changed the title ~~Raft: Leadership handoff~~ storage: leadership handoff Mar 31, 2016

tbg closed this as completed Sep 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: leadership handoff #326

storage: leadership handoff #326

tbg commented Feb 17, 2015

bdarnell commented Feb 17, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

cockroach-team commented Feb 19, 2015

bdarnell commented Feb 19, 2015

xiang90 commented Feb 19, 2015

spencerkimball commented Feb 19, 2015

bdarnell commented Feb 19, 2015

tbg commented Sep 26, 2016

storage: leadership handoff #326

storage: leadership handoff #326

Comments

tbg commented Feb 17, 2015

bdarnell commented Feb 17, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

spencerkimball commented Feb 18, 2015

bdarnell commented Feb 18, 2015

cockroach-team commented Feb 19, 2015

bdarnell commented Feb 19, 2015

xiang90 commented Feb 19, 2015

spencerkimball commented Feb 19, 2015

bdarnell commented Feb 19, 2015

tbg commented Sep 26, 2016