-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: leadership handoff #326
Comments
This is discussed in section 3.10 of https://ramcloud.stanford.edu/~ongaro/thesis.pdf Related: instead of choosing election timeouts from a uniform random distribution, we could skew the results according to the nodes' fitness to lead, so the best leaders are more likely to win elections. |
Yes, this is going to be important. Election timeouts sound like a good way to go with standard raft. But won't the coalesced heartbeats interfere? I think we will want preemptive leader reelection in the event there's a big amount of traffic to a follower. Ben, we're not forwarding from followers, correct? Given the machinery we have in place at range replicas who are consensus group leaders, we're going to have to always redirect the client to the leader replica. I guess when we do that, we should include something in the request header which shows the original, best latency follower and keep a little histogram in the leader. The leader could then decide based on a particularly lopsided histogram, whether to go for a new election. |
Yes, coalesced heartbeats will interfere, although we could still skew election timeouts for factors that are global to a node instead of specific to a range. (so an underloaded node is more likely to win than an overloaded one) We are forwarding commands from the followers, so the leader will be able to see where client traffic is coming from and decide to step down (assuming the clients are intelligently choosing their lowest-latency replica). |
We're forwarding through raft? Because commands have to start at store, then go down through the range, then into Raft... |
Yes, we're forwarding through raft. Commands can be proposed on followers, in which case the raft layer will forward them to the leader (this is something that came for free with the move to etcd/raft). |
Well I suppose it's not even relevant because the code in range.go won't allow the command to be proposed on a non-leader replica anyway. The redirect will happen at a level above raft. But what that means is that the optimization which @tschottdorf was considering from that paper cited from the quorum leases paper is also going to be inapplicable. |
Maybe the code in range.go should be changed. A node can lose its leadership at any time, so even if we pass an IsLeader check we still have to either allow for forwarding at the raft level or fail due to lost leadership. Given that, why not just allow commands to be proposed on the follower, and return the current leader as a redirection hint with the response? |
Latency of forwarding to leader for a response you mostly already know On Wednesday, February 18, 2015, Ben Darnell [email protected]
|
Why do you mostly already know the answer (answer to what)? I'm not sure what exactly you're referring to as a common case, but with the redirection hint in the response clients should converge on talking to the leaders most of the time. |
@bdarnell @spencerkimball @tschottdorf I think someone has tried this on our raft impl (handoff leader when there is a GC or something). I can find it out later. |
I meant the answer to who the current leader is. Assume a request arrives at a range replica which happens to be a follower. That follower usually knows (or thinks it knows) the replica which is the leader. So it can just return that immediately as a redirect. If the request then arrives at the specified replica and it isn't the leader, it again redirects. If a replica doesn't know the leader, then the client can be instructed to backoff and retry. I'm saying that forwarding to the leader is a pretty high latency operation. If a replica already believes it's not the leader, why not just immediately return the location of the replica it thinks is the leader? |
Forwarding to the leader is high latency, but so is returning to the client and having the client re-submit to the leader. If we assume that clients start by talking to their nearest replica (and that that nearest replica is not necessarily in the same DC as the client), then forwarding to the leader is likely to be lower latency because the client is no closer to the leader than the follower is. In any case, whether we redirect before handling the original request or forward and attach a redirect hint for the next request, I don't think it makes that much difference because it only affects the client's first request for a given range. Once the client's location cache has warmed up it should be able to find the leaders more reliably. The advantage of forwarding actually has to do with the corner cases of coalesced heartbeats (#315). The client will find out the current leader regardless of what we do here, but by forwarding we allow the server to learn if its view of the leader is out of date. This can break up the deadlocks we're concerned about in that thread. |
Some nodes may be better suited to be leader of a Raft group than others, for instance because it
The leader should periodically consider stepping down, based on appropriate criteria. That should boil down simply to sending a special Raft message to the suggested new leader which will then start elections.
The text was updated successfully, but these errors were encountered: