Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: leadership handoff #326

Closed
tbg opened this issue Feb 17, 2015 · 13 comments
Closed

storage: leadership handoff #326

tbg opened this issue Feb 17, 2015 · 13 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@tbg
Copy link
Member

tbg commented Feb 17, 2015

Some nodes may be better suited to be leader of a Raft group than others, for instance because it

  • receives a lot of client traffic
  • has good latencies to the other nodes
  • it just isn't as bad as the current leader in some other way.

The leader should periodically consider stepping down, based on appropriate criteria. That should boil down simply to sending a special Raft message to the suggested new leader which will then start elections.

@tbg tbg changed the title Raft: Smart leader election Raft: Leadership handoff Feb 17, 2015
@bdarnell
Copy link
Contributor

This is discussed in section 3.10 of https://ramcloud.stanford.edu/~ongaro/thesis.pdf

Related: instead of choosing election timeouts from a uniform random distribution, we could skew the results according to the nodes' fitness to lead, so the best leaders are more likely to win elections.

@spencerkimball
Copy link
Member

Yes, this is going to be important. Election timeouts sound like a good way to go with standard raft. But won't the coalesced heartbeats interfere? I think we will want preemptive leader reelection in the event there's a big amount of traffic to a follower. Ben, we're not forwarding from followers, correct? Given the machinery we have in place at range replicas who are consensus group leaders, we're going to have to always redirect the client to the leader replica. I guess when we do that, we should include something in the request header which shows the original, best latency follower and keep a little histogram in the leader. The leader could then decide based on a particularly lopsided histogram, whether to go for a new election.

@bdarnell
Copy link
Contributor

Yes, coalesced heartbeats will interfere, although we could still skew election timeouts for factors that are global to a node instead of specific to a range. (so an underloaded node is more likely to win than an overloaded one)

We are forwarding commands from the followers, so the leader will be able to see where client traffic is coming from and decide to step down (assuming the clients are intelligently choosing their lowest-latency replica).

@spencerkimball
Copy link
Member

We're forwarding through raft? Because commands have to start at store, then go down through the range, then into Raft...

@bdarnell
Copy link
Contributor

Yes, we're forwarding through raft. Commands can be proposed on followers, in which case the raft layer will forward them to the leader (this is something that came for free with the move to etcd/raft).

@spencerkimball
Copy link
Member

Well I suppose it's not even relevant because the code in range.go won't allow the command to be proposed on a non-leader replica anyway. The redirect will happen at a level above raft. But what that means is that the optimization which @tschottdorf was considering from that paper cited from the quorum leases paper is also going to be inapplicable.

@bdarnell
Copy link
Contributor

Maybe the code in range.go should be changed. A node can lose its leadership at any time, so even if we pass an IsLeader check we still have to either allow for forwarding at the raft level or fail due to lost leadership. Given that, why not just allow commands to be proposed on the follower, and return the current leader as a redirection hint with the response?

@cockroach-team
Copy link

Latency of forwarding to leader for a response you mostly already know
answer to sounds daunting in a wide area cluster. This is going to be a
pretty common case.

On Wednesday, February 18, 2015, Ben Darnell [email protected]
wrote:

Maybe the code in range.go should be changed. A node can lose its
leadership at any time, so even if we pass an IsLeader check we still have
to either allow for forwarding at the raft level or fail due to lost
leadership. Given that, why not just allow commands to be proposed on the
follower, and return the current leader as a redirection hint with the
response?


Reply to this email directly or view it on GitHub
#326 (comment)
.

You received this message because you are subscribed to the Google Groups
"Cockroach DB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected]
javascript:_e(%7B%7D,'cvml','cockroach-db%[email protected]');
.
For more options, visit https://groups.google.com/d/optout.

@bdarnell
Copy link
Contributor

Why do you mostly already know the answer (answer to what)? I'm not sure what exactly you're referring to as a common case, but with the redirection hint in the response clients should converge on talking to the leaders most of the time.

@xiang90
Copy link
Contributor

xiang90 commented Feb 19, 2015

@bdarnell @spencerkimball @tschottdorf

I think someone has tried this on our raft impl (handoff leader when there is a GC or something). I can find it out later.

@spencerkimball
Copy link
Member

I meant the answer to who the current leader is. Assume a request arrives at a range replica which happens to be a follower. That follower usually knows (or thinks it knows) the replica which is the leader. So it can just return that immediately as a redirect. If the request then arrives at the specified replica and it isn't the leader, it again redirects. If a replica doesn't know the leader, then the client can be instructed to backoff and retry.

I'm saying that forwarding to the leader is a pretty high latency operation. If a replica already believes it's not the leader, why not just immediately return the location of the replica it thinks is the leader?

@bdarnell
Copy link
Contributor

Forwarding to the leader is high latency, but so is returning to the client and having the client re-submit to the leader. If we assume that clients start by talking to their nearest replica (and that that nearest replica is not necessarily in the same DC as the client), then forwarding to the leader is likely to be lower latency because the client is no closer to the leader than the follower is.

In any case, whether we redirect before handling the original request or forward and attach a redirect hint for the next request, I don't think it makes that much difference because it only affects the client's first request for a given range. Once the client's location cache has warmed up it should be able to find the leaders more reliably.

The advantage of forwarding actually has to do with the corner cases of coalesced heartbeats (#315). The client will find out the current leader regardless of what we do here, but by forwarding we allow the server to learn if its view of the leader is out of date. This can break up the deadlocks we're concerned about in that thread.

@tamird tamird added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2015
@petermattis petermattis modified the milestone: 1.0 Feb 14, 2016
@petermattis petermattis changed the title Raft: Leadership handoff storage: leadership handoff Mar 31, 2016
@tbg
Copy link
Member Author

tbg commented Sep 26, 2016

See #9462 and #9465. Since this is at this point an ancient issue and the discussion has moved elsewhere, I'm going to close this.

@tbg tbg closed this as completed Sep 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

7 participants