-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer leadership when establishLeadership fails #5247
Conversation
This PR is implementing the leadership transfer extension described in the thesis chap 3.10. Background: Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again. Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic. The necessary steps to perform are: 1. Leader picks target to transition to 2. Leader stops accepting client requests 3. Leader makes sure to replicate logs to the target 4. Leader sends TimeoutNow RPC request 5. Target receives TimeoutNow request, which triggers an election 6a. If the election is successful, a message with the new term will make the old leader step down 6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation Resources: https://github.com/etcd-io/etcd/tree/master/raft
833070e
to
b8e9fa5
Compare
5a706d7
to
018d3e5
Compare
Level: hclog.LevelFromString(s.config.LogLevel), | ||
Output: s.config.LogOutput, | ||
}) | ||
s.config.RaftConfig.Logger = raftLogger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raft logging changed and we need to provide the hclogger now.
agent/consul/leader.go
Outdated
@@ -178,7 +195,7 @@ RECONCILE: | |||
if err := s.revokeLeadership(); err != nil { | |||
s.logger.Printf("[ERR] consul: failed to revoke leadership: %v", err) | |||
} | |||
goto WAIT | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first usecase: we couldn't establish leadership and instead of retrying it after interval
in WAIT
, an error is returned which leads to a raft leadership transfer.
agent/consul/leader.go
Outdated
err := reassert() | ||
errCh <- err | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second usecase: we couldn't reassert and instead of retrying it after interval
, an error is returned which leads to a raft leadership transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 245 to 246 in 09a1a32
// reassertLeaderCh is used to signal the leader loop should re-run | |
// leadership actions after a snapshot restore. |
This happens after snapshot restore. Since the agent revokes leadership and immediately tries to establish it again, there is the possibility that it fails. When it does we are in the same situation as above - raft thinks this agent is a leader, but consul disagrees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have a couple questions inline 🔍
88b9490
to
1534af3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about these Hans but just wondering if this solves the problem fully vs just making it more likely to get out of the bad state.
Please let me know if I'm missing something though - it's all pretty subtle!
agent/consul/leader.go
Outdated
err := reassert() | ||
errCh <- err | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?
agent/consul/leader.go
Outdated
s.leaderLoop(ch) | ||
err := s.leaderLoop(ch) | ||
if err != nil { | ||
s.leadershipTransfer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if leadership transfer fails? We exit the leader loop anyway and no longer act as the leader, but does raft still think we are the leader? That seems like a bad case to be in - roughly the same as the bug this is meant to be fixing although we at least made an attempt at telling raft we didn't want to be leader any more.
I can think of a couple of possible options here:
- Make a method in Raft lib that allows the leader to "StepDown" - it will stop running heartbeats etc. but not stop being a voter - basically, force a leader into follower state.
- Keep trying the leadership Transfer indefinitely because the cluster is broken until it works anyway.
- Keep retrying for a limited length of time and then crash the whole process to force a step down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple of changes to my PR, so that leaderLoop
is only left if leadershipTransfer
was successful. I think that mitigates the issues you are talking about.
What happens now is that in case leadershipTransfer
fails, the agent stays in leaderLoop
and waits until ReconcileInterval
to retry to establishLeadership
(and to transferLeadership
in case it fails).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK although ReconcileInterval
is 60 seconds which seems a lot to wait around for when we know the cluster is down. 🤔
Should we instead consider looping indefinitely retrying every 5 seconds or something?
I think this is way better than before and eventually it should recover so it's good just wondering if it's easy to make that better quicker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reset the interval
to 5 seconds when transfer leadership fails so that it retries establishLeadership again faster than before.
@i0rek could you also double check how the |
// leader, but consul disagrees. | ||
if err != nil { | ||
if err := s.leadershipTransfer(); err != nil { | ||
goto WAIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we attempt to transfer 3 times but if it fails we still hang out in non-leader limbo land for a bit before retrying?
I guess this is what i mentioned as "retry indefinitiely" and it should really work immediately if rest of the cluster is in an OK state so I think this is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that since leadershipTransfer failed, we try to establishLeadership again. I think in general establishLeadership is more likely to succeed than transferLeadership. I think if I make the interval smaller - like 5 seconds - before it retries it is the better solution than trying to transfer leadership indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'll leave it at request changes to make sure we don't forget to check out the log formatting for raft before merge.
@banks I checked the logs and fixed the timestamps, but sometimes there are two whitespaces before raft: master:
pr:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
// establishedLeader needs to be set to | ||
// false. | ||
establishedLeader = false | ||
interval = time.After(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK because we'll wait for 5 seconds in next wait loop and then timeout and goto RECONCILE
. Right after that label we will hit interval := time.After(s.config.ReconcileInterval)
again which presumably will set the interval back again.
I must admit I'm a little confused about how :=
works after a GOTO - it's reassigning the same variable name but in the same scope on subsequent jumps? I wonder if this is some strange variant of variable shadowing even though they are in the same scope? Maybe Go just has a special case to allow this when using GOTO but not in serial code? If it works I guess it's fine 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean. WAIT
is well after the RECONCILE
and the interval
variable declaration and the code should just be using the same variable.
I reproed your question in a playground: https://play.golang.org/p/6AqssHXg3Wt. If you jump before a variable declaration golang will create a new variable. Or did you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did a similar repro and it's fine, just was a strange one.
The path here is:
- we set interval to be a timer chan that will go off in 5 seconds
- we
goto WAIT
which enters a select on a few things including that chan - when the chan fires, that select branch does
goto RECONCILE
which immeidately re-assigns a timer chan for the original ReconcileInterval (using:=
).
My original concern was that we might end up regaining leadership and then doing reconcile every 5 seconds after that but it's not the case due to the path mentioned above.
It also occurs to me that we always have had a re-assignment after goto RECONCILE
so it's not really any different than before, it's just that was the only assignment before and I wondered if some strange form of shadowing might cause issues. That appears not to be the case so I think this is fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fun - Go does create a new variable with the same name.
You can see that here: https://play.golang.org/p/FU0ZxictDXE capturing the variable in a lambda and then looping with GOTO leaves the lamda holding the original value not the redefined one after the GOTO jump.
It's just weird to me because it's in the same scope - shadowing across scopes seems fine but this seems to be a special case you can't normally do outside of GOTO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, now I see what you mean.
@silenceper Looks like you are correct. One of the other engineers has been working on a circle ci check to ensure we aren't accidentally doing this (which appears to be a common mistake for us). |
This PR is implementing the leadership transfer extension described in the thesis chap 3.10. Background: Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again. Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic. The necessary steps to perform are: 1. Leader picks target to transition to 2. Leader stops accepting client requests 3. Leader makes sure to replicate logs to the target 4. Leader sends TimeoutNow RPC request 5. Target receives TimeoutNow request, which triggers an election 6a. If the election is successful, a message with the new term will make the old leader step down 6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation Resources: https://github.com/etcd-io/etcd/tree/master/raft
Fixes #5047. I am throwing this out here to have a place to discuss the approaches.
TLDR; transition leadership is now implemented in raft which is being used in this PR.
Background:
Leadership is established in raft and there is basically no way to influence that from consul. And there has been no need to. But when we setup leadership in consul in
establishLeadership
things can go wrong and in the past we retried to fix it. This is problematic since even though raft considers this server the leader, it is not fully prepared! Consistent reads for example are not possible.Abandoned Ideas:
revokeLeadership
: that tears down some consul things, but has no influence on raft leadershipleaderLoop
: that resets the consul side of things, but doesn't change the raft leadershipSolutions:
DemoteVoter
as the leader, will make this server step down by marking the current leader as nonvoter. The same server is promoted to a voter afterwards again by autopilot from the new leader:consul/agent/consul/autopilot/autopilot.go
Lines 304 to 329 in 884b2e0
Concerns with demotevoter:
establishLeadership
would have to fail often during that process to make this scenario possible. Without this fix though, we would be stuck with a leader that is not performing all leader tasks, without a chance of recovery.Concerns with leadership transfer:
none
This PR uses the leadership transfer feature. It requires a revendoring though and fails because of it.