-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If leader can't load snapshot cluster won't recover #522
Comments
Hi @benbuzbee, I'm not persuaded this is something that ought to be handled in the raft library itself. Moreover, the log you cite ("error waiting for Raft index") doesn't look like something from the library, but from Nomad, so it may be that what you're experiencing isn't purely a raft issue. I suggest you file this proposal as an issue on the https://github.com/hashicorp/nomad repo, and the maintainers of that project can decide whether it's better addressed in Nomad or here in the raft library. |
If that is where you think this best lives. My suggestion here I think was largely because it is where healthy leadership heart beating exists. Failure to load the snapshots exists in raft file_snapshot.go. Offhand I am not sure where the re-try loop exists but I suspect it is raft. Does Nomad actually have what it needs to detect raft failing to load and abort the retries and modify the cluster? |
Hi @benbuzbee, I retract what I said earlier: I agree with your original statement
Possible fix: in replicateTo, if we can't load a snapshot, we should step down as leader. The current code specifically doesn't stop replication for this error; it probably should, but there are likely other details we need to consider here. |
Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it
We had a cluster of nomad servers that lost quorum and would not elect a new leader
Looking at the logs, the leader at the time was logging this
And other servers were logging this
So here is my stance:
If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.
or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.
What do you think?
The text was updated successfully, but these errors were encountered: