Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing to set the votedFor variable for the Candidate/Leader may lead to multiple leaders in the same term. #44

Open
fwhdzh opened this issue Oct 30, 2024 · 3 comments

Comments

@fwhdzh
Copy link

fwhdzh commented Oct 30, 2024

Missing to set the votedFor variable for the Candidate/Leader may lead to multiple leaders in the same term.

In the original Raft protocol, when a node timeout,s it sets votedFor to Nil. After that, it sends a RequestVote to all nodes in the cluster (including itself). When this node receives its own RequestVote request, it sets votedFor to itself. Once the node receives votes from a majority of nodes, it updates its states to Leader.

In XRaft, however, when a node timeouts, it transitions to a Candidate node. At this point, it sets its own votesCounted to 1 and no longer sends a RequestVote request to itself.

However, there is an issue in the current XRaft implementation. Neither the Leader class nor Candidate class in XRaft includes a votedFor field, so the node which timouts does not record itself as the votedFor value. Therefore, even if it will persist the votedFor value to disk when transitioning to Candidate or Leader state, the persisted value is indeel Null. If a node restarts after becoming a Leader, it may vote for other nodes again because it did not retrieve any votedFor information from the disk.

Consider the following fault scenario in a 3-node cluster, where initially all nodes have term = 0:

  • Node 1 timeouts, setting its term = 1 and votesCount = 1.
  • Node 2 receives Node 1's vote request, sets its term to 1, and votes for Node 1. Upon receiving Node 2’s vote, Node 1 becomes the Leader.
  • When Node 1 becomes the Leader, it persists term and votedFor via the changeToRole method. However, because neither the Candidate class nor the Leader class modifies votedFor, the persisted value of votedFor remains Null.
  • Node 1 restarts, setting votedFor as persisted Null.
  • Node 3 timeouts, sets its term = 1, and votesCount = 1.
  • Node 3 sends a vote request to Node 1, and since Node 1's votedFor == Null, it votes for Node 3, making Node 3 become the Leader for term 1.

This scenario results in both Node 1 and Node 3 acting as Leaders in the same term, which does not comply with Raft's protocol.

Please check if this is indeed a bug, and if so, I would be happy to submit a PR to fix it.

@xnnyygn
Copy link
Owner

xnnyygn commented Nov 3, 2024

In the series of events you imagine, after node 1 becomes the leader, it'll soon add and replicate a special log entry for its term (i.e. 1) according to Raft protocol, leading to node 3 not being able to be voted by node 1 and node 2. If you are saying node 1 crashes immediately after node 1 receives the vote from node 2, node 2 will know or not know node 1 is the leader. Since AppendEntries message serves both log replication and election, if node 2 knows, it must have the special log entry for term 1. To sum up, node 3 won't be leader, unless node 1 doesn't successfully add the special log entry in the election.
In terms of the votedFor not saved, I'd say it should not be critical. Let's say node 1 fails to become the leader after receiving votes from node 2, the cluster starts a new round(term 2), and the votedFor saved for term 1 doesn't matter any more. Or node 3 becomes the leader as it gains the vote from node 1. In either case, there won't be 2 leaders. A thorough analysis might be needed.

@fwhdzh
Copy link
Author

fwhdzh commented Nov 3, 2024

Thank you for your reply. Yes, the no-op log entry can prevent safety issues in the Raft system. However, there could be a time interval between when Node 1 becomes the leader and when the other nodes receive the no-op logs, which can lead to the scenarios described above.

Let's take a look at the relevant code in NodeImpl.java:

changeToRole(new LeaderNodeRole(role.getTerm(), scheduleLogReplicationTask()));
context.log().appendEntry(role.getTerm()); // no-op log
context.connector().resetChannels(); // close all inbound channels

In the first line, Node 1 sets its role to Leader (where it indeed becomes the leader) and schedules a log replication task. In the second line, Node 1 adds the no-op log entry to its log sequence. In DefaultScheduler.scheduleLogReplicationTask, we see that Node 1 will conduct the first LogReplicationTask immediately (with a default logReplicationDelay=0), followed by subsequent LogReplicationTasks once per second (with a default logReplicationInterval=1000).

Now, there are two cases. If the addition of the no-op log finishes before the first LogReplicationTask, it can be broadcast to other nodes. However, if the addition is delayed and misses the first LogReplicationTask, there will be a 1-second gap between when Node 1 becomes the leader and when it broadcasts the no-op entry to others. In this case, if we use the MemoryEntrySequence, a restart of Node 1 during this time interval could lead to the bug scenarios described above.

From a straightforward perspective, maybe we can fix this issue by simply changing the order of the first and second lines in the code above. However, I recommend persisting the votedFor variable for both Candidate and Leader, as described in the original Raft paper. This approach can help avoid further potential problems. For instance, if the network is very slow, what happens if Node 1 restarts while the the messages with no-op log entry are on the fly? Will those messages be lost along with the log entry in Node 1, resulting in the scenarios described above again?

It is true it may not be an very important issue since there is no log inconsistence is this issue. However, consider there is a user who first sees node 1 become the leader in term 1 and then sees node 3 become the leader in term 1. He may be confused by this system behavior.

Best wishes.

@xnnyygn
Copy link
Owner

xnnyygn commented Nov 4, 2024

XRaft runs in a single-threaded architecture, which means the logs being replicated to other nodes always contain the no-op log entry no matter the order of lines of code. Also, in-memory logs are just demo purpose, it doesn't meet the persistence requirements of Raft, so I don't think we should discuss cases based on in-memory logs.

The nature of your question is whether an unsaved implicit vote in a candidate(or leader, in either case, vote for itself) could cause issues in the election, e.g. multiple leaders, unstable election, etc. I don't have time to make a proof, perhaps you can make one, or just request a change to save the self-vote. I don't see any problem in doing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants