Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

Open
Icysandwich opened this issue Mar 29, 2022 · 8 comments

Comments

@Icysandwich
Copy link

Here the candidate should store the message sender, i.e., current leader, in votedFor.

becomeFollower(rpc.getTerm(), null, rpc.getLeaderId(), true);

This issue can result in two leader with the same term in the following complex scenario.
Consider a cluster of 5 nodes. Node 1 ~ 3 become candidates with term 1. Node 4 and 5 are followers.

Node 1 receives votes from 4 and 5. It becomes the leader. However, Node 2 and 3 still remain as candidate.
Node 1 receives AppendEntry request, and send messages to Node 2 and 3.
Node 2 and 3 step down to followers, and set votedFor as null (by above code). Here they do not write logs yet.
Node 4 restarts and becomes candidates with term 1. It requests Node 2 and Node 3 for votes.
Node 2 and 3 finds their votedFor is null, and their logs are not newer than Node 4. Thus, they vote for Node 4.
Node 4 becomes the leader.

Here both Node 1 and Node 4 are the leader with term 1, which violate Raft's safety property.

@xnnyygn
Copy link
Owner

xnnyygn commented Mar 29, 2022

The votedFor is not saved because the node hasn't voted for anyone in that term.

In the scenario you said, when Node 2 and 3 become followers, they will save the first log, a NoOp log with only term in it. So Node 4 won't have a chance to become a leader without that log.

The NoOp log is the tricky part in the Raft algorithm. I cannot remember the exact chapter where it is discussed. Maybe the edge cases in the last several chapters.

@Icysandwich
Copy link
Author

But the log is written after the node becomes follower from candidate.
Thus, in a concurrent scenario, Node 4 can request Node 2 and Node 3 right after they become follower and before writing the log.

@xnnyygn
Copy link
Owner

xnnyygn commented Mar 30, 2022

Sorry, my explanation might not be clear. The NoOp log is inserted by the new leader and replicated to followers.

  1. Node 1 receives enough votes and becomes a leader. Node 1 inserts a NoOp log and starts to replicate its log immediately.
  2. Node 2 and 3 receive AppendEntriesRpc from the leader, set the leader to Node 1, merge the logs from AppendEntriesRpc.

So, Node 4 won't get enough votes before step 2 or after step 2.

Hope this answers your question.

@Icysandwich
Copy link
Author

Okay, I see. Thanks for your further explanation.

You mean the leader writes NoOp in this task, right?

changeToRole(new LeaderNodeRole(role.getTerm(), scheduleLogReplicationTask()));

And in this task the leader will send AppendEntriesRpc requests to other nodes.

But the 2nd step in your comment is not atomic, i.e., the operation set leader and merge logs are executed in two lines:

becomeFollower(rpc.getTerm(), null, rpc.getLeaderId(), true);
return new AppendEntriesResult(rpc.getMessageId(), rpc.getTerm(), appendEntries(rpc));

So here's the concurrency scenario: before appendEntries(rpc) is executed (before writing NoOp log), Node 4 can launch the new leader election process and get votes from Node 2 and 3.

@xnnyygn
Copy link
Owner

xnnyygn commented Apr 6, 2022

I guess you have misunderstood the thread model of XRaft. It's a single thread application except the connection handlers.

For the node receiving AppendEntriesRpc, while it is processing

public void onReceiveAppendEntriesRpc(AppendEntriesRpcMessage rpcMessage) {
context.taskExecutor().submit(() ->
context.connector().replyAppendEntries(doProcessAppendEntriesRpc(rpcMessage), rpcMessage),
LOGGING_FUTURE_CALLBACK
);
}

Any new messages will be queued and processed later, not at the same time.

TaskContext comes from here.

context.setTaskExecutor(taskExecutor != null ? taskExecutor : new ListeningTaskExecutor(
Executors.newSingleThreadExecutor(r -> new Thread(r, "node"))

@Icysandwich
Copy link
Author

Sorry for my misunderstanding. Now I've got the thread model. Thanks again for your explanation.

I seemingly find out a possible buggy scenario, which can occur when using FileLog instead of default MemoryLog.

After Node 1 finishes synchronizing the NoOp log (lastLogTerm = 1) with all nodes (Node 2, 3 and 4), a following generate snapshot command persists the log on disk. Then, Node 4 lanuches an new election process with term = 1 (volatile) and lastLogTerm = 1 (persistent) after a restart, and requests Node 2 and 3 for votes.
Here Node 2 and 3 can vote for Node 4 in case both two conditions hold.

if ((votedFor == null && !context.log().isNewerThan(rpc.getLastLogIndex(), rpc.getLastLogTerm())) ||

We know that votedFor == null holds on Node 2 and 3.

Here in isNewerThan method, Node 2 and 3's logTerm equals 1, while the rpc's lastLogTerm is also 1, and all logIndex equals 0. Thus, the method returns false. Finally, Node 2 and 3 can vote for Node 4.

return lastEntryMeta.getTerm() > lastLogTerm || lastEntryMeta.getIndex() > lastLogIndex;

@xnnyygn
Copy link
Owner

xnnyygn commented Apr 7, 2022

Of course node 2 and 3 should vote for node 4 if node 4 has effectively latest log.

After node 4 receives AppendEntriesRpc from node 1, its term will be the same as node 1, 2, and 3. If node 4 restarts due to some unexpected error, it will continue to be a follower while receiving AppendEntriesRpc or heart beat messages from node 1. Should there be a weird network partition, node 1 could not contact node 4 but they were both able to connect to node 2 and 3, node 4 attempted to become a leader and it was sure to be successful. However, the cluster will be unstable.

@Icysandwich
Copy link
Author

A network partition is not essential. When AppendEntriesRpc requests are delayed, the situation occurs. And Raft should be tolerant to message delay faults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants