VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

Icysandwich · 2022-03-29T08:22:17Z

Here the candidate should store the message sender, i.e., current leader, in votedFor.

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeImpl.java

Line 611 in 3442abc

becomeFollower(rpc.getTerm(), null, rpc.getLeaderId(), true);

This issue can result in two leader with the same term in the following complex scenario.
Consider a cluster of 5 nodes. Node 1 ~ 3 become candidates with term 1. Node 4 and 5 are followers.

Node 1 receives votes from 4 and 5. It becomes the leader. However, Node 2 and 3 still remain as candidate.
Node 1 receives AppendEntry request, and send messages to Node 2 and 3.
Node 2 and 3 step down to followers, and set votedFor as null (by above code). Here they do not write logs yet.
Node 4 restarts and becomes candidates with term 1. It requests Node 2 and Node 3 for votes.
Node 2 and 3 finds their votedFor is null, and their logs are not newer than Node 4. Thus, they vote for Node 4.
Node 4 becomes the leader.

Here both Node 1 and Node 4 are the leader with term 1, which violate Raft's safety property.

xnnyygn · 2022-03-29T08:34:43Z

The votedFor is not saved because the node hasn't voted for anyone in that term.

In the scenario you said, when Node 2 and 3 become followers, they will save the first log, a NoOp log with only term in it. So Node 4 won't have a chance to become a leader without that log.

The NoOp log is the tricky part in the Raft algorithm. I cannot remember the exact chapter where it is discussed. Maybe the edge cases in the last several chapters.

Icysandwich · 2022-03-29T08:52:20Z

But the log is written after the node becomes follower from candidate.
Thus, in a concurrent scenario, Node 4 can request Node 2 and Node 3 right after they become follower and before writing the log.

xnnyygn · 2022-03-30T08:11:13Z

Sorry, my explanation might not be clear. The NoOp log is inserted by the new leader and replicated to followers.

Node 1 receives enough votes and becomes a leader. Node 1 inserts a NoOp log and starts to replicate its log immediately.
Node 2 and 3 receive AppendEntriesRpc from the leader, set the leader to Node 1, merge the logs from AppendEntriesRpc.

So, Node 4 won't get enough votes before step 2 or after step 2.

Hope this answers your question.

Icysandwich · 2022-04-06T07:30:42Z

Okay, I see. Thanks for your further explanation.

You mean the leader writes NoOp in this task, right?

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeImpl.java

Line 561 in 3442abc

changeToRole(new LeaderNodeRole(role.getTerm(), scheduleLogReplicationTask()));

And in this task the leader will send AppendEntriesRpc requests to other nodes.

But the 2nd step in your comment is not atomic, i.e., the operation set leader and merge logs are executed in two lines:

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeImpl.java

Lines 611 to 612 in 3442abc

    
           becomeFollower(rpc.getTerm(), null, rpc.getLeaderId(), true); 
        
           return new AppendEntriesResult(rpc.getMessageId(), rpc.getTerm(), appendEntries(rpc));

So here's the concurrency scenario: before appendEntries(rpc) is executed (before writing NoOp log), Node 4 can launch the new leader election process and get votes from Node 2 and 3.

xnnyygn · 2022-04-06T07:44:51Z

I guess you have misunderstood the thread model of XRaft. It's a single thread application except the connection handlers.

For the node receiving AppendEntriesRpc, while it is processing

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeImpl.java

Lines 580 to 585 in 3442abc

    
           public void onReceiveAppendEntriesRpc(AppendEntriesRpcMessage rpcMessage) { 
        
               context.taskExecutor().submit(() -> 
        
                               context.connector().replyAppendEntries(doProcessAppendEntriesRpc(rpcMessage), rpcMessage), 
        
                       LOGGING_FUTURE_CALLBACK 
        
               ); 
        
           }

Any new messages will be queued and processed later, not at the same time.

TaskContext comes from here.

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeBuilder.java

Lines 268 to 269 in 3442abc

    
           context.setTaskExecutor(taskExecutor != null ? taskExecutor : new ListeningTaskExecutor( 
        
                   Executors.newSingleThreadExecutor(r -> new Thread(r, "node"))

Icysandwich · 2022-04-06T13:15:59Z

Sorry for my misunderstanding. Now I've got the thread model. Thanks again for your explanation.

I seemingly find out a possible buggy scenario, which can occur when using FileLog instead of default MemoryLog.

After Node 1 finishes synchronizing the NoOp log (lastLogTerm = 1) with all nodes (Node 2, 3 and 4), a following generate snapshot command persists the log on disk. Then, Node 4 lanuches an new election process with term = 1 (volatile) and lastLogTerm = 1 (persistent) after a restart, and requests Node 2 and 3 for votes.
Here Node 2 and 3 can vote for Node 4 in case both two conditions hold.

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/node/NodeImpl.java

Line 502 in 3442abc

    
           if ((votedFor == null && !context.log().isNewerThan(rpc.getLastLogIndex(), rpc.getLastLogTerm())) ||

We know that votedFor == null holds on Node 2 and 3.

Here in isNewerThan method, Node 2 and 3's logTerm equals 1, while the rpc's lastLogTerm is also 1, and all logIndex equals 0. Thus, the method returns false. Finally, Node 2 and 3 can vote for Node 4.

xraft/xraft-core/src/main/java/in/xnnyygn/xraft/core/log/AbstractLog.java

Line 133 in 3442abc

    
           return lastEntryMeta.getTerm() > lastLogTerm || lastEntryMeta.getIndex() > lastLogIndex;

xnnyygn · 2022-04-07T01:04:34Z

Of course node 2 and 3 should vote for node 4 if node 4 has effectively latest log.

After node 4 receives AppendEntriesRpc from node 1, its term will be the same as node 1, 2, and 3. If node 4 restarts due to some unexpected error, it will continue to be a follower while receiving AppendEntriesRpc or heart beat messages from node 1. Should there be a weird network partition, node 1 could not contact node 4 but they were both able to connect to node 2 and 3, node 4 attempted to become a leader and it was sure to be successful. However, the cluster will be unstable.

Icysandwich · 2022-04-13T03:42:28Z

A network partition is not essential. When AppendEntriesRpc requests are delayed, the situation occurs. And Raft should be tolerant to message delay faults.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

Icysandwich commented Mar 29, 2022

xnnyygn commented Mar 29, 2022

Icysandwich commented Mar 29, 2022

xnnyygn commented Mar 30, 2022 •

edited

Loading

Icysandwich commented Apr 6, 2022

xnnyygn commented Apr 6, 2022

Icysandwich commented Apr 6, 2022

xnnyygn commented Apr 7, 2022 •

edited

Loading

Icysandwich commented Apr 13, 2022

VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

VotedFor is not stored when a node is candidate and receives an AppendEntriesRpc #29

Comments

Icysandwich commented Mar 29, 2022

xnnyygn commented Mar 29, 2022

Icysandwich commented Mar 29, 2022

xnnyygn commented Mar 30, 2022 • edited Loading

Icysandwich commented Apr 6, 2022

xnnyygn commented Apr 6, 2022

Icysandwich commented Apr 6, 2022

xnnyygn commented Apr 7, 2022 • edited Loading

Icysandwich commented Apr 13, 2022

xnnyygn commented Mar 30, 2022 •

edited

Loading

xnnyygn commented Apr 7, 2022 •

edited

Loading