-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat should not introduce additional IO #698
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
@schreter This design has two problems:
|
- databendlabs#698 The blank log heartbeat design has two problems: - The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness. - Although `(term, log_index)` serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time. Wall clock time is also taken into account. There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader. For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1. This is why there needs to be two types of time, pseudo time `(term, log_index)` and wall clock time, to protect leadership. In the follow graph: - node-1 is the leader, has 4 log entries, and is sending a snapshot to node-2, - node-2 received several chunks of snapshot, and it perceived an active leader thus extended leader lease. - node-3 tried to send vote request to node-2, although node-2 do not have as many logs as node-3, it should still reject node-3's vote request because the leader lease has not yet expired. In the obsolete design, extending pseudo time `(term, index)` with a `tick`, in this case node-3 will seize the leadership from node-2. ```text Ni: Node i Ei: log entry i N1 E1 E2 E3 E4 | v N2 snapshot +-----------------+ ^ | | leader lease | N3 E1 E2 E3 | vote-request ---------------+----------------------------> clock time now ``` The original document is presented below for reference.
- databendlabs#698 The blank log heartbeat design has two problems: - The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness. - Although `(term, log_index)` serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time. Wall clock time is also taken into account. There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader. For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1. This is why there needs to be two types of time, pseudo time `(term, log_index)` and wall clock time, to protect leadership. In the follow graph: - node-1 is the leader, has 4 log entries, and is sending a snapshot to node-2, - node-2 received several chunks of snapshot, and it perceived an active leader thus extended leader lease. - node-3 tried to send vote request to node-2, although node-2 do not have as many logs as node-3, it should still reject node-3's vote request because the leader lease has not yet expired. In the obsolete design, extending pseudo time `(term, index)` with a `tick`, in this case node-3 will seize the leadership from node-2. ```text Ni: Node i Ei: log entry i N1 E1 E2 E3 E4 | v N2 snapshot +-----------------+ ^ | | leader lease | N3 E1 E2 E3 | vote-request ---------------+----------------------------> clock time now ``` The original document is presented below for reference.
Just an understanding question: In the original Raft design, after there is no further work coming at the leader, it will periodically send heartbeat, which IIRC is nothing else than AppendEntries with zero entries is sent. That would effectively cause zero I/O on the follower, since there is no new log to persist (only the committed index is updated on the first heartbeat). So what does openraft do here differently? Why does it require an I/O in the first place? |
You are right. And I reverted it to the standard raft way now. At first, I tried to avoid involving the wall clock by only using the term and raft-log index as a pseudo clock. But such a way can not prevent a candidate from seizing leadership until the leader replicated enough logs to a quorum. |
Statues:
In OpenRaft 0.8, the heartbeat is currently implemented as a blank log. However, this approach introduces additional IO, as every log, including the blank log, must be persisted for correctness.
Solution
To avoid this unnecessary IO, we propose making the heartbeat a sub-field of the pseudo time. The pseudo time in OpenRaft is currently represented by
(leader_id{term, node_id}, log_index)
, but we suggest upgrading it to(leader_id{term, node_id}, log_index, tick)
. The tick field does not need to be persisted.By implementing a pseudo time design like this, we can achieve the same ability as the current blank-log heartbeat to prevent a candidate with a smaller pseudo time from seizing leadership from a healthy cluster.
Explaination
A consensus should work barely with a pseudo time, the pseudo time does not have to be completely persistent.
Thus just give the pseudo time a non-persistent suffix to extend the precision:
In standard raft, the pseudo time is
(last_term, last_log_index)
.In openraft, the pseudo time for now is
(last_term, node_id, last_log_index)
.The next version pseudo time would be:
(last_term, node_id, last_log_index, heartbeat_count)
, where the heartbeat_count:heartbeat_count
field to the append-entries RPC body.When handling a vote request, a node compares two values to determine whether to grant it:
(term, node_id, nil, nil)
, which is just theVote
struct, without any change.(last_term, node_id, last_log_index, tick)
.This way the pseudo time keeps increasing while there is an active leader. There won't be additional IO introduced.
Originally posted by @drmingdrmer in #551 (reply in thread)
The text was updated successfully, but these errors were encountered: