Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat should not introduce additional IO #698

Closed
drmingdrmer opened this issue Mar 2, 2023 · 4 comments
Closed

Heartbeat should not introduce additional IO #698

drmingdrmer opened this issue Mar 2, 2023 · 4 comments

Comments

@drmingdrmer
Copy link
Member

drmingdrmer commented Mar 2, 2023

Statues:

In OpenRaft 0.8, the heartbeat is currently implemented as a blank log. However, this approach introduces additional IO, as every log, including the blank log, must be persisted for correctness.

Solution

To avoid this unnecessary IO, we propose making the heartbeat a sub-field of the pseudo time. The pseudo time in OpenRaft is currently represented by (leader_id{term, node_id}, log_index), but we suggest upgrading it to (leader_id{term, node_id}, log_index, tick). The tick field does not need to be persisted.

By implementing a pseudo time design like this, we can achieve the same ability as the current blank-log heartbeat to prevent a candidate with a smaller pseudo time from seizing leadership from a healthy cluster.

Explaination

A consensus should work barely with a pseudo time, the pseudo time does not have to be completely persistent.
Thus just give the pseudo time a non-persistent suffix to extend the precision:

In standard raft, the pseudo time is (last_term, last_log_index).
In openraft, the pseudo time for now is (last_term, node_id, last_log_index).
The next version pseudo time would be: (last_term, node_id, last_log_index, heartbeat_count), where the heartbeat_count:

  • heartbeat_count is not persisted, and is reset to 0 when a node restarts;
  • heartbeat_count is increased for every heartbeat(append-entries without any entries);
  • Add a heartbeat_count field to the append-entries RPC body.

When handling a vote request, a node compares two values to determine whether to grant it:

  • the next pseudo time the candidate wants to work in: (term, node_id, nil, nil), which is just the Vote struct, without any change.
  • the pseudo time of the data: (last_term, node_id, last_log_index, tick).

This way the pseudo time keeps increasing while there is an active leader. There won't be additional IO introduced.

Originally posted by @drmingdrmer in #551 (reply in thread)

@github-actions
Copy link

github-actions bot commented Mar 2, 2023

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member Author

@schreter
This design won't work. I'm going to close this issue:

This design has two problems:

  • The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness.

  • Although (term, log_index) serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time.
    Wall clock time is also taken into account.

    There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader.
    For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1.
    This is why there needs to be two types of time, pseudo time (term, log_index) and wall clock time, to protect leadership.

    In the follow graph:

    • node-1 is the leader, has 4 log entries, and is sending a snapshot to
      node-2,
    • node-2 received several chunks of snapshot, and it perceived an active
      leader thus extended leader lease.
    • node-3 tried to send vote request to node-2, although node-2 do not have
      as many logs as node-3, it should still reject node-3's vote request
      because the leader lease has not yet expired.

    In the obsolete design, extending pseudo time (term, index) with a
    tick, in this case node-3 will seize the leadership from node-2.

    Ni: Node i
    Ei: log entry i
    
    N1 E1 E2 E3 E4
          |
          v
    N2    snapshot
          +-----------------+
                   ^        |
                   |        leader lease
                   |
    N3 E1 E2 E3    | vote-request
    ---------------+----------------------------> clock time
                   now
    
    

drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Mar 4, 2023
- databendlabs#698

The blank log heartbeat design has two problems:

- The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness.

- Although `(term, log_index)` serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time.
  Wall clock time is also taken into account.

  There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader.
  For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1.
  This is why there needs to be two types of time, pseudo time `(term, log_index)` and wall clock time, to protect leadership.

  In the follow graph:
  - node-1 is the leader, has 4 log entries, and is sending a snapshot to
      node-2,
  - node-2 received several chunks of snapshot, and it perceived an active
      leader thus extended leader lease.
  - node-3 tried to send vote request to node-2, although node-2 do not have
      as many logs as node-3, it should still reject node-3's vote request
      because the leader lease has not yet expired.

  In the obsolete design, extending pseudo time `(term, index)` with a
  `tick`, in this case node-3 will seize the leadership from node-2.

  ```text
  Ni: Node i
  Ei: log entry i

  N1 E1 E2 E3 E4
        |
        v
  N2    snapshot
        +-----------------+
                 ^        |
                 |        leader lease
                 |
  N3 E1 E2 E3    | vote-request
  ---------------+----------------------------> clock time
                 now

  ```

The original document is presented below for reference.
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Mar 4, 2023
- databendlabs#698

The blank log heartbeat design has two problems:

- The heartbeat that sends a blank log introduces additional I/O, as a follower has to persist every log to maintain correctness.

- Although `(term, log_index)` serves as a pseudo time in Raft, measuring whether a node has caught up with the leader and is capable of becoming a new leader, leadership is not solely determined by this pseudo time.
  Wall clock time is also taken into account.

  There may be a case where the pseudo time is not upto date but the clock time is, and the node should not become the leader.
  For example, in a cluster of three nodes, if the leader (node-1) is busy sending a snapshot to node-2(it has not yet replicated the latest logs to a quorum, but node-2 received message from the leader(node-1), thus it knew there is an active leader), node-3 should not seize leadership from node-1.
  This is why there needs to be two types of time, pseudo time `(term, log_index)` and wall clock time, to protect leadership.

  In the follow graph:
  - node-1 is the leader, has 4 log entries, and is sending a snapshot to
      node-2,
  - node-2 received several chunks of snapshot, and it perceived an active
      leader thus extended leader lease.
  - node-3 tried to send vote request to node-2, although node-2 do not have
      as many logs as node-3, it should still reject node-3's vote request
      because the leader lease has not yet expired.

  In the obsolete design, extending pseudo time `(term, index)` with a
  `tick`, in this case node-3 will seize the leadership from node-2.

  ```text
  Ni: Node i
  Ei: log entry i

  N1 E1 E2 E3 E4
        |
        v
  N2    snapshot
        +-----------------+
                 ^        |
                 |        leader lease
                 |
  N3 E1 E2 E3    | vote-request
  ---------------+----------------------------> clock time
                 now

  ```

The original document is presented below for reference.
@schreter
Copy link
Collaborator

schreter commented Mar 4, 2023

Just an understanding question:

In the original Raft design, after there is no further work coming at the leader, it will periodically send heartbeat, which IIRC is nothing else than AppendEntries with zero entries is sent. That would effectively cause zero I/O on the follower, since there is no new log to persist (only the committed index is updated on the first heartbeat). So what does openraft do here differently? Why does it require an I/O in the first place?

@drmingdrmer
Copy link
Member Author

You are right. And I reverted it to the standard raft way now.

At first, I tried to avoid involving the wall clock by only using the term and raft-log index as a pseudo clock. But such a way can not prevent a candidate from seizing leadership until the leader replicated enough logs to a quorum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants