Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify user code about follower loss #517

Merged
merged 5 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions include/libnuraft/callback.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,13 @@ public:
*/
ResignationFromLeader = 27,

/**
* When a peer RPC errors count exceeds raft_server::limits.warning_limit_, or
* a peer doesn't respond for a long time (raft_params::leadership_expiry_),
* the peer is considered lost.
* ctx: null.
*/
FollowerLost = 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no other special reason, please use 28.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

};

struct Param {
Expand Down
6 changes: 6 additions & 0 deletions include/libnuraft/peer.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,10 @@ public:
ptr<req_msg> get_rsv_msg() const { return rsv_msg_; }
rpc_handler get_rsv_msg_handler() const { return rsv_msg_handler_; }

bool IsLost() const { return lost_by_leader_; }
void SetLost() { lost_by_leader_ = true; }
void SetRecovered() { lost_by_leader_ = false; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the naming convention: is_lost, set_lost, and set_recovered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


private:
void handle_rpc_result(ptr<peer> myself,
ptr<rpc_client> my_rpc_client,
Expand Down Expand Up @@ -498,6 +502,8 @@ private:
*/
std::atomic<bool> abandoned_;

std::atomic<bool> lost_by_leader_ {false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment about this member variable.

Also, instead of brace initializer here, please put the default value into constructor, to be aligned with others.


/**
* Reserved message that should be sent next time.
*/
Expand Down
3 changes: 2 additions & 1 deletion include/libnuraft/raft_server.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -886,7 +886,8 @@ protected:
int32 get_quorum_for_election();
int32 get_quorum_for_commit();
int32 get_leadership_expiry();
size_t get_not_responding_peers();
std::vector<ptr<peer>> get_not_responding_peers();
size_t get_not_responding_peers_count();
size_t get_num_stale_peers();

ptr<resp_msg> handle_append_entries(req_msg& req);
Expand Down
4 changes: 2 additions & 2 deletions src/handle_append_entries.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ bool raft_server::request_append_entries(ptr<peer> p) {
chk_timer.timeout_and_reset() ) {
// If auto adjust mode is on for 2-node cluster, and
// the follower is not responding, adjust the quorum.
size_t num_not_responding_peers = get_not_responding_peers();
size_t num_not_responding_peers = get_not_responding_peers_count();
size_t cur_quorum_size = get_quorum_for_commit();
size_t num_stale_peers = get_num_stale_peers();
if (cur_quorum_size >= 1) {
Expand Down Expand Up @@ -1191,7 +1191,7 @@ ulong raft_server::get_expected_committed_log_idx() {

size_t quorum_idx = get_quorum_for_commit();
if (ctx_->get_params()->use_full_consensus_among_healthy_members_) {
size_t not_responding_peers = get_not_responding_peers();
size_t not_responding_peers = get_not_responding_peers_count();
if (not_responding_peers < voting_members - quorum_idx) {
// If full consensus option is on, commit should be
// agreed by all healthy members, and the number of
Expand Down
45 changes: 43 additions & 2 deletions src/raft_server.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -608,7 +608,25 @@ int32 raft_server::get_leadership_expiry() {
return expiry;
}

size_t raft_server::get_not_responding_peers() {
std::vector<ptr<peer>> raft_server::get_not_responding_peers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me std::list<...> is enough for the purpose of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed to std::list

ptr<raft_params> params = ctx_->get_params();
const auto expiry =
params->heart_beat_interval_ * raft_server::raft_limits_.response_limit_;
std::vector<ptr<peer>> rs;
for (const auto& [id, peer_ptr]: peers_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a c++17 feature, and this project is based on c++11. Please use const auto& p and p.second. Looks like id is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad, sorry. Fixed.

if (!is_regular_member(peer_ptr)) {
continue;
}
const auto resp_elapsed_ms =
static_cast<int32>(peer_ptr->get_resp_timer_us() / 1000);
if (resp_elapsed_ms > expiry) {
rs.push_back(peer_ptr);
}
}
return rs;
}

size_t raft_server::get_not_responding_peers_count() {
// Check if quorum nodes are not responding
// (i.e., don't respond 20x heartbeat time long).
size_t num_not_resp_nodes = 0;
Expand Down Expand Up @@ -804,6 +822,14 @@ void raft_server::handle_peer_resp(ptr<resp_msg>& resp, ptr<rpc_exception>& err)
} else if (rpc_errs == raft_server::raft_limits_.warning_limit_) {
p_wn("too verbose RPC error on peer (%d), "
"will suppress it from now", peer_id);
if (!pp || !pp->IsLost()) {
if (pp) {
pp->SetLost();
}
cb_func::Param param(id_, leader_, peer_id);
const auto rc = ctx_->cb_func_.call(cb_func::FollowerLost, &param);
assert(rc == cb_func::ReturnCode::Ok);
}
}

if (pp && pp->is_leave_flag_set()) {
Expand Down Expand Up @@ -837,6 +863,9 @@ void raft_server::handle_peer_resp(ptr<resp_msg>& resp, ptr<rpc_exception>& err)
peer* pp = entry->second.get();
int rpc_errs = pp->get_rpc_errs();
if (rpc_errs >= raft_server::raft_limits_.warning_limit_) {
if (pp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the flow enters here only when entry != peers.end(), this if (pp) is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't guarantee that a shared pointer has a value, does it? But I see that you don't check this at the line above (pp->get_rpc_errs()), so ok, I removed this if (pp) here as well

pp->SetRecovered();
}
p_wn("recovered from RPC failure from peer %d, %d errors",
resp->get_src(), rpc_errs);
}
Expand Down Expand Up @@ -1085,7 +1114,7 @@ bool raft_server::check_leadership_validity() {
int32 num_voting_members = get_num_voting_members();

int leadership_expiry = get_leadership_expiry();
int32 nr_peers = (int32)get_not_responding_peers();
int32 nr_peers = (int32)get_not_responding_peers_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since get_not_responding_peers is also used in the same function, can we just call it once?

const auto nr_peers_list = get_not_responding_peers();
int32 nr_peers = nr_peers_list.size();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this point:

  • on one hand, yes nothing prevents us from using get_not_responding_peers() once;
  • but on another hand, this way we can get an overhead: creating an std::list, filling it up with ptr<peer>s, returning it (though, RVO should work here) - and we need the list only in the case if the condition in the if below is true ((num_voting_members - nr_peers < min_quorum_size) == true), in all other cases we don't need the list; I assume that in the majority of cases, in "normal" circumstances, that condition should be false

But if you are ok with this overhead, plese let me know and I will change the code

Also, please pay attention, I have changed the code of the get_not_responding_peers_count/get_not_responding_peers functions. They contained a duplicated part of code, which could potentially lead to some bugs in the future (if someone change the code of one function, but not another). Now I extracted that common part into a separate function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, your change looks good.

if (leadership_expiry < 0) {
// Negative expiry: leadership will never expire.
nr_peers = 0;
Expand All @@ -1102,6 +1131,18 @@ bool raft_server::check_leadership_validity() {
get_leadership_expiry(),
min_quorum_size);

const auto nr_peers_list = get_not_responding_peers();
assert(nr_peers_list.size() == nr_peers);
for (auto& peer : nr_peers_list) {
if (peer->IsLost()) {
continue;
}
peer->SetLost();
cb_func::Param param(id_, leader_, peer->get_id());
const auto rc = ctx_->cb_func_.call(cb_func::FollowerLost, &param);
assert(rc == cb_func::ReturnCode::Ok);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 4-space indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// NOTE:
// For a cluster where the number of members is the same
// as the size of quorum, we should not expire leadership,
Expand Down
Loading