Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify user code about follower loss #517

Merged
merged 5 commits into from
Jul 18, 2024

Conversation

apostolpav113
Copy link
Contributor

Hello, within this PR we suggest to introduce a new cb_func::Type enum item: FollowerLost, and, on detection of follower loss, call the callback function with the type parameter equal to FollowerLost.

Pavel Yurin,
C++ developer
NXLog Ltd.
[email protected]
www.nxlog.co

Copy link
Contributor

@greensky00 greensky00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting PR. I put a few comments, please take a look.

* the peer is considered lost.
* ctx: null.
*/
FollowerLost = 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no other special reason, please use 28.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 305 to 307
bool IsLost() const { return lost_by_leader_; }
void SetLost() { lost_by_leader_ = true; }
void SetRecovered() { lost_by_leader_ = false; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the naming convention: is_lost, set_lost, and set_recovered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

const auto expiry =
params->heart_beat_interval_ * raft_server::raft_limits_.response_limit_;
std::vector<ptr<peer>> rs;
for (const auto& [id, peer_ptr]: peers_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a c++17 feature, and this project is based on c++11. Please use const auto& p and p.second. Looks like id is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad, sorry. Fixed.

@@ -837,6 +863,9 @@ void raft_server::handle_peer_resp(ptr<resp_msg>& resp, ptr<rpc_exception>& err)
peer* pp = entry->second.get();
int rpc_errs = pp->get_rpc_errs();
if (rpc_errs >= raft_server::raft_limits_.warning_limit_) {
if (pp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the flow enters here only when entry != peers.end(), this if (pp) is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't guarantee that a shared pointer has a value, does it? But I see that you don't check this at the line above (pp->get_rpc_errs()), so ok, I removed this if (pp) here as well

Comment on lines 1136 to 1144
for (auto& peer : nr_peers_list) {
if (peer->IsLost()) {
continue;
}
peer->SetLost();
cb_func::Param param(id_, leader_, peer->get_id());
const auto rc = ctx_->cb_func_.call(cb_func::FollowerLost, &param);
assert(rc == cb_func::ReturnCode::Ok);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 4-space indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1085,7 +1114,7 @@ bool raft_server::check_leadership_validity() {
int32 num_voting_members = get_num_voting_members();

int leadership_expiry = get_leadership_expiry();
int32 nr_peers = (int32)get_not_responding_peers();
int32 nr_peers = (int32)get_not_responding_peers_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since get_not_responding_peers is also used in the same function, can we just call it once?

const auto nr_peers_list = get_not_responding_peers();
int32 nr_peers = nr_peers_list.size();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this point:

  • on one hand, yes nothing prevents us from using get_not_responding_peers() once;
  • but on another hand, this way we can get an overhead: creating an std::list, filling it up with ptr<peer>s, returning it (though, RVO should work here) - and we need the list only in the case if the condition in the if below is true ((num_voting_members - nr_peers < min_quorum_size) == true), in all other cases we don't need the list; I assume that in the majority of cases, in "normal" circumstances, that condition should be false

But if you are ok with this overhead, plese let me know and I will change the code

Also, please pay attention, I have changed the code of the get_not_responding_peers_count/get_not_responding_peers functions. They contained a duplicated part of code, which could potentially lead to some bugs in the future (if someone change the code of one function, but not another). Now I extracted that common part into a separate function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, your change looks good.

@@ -608,7 +608,25 @@ int32 raft_server::get_leadership_expiry() {
return expiry;
}

size_t raft_server::get_not_responding_peers() {
std::vector<ptr<peer>> raft_server::get_not_responding_peers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me std::list<...> is enough for the purpose of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed to std::list

@apostolpav113
Copy link
Contributor Author

apostolpav113 commented Jun 26, 2024

Thanks for submitting PR. I put a few comments, please take a look.

@greensky00, thank you for the review. I have made the fixes, and regarding this comment #517 (comment) - need one more round of discussion. Please take a look

Copy link
Contributor

@greensky00 greensky00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, another couple of comments.

@@ -1085,7 +1114,7 @@ bool raft_server::check_leadership_validity() {
int32 num_voting_members = get_num_voting_members();

int leadership_expiry = get_leadership_expiry();
int32 nr_peers = (int32)get_not_responding_peers();
int32 nr_peers = (int32)get_not_responding_peers_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, your change looks good.

@@ -498,6 +502,8 @@ private:
*/
std::atomic<bool> abandoned_;

std::atomic<bool> lost_by_leader_ {false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment about this member variable.

Also, instead of brace initializer here, please put the default value into constructor, to be aligned with others.

@@ -837,6 +861,7 @@ void raft_server::handle_peer_resp(ptr<resp_msg>& resp, ptr<rpc_exception>& err)
peer* pp = entry->second.get();
int rpc_errs = pp->get_rpc_errs();
if (rpc_errs >= raft_server::raft_limits_.warning_limit_) {
pp->set_recovered();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_lost() is invoked in two places:

  1. handle_peer_resp: based on raft_limits_.warning_limit_, and
  2. check_leadership_validity: from get_not_responding_peers which is based on raft_limits_.response_limit_.

Although both limits have the same default value, they can be configured with different numbers. If warning_limit_ > response_limit_, there can be a possibility that set_recovered() may not be called, when a leader loses its role and then becomes leader later again.

Please confirm this. I think calling set_recovered outside of this if-clause is ok. It is just setting an atomic variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, calling set_recovered for all peers when a server becomes leader (in raft_server::become_leader) is also an option. Leader may need a fresh start.

@greensky00
Copy link
Contributor

@apostolpav113
Since it has been pending for a while, I made those changes. Please let me know if you have any concerns. You can push follow-up PRs anytime.

@greensky00 greensky00 merged commit 0e28077 into eBay:master Jul 18, 2024
1 check passed
@apostolpav113
Copy link
Contributor Author

greensky00, sorry for delaying and thank you for making the changes! :-) I was going to start fixing today, and was surprised by the fact that PR had been already merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants