Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft problem #2483

Merged
merged 3 commits into from
Jul 23, 2021
Merged

Raft problem #2483

merged 3 commits into from
Jul 23, 2021

Conversation

sherlockkenan
Copy link

@sherlockkenan sherlockkenan commented Jun 30, 2021

What changes were proposed in this pull request?

####. fix inconsistency problem in raftpart appendlog method.

*Hi, when I read the implementation of raft in neblua, I found out that may exist a inconsistency problem. In the raftpart, if a follower receive logs from leader conflicting with itself, it will rollback the log to lastCommitID. such behavior may cause inconsistency commit logs.

here is a picture to show how it may happend :
image

I think that the follower could not deletes the log arbitrarily. because some logs may have been already committed.

Why are the changes needed?

I refine the implementation to follow the guildline in the paper of raft .
image

Will break the compatibility? How if so?

no

Does this PR introduce any user-facing change?

no

How was this patch tested?

Checklist

  • I've run the tests to see all new and existing tests pass
  • If this Pull Request resolves an issue, I linked to the issue in the text above
  • I've informed the technical writer about the documentation change if necessary

@CLAassistant
Copy link

CLAassistant commented Jun 30, 2021

CLA assistant check
All committers have signed the CLA.

@sherlockkenan sherlockkenan changed the title fix inconsistency problem in appendlog Raft prom Jun 30, 2021
@sherlockkenan sherlockkenan changed the title Raft prom Raft problem Jun 30, 2021
@Shylock-Hg Shylock-Hg requested a review from a team July 1, 2021 02:45
@Shylock-Hg
Copy link
Contributor

I have a problem about the flow in above picture, in that, the leader must commit after committed by major, but the stage 1,2 doesn't.

@critical27
Copy link
Contributor

👍 , this is a known issue (long story). I'll review later.

@sherlockkenan
Copy link
Author

sherlockkenan commented Jul 1, 2021

I have a problem about the flow in above picture, in that, the leader must commit after committed by major, but the stage 1,2 doesn't.

the leader could commit after it sends logs to major successfully. after that leader update its commit message to followers in the next request. Before transferring the commit message, commitId between leader and followers could be different.

Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for contribution, some place need to have a discuss, see the comments.

req.get_log_str_list().end());

// may be need to rollback wal_
if (!( req.get_last_log_id_sent() == wal_->lastLogId() && req.get_last_log_term_sent() == wal_->lastLogTerm())) {
Copy link
Contributor

@critical27 critical27 Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we didn't check the condition like in line 1551 req.get_last_log_id_sent() == lastLogId_ && req.get_last_log_term_sent() == lastLogTerm_?

My point is, when it is true, we don't need to check term in each log according to the property of Log Matching in paper.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, using

req.get_last_log_id_sent() == lastLogId_ && req.get_last_log_term_sent() == lastLogTerm_

to check is better, I will modify it later.

I agree that when it is true, it not need to check term in each log. that is why I add this condition check.

if (!( req.get_last_log_id_sent() == wal_->lastLogId() && req.get_last_log_term_sent() == wal_->lastLogTerm()))

In most case, above check is false, therefore we don not need check term of log one by one, just append the new log。

Copy link
Contributor

@critical27 critical27 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most case, above check is false, therefore we don not need check term of log one by one, just append the new log。

Agree.

The reason I think lastLogId_ and lastLogTerm_ will be better is that: If a host converts from follower -> leader -> follower again, in the leader phase, it could write some log in wal (although the case is rare).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get it

src/kvstore/wal/FileBasedWal.cpp Show resolved Hide resolved
TermID committedLogTerm = wal_->getLogTerm(committedLogId_);
if (committedLogTerm > 0 ) {
resp.set_last_log_id(committedLogId_);
resp.set_last_log_term(committedLogTerm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider do not set_last_log_term, IIRC, it is not used, so we don't nedd to wal_->getLogTerm(committedLogId_);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered it before, but if we don't set it,then the resp is: (commmitedLogId_,last_log_term).

when leader receive the response, it will send a new request with (commmitedLogId_,last_log_term) not (commmitedLogId_, commmitedLogIdTerm_), see self->lastLogTermSent_ = resp.get_last_log_term();

append log request with (commmitedLogId_,last_log_term) could not pass the check in the follower , which will causing a dead loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right.

@critical27 critical27 added the ready-for-testing PR: ready for the CI test label Jul 2, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2021

Codecov Report

Merging #2483 (06a816b) into master (2249c03) will decrease coverage by 0.23%.
The diff coverage is 82.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2483      +/-   ##
==========================================
- Coverage   86.46%   86.23%   -0.24%     
==========================================
  Files         649      647       -2     
  Lines       64374    67326    +2952     
==========================================
+ Hits        55662    58059    +2397     
- Misses       8712     9267     +555     
Impacted Files Coverage Δ
src/common/base/EitherOr.h 95.69% <ø> (ø)
src/common/base/StatusOr.h 91.85% <ø> (ø)
src/common/filter/geo/GeoParams.h 100.00% <ø> (ø)
src/common/fs/FileUtils.h 85.71% <ø> (ø)
src/common/hdfs/HdfsCommandHelper.cpp 0.00% <0.00%> (-75.00%) ⬇️
src/common/stats/StatsManager.h 100.00% <ø> (ø)
src/common/time/Duration.cpp 75.86% <ø> (ø)
src/common/time/Duration.h 100.00% <ø> (ø)
src/common/time/detail/TscHelper.cpp 100.00% <ø> (ø)
src/common/utils/NebulaKeyUtils.cpp 92.16% <0.00%> (-5.29%) ⬇️
... and 165 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2faf9bd...06a816b. Read the comment docs.

@sherlockkenan
Copy link
Author

I only ran CTest in the centos environment , it is ok, but not in ubuntu clang-9. I don't known why the error happend.

size_t numLogs = req.get_log_str_list().size();
LogID firstId = req.get_last_log_id_sent() + 1;

std::vector<cpp2::LogEntry> logEntries (req.get_log_str_list().begin(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try use std::make_move_iterator here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I have refined it

}

// update msg
logEntries = std::vector<cpp2::LogEntry> (req.get_log_str_list().begin() + diffIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be quite expensive as well, emm, can't figure out a better way for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refined it to avoid copy

@critical27
Copy link
Contributor

I only ran CTest in the centos environment , it is ok, but not in ubuntu clang-9. I don't known why the error happend.

OK, leave it alone.

@sherlockkenan
Copy link
Author

sherlockkenan commented Jul 7, 2021

I only ran CTest in the centos environment , it is ok, but not in ubuntu clang-9. I don't known why the error happend.

OK, leave it alone.

Is there anything else that needs to be updated

Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response, LGTM now. Thx!

Copy link
Contributor

@liuyu85cn liuyu85cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamieliu1023
Copy link
Contributor

@sherlockkenan Thanks for your contribution to the Nebula Graph community! This is Jamie with Nebula and I'd like to email you the Nebula Contributor certificate and ship you a mug to mark this special moment. Could you please kindly reach me via jamie.liu(at)vesoft.com? Again, thanks for being a part of the Nebula community!

@sherlockkenan
Copy link
Author

sherlockkenan commented Jul 25, 2021 via email

@critical27 critical27 mentioned this pull request Oct 21, 2021
yixinglu pushed a commit to yixinglu/nebula that referenced this pull request Sep 14, 2023
* Fix GraphSessionManager::addSessionCount() in multu-thread case

* Fix compilation

---------

Co-authored-by: Yichen Wang <[email protected]>
Co-authored-by: Sophie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants