-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fault-tolerant storage engine errors for write operations #1399
Conversation
f90ffd7
to
7784c7e
Compare
src/replica/test/mock_utils.h
Outdated
@@ -46,7 +46,7 @@ class mock_replication_app_base : public replication_app_base | |||
explicit mock_replication_app_base(replica *replica) : replication_app_base(replica) {} | |||
|
|||
error_code start(int, char **) override { return ERR_NOT_IMPLEMENTED; } | |||
error_code stop(bool) override { return ERR_NOT_IMPLEMENTED; } | |||
error_code stop(bool) override { return ERR_OK; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change to ERR_OK
?
src/replica/prepare_list.cpp
Outdated
|
||
int count = 0; | ||
mutation_ptr mu = get_mutation_by_decree(last_committed_decree() + 1); | ||
|
||
while (mu != nullptr && mu->is_ready_for_commit() && mu->data.header.ballot >= last_bt) { | ||
_last_committed_decree++; | ||
last_bt = mu->data.header.ballot; | ||
_committer(mu); | ||
ERR_LOG_PREFIX_AND_RETURN_NOT_OK(_committer(mu), "commit error in COMMIT_ALL_READY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once failed, should everything be rolled back ?
@@ -542,7 +545,10 @@ void replica::on_prepare(dsn::message_ex *request) | |||
} | |||
|
|||
error_code err = _prepare_list->prepare(mu, status(), pop_all_committed_mutations); | |||
CHECK_EQ_MSG(err, ERR_OK, "prepare mutation failed"); | |||
if (err != ERR_OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since for prepare()
there is a call chain prepare_list::prepare() => prepare_list::commit() => replica::execute_mutation() => _app->apply_mutation() => pegasus_server_impl::on_batched_write_requests() => pegasus_server_write::on_batched_write_requests() => pegasus_server_write::on_batched_writes()
where rocksdb write interface will be called, is it necessary to call handle_local_failure()
for error ?
Currently primary replica server respond to client after rocksdb has been written. However, rocksdb write interface may return To solve this problem, I think we can make rocksdb write asynchronous. Once fail to write rocksdb asynchronously, for example, |
464603e
to
fcfd1a9
Compare
#1383
Make Pegasus fault-tolerant storage engine (i.e. RocksDB) error
for write operations. The unrecoverable RocksDB instance
directory will be moved to a .err path and the replica will be
closed and removed from stub. Then meta server will detect the
missing replica and recover automatically.
This patch only deal with storage engine unrecoverable errors
for write operations.