Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fault-tolerant storage engine errors for write operations #1399

Closed
wants to merge 5 commits into from

Conversation

acelyc111
Copy link
Member

@acelyc111 acelyc111 commented Mar 16, 2023

#1383

Make Pegasus fault-tolerant storage engine (i.e. RocksDB) error
for write operations. The unrecoverable RocksDB instance
directory will be moved to a .err path and the replica will be
closed and removed from stub. Then meta server will detect the
missing replica and recover automatically.
This patch only deal with storage engine unrecoverable errors
for write operations.

@github-actions github-actions bot added the cpp label Mar 16, 2023
@acelyc111 acelyc111 changed the title fix: Trash the unrecoverable rocksDB instance to .err path fix: Fault-tolerant storage engine errors for write operations Mar 17, 2023
@acelyc111 acelyc111 force-pushed the return_code branch 2 times, most recently from f90ffd7 to 7784c7e Compare March 17, 2023 12:52
@acelyc111 acelyc111 marked this pull request as ready for review March 18, 2023 05:31
src/common/fs_manager.cpp Outdated Show resolved Hide resolved
src/common/fs_manager.cpp Outdated Show resolved Hide resolved
@@ -46,7 +46,7 @@ class mock_replication_app_base : public replication_app_base
explicit mock_replication_app_base(replica *replica) : replication_app_base(replica) {}

error_code start(int, char **) override { return ERR_NOT_IMPLEMENTED; }
error_code stop(bool) override { return ERR_NOT_IMPLEMENTED; }
error_code stop(bool) override { return ERR_OK; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to ERR_OK ?


int count = 0;
mutation_ptr mu = get_mutation_by_decree(last_committed_decree() + 1);

while (mu != nullptr && mu->is_ready_for_commit() && mu->data.header.ballot >= last_bt) {
_last_committed_decree++;
last_bt = mu->data.header.ballot;
_committer(mu);
ERR_LOG_PREFIX_AND_RETURN_NOT_OK(_committer(mu), "commit error in COMMIT_ALL_READY");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once failed, should everything be rolled back ?

src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
@@ -542,7 +545,10 @@ void replica::on_prepare(dsn::message_ex *request)
}

error_code err = _prepare_list->prepare(mu, status(), pop_all_committed_mutations);
CHECK_EQ_MSG(err, ERR_OK, "prepare mutation failed");
if (err != ERR_OK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since for prepare() there is a call chain prepare_list::prepare() => prepare_list::commit() => replica::execute_mutation() => _app->apply_mutation() => pegasus_server_impl::on_batched_write_requests() => pegasus_server_write::on_batched_write_requests() => pegasus_server_write::on_batched_writes() where rocksdb write interface will be called, is it necessary to call handle_local_failure() for error ?

@empiredan
Copy link
Contributor

Currently primary replica server respond to client after rocksdb has been written. However, rocksdb write interface may return kCorruption or kIOError, which will be returned to client and client will think this request has failed. In fact all of primary and secondary logs have been written successfully thus this request should be considered successful. Client will choose to write again and will lead to inconsistency for the non-idempotent writes such as incr, check_and_set and check_and_mutate.

To solve this problem, I think we can make rocksdb write asynchronous. Once fail to write rocksdb asynchronously, for example, kCorruption or kIOError, just remove the replica and move the rocksdb directory to .err and move this primary replica to other secondary replica. The consistency will be guaranteed if and only if logs are consistent. We can just write rocksdb asynchrously.

@acelyc111 acelyc111 force-pushed the return_code branch 3 times, most recently from 464603e to fcfd1a9 Compare March 27, 2023 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants