Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

feat(dup): support multiple fail modes for duplication #429

Merged
merged 11 commits into from
Mar 27, 2020

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Mar 25, 2020

This PR introduces a new concept, namely "fail_mode" for duplication, in case of permanent failure occurred, fail-modes will allow flexible and extensible failure handling strategies.

replication.thrift

// How duplication reacts on permanent failure.
enum duplication_fail_mode
{
    // The default mode. If some permanent failure occurred that makes duplication
    // blocked, it will retry forever until external interference.
    FAIL_SLOW = 0,

    // Skip the writes that failed to duplicate, which means minor data loss on the remote cluster.
    // This will certainly achieve better stability of the system.
    FAIL_SKIP,

    // Stop immediately after it ensures itself unable to duplicate.
    // WARN: this mode kills the server process, replicas on the server will all be effected.
    FAIL_FAST
}

We currently support 3 fail-modes.

  • FAIL_SLOW: the default mode that uses monitoring to report failure, requires human interference, like downgrade the failing primary, to solve the problem.

  • FAIL_SKIP: skip the failing file (when the file is corrupted), or failing RPC (the remote cluster is possibly down). This mode is useful for those who care high availability more than data safety.

  • FAIL_FAST: Certainly, fail-fast (suicide and coredump) is always the simplest error handling, but not ops-friendly, its blast-radius will cover many the irrelevant replicas on the server.

Users are able to dynamically change the fail mode by 'modify_dup', using shell command 'update_fail_mode'.

levy5307
levy5307 previously approved these changes Mar 26, 2020
@@ -43,6 +43,15 @@ namespace replication {
return it->second;
}

/*extern*/ const char *duplication_fail_mode_to_string(duplication_fail_mode::type fmode)
{
auto it = _duplication_fail_mode_VALUES_TO_NAMES.find(fmode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplication_fail_mode::type is just a struct enum, you can set any value you like. If you set a invalid value, server will coredump here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if so, something must be wrong. Fail early is necessary.

@neverchanje neverchanje deleted the dup-fail-mode branch May 26, 2020 16:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants