Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to VTOrc to enable/disable its ability to run ERS #13259

Merged
merged 3 commits into from
Jun 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions changelog/18.0/18.0.0/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## Summary

### Table of Contents

- **[Major Changes](#major-changes)**
- **[Breaking Changes](#breaking-changes)**
- **[New command line flags and behavior](#new-flag)**
- [VTOrc flag `--allow-emergency-reparent`](#new-flag-toggle-ers)
- **[Deprecations and Deletions](#deprecations-and-deletions)**


## <a id="major-changes"/>Major Changes

### <a id="breaking-changes"/>Breaking Changes

### <a id="new-flag"/>New command line flags and behavior

#### <a id="new-flag-toggle-ers"/>VTOrc flag `--allow-emergency-reparent`

VTOrc has a new flag `--allow-emergency-reparent` that allows the users to toggle the ability of VTOrc to run emergency reparent operations.
The users that want VTOrc to fix the replication issues, but don't want it to run any reparents should start using this flag.
By default, VTOrc will be able to run `EmergencyReparentShard`. The users must specify the flag to `false` to change the behaviour.

### <a id="deprecations-and-deletions"/>Deprecations and Deletions
1 change: 1 addition & 0 deletions changelog/18.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
## v18.0
1 change: 1 addition & 0 deletions go/flags/endtoend/vtorc.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
Usage of vtorc:
--allow-emergency-reparent Whether VTOrc should be allowed to run emergency reparent operation when it detects a dead primary (default true)
--alsologtostderr log to standard error as well as files
--audit-file-location string File location where the audit logs are to be stored
--audit-purge-duration duration Duration for which audit logs are held before being purged. Should be in multiples of days (default 168h0m0s)
Expand Down
12 changes: 12 additions & 0 deletions go/vt/vtorc/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ var (
waitReplicasTimeout = 30 * time.Second
topoInformationRefreshDuration = 15 * time.Second
recoveryPollDuration = 1 * time.Second
ersEnabled = true
)

// RegisterFlags registers the flags required by VTOrc
Expand All @@ -86,6 +87,7 @@ func RegisterFlags(fs *pflag.FlagSet) {
fs.DurationVar(&waitReplicasTimeout, "wait-replicas-timeout", waitReplicasTimeout, "Duration for which to wait for replica's to respond when issuing RPCs")
fs.DurationVar(&topoInformationRefreshDuration, "topo-information-refresh-duration", topoInformationRefreshDuration, "Timer duration on which VTOrc refreshes the keyspace and vttablet records from the topology server")
fs.DurationVar(&recoveryPollDuration, "recovery-poll-duration", recoveryPollDuration, "Timer duration on which VTOrc polls its database to run a recovery")
fs.BoolVar(&ersEnabled, "allow-emergency-reparent", ersEnabled, "Whether VTOrc should be allowed to run emergency reparent operation when it detects a dead primary")
}

// Configuration makes for vtorc configuration input, which can be provided by user via JSON formatted file.
Expand Down Expand Up @@ -137,6 +139,16 @@ func UpdateConfigValuesFromFlags() {
Config.RecoveryPollSeconds = int(recoveryPollDuration / time.Second)
}

// ERSEnabled reports whether VTOrc is allowed to run ERS or not.
func ERSEnabled() bool {
return ersEnabled
}

// SetERSEnabled sets the value for the ersEnabled variable. This should only be used from tests.
func SetERSEnabled(val bool) {
ersEnabled = val
}

// LogConfigValues is used to log the config values.
func LogConfigValues() {
b, _ := json.MarshalIndent(Config, "", "\t")
Expand Down
4 changes: 4 additions & 0 deletions go/vt/vtorc/logic/topology_recovery.go
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,10 @@ func getCheckAndRecoverFunctionCode(analysisCode inst.AnalysisCode, analyzedInst
switch analysisCode {
// primary
case inst.DeadPrimary, inst.DeadPrimaryAndSomeReplicas:
// If ERS is disabled, we have no way of repairing the cluster.
if !config.ERSEnabled() {
return noRecoveryFunc
}
if isInEmergencyOperationGracefulPeriod(analyzedInstanceKey) {
return recoverGenericProblemFunc
}
Expand Down
70 changes: 68 additions & 2 deletions go/vt/vtorc/logic/topology_recovery_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,9 @@ import (

topodatapb "vitess.io/vitess/go/vt/proto/topodata"
"vitess.io/vitess/go/vt/topo/memorytopo"
"vitess.io/vitess/go/vt/vtorc/config"
"vitess.io/vitess/go/vt/vtorc/db"
"vitess.io/vitess/go/vt/vtorc/inst"

// import the gRPC client implementation for tablet manager
_ "vitess.io/vitess/go/vt/vttablet/grpctmclient"
)

Expand Down Expand Up @@ -190,3 +189,70 @@ func TestDifferentAnalysescHaveDifferentCooldowns(t *testing.T) {
_, err = AttemptRecoveryRegistration(&primaryAnalysisEntry, true, true)
require.Nil(t, err)
}

func TestGetCheckAndRecoverFunctionCode(t *testing.T) {
tests := []struct {
name string
ersEnabled bool
analysisCode inst.AnalysisCode
analyzedInstanceKey *inst.InstanceKey
wantRecoveryFunction recoveryFunction
}{
{
name: "DeadPrimary with ERS enabled",
ersEnabled: true,
analysisCode: inst.DeadPrimary,
analyzedInstanceKey: &inst.InstanceKey{
Hostname: hostname,
Port: 1,
},
wantRecoveryFunction: recoverDeadPrimaryFunc,
}, {
name: "DeadPrimary with ERS disabled",
ersEnabled: false,
analysisCode: inst.DeadPrimary,
analyzedInstanceKey: &inst.InstanceKey{
Hostname: hostname,
Port: 1,
},
wantRecoveryFunction: noRecoveryFunc,
}, {
name: "PrimaryHasPrimary",
ersEnabled: false,
analysisCode: inst.PrimaryHasPrimary,
wantRecoveryFunction: recoverPrimaryHasPrimaryFunc,
}, {
name: "ClusterHasNoPrimary",
ersEnabled: false,
analysisCode: inst.ClusterHasNoPrimary,
wantRecoveryFunction: electNewPrimaryFunc,
}, {
name: "ReplicationStopped",
ersEnabled: false,
analysisCode: inst.ReplicationStopped,
wantRecoveryFunction: fixReplicaFunc,
}, {
name: "PrimarySemiSyncMustBeSet",
ersEnabled: false,
analysisCode: inst.PrimarySemiSyncMustBeSet,
wantRecoveryFunction: fixPrimaryFunc,
},
}

// Needed for the test to work
oldMap := emergencyOperationGracefulPeriodMap
emergencyOperationGracefulPeriodMap = cache.New(time.Second*5, time.Millisecond*500)
defer func() {
emergencyOperationGracefulPeriodMap = oldMap
}()
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
prevVal := config.ERSEnabled()
config.SetERSEnabled(tt.ersEnabled)
defer config.SetERSEnabled(prevVal)

gotFunc := getCheckAndRecoverFunctionCode(tt.analysisCode, tt.analyzedInstanceKey)
require.EqualValues(t, tt.wantRecoveryFunction, gotFunc)
})
}
}