Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: allow more errors for REGION survivability DRT #80526

Merged
merged 1 commit into from
May 19, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 18 additions & 11 deletions pkg/cmd/roachtest/tests/tpcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,21 @@ func registerTPCC(r registry.Registry) {
if err != nil {
return tpccChaosEventProcessor{}, err
}
// We see a slow trickle of errors after a server has been force shutdown due
// to queries before the shutdown not fully completing. You can inspect this
// by looking at the workload logs and corresponding the errors with the
// prometheus graphs.
// The errors seen can be of the form:
// * ERROR: inbox communication error: rpc error: code = Canceled
// desc = context canceled (SQLSTATE 58C01)
// Setting this allows some errors to occur.
allowedErrorsMultiplier := 5
if tc.survivalGoal == "region" {
// REGION failures last a bit longer after a region has gone down.
allowedErrorsMultiplier *= 20
}
maxErrorsDuringUptime := warehousesPerRegion * tpcc.NumWorkersPerWarehouse * allowedErrorsMultiplier

return tpccChaosEventProcessor{
workloadInstances: workloadInstances,
workloadNodeIP: prometheusNodeIP[0],
Expand All @@ -624,17 +639,9 @@ func registerTPCC(r registry.Registry) {
"orderStatus",
"stockLevel",
},
ch: chaosEventCh,
promClient: promv1.NewAPI(client),
// We see a slow trickle of errors after a server has been force shutdown due
// to queries before the shutdown not fully completing. You can inspect this
// by looking at the workload logs and corresponding the errors with the
// prometheus graphs.
// The errors seen can be be of the form:
// * ERROR: inbox communication error: rpc error: code = Canceled
// desc = context canceled (SQLSTATE 58C01)
// Setting this allows some errors to occur.
maxErrorsDuringUptime: warehousesPerRegion * tpcc.NumWorkersPerWarehouse,
ch: chaosEventCh,
promClient: promv1.NewAPI(client),
maxErrorsDuringUptime: maxErrorsDuringUptime,
// "delivery" does not trigger often.
allowZeroSuccessDuringUptime: true,
}, nil
Expand Down