Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover variant for partial network partitions #94614

Closed
erikgrinaker opened this issue Jan 3, 2023 · 4 comments · Fixed by #95394 or #103254
Closed

roachtest: failover variant for partial network partitions #94614

erikgrinaker opened this issue Jan 3, 2023 · 4 comments · Fixed by #95394 or #103254
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure O-qa

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 3, 2023

We should add failover roachtest variants that benchmark range unavailability in the case of partial network partitions. There are two main variants, where all nodes can reach the liveness leaseholder: range leaseholder partitioned away from Raft leader, and SQL gateway partitioned away from range leaseholder.

See also internal doc.

Jira issue: CRDB-23045

Epic CRDB-25212

@erikgrinaker erikgrinaker added O-qa A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure T-kv-replication labels Jan 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 3, 2023

cc @cockroachdb/replication

@andrewbaptist andrewbaptist self-assigned this Jan 4, 2023
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 4, 2023

@andrewbaptist Note that we already have roachtest variants for asymmetric partitions, which is what's most relevant for #84289: failover/non-system/blackhole-recv and failover/non-system/blackhole-send.

r.Add(registry.TestSpec{
Name: fmt.Sprintf("failover/non-system/%s", failureMode),
Owner: registry.OwnerKV,
Timeout: 30 * time.Minute,
Cluster: r.MakeClusterSpec(7, spec.CPU(4)),
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
runFailoverNonSystem(ctx, t, c, failureMode)
},
})

The pMax recovery time is graphed nightly, and looks terrible because we just don't handle these failures at all (60 seconds is the current maximum latency the tests measure).

@andrewbaptist andrewbaptist linked a pull request Jan 17, 2023 that will close this issue
@craig craig bot closed this as completed in #95394 Mar 6, 2023
@erikgrinaker
Copy link
Contributor Author

This is not done yet, the test in #95394 does not cover two important classes of partition: leader/leaseholder and leaseholder/gateway.

@erikgrinaker erikgrinaker reopened this Mar 6, 2023
@erikgrinaker
Copy link
Contributor Author

I'll pick this up as part of the Raft prevote work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure O-qa
Projects
None yet
2 participants