Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: roachtest: add failover variants for partial network partitions #103974

Merged
merged 5 commits into from
May 27, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 26, 2023

Backport 5/5 commits from #103254.

/cc @cockroachdb/release

Release justification: test-only change.


roachtest: improve zone config readability in failover tests

This patch improves readability of zone configs in failover tests, by tweaking the APIs. It also fixes a bug which placed replicas on the wrong nodes in failover/partial/lease-liveness.

roachtest: improve failover/partial/lease-liveness

This test did not sufficiently constrain range/lease placement, which caused occasional permanent unavailability as it randomly hit other failure modes than the one it's trying to test.

This patch makes the test more prescriptive, by separating out the system ranges, SQL gateways, liveness leaseholder, and user ranges, and only introducing a partial partition between a user leaseholder and the liveness leaseholder.

roachtest: add failover/partial/lease-gateway

This patch adds a roachtest that benchmarks the pMax unavailability during a partial network partition between a SQL gateway and a leaseholder. We currently don't handle this failure mode at all, and expect this to result in permanent unavailability.

kvserver: add COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER

This patch adds COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER which will disable colocation of the Raft leader and leaseholder. This is useful for tests.

roachtest: add failover/partial/lease-leader

This patch adds a roachtest that benchmarks unavailability during a partial partition between a Raft leader and leaseholder.

Resolves #94614.
Touches #93503.
Epic: none

Release note: None

This patch improves readability of zone configs in `failover` tests, by
tweaking the APIs. It also fixes a bug which placed replicas on the
wrong nodes in `failover/partial/lease-liveness`.

Epic: none
Release note: None
This test did not sufficiently constrain range/lease placement, which
caused occasional permanent unavailability as it randomly hit other
failure modes than the one it's trying to test.

This patch makes the test more prescriptive, by separating out the
system ranges, SQL gateways, liveness leaseholder, and user ranges, and
only introducing a partial partition between a user leaseholder and the
liveness leaseholder.

Epic: none
Release note: None
This patch adds a roachtest that benchmarks the pMax unavailability
during a partial network partition between a SQL gateway and a
leaseholder. We currently don't handle this failure mode at all, and
expect this to result in permanent unavailability.

Epic: none
Release note: None
This patch adds `COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER` which
will disable colocation of the Raft leader and leaseholder. This is
useful for tests.

Epic: none
Release note: None
@erikgrinaker erikgrinaker requested a review from a team as a code owner May 26, 2023 20:47
@erikgrinaker erikgrinaker self-assigned this May 26, 2023
@erikgrinaker erikgrinaker requested a review from a team as a code owner May 26, 2023 20:47
@erikgrinaker erikgrinaker requested review from herkolategan and smg260 and removed request for a team May 26, 2023 20:47
@blathers-crl
Copy link

blathers-crl bot commented May 26, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

This patch adds a roachtest that benchmarks unavailability during a
partial partition between a Raft leader and leaseholder.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the backport23.1-103254 branch from 215cf69 to 3f53f25 Compare May 26, 2023 21:37
@msbutler msbutler merged commit afe0240 into cockroachdb:release-23.1 May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants