Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: server: retry ambiguous kv/sql operations on node startup #100463

Merged
merged 3 commits into from
Apr 27, 2023

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Apr 3, 2023

In TestClusters with more than 4 nodes, wait for full replication
doesn't take updated config into account.
This commit adds new check that zone config was propagated prior to
making replication state checks.

Epic: none

Release note: None

Backport 2/2 commits from #97710.

/cc @cockroachdb/release


Previously, nodes could fail on startup if node itself was needed to restore quorum on ranges. This problem was caused by circuit breaker failing fast on live node when talking back to restarting node instead of waiting for raft operation to run again (and succeed).
This PR adds explicit retries and also adds assertions enabled in race build that would fire if startup reaches dist sender or internal sql executor without wrapping it in startup retry helper.

Release note: None

Fixes #74714


Includes fix #100583

Release justification: fixes a stability issue during node restarts

In TestClusters with more than 4 nodes, wait for full replication
doesn't take updated config into account.
This commit adds new check that zone config was propagated prior to
making replication state checks.

Epic: none

Release note: None
@aliher1911 aliher1911 self-assigned this Apr 3, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 changed the title testutil: zone config propagation wait for test cluster server: retry ambiguous kv/sql operations on node startup Apr 4, 2023
@erikgrinaker erikgrinaker changed the title server: retry ambiguous kv/sql operations on node startup release-22.2: server: retry ambiguous kv/sql operations on node startup Apr 21, 2023
@aliher1911 aliher1911 force-pushed the backport22.2-97710 branch 8 times, most recently from 4c7a842 to fcbf4b7 Compare April 25, 2023 17:36
aliher1911 and others added 2 commits April 25, 2023 20:36
Previously, nodes could fail on startup if node itself was needed to
restore quorum on ranges. This problem was caused by circuit breaker
failing fast on live node when talking back to restarting node instead
of waiting for raft operation to run again (and succeed).
This PR adds explicit retries and also adds assertions enabled in race
build that would fire if startup reaches dist sender or internal sql
executor without wrapping it in startup retry helper.

Release note: None
We need to have the `RunIdempotentWithRetry` calls abort if the
surrounding server shuts down prematurely. This happens most
frequently in tests that fail for another reason; and also
in multitenancy tests with multiple tenant server side-by-side.

Release note: None
@aliher1911 aliher1911 marked this pull request as ready for review April 26, 2023 07:50
@aliher1911 aliher1911 requested a review from a team as a code owner April 26, 2023 07:50
@aliher1911 aliher1911 requested a review from a team April 26, 2023 07:50
@aliher1911 aliher1911 requested review from a team as code owners April 26, 2023 07:50
@aliher1911 aliher1911 requested a review from a team April 26, 2023 07:50
@aliher1911 aliher1911 requested a review from a team as a code owner April 26, 2023 07:50
@aliher1911 aliher1911 requested a review from tbg April 26, 2023 07:51
@aliher1911
Copy link
Contributor Author

TestTransientClusterSimulateLatencies is unrelated to backport and are flaking on the branch with the same error previously.

@aliher1911
Copy link
Contributor Author

@tbg this is a backport PR

@aliher1911 aliher1911 merged commit e558649 into cockroachdb:release-22.2 Apr 27, 2023
@aliher1911 aliher1911 deleted the backport22.2-97710 branch April 27, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants