Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: server: retry ambiguous kv/sql operations on node startup #102245

Merged

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Apr 25, 2023

server,util: avoid retry forever during startup upon premature shutdown

We need to have the RunIdempotentWithRetry calls abort if the
surrounding server shuts down prematurely. This happens most
frequently in tests that fail for another reason; and also
in multitenancy tests with multiple tenant server side-by-side.

Release note: None

server: retry ambiguous kv/sql operations on node startup

Previously, nodes could fail on startup if node itself was needed to
restore quorum on ranges. This problem was caused by circuit breaker
failing fast on live node when talking back to restarting node instead
of waiting for raft operation to run again (and succeed).
This PR adds explicit retries and also adds assertions enabled in race
build that would fire if startup reaches dist sender or internal sql
executor without wrapping it in startup retry helper.

Release note: None

testutil: zone config propagation wait for test cluster

In TestClusters with more than 4 nodes, wait for full replication
doesn't take updated config into account.
This commit adds new check that zone config was propagated prior to
making replication state checks.

Epic: none

Release note: None


Backport:

/cc @cockroachdb/release

Release justification: This backport addresses production issue with nodes failing start.

In TestClusters with more than 4 nodes, wait for full replication
doesn't take updated config into account.
This commit adds new check that zone config was propagated prior to
making replication state checks.

Epic: none

Release note: None
@aliher1911 aliher1911 self-assigned this Apr 25, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 changed the title testutil: zone config propagation wait for test cluster release-22.1: server: retry ambiguous kv/sql operations on node startup Apr 25, 2023
@aliher1911 aliher1911 force-pushed the backport22.1-97710-100583 branch 10 times, most recently from 97bdda1 to 5398c94 Compare April 26, 2023 11:34
aliher1911 and others added 2 commits April 26, 2023 16:49
Previously, nodes could fail on startup if node itself was needed to
restore quorum on ranges. This problem was caused by circuit breaker
failing fast on live node when talking back to restarting node instead
of waiting for raft operation to run again (and succeed).
This PR adds explicit retries and also adds assertions enabled in race
build that would fire if startup reaches dist sender or internal sql
executor without wrapping it in startup retry helper.

Release note: None
We need to have the `RunIdempotentWithRetry` calls abort if the
surrounding server shuts down prematurely. This happens most
frequently in tests that fail for another reason; and also
in multitenancy tests with multiple tenant server side-by-side.

Release note: None
@aliher1911 aliher1911 force-pushed the backport22.1-97710-100583 branch from 5398c94 to 3a217ee Compare April 26, 2023 15:50
@aliher1911 aliher1911 marked this pull request as ready for review April 27, 2023 08:46
@aliher1911 aliher1911 requested review from a team as code owners April 27, 2023 08:46
@aliher1911 aliher1911 requested review from a team and tbg April 27, 2023 08:46
@@ -479,7 +480,7 @@ func ExpectedDescriptorIDs(
defaultZoneConfig *zonepb.ZoneConfig,
defaultSystemZoneConfig *zonepb.ZoneConfig,
) (descpb.IDs, error) {
completedMigrations, err := getCompletedMigrations(ctx, db, codec)
completedMigrations, err := getCompletedMigrations(ctx, make(chan struct{}), db, codec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't passing the struct here lead to stalls during quiesce?
Ah I see the description says this is test only. I also no longer see this method on master. Ok, let's leave as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go land even shows me this method as grey and I don't see any use of it, so it is likely a leftover but I decided not to clutter the PR with unrelated change.

@aliher1911 aliher1911 merged commit f816de4 into cockroachdb:release-22.1 Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants