-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: server: retry ambiguous kv/sql operations on node startup #102245
release-22.1: server: retry ambiguous kv/sql operations on node startup #102245
Conversation
In TestClusters with more than 4 nodes, wait for full replication doesn't take updated config into account. This commit adds new check that zone config was propagated prior to making replication state checks. Epic: none Release note: None
97bdda1
to
5398c94
Compare
Previously, nodes could fail on startup if node itself was needed to restore quorum on ranges. This problem was caused by circuit breaker failing fast on live node when talking back to restarting node instead of waiting for raft operation to run again (and succeed). This PR adds explicit retries and also adds assertions enabled in race build that would fire if startup reaches dist sender or internal sql executor without wrapping it in startup retry helper. Release note: None
We need to have the `RunIdempotentWithRetry` calls abort if the surrounding server shuts down prematurely. This happens most frequently in tests that fail for another reason; and also in multitenancy tests with multiple tenant server side-by-side. Release note: None
5398c94
to
3a217ee
Compare
@@ -479,7 +480,7 @@ func ExpectedDescriptorIDs( | |||
defaultZoneConfig *zonepb.ZoneConfig, | |||
defaultSystemZoneConfig *zonepb.ZoneConfig, | |||
) (descpb.IDs, error) { | |||
completedMigrations, err := getCompletedMigrations(ctx, db, codec) | |||
completedMigrations, err := getCompletedMigrations(ctx, make(chan struct{}), db, codec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't passing the struct here lead to stalls during quiesce?
Ah I see the description says this is test only. I also no longer see this method on master
. Ok, let's leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go land even shows me this method as grey and I don't see any use of it, so it is likely a leftover but I decided not to clutter the PR with unrelated change.
server,util: avoid retry forever during startup upon premature shutdown
We need to have the
RunIdempotentWithRetry
calls abort if thesurrounding server shuts down prematurely. This happens most
frequently in tests that fail for another reason; and also
in multitenancy tests with multiple tenant server side-by-side.
Release note: None
server: retry ambiguous kv/sql operations on node startup
Previously, nodes could fail on startup if node itself was needed to
restore quorum on ranges. This problem was caused by circuit breaker
failing fast on live node when talking back to restarting node instead
of waiting for raft operation to run again (and succeed).
This PR adds explicit retries and also adds assertions enabled in race
build that would fire if startup reaches dist sender or internal sql
executor without wrapping it in startup retry helper.
Release note: None
testutil: zone config propagation wait for test cluster
In TestClusters with more than 4 nodes, wait for full replication
doesn't take updated config into account.
This commit adds new check that zone config was propagated prior to
making replication state checks.
Epic: none
Release note: None
Backport:
/cc @cockroachdb/release
Release justification: This backport addresses production issue with nodes failing start.