-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix the tenant server error handling #100436
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment was marked as outdated.
This comment was marked as outdated.
knz
force-pushed
the
20230402-tenant-start
branch
2 times, most recently
from
April 4, 2023 09:46
ef0f50f
to
738e5ec
Compare
Encountering this issue while troubleshooting TestServerControllerMultiNodeTenantStartup. Will come back to this later. |
knz
force-pushed
the
20230402-tenant-start
branch
from
April 4, 2023 11:10
738e5ec
to
4eaffc1
Compare
Adding this diff to help with #100578: @@ -73,6 +74,12 @@ func (u Updater) update(ctx context.Context, useReadLock bool, updateFn UpdateFn
ctx, sp := tracing.ChildSpan(ctx, "update-job")
defer sp.Finish()
+ // Disable DistSQL query distribution. This ensures that the job operations do not
+ // require SQL servers to be ready on other nodes.
+ prevMode := u.txn.SessionData().DistSQLMode
+ defer func(prevMode sessiondatapb.DistSQLExecMode) { u.txn.SessionData().DistSQLMode = prevMode }(prevMode)
+ u.txn.SessionData().DistSQLMode = sessiondatapb.DistSQLOff
+ |
This was referenced Apr 4, 2023
knz
force-pushed
the
20230402-tenant-start
branch
from
April 4, 2023 13:41
4eaffc1
to
e73eab3
Compare
knz
force-pushed
the
20230402-tenant-start
branch
from
April 5, 2023 21:58
e73eab3
to
927c1c4
Compare
knz
force-pushed
the
20230402-tenant-start
branch
from
April 5, 2023 22:00
927c1c4
to
dc5eb63
Compare
knz
force-pushed
the
20230402-tenant-start
branch
from
April 6, 2023 12:45
dc5eb63
to
07dd47f
Compare
craig bot
pushed a commit
that referenced
this pull request
Apr 6, 2023
100579: jobs: prevent a deadlock during upgrades r=yuzefovich a=knz Needed for #100436. Might address #100578 (unsure) Epic: None Prior to this patch, if multiple SQL instances were started side-by-side but behind on migrations, they would deadlock on performing their migrations because distsql on each instance would be unable to reach other other instances. More generally, we're finding it undesirable for the jobs subsystem to operate on system.jobs / job_info using distributed queries. This patch fixes it by disabling query distribution during job operations. The testing here is implicit when the test TestServerControllerMultiNodeTenantStartup is stressed - this used to deadlock under stress without this patch. 100764: roachtest: remove old executables before installing ruby r=rafiss a=rafiss This should prevent errors during installation. fixes #95004 fixes #100428 backport fixes #100586 Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
knz
force-pushed
the
20230402-tenant-start
branch
from
April 6, 2023 13:51
07dd47f
to
53f6be2
Compare
…enantStartup Release note: None
Release note: None
Release note: None
Release note: None
ahead of splitting "new" vs "start" during construction. Release note: None
Release note: None
This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None
Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None
Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None
knz
force-pushed
the
20230402-tenant-start
branch
from
April 6, 2023 14:50
53f6be2
to
1491929
Compare
craig bot
pushed a commit
that referenced
this pull request
Apr 6, 2023
99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz Epic: CRDB-23559 Fixes #92523. All commits but the last are from #100436. This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. 100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version. informs #100552 Release note: None 100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is. Release note: None Epic: None 100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols normalization rules to avoid pruning columns which might be needed when deriving new predicates based on foreign key constraints for lookup join. However, this caused a problem where rules might sometimes fire in an infinite loop because the same columns to prune keep getting added as PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and DerivePruneCols must be kept in sync to avoid such problems, and this PR brings it back in sync. Epic: none Fixes: #100478 Release note: None 100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank. Epic: None Informs: #97705. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Jackson Owens <[email protected]>
This was referenced Apr 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Needed for #99958.
Fixes #97661.
Fixes #98868.
Epic: CRDB-23559
First commit from #100579.
Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).
This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.