Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix the tenant server error handling #100436

Merged
merged 9 commits into from
Apr 6, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 2, 2023

Needed for #99958.
Fixes #97661.
Fixes #98868.
Epic: CRDB-23559
First commit from #100579.

Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).

This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.

@blathers-crl

This comment was marked as outdated.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz requested a review from stevendanna April 2, 2023 18:30
@knz knz marked this pull request as ready for review April 2, 2023 18:30
@knz knz requested a review from a team as a code owner April 2, 2023 18:30
@knz knz requested a review from a team April 2, 2023 18:30
@knz knz requested review from a team as code owners April 2, 2023 18:30
@knz knz force-pushed the 20230402-tenant-start branch 2 times, most recently from ef0f50f to 738e5ec Compare April 4, 2023 09:46
@knz
Copy link
Contributor Author

knz commented Apr 4, 2023

Encountering this issue while troubleshooting TestServerControllerMultiNodeTenantStartup. Will come back to this later.

@knz knz force-pushed the 20230402-tenant-start branch from 738e5ec to 4eaffc1 Compare April 4, 2023 11:10
@knz knz requested a review from a team as a code owner April 4, 2023 11:10
@knz
Copy link
Contributor Author

knz commented Apr 4, 2023

Adding this diff to help with #100578:

@@ -73,6 +74,12 @@ func (u Updater) update(ctx context.Context, useReadLock bool, updateFn UpdateFn
        ctx, sp := tracing.ChildSpan(ctx, "update-job")
        defer sp.Finish()

+       // Disable DistSQL query distribution. This ensures that the job operations do not
+       // require SQL servers to be ready on other nodes.
+       prevMode := u.txn.SessionData().DistSQLMode
+       defer func(prevMode sessiondatapb.DistSQLExecMode) { u.txn.SessionData().DistSQLMode = prevMode }(prevMode)
+       u.txn.SessionData().DistSQLMode = sessiondatapb.DistSQLOff
+

@knz knz force-pushed the 20230402-tenant-start branch from 4eaffc1 to e73eab3 Compare April 4, 2023 13:41
@abarganier abarganier removed the request for review from a team April 4, 2023 14:36
@knz knz force-pushed the 20230402-tenant-start branch from e73eab3 to 927c1c4 Compare April 5, 2023 21:58
@knz knz requested a review from a team as a code owner April 5, 2023 21:58
@knz knz force-pushed the 20230402-tenant-start branch from 927c1c4 to dc5eb63 Compare April 5, 2023 22:00
@knz knz force-pushed the 20230402-tenant-start branch from dc5eb63 to 07dd47f Compare April 6, 2023 12:45
@knz knz requested a review from a team as a code owner April 6, 2023 12:45
craig bot pushed a commit that referenced this pull request Apr 6, 2023
100579: jobs: prevent a deadlock during upgrades r=yuzefovich a=knz

Needed for #100436.
Might address #100578 (unsure)
Epic: None

Prior to this patch, if multiple SQL instances were started
side-by-side but behind on migrations, they would deadlock on
performing their migrations because distsql on each instance would be
unable to reach other other instances.

More generally, we're finding it undesirable for the jobs subsystem to
operate on system.jobs / job_info using distributed queries.

This patch fixes it by disabling query distribution during job
operations.

The testing here is implicit when the test TestServerControllerMultiNodeTenantStartup is stressed - this used to deadlock under stress without this patch.

100764: roachtest: remove old executables before installing ruby r=rafiss a=rafiss

This should prevent errors during installation.

fixes #95004
fixes #100428
backport fixes #100586

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@knz knz force-pushed the 20230402-tenant-start branch from 07dd47f to 53f6be2 Compare April 6, 2023 13:51
knz added 7 commits April 6, 2023 16:50
ahead of splitting "new" vs "start" during construction.

Release note: None
This peels the call to "start" from the `newTenantServer`
interface and pulls it into the orchestration retry loop.

This change also incidentally reveals an earlier misdesign: we are
calling `newTenantServer` _then_ `start` in the same retry loop. If
`new` succeeds but `start` fails, the next retry will call
`newTenantServer` again *with the same stopper*, which will leak
closers from the previous call to `new`.

Release note: None
Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).

This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.

Release note: None
Prior to this patch, the test was not cleaning up its server stopper
reliably at the end of each sub-test. This patch fixes it.

Release note: None
@knz knz force-pushed the 20230402-tenant-start branch from 53f6be2 to 1491929 Compare April 6, 2023 14:50
craig bot pushed a commit that referenced this pull request Apr 6, 2023
99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz

Epic: CRDB-23559
Fixes #92523.

All commits but the last are from #100436.

This change ensures that tenant servers managed by the server
controller receive a graceful drain request as part of the graceful
drain process of the surrounding KV node.

This change, in turn, ensures that SQL clients connected to these
secondary tenant servers benefit from the same guarantees (and
graceful periods) as clients to the system tenant.


100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss

As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version.

informs #100552 
Release note: None

100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei

Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is.

Release note: None
Epic: None

100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek

In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to avoid pruning columns which might be needed when
deriving new predicates based on foreign key constraints for lookup join.

However, this caused a problem where rules might sometimes fire in an
infinite loop because the same columns to prune keep getting added as
PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and
DerivePruneCols must be kept in sync to avoid such problems, and this
PR brings it back in sync.

Epic: none
Fixes: #100478

Release note: None

100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens

Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@craig craig bot merged commit 1491929 into cockroachdb:master Apr 6, 2023
@knz knz deleted the 20230402-tenant-start branch April 6, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants