jobs,server: graceful shutdown for secondary tenant servers #99958

knz · 2023-03-29T17:54:22Z

All commits but the last are from #100436.

This change ensures that tenant servers managed by the server
controller receive a graceful drain request as part of the graceful
drain process of the surrounding KV node.

This change, in turn, ensures that SQL clients connected to these
secondary tenant servers benefit from the same guarantees (and
graceful periods) as clients to the system tenant.

blathers-crl · 2023-03-29T17:54:27Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-03-29T17:54:36Z

This change is

stevendanna

Overall, this looks reasonable to me. startControlledServer is a bit of a beast, but that is a fish to fry for another day.

stevendanna · 2023-03-29T18:08:11Z

pkg/jobs/registry.go

@@ -1066,6 +1071,9 @@ func (r *Registry) Start(ctx context.Context, stopper *stop.Stopper) error {
 	})

 	if err := stopper.RunAsyncTask(ctx, "jobs/cancel", func(ctx context.Context) {
+		r.startedControllerTasksWG.Add(1)


We have examples of both patterns in the database, but I almost always put the Add() before launching the goroutine/async task. You don't have a guarantee that the scheduler has started the goroutine before you end up calling Wait().

stevendanna · 2023-03-29T18:18:16Z

pkg/server/server_controller_orchestration.go

+
+	drainCtx := logtags.WithTags(context.Background(), logtags.FromContext(ctx))
+
+	for ; ; prevRemaining = remaining {


[nit] I have a feeling this would become a little clearer to read if remaining became local to the scope of the block, and we set prevRemaining = remaining at the end of the loop, and then used for { here.

stevendanna · 2023-03-29T18:35:02Z

pkg/server/server_controller_orchestration.go

+		drainServer := func() interface {
+			gracefulDrain(ctx context.Context, verbose bool) (uint64, redact.RedactableString, error)
+		} {
+			shutdownInterface.Lock()
+			defer shutdownInterface.Unlock()
+			return shutdownInterface.drainableServer
+		}()


[nit] Perhaps I'm missing something, but this is a lot of readability overhead. I know we like defers for locks, but in this case

shutdownInterface.Lock() drainServer := shutdownInterface.drainableServer shutdownInterface.Unlock()

feels pretty attractive.

Overall, I think we would benefit from refactoring this a bit so that we had some set of methods on serverEntry or some new type that wrapped some of these statement management things up in some functions so they didn't have to all be inline to this function.

As a quick follow up, the refactor I mentioned is strictly future looking. Not required for this PR.

knz

startControlledServer is a bit of a beast, but that is a fish to fry for another day.

I reduced the quantity of changes to the function. PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @stevendanna)

pkg/jobs/registry.go line 1074 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

We have examples of both patterns in the database, but I almost always put the Add() before launching the goroutine/async task. You don't have a guarantee that the scheduler has started the goroutine before you end up calling Wait().

Done.

pkg/server/server_controller_orchestration.go line 229 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

As a quick follow up, the refactor I mentioned is strictly future looking. Not required for this PR.

Done.

pkg/server/server_controller_orchestration.go line 561 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit] I have a feeling this would become a little clearer to read if remaining became local to the scope of the block, and we set prevRemaining = remaining at the end of the loop, and then used for { here.

Done.

Also I made this code shared with pkg/cli/start.go as suggested earlier.

stevendanna

Left a few nits and one minor race/leak. I'll leave the latter to your judgement.

stevendanna · 2023-03-31T16:01:28Z

pkg/server/server_controller_orchestration.go

+			// by the tenant server any more.
+			drainCtx, cancel := c.stopper.WithCancelOnQuiesce(tenantCtx)
+			defer cancel()
+			shutdownInterface.maybeCallDrain(drainCtx)


I think we probably have other problems of this nature so perhaps it isn't worth fixing, but I think the current structure will fail to drain the server if the following sequence happens:

The propogate-close task starts and is running

The managed-tenant-server task starts and reaches line 349 but has not run line 350

createServerEntryLocked returns the server entry, the entry is added to the server slice, the severs slice mutex is unlocked.

requestAllStopped() is called.

propogate-close sees the stoprequest, and runs myabeCallDrain(), but we haven't set the server yet, so it reports nothing to drain, even though our server already has drainable components started.

I think this is minor because the server won't be serving any user connections yet (since the orchestrator doesn't think it is started). So it just means any jobs or other internal components might not get drained.

stevendanna · 2023-03-31T16:04:39Z

pkg/sql/stats/automatic_stats.go

@@ -364,14 +375,34 @@ func (r *Refresher) getTableDescriptor(
 	return desc
 }

+// WaitForJobShutdown(ctx context.Context) {


Stray comment

stevendanna · 2023-03-31T16:09:17Z

pkg/jobs/registry.go

 // NB: Check the implementation of drain before adding code that would
 // make this block.


[nit[ We can probably remove this comment now

stevendanna · 2023-03-31T16:13:06Z

pkg/server/drain.go

+	statsProvider.Flush(ctx)
+	statsProvider.Stop(ctx)
+
+	// Inform the async tasks for table stats that the ode is draining


[nit] ode is draining

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @mgartner, and @stevendanna)

pkg/jobs/registry.go line 1958 at r6 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit[ We can probably remove this comment now

Done.

pkg/server/drain.go line 405 at r6 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit] ode is draining

Done.

pkg/server/server_controller_orchestration.go line 257 at r6 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I think we probably have other problems of this nature so perhaps it isn't worth fixing, but I think the current structure will fail to drain the server if the following sequence happens:

The propogate-close task starts and is running

The managed-tenant-server task starts and reaches line 349 but has not run line 350

createServerEntryLocked returns the server entry, the entry is added to the server slice, the severs slice mutex is unlocked.

requestAllStopped() is called.

propogate-close sees the stoprequest, and runs myabeCallDrain(), but we haven't set the server yet, so it reports nothing to drain, even though our server already has drainable components started.

I think this is minor because the server won't be serving any user connections yet (since the orchestrator doesn't think it is started). So it just means any jobs or other internal components might not get drained.

I'm impressed and curious how you went about analyzing this condition 💯

The solution here is to split the instantiation of a new server and the starting of the new server in two separate steps (i.e. split (s *Server) startTenantServerInternal). We can then have the shutdown interface ready after instantiation and before start. This incidentally mirrors the dance done in cli/start.go for exactly the same reason.

I believe this is needed anyway to solve #97661 / #98868. I'll mull it over.

pkg/sql/stats/automatic_stats.go line 378 at r6 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Stray comment

Done.

knz · 2023-04-01T08:24:47Z

~~TODO for self: add a warning in logs when a tenant server is shut down without graceful drain.~~ done

knz

Rebased on top of #100436.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @mgartner, and @stevendanna)

pkg/server/server_controller_orchestration.go line 257 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I'm impressed and curious how you went about analyzing this condition 💯

The solution here is to split the instantiation of a new server and the starting of the new server in two separate steps (i.e. split (s *Server) startTenantServerInternal). We can then have the shutdown interface ready after instantiation and before start. This incidentally mirrors the dance done in cli/start.go for exactly the same reason.

I believe this is needed anyway to solve #97661 / #98868. I'll mull it over.

Done.

98459: server,autoconfig: automatic configuration via config tasks r=adityamaru a=knz Epic: CRDB-23559 Informs #98431. All commits but the last are from #98993. This change introduces "auto config tasks", a mechanism through which configuration payloads ("tasks") can be injected into a running SQL service. This is driven via the "auto config runner" job that was introduced in the previous commit. The job listens for the arrival of new task definitions via a `Provider` interface. When new tasks are known, and previous tasks have completed, the runner creates a job for the first next task. Release note: None 100476: server/drain: shut down SQL subsystems gracefully before releasing table leases r=JeffSwenson,rytaft a=knz Needed for #99941 and #99958. Epic: CRDB-23559 See individual commits for details. 100511: sqlccl: deflake TestGCTenantJobWaitsForProtectedTimestamps r=adityamaru,arulajmani a=knz Fixes #94808 The tenant server must be shut down before the tenant record is removed; otherwise the tenant's SQL server will self-terminate by calling Stop() on its stopper, which in this case was shared with the parent cluster. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

100583: server,util: avoid retry forever during startup upon premature shutdown r=aliher1911 a=knz Needed for #99958. Probably helps with #100578. We need to have the `RunIdempotentWithRetry` calls abort if the surrounding server shuts down prematurely. This happens most frequently in tests that fail for another reason; and also in multitenancy tests with multiple tenant server side-by-side. Release note: None Epic: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz · 2023-04-05T22:20:08Z

~~note to self: add a unit test using the new test knob~~ done

knz · 2023-04-06T06:53:18Z

~~Note to self: abandon drain on stopper quiescing~~ done

…enantStartup Release note: None

Release note: None

ahead of splitting "new" vs "start" during construction. Release note: None

Release note: None

This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None

Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. Release note: None

knz · 2023-04-06T17:11:22Z

bors r=stevendanna

craig · 2023-04-06T18:12:17Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-04-06T18:12:24Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 8b7093b to blathers/backport-release-23.1-99958: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

knz requested a review from stevendanna March 29, 2023 17:54

knz requested review from a team as code owners March 29, 2023 17:54

knz requested a review from jeffswenson March 29, 2023 17:55

stevendanna reviewed Mar 29, 2023

View reviewed changes

knz force-pushed the 20230329-tenant-server-drain branch from 75a863c to 8326c94 Compare March 31, 2023 15:23

knz requested a review from a team March 31, 2023 15:23

knz requested a review from a team as a code owner March 31, 2023 15:23

knz requested a review from mgartner March 31, 2023 15:23

knz commented Mar 31, 2023

View reviewed changes

knz force-pushed the 20230329-tenant-server-drain branch from 8326c94 to 83edc7f Compare March 31, 2023 15:36

stevendanna reviewed Mar 31, 2023

View reviewed changes

stevendanna approved these changes Mar 31, 2023

View reviewed changes

knz added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 31, 2023

knz commented Apr 1, 2023

View reviewed changes

knz force-pushed the 20230329-tenant-server-drain branch from 83edc7f to 8d04053 Compare April 2, 2023 10:46

knz mentioned this pull request Apr 2, 2023

server: fix the tenant server error handling #100436

Merged

knz force-pushed the 20230329-tenant-server-drain branch from 8d04053 to d0e1081 Compare April 2, 2023 19:18

knz requested a review from a team April 2, 2023 19:18

knz requested a review from a team as a code owner April 2, 2023 19:18

knz commented Apr 2, 2023

View reviewed changes

knz mentioned this pull request Apr 3, 2023

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

Merged

mgartner removed their request for review April 3, 2023 23:28

knz changed the title ~~jobs,server: graceful shutdown for jobs + secondary tenant servers~~ jobs,server: graceful shutdown for secondary tenant servers Apr 4, 2023

knz force-pushed the 20230329-tenant-server-drain branch from d0e1081 to a82e2c7 Compare April 4, 2023 11:34

knz mentioned this pull request Apr 4, 2023

server,util: avoid retry forever during startup upon premature shutdown #100583

Merged

knz force-pushed the 20230329-tenant-server-drain branch 2 times, most recently from 6950c76 to 19c93cd Compare April 5, 2023 22:11

knz requested a review from a team as a code owner April 5, 2023 22:11

knz force-pushed the 20230329-tenant-server-drain branch from 19c93cd to 6009cae Compare April 6, 2023 14:27

knz added 10 commits April 6, 2023 16:50

serverccl: clarify the progress inside TestServerControllerMultiNodeT…

599ae07

…enantStartup Release note: None

serverccl: make TestServerControllerMultiNodeTenantStartup faster

8b7093b

Release note: None

server: unexport some functions

aa23f85

Release note: None

server: save a reference to BaseConfig in SQLServerWrapper

14f3897

Release note: None

server: extend the onDemandServer interface

2705bdf

ahead of splitting "new" vs "start" during construction. Release note: None

server: lift reportTenantInfo into *SQLServerWrapper

20f0428

Release note: None

serverccl: simplify TestServerStartupGuardrails

1491929

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

knz force-pushed the 20230329-tenant-server-drain branch from 6009cae to 4d4c111 Compare April 6, 2023 14:51

craig bot merged commit 7d21940 into cockroachdb:master Apr 6, 2023

knz deleted the 20230329-tenant-server-drain branch April 10, 2023 13:42

This was referenced Apr 10, 2023

release-23.1: improve tenant server start/stop in shared-process multitenancy #101089

Merged

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

Merged

knz mentioned this pull request Jun 19, 2023

server: avoid retry forever during startup upon premature shutdown #105141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs,server: graceful shutdown for secondary tenant servers #99958

jobs,server: graceful shutdown for secondary tenant servers #99958

knz commented Mar 29, 2023 •

edited

Loading

blathers-crl bot commented Mar 29, 2023

cockroach-teamcity commented Mar 29, 2023

stevendanna left a comment

stevendanna Mar 29, 2023

stevendanna Mar 29, 2023

stevendanna Mar 29, 2023

stevendanna Mar 29, 2023

knz left a comment

stevendanna left a comment

stevendanna Mar 31, 2023

stevendanna Mar 31, 2023

stevendanna Mar 31, 2023

stevendanna Mar 31, 2023

knz left a comment

knz commented Apr 1, 2023 •

edited

Loading

knz left a comment

knz commented Apr 5, 2023 •

edited

Loading

knz commented Apr 6, 2023 •

edited

Loading

knz commented Apr 6, 2023

craig bot commented Apr 6, 2023

blathers-crl bot commented Apr 6, 2023


		drainCtx := logtags.WithTags(context.Background(), logtags.FromContext(ctx))

		for ; ; prevRemaining = remaining {

		// NB: Check the implementation of drain before adding code that would
		// make this block.

jobs,server: graceful shutdown for secondary tenant servers #99958

jobs,server: graceful shutdown for secondary tenant servers #99958

Conversation

knz commented Mar 29, 2023 • edited Loading

blathers-crl bot commented Mar 29, 2023

cockroach-teamcity commented Mar 29, 2023

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Apr 1, 2023 • edited Loading

knz left a comment

Choose a reason for hiding this comment

knz commented Apr 5, 2023 • edited Loading

knz commented Apr 6, 2023 • edited Loading

knz commented Apr 6, 2023

craig bot commented Apr 6, 2023

blathers-crl bot commented Apr 6, 2023

knz commented Mar 29, 2023 •

edited

Loading

knz commented Apr 1, 2023 •

edited

Loading

knz commented Apr 5, 2023 •

edited

Loading

knz commented Apr 6, 2023 •

edited

Loading