server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

knz · 2023-04-03T14:22:12Z

Needed for #99941 and #99958.
Epic: CRDB-23559

See individual commits for details.

cockroach-teamcity · 2023-04-03T14:22:28Z

This change is

miretskiy · 2023-04-03T14:23:45Z

pkg/server/drain.go

@@ -358,6 +355,12 @@ func (s *drainServer) drainClients(
 		return err
 	}

+	// Inform the job system that the node is draining.
+	s.sqlServer.jobRegistry.SetDraining()


Do we need this to happen after the above step?

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)

pkg/server/drain.go line 359 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Do we need this to happen after the above step?

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

miretskiy · 2023-04-03T14:33:08Z

knz · 2023-04-03T14:34:05Z

huh why did you close it?

miretskiy · 2023-04-03T14:35:50Z

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

Just curious. The whole shutdown is a bit of a "dance"... I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.
On the other hand, this dance is pretty brittle, and sort of doesn't provide any guarantees anyway (guarantees in a sense
that the caller may want to do some cleanup work before connections/sessions are cancelled.

huh why did you close it?

I have NO idea how that happen. Apologies. I guess I was typing reply, and hit tab or something to change focus...

knz

I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.

If we find there is a benefit to notifying various subsystems when a drain is starting (as opposed to the hard deadline of it completing), we could also build that. It's out of scope of this PR but I think it'd be an interesting project.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)

knz · 2023-04-03T14:57:27Z

@jeffswenson i have added the extra logging and test knob in a last commit as we discussed.

jeffswenson

LGTM

rytaft

SQL Queries changes LGTM

Reviewed 3 of 3 files at r2, 3 of 3 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @miretskiy)

-- commits line 20 at r2:
These changes look good. Do you have any explicit tests that this is working?

pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):

				// helps avoiding log spam.
				err = errors.New("server is shutting down")
			}

If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @rytaft)

-- commits line 20 at r2:

Previously, rytaft (Rebecca Taft) wrote…

These changes look good. Do you have any explicit tests that this is working?

Will add those as part of #100436 which will exercise this code path.

pkg/server/drain.go line 359 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

Added a comment.

pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?

As discussed in #100482 it is actually undesirable to fail to queue the mutation during a server shutdown. So probably we benefit from the extra warning in that case.

This change ensures that the job registry shuts down and cancels all currently running job before the KV layer gets shut down. This avoids noise in the logs and also creates a natural place in the control flow of a drain where the SQL instance can de-register itself with a guarantee that the record is not needed any more. Release note: None

This ensures that the auto-stats task shut down before the SQL lease manager and other infrastructure has started quiescing. It avoids log spam caused by the auto-stats tasks and provide a guarantee of no further SQL activity caused by auto-stats after that stage of a graceful drain. Release note: None

This commit ensures that the background tasks that manage the persistence of SQL stats are shut down during graceful drain, to reduce log spam at the end of the shutdown sequence and also to provide a sequence point where we are confident there is no more SQL activity. Release note: None

Release note: None

Prior to this patch, it wasn't possible for a test to assert that a graceful drain occurred prior to shutting down a server. This patch adds it. It also clarifies the logging output around the last phases of a graceful drain to ease troubleshooting. Release note:

knz · 2023-04-03T19:40:33Z

bors r=JeffSwenson,rytaft

craig · 2023-04-03T20:56:20Z

Build succeeded:

Bazel Essential CI (Cockroach)

The test becamse flaky after cockroachdb#100476 merged Fixes cockroachdb#100903 Release note: None

100893: ui: search criteria ux improvements r=maryliag a=maryliag Some of the names of sort on search criteria were not a match for the column name on the tables, which could cause confusion. This commit updates the values of "P99" to "P99 Latency" and "Service Latency" to "Statement time" and "Transaction time". Epic: None Release note (ui change): Update sort label on Search Criteria to match the name on the table columns. 101058: roachtest: bump tpccbench timeout r=srosenberg a=renatolabs Looking at the test history, we see that tpccbench may sometimes take longer than 5h, especially in multi-region setups. For that reason, we bump the timeout for this test to 7h, which should be sufficient and avoid failures due to timeouts. This commit also removes an unused `MinVersion` field in the `tpccBenchSpec` struct. Resolves #100975. Release note: None 101077: changefeedccl: Fix TestChangefeedHandlesDrainingNodes test r=miretskiy a=miretskiy The test becamse flaky after #100476 merged Fixes #100903 Release note: None 101097: server: fix a race condition during server initialization r=irfansharif a=knz Fixes #91414. Fixes #101010. Fixes #100902. The call to `registerEnginesForDiskStatsMap` needs to wait until the store IDs are known. Release note: None Epic: None Co-authored-by: maryliag <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

The test becamse flaky after #100476 merged Fixes #100903 Release note: None

knz requested a review from jeffswenson April 3, 2023 14:22

knz requested review from a team as code owners April 3, 2023 14:22

knz requested a review from a team April 3, 2023 14:22

knz requested a review from a team as a code owner April 3, 2023 14:22

knz requested a review from rytaft April 3, 2023 14:22

miretskiy reviewed Apr 3, 2023

View reviewed changes

knz commented Apr 3, 2023

View reviewed changes

miretskiy closed this Apr 3, 2023

knz reopened this Apr 3, 2023

knz requested a review from miretskiy April 3, 2023 14:34

knz commented Apr 3, 2023

View reviewed changes

knz requested a review from a team as a code owner April 3, 2023 14:55

knz force-pushed the 20230403-shutdown-sql-first branch from 1006874 to 920ac9b Compare April 3, 2023 14:57

jeffswenson approved these changes Apr 3, 2023

View reviewed changes

rytaft approved these changes Apr 3, 2023

View reviewed changes

knz commented Apr 3, 2023

View reviewed changes

knz force-pushed the 20230403-shutdown-sql-first branch from 920ac9b to e04a286 Compare April 3, 2023 16:16

knz added 5 commits April 3, 2023 21:02

sqlliveness: properly annotate logging context for hearbeat loop

cae8c3b

Release note: None

knz force-pushed the 20230403-shutdown-sql-first branch from e04a286 to 0a0985e Compare April 3, 2023 19:02

craig bot merged commit 04a92b6 into cockroachdb:master Apr 3, 2023

jeffswenson mentioned this pull request Apr 5, 2023

release-23.1: backport sql session and sql instance clean up #100720

Merged

knz mentioned this pull request Apr 10, 2023

ccl/changefeedccl: TestChangefeedHandlesDrainingNodes failed #100903

Closed

miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Apr 10, 2023

changefeedccl: Fix TestChangefeedHandlesDrainingNodes test

e0b9e1c

The test becamse flaky after cockroachdb#100476 merged Fixes cockroachdb#100903 Release note: None

miretskiy mentioned this pull request Apr 10, 2023

changefeedccl: Fix TestChangefeedHandlesDrainingNodes test #101077

Merged

blathers-crl bot mentioned this pull request Apr 10, 2023

release-23.1: changefeedccl: Fix TestChangefeedHandlesDrainingNodes test #101124

Merged

blathers-crl bot pushed a commit that referenced this pull request Apr 10, 2023

changefeedccl: Fix TestChangefeedHandlesDrainingNodes test

f6df923

The test becamse flaky after #100476 merged Fixes #100903 Release note: None

erikgrinaker mentioned this pull request Apr 28, 2023

sqlstats: PersistedSQLStats.Stop() blocking server drain #102574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

knz commented Apr 3, 2023

cockroach-teamcity commented Apr 3, 2023

miretskiy Apr 3, 2023 •

edited

Loading

knz left a comment

miretskiy commented Apr 3, 2023

knz commented Apr 3, 2023

miretskiy commented Apr 3, 2023

knz left a comment

knz commented Apr 3, 2023

jeffswenson left a comment

rytaft left a comment

knz left a comment

knz commented Apr 3, 2023

craig bot commented Apr 3, 2023

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

Conversation

knz commented Apr 3, 2023

cockroach-teamcity commented Apr 3, 2023

miretskiy Apr 3, 2023 • edited Loading

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

miretskiy commented Apr 3, 2023

knz commented Apr 3, 2023

miretskiy commented Apr 3, 2023

knz left a comment

Choose a reason for hiding this comment

knz commented Apr 3, 2023

jeffswenson left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Apr 3, 2023

craig bot commented Apr 3, 2023

miretskiy Apr 3, 2023 •

edited

Loading