Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/drain: shut down SQL subsystems gracefully before releasing table leases #100476

Merged
merged 5 commits into from
Apr 3, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 3, 2023

Needed for #99941 and #99958.
Epic: CRDB-23559

See individual commits for details.

@knz knz requested a review from jeffswenson April 3, 2023 14:22
@knz knz requested review from a team as code owners April 3, 2023 14:22
@knz knz requested a review from a team April 3, 2023 14:22
@knz knz requested a review from a team as a code owner April 3, 2023 14:22
@knz knz requested a review from rytaft April 3, 2023 14:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -358,6 +355,12 @@ func (s *drainServer) drainClients(
return err
}

// Inform the job system that the node is draining.
s.sqlServer.jobRegistry.SetDraining()
Copy link
Contributor

@miretskiy miretskiy Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to happen after the above step?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)


pkg/server/drain.go line 359 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Do we need this to happen after the above step?

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

@miretskiy
Copy link
Contributor

@miretskiy miretskiy closed this Apr 3, 2023
@knz
Copy link
Contributor Author

knz commented Apr 3, 2023

huh why did you close it?

@knz knz reopened this Apr 3, 2023
@knz knz requested a review from miretskiy April 3, 2023 14:34
@miretskiy
Copy link
Contributor

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

Just curious. The whole shutdown is a bit of a "dance"... I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.
On the other hand, this dance is pretty brittle, and sort of doesn't provide any guarantees anyway (guarantees in a sense
that the caller may want to do some cleanup work before connections/sessions are cancelled.

huh why did you close it?

I have NO idea how that happen. Apologies. I guess I was typing reply, and hit tab or something to change focus...

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.

If we find there is a benefit to notifying various subsystems when a drain is starting (as opposed to the hard deadline of it completing), we could also build that. It's out of scope of this PR but I think it'd be an interesting project.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)

@knz knz requested a review from a team as a code owner April 3, 2023 14:55
@knz knz force-pushed the 20230403-shutdown-sql-first branch from 1006874 to 920ac9b Compare April 3, 2023 14:57
@knz
Copy link
Contributor Author

knz commented Apr 3, 2023

@jeffswenson i have added the extra logging and test knob in a last commit as we discussed.

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL Queries changes LGTM

Reviewed 3 of 3 files at r2, 3 of 3 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @miretskiy)


-- commits line 20 at r2:
These changes look good. Do you have any explicit tests that this is working?


pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):

				// helps avoiding log spam.
				err = errors.New("server is shutting down")
			}

If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @rytaft)


-- commits line 20 at r2:

Previously, rytaft (Rebecca Taft) wrote…

These changes look good. Do you have any explicit tests that this is working?

Will add those as part of #100436 which will exercise this code path.


pkg/server/drain.go line 359 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?

Added a comment.


pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?

As discussed in #100482 it is actually undesirable to fail to queue the mutation during a server shutdown. So probably we benefit from the extra warning in that case.

@knz knz force-pushed the 20230403-shutdown-sql-first branch from 920ac9b to e04a286 Compare April 3, 2023 16:16
knz added 5 commits April 3, 2023 21:02
This change ensures that the job registry shuts down and cancels all
currently running job before the KV layer gets shut down.

This avoids noise in the logs and also creates a natural place in the
control flow of a drain where the SQL instance can de-register itself
with a guarantee that the record is not needed any more.

Release note: None
This ensures that the auto-stats task shut down before the SQL lease
manager and other infrastructure has started quiescing. It avoids log
spam caused by the auto-stats tasks and provide a guarantee of no
further SQL activity caused by auto-stats after that stage of a
graceful drain.

Release note: None
This commit ensures that the background tasks that manage the
persistence of SQL stats are shut down during graceful drain, to
reduce log spam at the end of the shutdown sequence and also to
provide a sequence point where we are confident there is no more SQL
activity.

Release note: None
Prior to this patch, it wasn't possible for a test to assert that a
graceful drain occurred prior to shutting down a server. This patch
adds it.

It also clarifies the logging output around the last phases of a
graceful drain to ease troubleshooting.

Release note:
@knz knz force-pushed the 20230403-shutdown-sql-first branch from e04a286 to 0a0985e Compare April 3, 2023 19:02
@knz
Copy link
Contributor Author

knz commented Apr 3, 2023

bors r=JeffSwenson,rytaft

@craig
Copy link
Contributor

craig bot commented Apr 3, 2023

Build succeeded:

@craig craig bot merged commit 04a92b6 into cockroachdb:master Apr 3, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Apr 10, 2023
The test becamse flaky after cockroachdb#100476 merged
Fixes cockroachdb#100903

Release note: None
craig bot pushed a commit that referenced this pull request Apr 10, 2023
100893: ui: search criteria ux improvements r=maryliag a=maryliag

Some of the names of sort on search criteria were
not a match for the column name on the tables, which could cause confusion. This commit updates the values of "P99" to "P99 Latency" and "Service Latency" to "Statement time" and "Transaction time".

Epic: None

Release note (ui change): Update sort label
on Search Criteria to match the name on the table columns.

101058: roachtest: bump tpccbench timeout r=srosenberg a=renatolabs

Looking at the test history, we see that tpccbench may sometimes take longer than 5h, especially in multi-region setups. For that reason, we bump the timeout for this test to 7h, which should be sufficient and avoid failures due to timeouts.

This commit also removes an unused `MinVersion` field in the `tpccBenchSpec` struct.

Resolves #100975.

Release note: None

101077: changefeedccl: Fix TestChangefeedHandlesDrainingNodes test r=miretskiy a=miretskiy

The test becamse flaky after #100476 merged
Fixes #100903

Release note: None

101097: server: fix a race condition during server initialization r=irfansharif a=knz

Fixes #91414.
Fixes  #101010.
Fixes #100902.

The call to `registerEnginesForDiskStatsMap` needs to wait until the store IDs are known.

Release note: None
Epic: None

Co-authored-by: maryliag <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Apr 10, 2023
The test becamse flaky after #100476 merged
Fixes #100903

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants