-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/drain: shut down SQL subsystems gracefully before releasing table leases #100476
Conversation
@@ -358,6 +355,12 @@ func (s *drainServer) drainClients( | |||
return err | |||
} | |||
|
|||
// Inform the job system that the node is draining. | |||
s.sqlServer.jobRegistry.SetDraining() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this to happen after the above step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)
pkg/server/drain.go
line 359 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Do we need this to happen after the above step?
How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?
|
huh why did you close it? |
Just curious. The whole shutdown is a bit of a "dance"... I was working on some other code, and wanted to get as much
I have NO idea how that happen. Apologies. I guess I was typing reply, and hit tab or something to change focus... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.
If we find there is a benefit to notifying various subsystems when a drain is starting (as opposed to the hard deadline of it completing), we could also build that. It's out of scope of this PR but I think it'd be an interesting project.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)
1006874
to
920ac9b
Compare
@jeffswenson i have added the extra logging and test knob in a last commit as we discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL Queries changes LGTM
Reviewed 3 of 3 files at r2, 3 of 3 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @miretskiy)
-- commits
line 20 at r2:
These changes look good. Do you have any explicit tests that this is working?
pkg/sql/stats/automatic_stats.go
line 814 at r2 (raw file):
// helps avoiding log spam. err = errors.New("server is shutting down") }
If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @rytaft)
Previously, rytaft (Rebecca Taft) wrote…
These changes look good. Do you have any explicit tests that this is working?
Will add those as part of #100436 which will exercise this code path.
pkg/server/drain.go
line 359 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?
Added a comment.
pkg/sql/stats/automatic_stats.go
line 814 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?
As discussed in #100482 it is actually undesirable to fail to queue the mutation during a server shutdown. So probably we benefit from the extra warning in that case.
920ac9b
to
e04a286
Compare
This change ensures that the job registry shuts down and cancels all currently running job before the KV layer gets shut down. This avoids noise in the logs and also creates a natural place in the control flow of a drain where the SQL instance can de-register itself with a guarantee that the record is not needed any more. Release note: None
This ensures that the auto-stats task shut down before the SQL lease manager and other infrastructure has started quiescing. It avoids log spam caused by the auto-stats tasks and provide a guarantee of no further SQL activity caused by auto-stats after that stage of a graceful drain. Release note: None
This commit ensures that the background tasks that manage the persistence of SQL stats are shut down during graceful drain, to reduce log spam at the end of the shutdown sequence and also to provide a sequence point where we are confident there is no more SQL activity. Release note: None
Release note: None
Prior to this patch, it wasn't possible for a test to assert that a graceful drain occurred prior to shutting down a server. This patch adds it. It also clarifies the logging output around the last phases of a graceful drain to ease troubleshooting. Release note:
e04a286
to
0a0985e
Compare
bors r=JeffSwenson,rytaft |
Build succeeded: |
The test becamse flaky after cockroachdb#100476 merged Fixes cockroachdb#100903 Release note: None
100893: ui: search criteria ux improvements r=maryliag a=maryliag Some of the names of sort on search criteria were not a match for the column name on the tables, which could cause confusion. This commit updates the values of "P99" to "P99 Latency" and "Service Latency" to "Statement time" and "Transaction time". Epic: None Release note (ui change): Update sort label on Search Criteria to match the name on the table columns. 101058: roachtest: bump tpccbench timeout r=srosenberg a=renatolabs Looking at the test history, we see that tpccbench may sometimes take longer than 5h, especially in multi-region setups. For that reason, we bump the timeout for this test to 7h, which should be sufficient and avoid failures due to timeouts. This commit also removes an unused `MinVersion` field in the `tpccBenchSpec` struct. Resolves #100975. Release note: None 101077: changefeedccl: Fix TestChangefeedHandlesDrainingNodes test r=miretskiy a=miretskiy The test becamse flaky after #100476 merged Fixes #100903 Release note: None 101097: server: fix a race condition during server initialization r=irfansharif a=knz Fixes #91414. Fixes #101010. Fixes #100902. The call to `registerEnginesForDiskStatsMap` needs to wait until the store IDs are known. Release note: None Epic: None Co-authored-by: maryliag <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Needed for #99941 and #99958.
Epic: CRDB-23559
See individual commits for details.