-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli,server: make quit/drain more friendly to clusters with many ranges #45149
Conversation
cc @ajwerner if you could help with the first pass of reviews that would be swell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cli/start.go, line 1416 at r2 (raw file):
// via --drain-wait. var drainTimeoutCh <-chan time.Time if quitCtx.drainWait == 0 {
Should this be > 0
?
pkg/cli/start.go, line 1419 at r2 (raw file):
drainTimeoutCh = time.After(quitCtx.drainWait) } else { drainTimeoutCh = make(chan time.Time)
nit: a nil channel blocks forever on recv, no need to allocate this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default remains at 1 minute.
Is this a good default? I think we want to have a large default, if not outright unlimited, the thinking being that some kind of supervisor should be in charge of enforcing a timeout. But maybe we should discuss and make that change separately.
Code looks good mod existing comments, just needs tests.
Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/cli/start.go, line 1275 at r2 (raw file):
connections, and finally pushes range leases onto other nodes, subject to various timeout parameters configurable via cluster settings. After the first stage completes,
I'm on the fence about --only-drain
. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using --drain-wait
might get themselves into some amount of trouble if they forget to then actually terminate the node.
pkg/cli/start.go, line 1396 at r2 (raw file):
for { if _, err = stream.Recv(); err != nil { if grpcutil.IsClosedConnection(err) {
Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.
pkg/cli/start.go, line 1406 at r2 (raw file):
// If --decommission was passed, perform the decommission as first // step. if quitCtx.serverDecommission {
Reminds me that we (should) want to deprecate --decommission
. But it's probably not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rebased this and addressed the most egregious correctness problems. I'm nowgoing to iterate on this once more about the default delay.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @tbg)
pkg/cli/start.go, line 1396 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.
Oh yes you're right. We're not expecting a conn close here.
pkg/cli/start.go, line 1406 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Reminds me that we (should) want to deprecate
--decommission
. But it's probably not worth it.
It's deprecated now.
pkg/cli/start.go, line 1416 at r2 (raw file):
Previously, ajwerner wrote…
Should this be
> 0
?
oops. yes.
pkg/cli/start.go, line 1419 at r2 (raw file):
Previously, ajwerner wrote…
nit: a nil channel blocks forever on recv, no need to allocate this
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I let this slip under my radar for so long. Code . I sympathize with Tobi's concern that our UX around observing the draining state.
Seems that the acceptance tests need some updating.
Reviewed 2 of 9 files at r3, 5 of 6 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/cli/start.go, line 1275 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm on the fence about
--only-drain
. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using--drain-wait
might get themselves into some amount of trouble if they forget to then actually terminate the node.
This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without only-drain
would be valuable as draining may take a long time.
I'd be okay with a node in only-drain
waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.
pkg/cli/start.go, line 1500 at r4 (raw file):
} if !quitCtx.onlyDrain {
Should we log.Info that we've drained the node but are not exiting in an else
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default remains at 1 minute.
Is this a good default? I think we want to have a large default [...] maybe we should discuss and make that change separately.
I wanted this PR to be back-portable to 19.x for our customers, where we can't make such a backward-incompatible change. But I can see this default changing for 20.x. Will send a separate PR.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/cli/start.go, line 1275 at r2 (raw file):
Previously, ajwerner wrote…
This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without
only-drain
would be valuable as draining may take a long time.I'd be okay with a node in
only-drain
waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.
The reason for --only-drain
is because we need a mode for preparing a node to be shut down by a process/service manager. If we did not have that, then cockroach quit
would drain the node then terminate the process, and the service manager would then auto-restart it, and make the node live again. That's counter productive. My vision is that we're introducing --only-drain
in v20.1, but we'll make it the default in 20.2 or later (and make the shutdown bit opt-in instead).
Adding this explanation to the release note.
I motivated --undrain
to cancel a mistaken cockroach quit
and bring a node back up. However, further testing reveals this was a bad idea: at that point, there are several background tasks (SQL stats collection, event log reader etc) which were stalled, and letting them resume creates a burst of activity and a catch-up that probably violates the periodicity assumptions of these services. Until we can make --undrain
cancel all this backed up work, I don't think we should bring back a node shutting down back from the brink of extinction. So I'm removing that code. (But --only-drain remains for the reason above)
pkg/cli/start.go, line 1500 at r4 (raw file):
Previously, ajwerner wrote…
Should we log.Info that we've drained the node but are not exiting in an
else
block?
this is the client code. By default the info logs are not displayed. But I see what you mean.
Added logging here and also server-side.
Ok so the functionality here looks good but I have zero idea how to unit test this properly. @nvanbenschoten had suggested that I look at some existing test that check that leaseholders get pushed away upon server shutdown. However, I've been searching a bit and I cannot find such tests. In fact, I cannot find any existing test for the |
I could create an ad-hoc test for |
also FWIW:
No they don't. The changes here are behavior-neutral in the default configuration. But I had made a mistake in the code and CI caught it :) - it is now corrected. |
e3a176c
to
1013730
Compare
043b355
to
c68ad11
Compare
I added a commit at the end with a new roachtest ( |
I have rebased and squashed all the fixups. Planning to merge this after Teamcity thinks it's good. |
yay it's green bors r=tbg,ajwerner |
Many plauses, Raphael. Thanks for pushing this over the finish line.
|
Build succeeded |
The plan is to backport as soon as the final release SHA is selected. I have removed the release-blocker tag on the 3 issues, do we need to do more?
--
Verstuurd vanaf mijn Android apparaat met K-9 Mail. Excuseer mijn beknoptheid.
|
If we're planning on backporting shortly, then no, this is good enough. Thanks for the update. |
Fixes #43901.
Fixes #43015.
Fixes #46388.
Release justification: fixes for high-priority or high-severity bugs in existing functionality
Table of contents:
quit
Overview of this PR
the first 4 commits remove arbitrary limits that prevented graceful shutdown on clusters with many ranges, and makes the remaining limits configurable:
cockroach quit
(fixes server,cli: add a--timeout
flag tocockroach quit
and the Drain RPC #43901)Additionally, this second commit adds:
cockroach node drain
which creates new operational flexibility: drain the server but keep the process alive. From now on, we want to recommend in docs to only use 'node drain' and then use a process manager / orchestration to stop the server.--drain-wait
valid for bothquit
andnode drain
, which makes the timeout configurable.(5 release notes, see below)
the two commits after that are code movements to make the subsequent refactoring easier:
(no release note)
the commit after that fixes server: simplify/rationalize the Drain API #46388 by removing unneeded+untested flexibility in the drain RPC:
(no release note)
the commit after that makes
cockroach node drain
report the progress of the drain on the terminal, so that an operator can ascertain when it's safe to stop the process. It also makescockroach quit
print out a warning if it detects that there is still some state on the server that wasn't drained successfully prior to a shutdown.(1 release note, see below)
The commit after that adds a test that verifies that the first two commits from step (1), combined with the facility in (4), make "large" clusters drain gracefully and in predictable ways.
(The reviewers are invited to examine the individual commit messages for more details, rationales and examples.)
Shutdown sequence
Summary of the shutdown sequence upon
cockroach quit
- this needs to be documented:server.shutdown.drain_wait
(def 0s)server.shutdown.query_wait
timeout (def 10s)server.shutdown.query_wait
(def 10s).Note: this is the same cluster setting, but a different/successive wait.
server.shutdown.lease_transfer_wait
(def 5s).(Previously: for a maximum of 10000 attempts or a hard limit of 5 seconds. The number of attempts is not limited any more and the timeout has become configurable.)
With
cockroach node drain
, the node process remains running so it can be shut down cleanly via orchestration or a service manager.SEPARATELY, the commands
cockroach quit
andcockroach node drain
both wait for max duration--drain-wait
(default 1 minute) for the steps 1-5 to complete.What it means for "large" deployments
A "large" deployment with many clients and/or many ranges should increase multiple parameters:
server.shutdown.query_wait
,server.shutdown.lease_transfer_wait
and also runcockroach node drain
(of possiblyquit
) with a larger--drain-wait
timeout value.Follow-up work
I have another PR in the works to change the default duration (#46396).
However, I wanted a change that would also work for 19.1/19.2 clusters, because we will want users/customers using these versions to be willing to consider an upgrade to 20.1, and they need a way to gracefully roll over their nodes without the hard timeouts. So this PR provides this, with less UX goodies but the option to backport.
A follow-up action to this code PR is a docs PR that documents how the various timeout parameters stack on top of each other. I have reverse-engineered the code into a timeline (see individual commit messages) and this could be rendered in a more beautiful diagram.
List of release notes
Release note (general change): Prior to this patch, the phase of
server shutdown responsible for range lease transfers to other nodes
had a hard timeout of 5 seconds. This patch makes this timeout
configurable via the new cluster setting
server.shutdown.lease_transfer_wait
.Release note (backward incompatible change): Prior to this patch, the
phase of server shutdown responsible for range lease transfers to
other nodes would give up after 10000 attempts of transferring replica
leases away, regardless of the value of
server.shutdown.lease_transfer_wait
. The limit of 10000 attempts hasbeen removed, so that now only the maximum duration
server.shutdown.lease_transfer_wait
applies.Release note (backward-incompatible change): The textual error and
warning messages displayed by
cockroach quit
under variouscircumstances have been updated. Meanwhile, the message "
ok
" remainsas indicator that the operation has likely succeeded.
Release note (cli change): The time that
cockroach quit
waitsclient-side for the node to drain (remove existing clients and push
range leases away) is now configurable via the command-line flag
--drain-wait
. Note that separate server-side timeouts also applyseparately, check the
server.shutdown.*
cluster settings fordetails.
Release note (cli change): It is now possible to drain a node without
shutting down the process, using
cockroach node drain
. This makes iteasier to integrate with service managers and orchestration: it now
becomes safe to issue
cockroach node drain
and then separately stopthe service via a process manager or orchestrator. Without this new
mode, there is a risk to misconfigure the service manager to
auto-restart the node after it shuts down via
quit
, in a way that'ssurprising or unwanted. The new command
node drain
alsorecognizes the new
--drain-wait
flag.Release note (cli change): The commands
cockroach quit
andcockroach node drain
now report a "work remaining" metric on theirstandard error stream. The value reduces until it reaches 0 to
indicate that the graceful shutdown has completed server-side. An
operator can now rely on
cockroach node drain
to obtain confidenceof a graceful shutdown prior to terminating the server process.
Release note (backward-incompatible change):
cockroach quit
nowprints out progress details on its standard error stream, even when
--logtostderr
is not specified. Previously, nothing was printed onstandard error. Scripts who wish to ignore this output can redirect
the stderr stream.