-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update rolling upgrade docs with node drain, etc. #7542
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommendation written here is "have systemd send two signals".
@taroface The current word on the street (and what I told SEs/TAMs) is to first issue node drain
, then wait for that to complete, and only then send SIGTERM, wait 1 minute, then SIGKILL if necessary.
However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain
as first graceful command instead of sending a signal, or before?
(NB I appreciate your concerns about the authentication aspects of node drain
, but let us set this discussion aside for this doc PR)
@knz Thanks for these clarifications. When is it necessary to send How are these signals handled by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view cockroach node drain
is similar to the SAVEPOINT cockroach_restart
protocol - it has some benefits so we should consider using it in our own tooling (i.e. the k8s operator) but it's so much of a pain to automate that we should not expect most customers to use/script it themselves. I think systemd
users should only use signals here.
However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain as first graceful command instead of sending a signal, or before?
Yes, this is possible (see the ExecStop
option). It just requires you to keep a root cert cert and key around. But I don't know what kind of user experience this would have if the node fails to drain and the command just hangs (does it proceed to the SIGKILL after one minute? If so what good is it to use node drain
instead of SIGTERM
? If not, is it just a black box or can the user see what's going on?)
When is it necessary to send SIGKILL? Does this also mean that SIGTERM is "normally" the only required signal?
Normally SIGTERM
is all that is required; the SIGKILL
is only a fallback if something is going wrong and the node isn't shutting down as expected. In cases where SIGTERM isn't enough, cockroach node drain
is likely to hang too.
How are these signals handled by systemctl stop (this is the currently documented command on the upgrade doc)?
systemd
will have the desired behavior automatically if you set TimeoutStopSecs=60
(which is set in the config templates we publish).
Reviewable status: complete! 0 of 0 LGTMs obtained
@bdarnell Thank you for clarifying! Are the updated instructions accurate? |
@bdarnell, ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include file now looks good, but the "just use signals" message in that file is inconsistent with the instructions to use node drain
in the other two.
Reviewed 2 of 6 files at r2, 2 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @taroface)
v20.1/cockroach-quit.md, line 10 at r4 (raw file):
{{site.data.alerts.callout_danger}} `cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html), wait for the node to be completely drained, and then do one of the following:
Again, I don't think we want to recommend node drain
as a replacement for quit
. Just stop the node as described in the include file.
v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):
{{site.data.alerts.end}} 1. Drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries:
I think we want to mark this step as optional, if not remove it completely.
@bdarnell OK, noted! Also cc @knz |
Doesn't decommissioning imply draining? I'm pretty sure it's completely redundant there. Anyway, yes, I'd remove |
@bdarnell My understanding is that |
Yes that understanding is correct. The two features are independent from each other. |
The latest updates to this PR:
@bdarnell @knz Would you mind re-reviewing based on the latest conversation about our node shutdown guidance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start to like where this PR is going. Thanks for the improvements.
v20.1/remove-nodes.md
Outdated
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...
v20.1/remove-nodes.md
Outdated
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
v20.1/upgrade-cockroach-version.md
Outdated
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
v20.1/remove-nodes.md
Outdated
@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria: | |||
|
|||
The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node. | |||
|
|||
After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. | |||
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this bit: "When stopped, ..."
I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.
Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".
The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.
v20.2/remove-nodes.md
Outdated
@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria: | |||
|
|||
The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node. | |||
|
|||
After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. | |||
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my note above
v20.2/remove-nodes.md
Outdated
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see note above
v20.2/upgrade-cockroach-version.md
Outdated
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see note above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)
v20.1/cockroach-quit.md, line 10 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Again, I don't think we want to recommend
node drain
as a replacement forquit
. Just stop the node as described in the include file.
Done.
v20.1/remove-nodes.md, line 26 at r5 (raw file):
Previously, knz (kena) wrote…
Regarding this bit: "When stopped, ..."
I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.
Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".
The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.
Thanks, this was indeed misleading! Hopefully I have the correct sequence now. Let me know if not.
v20.1/remove-nodes.md, line 168 at r5 (raw file):
Previously, knz (kena) wrote…
... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...
Done.
v20.1/remove-nodes.md, line 325 at r5 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I think we want to mark this step as optional, if not remove it completely.
Done.
v20.1/upgrade-cockroach-version.md, line 129 at r5 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
v20.2/remove-nodes.md, line 26 at r5 (raw file):
Previously, knz (kena) wrote…
see my note above
Done.
v20.2/remove-nodes.md, line 325 at r5 (raw file):
Previously, knz (kena) wrote…
see note above
Done.
v20.2/upgrade-cockroach-version.md, line 89 at r5 (raw file):
Previously, knz (kena) wrote…
see note above
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r3, 3 of 14 files at r5, 5 of 8 files at r6, 5 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)
_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):
<ul> <li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop <systemd config filename></code> to stop the node without <code>systemd</code> restarting it.</li>
This "one minute" is problematic. (it's actually incorrect)
This was discussed at this week's KV team meeting.
Also in John's doc notes.
_includes/v20.2/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):
<ul> <li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop <systemd config filename></code> to stop the node without <code>systemd</code> restarting it.</li>
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)
_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):
Previously, knz (kena) wrote…
This "one minute" is problematic. (it's actually incorrect)
This was discussed at this week's KV team meeting.
Also in John's doc notes.
Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)
_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):
Previously, taroface (Ryan Kuo) wrote…
Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.
The callout below is fine.
However there should be a forward reference to the callout at "for example, set TimeoutStopSecs=60
in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.
v20.1/remove-nodes.md, line 168 at r9 (raw file):
{{site.data.alerts.callout_info}} The amount of time you should wait before sending `SIGKILL` can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
I like this, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 14 files at r5, 5 of 8 files at r6, 1 of 5 files at r7, 1 of 1 files at r8, 6 of 6 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)
_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):
Previously, knz (kena) wrote…
The callout below is fine.
However there should be a forward reference to the callout at "for example, set
TimeoutStopSecs=60
in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.
Good point. I moved the callout directly below the list item, where it's most relevant.
@knz ping on review. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)
Addresses: https://cockroachlabs.slack.com/archives/C65RYUY3Z/p1592851886246500
(Also reformats the
node-shutdown.md
include into HTML so that it works inside another list)