update rolling upgrade docs with node drain, etc. #7542

taroface · 2020-06-22T23:44:29Z

Addresses: https://cockroachlabs.slack.com/archives/C65RYUY3Z/p1592851886246500

(Also reformats the node-shutdown.md include into HTML so that it works inside another list)

cockroach-teamcity · 2020-06-22T23:44:36Z

This change is

cockroach-teamcity · 2020-06-22T23:49:27Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/35f8764cde83e396166909ef8031ae7447a600d8/

Edited pages:

knz

The recommendation written here is "have systemd send two signals".

@taroface The current word on the street (and what I told SEs/TAMs) is to first issue node drain, then wait for that to complete, and only then send SIGTERM, wait 1 minute, then SIGKILL if necessary.

However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain as first graceful command instead of sending a signal, or before?

(NB I appreciate your concerns about the authentication aspects of node drain, but let us set this discussion aside for this doc PR)

taroface · 2020-06-23T19:03:17Z

and only then send SIGTERM, wait 1 minute, then SIGKILL if necessary.

@knz Thanks for these clarifications.

When is it necessary to send SIGKILL? Does this also mean that SIGTERM is "normally" the only required signal?

How are these signals handled by systemctl stop <systemd config filename> (this is the currently documented command on the upgrade doc)?

cockroach-teamcity · 2020-06-23T19:37:38Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/86797f92ae41c438cdeea144dd879b88f4bf3166/

Edited pages:

bdarnell

In my view cockroach node drain is similar to the SAVEPOINT cockroach_restart protocol - it has some benefits so we should consider using it in our own tooling (i.e. the k8s operator) but it's so much of a pain to automate that we should not expect most customers to use/script it themselves. I think systemd users should only use signals here.

However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain as first graceful command instead of sending a signal, or before?

Yes, this is possible (see the ExecStop option). It just requires you to keep a root cert cert and key around. But I don't know what kind of user experience this would have if the node fails to drain and the command just hangs (does it proceed to the SIGKILL after one minute? If so what good is it to use node drain instead of SIGTERM? If not, is it just a black box or can the user see what's going on?)

When is it necessary to send SIGKILL? Does this also mean that SIGTERM is "normally" the only required signal?

Normally SIGTERM is all that is required; the SIGKILL is only a fallback if something is going wrong and the node isn't shutting down as expected. In cases where SIGTERM isn't enough, cockroach node drain is likely to hang too.

How are these signals handled by systemctl stop (this is the currently documented command on the upgrade doc)?

systemd will have the desired behavior automatically if you set TimeoutStopSecs=60 (which is set in the config templates we publish).

Reviewable status: complete! 0 of 0 LGTMs obtained

taroface · 2020-07-01T22:36:53Z

@bdarnell Thank you for clarifying! Are the updated instructions accurate?

cockroach-teamcity · 2020-07-01T22:41:06Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/ed17d11268ebc78c1ceab8974501a8c32d764a66/

Edited pages:

cockroach-teamcity · 2020-07-06T17:06:56Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8b969efe092035529f5ef5319c82efd0a3de113e/

Edited pages:

jseldess · 2020-07-13T10:37:36Z

@bdarnell, ping.

bdarnell

The include file now looks good, but the "just use signals" message in that file is inconsistent with the instructions to use node drain in the other two.

Reviewed 2 of 6 files at r2, 2 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @taroface)

v20.1/cockroach-quit.md, line 10 at r4 (raw file):

{{site.data.alerts.callout_danger}}
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html), wait for the node to be completely drained, and then do one of the following:

Again, I don't think we want to recommend node drain as a replacement for quit. Just stop the node as described in the include file.

v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):

{{site.data.alerts.end}}

1. Drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries:

I think we want to mark this step as optional, if not remove it completely.

taroface · 2020-07-13T15:08:06Z

Drain the node of SQL clients and distributed SQL queries:

I think we want to mark this step as optional, if not remove it completely.

@bdarnell OK, noted! cockroach node drain is also listed as a required step for node decommissioning. Do you mean that draining should always be optional before stopping a node, or only when stopping a node to upgrade?

Also cc @knz

bdarnell · 2020-07-13T15:12:12Z

Doesn't decommissioning imply draining? I'm pretty sure it's completely redundant there. Anyway, yes, I'd remove node drain from the decommissioning process too. The only place I feel node drain is appropriate is for advanced users who are building it in to reusable infrastructure such as a k8s operator.

taroface · 2020-07-13T15:59:43Z

Doesn't decommissioning imply draining?

@bdarnell My understanding is that node drain is for draining SQL clients and distributed SQL queries, and that running node decommission mainly transfers ranges off the node, while continuing to accept SQL connections. Is it OK to decommission a node without draining SQL clients and DistSQL?

knz · 2020-07-13T16:01:35Z

My understanding is that node drain is for draining SQL clients and distributed SQL queries, and that running node decommission mainly transfers ranges off the node, while continuing to accept SQL connections.

Yes that understanding is correct. The two features are independent from each other.

cockroach-teamcity · 2020-07-28T20:50:13Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/91abdad73682a1c734b43a8763bca6baa3322bc0/

Edited pages:

cockroach-teamcity · 2020-07-28T20:51:14Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/f3fd5985eb62e03086423b8b532d09570124db2e/

Edited pages:

taroface · 2020-07-28T20:54:44Z

The latest updates to this PR:

Remove node drain as a required step during decommissioning and upgrades
Add a callout on the data unavailability / latency edge cases with SIGTERM + SIGKILL
Replace instances of pkill cockroach with the node shutdown steps discussed here
- (This command is still mentioned in cockroach quit docs, but since these are clearly being phased out in favor of Create how-to on shutting down a node #6951, I figure it's moot)

@bdarnell @knz Would you mind re-reviewing based on the latest conversation about our node shutdown guidance?

knz

I start to like where this PR is going. Thanks for the improvements.

knz · 2020-07-29T11:51:42Z

v20.1/remove-nodes.md


 {% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

+{{site.data.alerts.callout_info}}
+In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.


... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...

knz · 2020-07-29T11:51:52Z

v20.1/remove-nodes.md


 {% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

+{{site.data.alerts.callout_info}}
+In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it.


knz · 2020-07-29T11:51:59Z

v20.1/upgrade-cockroach-version.md


    {% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

+    {{site.data.alerts.callout_info}}
+    In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.


knz · 2020-07-29T11:53:46Z

v20.1/remove-nodes.md

@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

 The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

-After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
+After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.


Regarding this bit: "When stopped, ..."

I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.

Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".

The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.

knz · 2020-07-29T11:53:59Z

v20.2/remove-nodes.md

@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

 The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

-After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
+After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.


see my note above

knz · 2020-07-29T11:54:14Z

v20.2/remove-nodes.md


 {% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

+{{site.data.alerts.callout_info}}
+In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it.


see note above

knz · 2020-07-29T11:54:17Z

v20.2/upgrade-cockroach-version.md


    {% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

+    {{site.data.alerts.callout_info}}
+    In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.


see note above

taroface

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

v20.1/cockroach-quit.md, line 10 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Again, I don't think we want to recommend node drain as a replacement for quit. Just stop the node as described in the include file.

Done.

v20.1/remove-nodes.md, line 26 at r5 (raw file):

Previously, knz (kena) wrote…

Regarding this bit: "When stopped, ..."

I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.

Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".

The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.

Thanks, this was indeed misleading! Hopefully I have the correct sequence now. Let me know if not.

v20.1/remove-nodes.md, line 168 at r5 (raw file):

Previously, knz (kena) wrote…

... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...

Done.

v20.1/remove-nodes.md, line 325 at r5 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we want to mark this step as optional, if not remove it completely.

Done.

v20.1/upgrade-cockroach-version.md, line 129 at r5 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

v20.2/remove-nodes.md, line 26 at r5 (raw file):

Previously, knz (kena) wrote…

see my note above

Done.

v20.2/remove-nodes.md, line 325 at r5 (raw file):

Previously, knz (kena) wrote…

see note above

Done.

v20.2/upgrade-cockroach-version.md, line 89 at r5 (raw file):

Previously, knz (kena) wrote…

see note above

Done.

cockroach-teamcity · 2020-07-29T17:05:18Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/c7c18b0caafe7ddea6a1031b7721d06bafbdca26/

Edited pages:

cockroach-teamcity · 2020-07-29T18:26:46Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/1fbf59c75d3f77aff9a2d248be0d9b1f8d4e5238/

Edited pages:

knz

Reviewed 1 of 4 files at r3, 3 of 14 files at r5, 5 of 8 files at r6, 5 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

<ul>
	<li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>

This "one minute" is problematic. (it's actually incorrect)

This was discussed at this week's KV team meeting.
Also in John's doc notes.

_includes/v20.2/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

<ul>
	<li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>

ditto

taroface

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)

_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, knz (kena) wrote…

This "one minute" is problematic. (it's actually incorrect)

This was discussed at this week's KV team meeting.
Also in John's doc notes.

Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.

cockroach-teamcity · 2020-08-03T21:35:01Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/ca13b53e52372a973d2e36fc4d596b3810aacf74/

Edited pages:

knz

Reviewed 6 of 6 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.

The callout below is fine.

However there should be a forward reference to the callout at "for example, set TimeoutStopSecs=60 in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.

v20.1/remove-nodes.md, line 168 at r9 (raw file):

{{site.data.alerts.callout_info}}
The amount of time you should wait before sending `SIGKILL` can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.

I like this, thanks

bdarnell

Reviewed 3 of 14 files at r5, 5 of 8 files at r6, 1 of 5 files at r7, 1 of 1 files at r8, 6 of 6 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

taroface

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)

_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, knz (kena) wrote…

The callout below is fine.

However there should be a forward reference to the callout at "for example, set TimeoutStopSecs=60 in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.

Good point. I moved the callout directly below the list item, where it's most relevant.

cockroach-teamcity · 2020-08-04T18:51:09Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/0a8b106feb6f745ca1370620342b9d163c316577/

Edited pages:

taroface · 2020-08-06T14:25:59Z

@knz ping on review. thanks!

knz

Reviewed 6 of 6 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

update rolling upgrade docs with node drain, etc.

35f8764

taroface requested a review from knz June 22, 2020 23:44

knz reviewed Jun 23, 2020

View reviewed changes

clarify node shutdown steps

86797f9

bdarnell reviewed Jun 29, 2020

View reviewed changes

update node shutdown instructions

ed17d11

Merge branch 'master' into rolling-upgrade

8b969ef

bdarnell reviewed Jul 13, 2020

View reviewed changes

taroface added the docs-release-note label Jul 27, 2020

taroface and others added 2 commits July 28, 2020 16:45

update node shutdown guidelines

91abdad

Merge branch 'master' into rolling-upgrade

f3fd598

knz reviewed Jul 29, 2020

View reviewed changes

knz comments

c7c18b0

taroface commented Jul 29, 2020

View reviewed changes

fix link

1fbf59c

knz suggested changes Jul 30, 2020

View reviewed changes

qualify time interval to SIGKILL

ca13b53

taroface commented Aug 3, 2020

View reviewed changes

knz reviewed Aug 4, 2020

View reviewed changes

bdarnell approved these changes Aug 4, 2020

View reviewed changes

relocate time interval callout

0a8b106

taroface commented Aug 4, 2020

View reviewed changes

knz approved these changes Aug 6, 2020

View reviewed changes

taroface requested a review from Amruta-Ranade August 6, 2020 15:19

Amruta-Ranade approved these changes Aug 6, 2020

View reviewed changes

taroface merged commit 255b375 into master Aug 6, 2020

taroface deleted the rolling-upgrade branch August 6, 2020 15:41

taroface mentioned this pull request Sep 22, 2020

Make decommissioning docs production-ready #4649

Closed

update rolling upgrade docs with node drain, etc. #7542

update rolling upgrade docs with node drain, etc. #7542

Conversation

taroface commented Jun 22, 2020 • edited Loading

cockroach-teamcity commented Jun 22, 2020

cockroach-teamcity commented Jun 22, 2020

knz left a comment

Choose a reason for hiding this comment

taroface commented Jun 23, 2020

cockroach-teamcity commented Jun 23, 2020

bdarnell left a comment

Choose a reason for hiding this comment

taroface commented Jul 1, 2020

cockroach-teamcity commented Jul 1, 2020

cockroach-teamcity commented Jul 6, 2020

jseldess commented Jul 13, 2020

bdarnell left a comment

Choose a reason for hiding this comment

taroface commented Jul 13, 2020 • edited Loading

bdarnell commented Jul 13, 2020

taroface commented Jul 13, 2020

knz commented Jul 13, 2020

cockroach-teamcity commented Jul 28, 2020

cockroach-teamcity commented Jul 28, 2020

taroface commented Jul 28, 2020

knz left a comment

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

knz Jul 29, 2020

Choose a reason for hiding this comment

taroface left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Jul 29, 2020

cockroach-teamcity commented Jul 29, 2020

knz left a comment

Choose a reason for hiding this comment

taroface left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Aug 3, 2020

knz left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

taroface left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Aug 4, 2020

taroface commented Aug 6, 2020

knz left a comment

Choose a reason for hiding this comment

taroface commented Jun 22, 2020 •

edited

Loading

taroface commented Jul 13, 2020 •

edited

Loading