Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update rolling upgrade docs with node drain, etc. #7542

Merged
merged 10 commits into from
Aug 6, 2020
Merged

Conversation

taroface
Copy link
Contributor

@taroface taroface commented Jun 22, 2020

Addresses: https://cockroachlabs.slack.com/archives/C65RYUY3Z/p1592851886246500

(Also reformats the node-shutdown.md include into HTML so that it works inside another list)

@taroface taroface requested a review from knz June 22, 2020 23:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommendation written here is "have systemd send two signals".

@taroface The current word on the street (and what I told SEs/TAMs) is to first issue node drain, then wait for that to complete, and only then send SIGTERM, wait 1 minute, then SIGKILL if necessary.

However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain as first graceful command instead of sending a signal, or before?

(NB I appreciate your concerns about the authentication aspects of node drain, but let us set this discussion aside for this doc PR)

@taroface
Copy link
Contributor Author

and only then send SIGTERM, wait 1 minute, then SIGKILL if necessary.

@knz Thanks for these clarifications.

When is it necessary to send SIGKILL? Does this also mean that SIGTERM is "normally" the only required signal?

How are these signals handled by systemctl stop <systemd config filename> (this is the currently documented command on the upgrade doc)?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view cockroach node drain is similar to the SAVEPOINT cockroach_restart protocol - it has some benefits so we should consider using it in our own tooling (i.e. the k8s operator) but it's so much of a pain to automate that we should not expect most customers to use/script it themselves. I think systemd users should only use signals here.

However, I wonder if this can be automated. @bdarnell Do you know if we can have systemd use node drain as first graceful command instead of sending a signal, or before?

Yes, this is possible (see the ExecStop option). It just requires you to keep a root cert cert and key around. But I don't know what kind of user experience this would have if the node fails to drain and the command just hangs (does it proceed to the SIGKILL after one minute? If so what good is it to use node drain instead of SIGTERM? If not, is it just a black box or can the user see what's going on?)

When is it necessary to send SIGKILL? Does this also mean that SIGTERM is "normally" the only required signal?

Normally SIGTERM is all that is required; the SIGKILL is only a fallback if something is going wrong and the node isn't shutting down as expected. In cases where SIGTERM isn't enough, cockroach node drain is likely to hang too.

How are these signals handled by systemctl stop (this is the currently documented command on the upgrade doc)?

systemd will have the desired behavior automatically if you set TimeoutStopSecs=60 (which is set in the config templates we publish).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@taroface
Copy link
Contributor Author

taroface commented Jul 1, 2020

@bdarnell Thank you for clarifying! Are the updated instructions accurate?

@jseldess
Copy link
Contributor

@bdarnell, ping.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include file now looks good, but the "just use signals" message in that file is inconsistent with the instructions to use node drain in the other two.

Reviewed 2 of 6 files at r2, 2 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @taroface)


v20.1/cockroach-quit.md, line 10 at r4 (raw file):

{{site.data.alerts.callout_danger}}
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html), wait for the node to be completely drained, and then do one of the following:

Again, I don't think we want to recommend node drain as a replacement for quit. Just stop the node as described in the include file.


v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):

{{site.data.alerts.end}}

1. Drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries:

I think we want to mark this step as optional, if not remove it completely.

@taroface
Copy link
Contributor Author

taroface commented Jul 13, 2020

  1. Drain the node of SQL clients and distributed SQL queries:

I think we want to mark this step as optional, if not remove it completely.

@bdarnell OK, noted! cockroach node drain is also listed as a required step for node decommissioning. Do you mean that draining should always be optional before stopping a node, or only when stopping a node to upgrade?

Also cc @knz

@bdarnell
Copy link
Contributor

Doesn't decommissioning imply draining? I'm pretty sure it's completely redundant there. Anyway, yes, I'd remove node drain from the decommissioning process too. The only place I feel node drain is appropriate is for advanced users who are building it in to reusable infrastructure such as a k8s operator.

@taroface
Copy link
Contributor Author

Doesn't decommissioning imply draining?

@bdarnell My understanding is that node drain is for draining SQL clients and distributed SQL queries, and that running node decommission mainly transfers ranges off the node, while continuing to accept SQL connections. Is it OK to decommission a node without draining SQL clients and DistSQL?

@knz
Copy link
Contributor

knz commented Jul 13, 2020

My understanding is that node drain is for draining SQL clients and distributed SQL queries, and that running node decommission mainly transfers ranges off the node, while continuing to accept SQL connections.

Yes that understanding is correct. The two features are independent from each other.

@taroface
Copy link
Contributor Author

The latest updates to this PR:

  • Remove node drain as a required step during decommissioning and upgrades
  • Add a callout on the data unavailability / latency edge cases with SIGTERM + SIGKILL
  • Replace instances of pkill cockroach with the node shutdown steps discussed here

@bdarnell @knz Would you mind re-reviewing based on the latest conversation about our node shutdown guidance?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start to like where this PR is going. Thanks for the improvements.


{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...


{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this bit: "When stopped, ..."

I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.

Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".

The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.

@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my note above


{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see note above


{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see note above

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)


v20.1/cockroach-quit.md, line 10 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Again, I don't think we want to recommend node drain as a replacement for quit. Just stop the node as described in the include file.

Done.


v20.1/remove-nodes.md, line 26 at r5 (raw file):

Previously, knz (kena) wrote…

Regarding this bit: "When stopped, ..."

I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.

Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".

The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.

Thanks, this was indeed misleading! Hopefully I have the correct sequence now. Let me know if not.


v20.1/remove-nodes.md, line 168 at r5 (raw file):

Previously, knz (kena) wrote…

... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...

Done.


v20.1/remove-nodes.md, line 325 at r5 (raw file):

Previously, knz (kena) wrote…

ditto

Done.


v20.1/upgrade-cockroach-version.md, line 124 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we want to mark this step as optional, if not remove it completely.

Done.


v20.1/upgrade-cockroach-version.md, line 129 at r5 (raw file):

Previously, knz (kena) wrote…

ditto

Done.


v20.2/remove-nodes.md, line 26 at r5 (raw file):

Previously, knz (kena) wrote…

see my note above

Done.


v20.2/remove-nodes.md, line 325 at r5 (raw file):

Previously, knz (kena) wrote…

see note above

Done.


v20.2/upgrade-cockroach-version.md, line 89 at r5 (raw file):

Previously, knz (kena) wrote…

see note above

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r3, 3 of 14 files at r5, 5 of 8 files at r6, 5 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)


_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

<ul>
	<li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>

This "one minute" is problematic. (it's actually incorrect)

This was discussed at this week's KV team meeting.
Also in John's doc notes.


_includes/v20.2/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

<ul>
	<li>If the node was started with a process manager, stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code>. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>

ditto

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)


_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, knz (kena) wrote…

This "one minute" is problematic. (it's actually incorrect)

This was discussed at this week's KV team meeting.
Also in John's doc notes.

Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)


_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Following our convo today, I tried to capture more of the nuance (as I understand it) in the callout.

The callout below is fine.

However there should be a forward reference to the callout at "for example, set TimeoutStopSecs=60 in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.


v20.1/remove-nodes.md, line 168 at r9 (raw file):

{{site.data.alerts.callout_info}}
The amount of time you should wait before sending `SIGKILL` can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.

I like this, thanks

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 14 files at r5, 5 of 8 files at r6, 1 of 5 files at r7, 1 of 1 files at r8, 6 of 6 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @taroface)


_includes/v20.1/prod-deployment/node-shutdown.md, line 2 at r8 (raw file):

Previously, knz (kena) wrote…

The callout below is fine.

However there should be a forward reference to the callout at "for example, set TimeoutStopSecs=60 in your configuration template" because we need a heavy disclaimer on that part and the user is unlikely to read further after they get actionable advice from this bullet list.

Good point. I moved the callout directly below the list item, where it's most relevant.

@taroface
Copy link
Contributor Author

taroface commented Aug 6, 2020

@knz ping on review. thanks!

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @taroface)

@taroface taroface requested a review from Amruta-Ranade August 6, 2020 15:19
@taroface taroface merged commit 255b375 into master Aug 6, 2020
@taroface taroface deleted the rolling-upgrade branch August 6, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants