Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update rolling upgrade docs with node drain, etc. #7542

Merged
merged 10 commits into from
Aug 6, 2020
11 changes: 8 additions & 3 deletions _includes/v20.1/prod-deployment/node-shutdown.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
- If the node was started with a process manager like [systemd](https://www.freedesktop.org/wiki/Software/systemd/), stop the node using the process manager. The process manager should be configured to send `SIGTERM` and then, after about 1 minute, `SIGKILL`.
- If the node was started using [`cockroach start`](cockroach-start.html) and is running in the foreground, press `ctrl-c` in the terminal.
- If the node was started using [`cockroach start`](cockroach-start.html) and the `--background` and `--pid-file` flags, run `kill <pid>`, where `<pid>` is the process ID of the node.
<ul>
<li>If the node was started with a process manager, gracefully stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code> to terminate the process. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>
<div class="bs-callout bs-callout--info"><div class="bs-callout__label">Note:</div>
<p>The amount of time you should wait before sending <code>SIGKILL</code> can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability, you can run <a href="cockroach-node.html"><code>cockroach node drain</code></a> prior to node shutdown and actively monitor the draining process instead of automating it.</p>
</div>
<li>If the node was started using <a href="cockroach-start.html"><code>cockroach start</code></a> and is running in the foreground, press <code>ctrl-c</code> in the terminal.</li>
<li>If the node was started using <a href="cockroach-start.html"><code>cockroach start</code></a> and the <code>--background</code> and <code>--pid-file</code> flags, run <code>kill &lt;pid&gt;</code>, where <code>&lt;pid&gt;</code> is the process ID of the node.</li>
</ul>
11 changes: 8 additions & 3 deletions _includes/v20.2/prod-deployment/node-shutdown.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
- If the node was started with a process manager like [systemd](https://www.freedesktop.org/wiki/Software/systemd/), stop the node using the process manager. The process manager should be configured to send `SIGTERM` and then, after about 1 minute, `SIGKILL`.
- If the node was started using [`cockroach start`](cockroach-start.html) and is running in the foreground, press `ctrl-c` in the terminal.
- If the node was started using [`cockroach start`](cockroach-start.html) and the `--background` and `--pid-file` flags, run `kill <pid>`, where `<pid>` is the process ID of the node.
<ul>
<li>If the node was started with a process manager, gracefully stop the node by sending <code>SIGTERM</code> with the process manager. If the node is not shutting down after 1 minute, send <code>SIGKILL</code> to terminate the process. When using <code><a href="https://www.freedesktop.org/wiki/Software/systemd/" target="_blank">systemd</a></code>, for example, set <code>TimeoutStopSecs=60</code> in your configuration template and run <code>systemctl stop &lt;systemd config filename&gt;</code> to stop the node without <code>systemd</code> restarting it.</li>
<div class="bs-callout bs-callout--info"><div class="bs-callout__label">Note:</div>
<p>The amount of time you should wait before sending <code>SIGKILL</code> can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability, you can run <a href="cockroach-node.html"><code>cockroach node drain</code></a> prior to node shutdown and actively monitor the draining process instead of automating it.</p>
</div>
<li>If the node was started using <a href="cockroach-start.html"><code>cockroach start</code></a> and is running in the foreground, press <code>ctrl-c</code> in the terminal.</li>
<li>If the node was started using <a href="cockroach-start.html"><code>cockroach start</code></a> and the <code>--background</code> and <code>--pid-file</code> flags, run <code>kill &lt;pid&gt;</code>, where <code>&lt;pid&gt;</code> is the process ID of the node.</li>
</ul>
2 changes: 1 addition & 1 deletion v19.1/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. In v20.1, the health endpoint correctly considers the node "ready".

After all range replicas have been transferred, it's typical to use [`cockroach node drain`](view-node-details.html) to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](stop-a-node.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](stop-a-node.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.

You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down
2 changes: 1 addition & 1 deletion v19.1/view-node-details.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](stop-a-node.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done during [node shutdown](stop-a-node.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v19.2/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
4 changes: 2 additions & 2 deletions v19.2/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ A node is considered to be decommissioned when it meets two criteria:
1. The node has completed the decommissioning process.
2. The node has been stopped and has not [updated its liveness record](architecture/replication-layer.html#epoch-based-leases-table-data) for the duration configured via [`server.time_until_store_dead`](cluster-settings.html), which defaults to 5 minutes.

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. In v20.1, the health endpoint correctly considers the node "ready".
The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. In v20.1, the health endpoint correctly considers the node "ready" during decommissioning.

After all range replicas have been transferred, it's typical to use [`cockroach node drain`](cockroach-node.html) to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](cockroach-quit.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, a graceful shutdown is initiated by sending `SIGTERM` or running [`cockroach quit`](cockroach-quit.html), during which the node is drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases. Once draining completes and the process is terminated, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.

You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down
2 changes: 1 addition & 1 deletion v20.1/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done by sending `SIGTERM` during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v20.1/cockroach-quit.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ key: stop-a-node.html
---

{{site.data.alerts.callout_danger}}
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html) and then do one of the following:
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, do one of the following:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}
{{site.data.alerts.end}}
Expand Down
7 changes: 3 additions & 4 deletions v20.1/common-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,9 @@ To resolve this issue, do one of the following:

If you're not sure what the IP address/hostname and port values might have been, you can look in the node's [logs](debug-and-error-logs.html). If necessary, you can also end the `cockroach` process, and then restart the node:

{% include copy-clipboard.html %}
~~~ shell
$ pkill cockroach
~~~
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

Then restart the node:

{% include copy-clipboard.html %}
~~~ shell
Expand Down
58 changes: 3 additions & 55 deletions v20.1/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, a graceful shutdown is initiated by sending `SIGTERM`, during which the node is drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases. Meanwhile, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. Once draining completes and the process is terminated, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.

You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down Expand Up @@ -160,33 +160,7 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h

### Step 5. Stop the decommissioning node

A node should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down.

Run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain:

<div class="filter-content" markdown="1" data-scope="secure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --certs-dir=certs --host=<address of node to drain>
~~~
</div>

<div class="filter-content" markdown="1" data-scope="insecure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --insecure --host=<address of node to drain>
~~~
</div>

Once the node has been drained, you'll see a confirmation:

~~~
node is draining... remaining: 1
node is draining... remaining: 0 (complete)
ok
~~~

Stop the node using one of the following methods:
Drain and stop the node using one of the following methods:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

Expand Down Expand Up @@ -339,33 +313,7 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h

### Step 5. Stop the decommissioning nodes

Nodes should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down.

For each node, run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain:

<div class="filter-content" markdown="1" data-scope="secure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --certs-dir=certs --host=<address of node to drain>
~~~
</div>

<div class="filter-content" markdown="1" data-scope="insecure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --insecure --host=<address of node to drain>
~~~
</div>

Once each node has been drained, you'll see a confirmation:

~~~
node is draining... remaining: 1
node is draining... remaining: 0 (complete)
ok
~~~

Stop each node using one of the following methods:
Drain and stop each node using one of the following methods:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

Expand Down
Loading