Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update rolling upgrade docs with node drain, etc. #7542

Merged
merged 10 commits into from
Aug 6, 2020
2 changes: 1 addition & 1 deletion v19.1/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. In v20.1, the health endpoint correctly considers the node "ready".

After all range replicas have been transferred, it's typical to use [`cockroach node drain`](view-node-details.html) to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](stop-a-node.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](stop-a-node.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.

You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down
2 changes: 1 addition & 1 deletion v19.1/view-node-details.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](stop-a-node.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v19.2/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v19.2/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. In v20.1, the health endpoint correctly considers the node "ready".

After all range replicas have been transferred, it's typical to use [`cockroach node drain`](cockroach-node.html) to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](cockroach-quit.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use [`cockroach quit`](cockroach-quit.html) to drain and shut down the node. When stopped, the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.

You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down
2 changes: 1 addition & 1 deletion v20.1/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done by sending `SIGTERM` during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v20.1/cockroach-quit.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ key: stop-a-node.html
---

{{site.data.alerts.callout_danger}}
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html), wait for the node to be completely drained, and then do one of the following:
`cockroach quit` is no longer recommended, and will be deprecated in v20.2. To stop a node, do one of the following:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}
{{site.data.alerts.end}}
Expand Down
9 changes: 4 additions & 5 deletions v20.1/common-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ To resolve this issue, do one of the following:
- If the node hasn't yet been started, [start the node](cockroach-start.html).
- If you specified a [`--listen-addr` and/or a `--advertise-addr` flag](cockroach-start.html#networking) when starting the node, you must include the specified IP address/hostname and port with all other [`cockroach` commands](cockroach-commands.html) or change the `COCKROACH_HOST` environment variable.

If you're not sure what the IP address/hostname and port values might have been, you can look in the node's [logs](debug-and-error-logs.html). If necessary, you can also kill the `cockroach` process, and then restart the node:
If you're not sure what the IP address/hostname and port values might have been, you can look in the node's [logs](debug-and-error-logs.html). If necessary, you can also stop the node:

{% include copy-clipboard.html %}
~~~ shell
$ pkill cockroach
~~~
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

Then restart the node:

{% include copy-clipboard.html %}
~~~ shell
Expand Down
66 changes: 11 additions & 55 deletions v20.1/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria:

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node.

After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this bit: "When stopped, ..."

I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.

Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".

The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.


You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI.

Expand Down Expand Up @@ -160,36 +160,14 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h

### Step 5. Stop the decommissioning node

A node should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down.

Run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain:

<div class="filter-content" markdown="1" data-scope="secure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --certs-dir=certs --host=<address of node to drain>
~~~
</div>

<div class="filter-content" markdown="1" data-scope="insecure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --insecure --host=<address of node to drain>
~~~
</div>

Once the node has been drained, you'll see a confirmation:

~~~
node is draining... remaining: 1
node is draining... remaining: 0 (complete)
ok
~~~

Stop the node using one of the following methods:
Drain and stop the node using one of the following methods:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result...

{{site.data.alerts.end}}

After the duration configured via [`server.time_until_store_dead`](cluster-settings.html), you'll see the stopped node listed under **Recently Decommissioned Nodes**:

<div style="text-align: center;"><img src="{{ 'images/v20.1/cluster-status-after-decommission2.png' | relative_url }}" alt="Decommission a single live node" style="border:1px solid #eee;max-width:100%" /></div>
Expand Down Expand Up @@ -339,36 +317,14 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h

### Step 5. Stop the decommissioning nodes

Nodes should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down.

For each node, run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain:

<div class="filter-content" markdown="1" data-scope="secure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --certs-dir=certs --host=<address of node to drain>
~~~
</div>

<div class="filter-content" markdown="1" data-scope="insecure">
{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --insecure --host=<address of node to drain>
~~~
</div>

Once each node has been drained, you'll see a confirmation:

~~~
node is draining... remaining: 1
node is draining... remaining: 0 (complete)
ok
~~~

Stop each node using one of the following methods:
Drain and stop each node using one of the following methods:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

{{site.data.alerts.end}}

After the duration configured via [`server.time_until_store_dead`](cluster-settings.html), you'll see the stopped nodes listed under **Recently Decommissioned Nodes**:

<div style="text-align: center;"><img src="{{ 'images/v20.1/decommission-multiple7.png' | relative_url }}" alt="Decommission multiple nodes" style="border:1px solid #eee;max-width:100%" /></div>
Expand Down
20 changes: 5 additions & 15 deletions v20.1/upgrade-cockroach-version.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,24 +121,14 @@ Note that this behavior is specific to upgrades from v19.2 to v20.1; it does not
We recommend creating scripts to perform these steps instead of performing them manually. Also, if you are running CockroachDB on Kubernetes, see our documentation on [single-cluster](orchestrate-cockroachdb-with-kubernetes.html#upgrade-the-cluster) and/or [multi-cluster](orchestrate-cockroachdb-with-kubernetes-multi-cluster.html#upgrade-the-cluster) orchestrated deployments for upgrade guidance instead.
{{site.data.alerts.end}}

1. Drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries:

{% include copy-clipboard.html %}
~~~ shell
cockroach node drain --certs-dir=certs --host=<address of node to drain>
~~~

Once the node has been drained, you'll see a confirmation:

~~~
node is draining... remaining: 0 (complete)
ok
~~~

1. After the node is completely drained, stop the node:
1. Drain and stop the node using one of the following methods:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

{{site.data.alerts.callout_info}}
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

{{site.data.alerts.end}}

Verify that the process has stopped:

{% include copy-clipboard.html %}
Expand Down
2 changes: 1 addition & 1 deletion v20.2/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Subcommand | Usage
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).
`drain` | Drain nodes of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and prevent ranges from rebalancing onto the node. This is normally done by sending `SIGTERM` during [node shutdown](cockroach-quit.html), but the `drain` subcommand provides operators an option to interactively monitor, and if necessary intervene in, the draining process.

## Synopsis

Expand Down
2 changes: 1 addition & 1 deletion v20.2/cockroach-quit.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ key: stop-a-node.html
---

{{site.data.alerts.callout_danger}}
`cockroach quit` is deprecated. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html), wait for the node to be completely drained, and then do one of the following:
`cockroach quit` is deprecated. To stop a node, do one of the following:

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}
{{site.data.alerts.end}}
Expand Down
9 changes: 4 additions & 5 deletions v20.2/common-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ To resolve this issue, do one of the following:
- If the node hasn't yet been started, [start the node](cockroach-start.html).
- If you specified a [`--listen-addr` and/or a `--advertise-addr` flag](cockroach-start.html#networking) when starting the node, you must include the specified IP address/hostname and port with all other [`cockroach` commands](cockroach-commands.html) or change the `COCKROACH_HOST` environment variable.

If you're not sure what the IP address/hostname and port values might have been, you can look in the node's [logs](debug-and-error-logs.html). If necessary, you can also kill the `cockroach` process, and then restart the node:
If you're not sure what the IP address/hostname and port values might have been, you can look in the node's [logs](debug-and-error-logs.html). If necessary, you can also stop the node:

{% include copy-clipboard.html %}
~~~ shell
$ pkill cockroach
~~~
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

Then restart the node:

{% include copy-clipboard.html %}
~~~ shell
Expand Down
Loading