-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update rolling upgrade docs with node drain, etc. #7542
Changes from 1 commit
35f8764
86797f9
ed17d11
8b969ef
91abdad
f3fd598
c7c18b0
1fbf59c
ca13b53
0a8b106
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ A node is considered to be decommissioned when it meets two criteria: | |
|
||
The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) continues to consider the node "ready" so load balancers can continue directing traffic to the node. | ||
|
||
After all range replicas have been transferred, it's typical to drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries. The node can then be stopped via a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. | ||
After all range replicas have been transferred, the node can be drained of SQL clients, [distributed SQL](architecture/sql-layer.html#distsql) queries, and range leases, and then stopped. This can be done with a process manager or orchestration tool, or by sending `SIGTERM` manually. When stopped, the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status response code so that load balancers stop directing traffic to the node. At this point the node stops updating its liveness record, and after the duration configured via [`server.time_until_store_dead`](cluster-settings.html) is considered to be decommissioned. | ||
|
||
You can [check the status of node decommissioning](#check-the-status-of-decommissioning-nodes) with the CLI. | ||
|
||
|
@@ -160,36 +160,14 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h | |
|
||
### Step 5. Stop the decommissioning node | ||
|
||
A node should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down. | ||
|
||
Run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain: | ||
|
||
<div class="filter-content" markdown="1" data-scope="secure"> | ||
{% include copy-clipboard.html %} | ||
~~~ shell | ||
cockroach node drain --certs-dir=certs --host=<address of node to drain> | ||
~~~ | ||
</div> | ||
|
||
<div class="filter-content" markdown="1" data-scope="insecure"> | ||
{% include copy-clipboard.html %} | ||
~~~ shell | ||
cockroach node drain --insecure --host=<address of node to drain> | ||
~~~ | ||
</div> | ||
|
||
Once the node has been drained, you'll see a confirmation: | ||
|
||
~~~ | ||
node is draining... remaining: 1 | ||
node is draining... remaining: 0 (complete) | ||
ok | ||
~~~ | ||
|
||
Stop the node using one of the following methods: | ||
Drain and stop the node using one of the following methods: | ||
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during node decommissioning, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... stopping a node forcefully using SIGKILL or another signal than SIGTERM can result... |
||
{{site.data.alerts.end}} | ||
|
||
After the duration configured via [`server.time_until_store_dead`](cluster-settings.html), you'll see the stopped node listed under **Recently Decommissioned Nodes**: | ||
|
||
<div style="text-align: center;"><img src="{{ 'images/v20.1/cluster-status-after-decommission2.png' | relative_url }}" alt="Decommission a single live node" style="border:1px solid #eee;max-width:100%" /></div> | ||
|
@@ -339,36 +317,14 @@ Even with zero replicas on a node, its [status](admin-ui-cluster-overview-page.h | |
|
||
### Step 5. Stop the decommissioning nodes | ||
|
||
Nodes should be drained of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries before being shut down. | ||
|
||
For each node, run the [`cockroach node drain`](cockroach-node.html) command with the address of the node to drain: | ||
|
||
<div class="filter-content" markdown="1" data-scope="secure"> | ||
{% include copy-clipboard.html %} | ||
~~~ shell | ||
cockroach node drain --certs-dir=certs --host=<address of node to drain> | ||
~~~ | ||
</div> | ||
|
||
<div class="filter-content" markdown="1" data-scope="insecure"> | ||
{% include copy-clipboard.html %} | ||
~~~ shell | ||
cockroach node drain --insecure --host=<address of node to drain> | ||
~~~ | ||
</div> | ||
|
||
Once each node has been drained, you'll see a confirmation: | ||
|
||
~~~ | ||
node is draining... remaining: 1 | ||
node is draining... remaining: 0 (complete) | ||
ok | ||
~~~ | ||
|
||
Stop each node using one of the following methods: | ||
Drain and stop each node using one of the following methods: | ||
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you want to minimize these occurrences, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and monitor the draining process instead of automating it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
{{site.data.alerts.end}} | ||
|
||
After the duration configured via [`server.time_until_store_dead`](cluster-settings.html), you'll see the stopped nodes listed under **Recently Decommissioned Nodes**: | ||
|
||
<div style="text-align: center;"><img src="{{ 'images/v20.1/decommission-multiple7.png' | relative_url }}" alt="Decommission multiple nodes" style="border:1px solid #eee;max-width:100%" /></div> | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -121,24 +121,14 @@ Note that this behavior is specific to upgrades from v19.2 to v20.1; it does not | |
We recommend creating scripts to perform these steps instead of performing them manually. Also, if you are running CockroachDB on Kubernetes, see our documentation on [single-cluster](orchestrate-cockroachdb-with-kubernetes.html#upgrade-the-cluster) and/or [multi-cluster](orchestrate-cockroachdb-with-kubernetes-multi-cluster.html#upgrade-the-cluster) orchestrated deployments for upgrade guidance instead. | ||
{{site.data.alerts.end}} | ||
|
||
1. Drain the node of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries: | ||
|
||
{% include copy-clipboard.html %} | ||
~~~ shell | ||
cockroach node drain --certs-dir=certs --host=<address of node to drain> | ||
~~~ | ||
|
||
Once the node has been drained, you'll see a confirmation: | ||
|
||
~~~ | ||
node is draining... remaining: 0 (complete) | ||
ok | ||
~~~ | ||
|
||
1. After the node is completely drained, stop the node: | ||
1. Drain and stop the node using one of the following methods: | ||
|
||
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %} | ||
|
||
{{site.data.alerts.callout_info}} | ||
In certain edge cases, stopping a node using signals can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability during an upgrade, you can run [`cockroach node drain`](cockroach-node.html) prior to node shutdown and actively monitor the draining process instead of automating it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
{{site.data.alerts.end}} | ||
|
||
Verify that the process has stopped: | ||
|
||
{% include copy-clipboard.html %} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this bit: "When stopped, ..."
I think the paragraph should do a better job to distinguish the initiation of the graceful shutdown (a signal is sent), the process of shutting down/draining, and the process termination.
Then in that context, you can say "at some point during the process of shutting down, the monitoring endpoint starts reporting the node as non-ready, so that load balancers can redirect traffic".
The current text uses the verb "Stopped" which implicitly refers only to the end of the process. That's misleading.