Update 'Stop a Node' with more draining info #2671

rmloveland · 2018-03-08T21:14:08Z

cockroach-teamcity · 2018-03-08T21:14:13Z

This change is

cockroach-teamcity · 2018-03-08T21:16:25Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/3cecc6d1346ce6598161e6b7edec1780a28d72a3/

rmloveland · 2018-03-08T21:29:26Z

@a-robinson - Alfonso is on holiday for 2 weeks and suggested you as a potential reviewer. Mind giving these changes a look?

cockroach-teamcity · 2018-03-08T22:00:37Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/898fa26655d461b2abbb0ff25ba783d3baf91030/

a-robinson · 2018-03-08T23:21:02Z

Reviewed 3 of 3 files at r1.
Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

v2.0/cluster-settings.md, line 25 at r1 (raw file):

The following settings can be configured without further input from Cockroach Labs:

{% include settings/v2.0/settings.md %}

We shouldn't replace these with the entire list -- they were curated to be non-experimental settings that end users could reasonably understand and use on their own without additional documentation or handholding from us. I know it's a pain to keep this up-to-date, but we should either continue trying to do so or change the sentence above to not claim that they can all be configured without our advice.

v2.0/stop-a-node.md, line 19 at r1 (raw file):

When you stop a node, it performs the following steps:

- Finishes in-flight requests. Note that this is a best effort that times out at the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).

I'd phrase this as "that times out after the duration specified by the server.shutdown.query_wait cluster setting"

v2.0/stop-a-node.md, line 21 at r1 (raw file):

- Finishes in-flight requests. Note that this is a best effort that times out at the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
- Transfers all *range leases* and Raft leadership to other nodes.
- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node, and no leases are transferred to the draining node.  Note that this is best effort that times out at the `server.shutdown.drain_wait` [cluster setting](cluster-settings.html), so other nodes may not receive the gossip info in time.

s/best effort/a best effort/

And a similar rephrasing as above would make sense here, too.

v2.0/stop-a-node.md, line 21 at r1 (raw file):

- Finishes in-flight requests. Note that this is a best effort that times out at the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
- Transfers all *range leases* and Raft leadership to other nodes.
- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node, and no leases are transferred to the draining node.  Note that this is best effort that times out at the `server.shutdown.drain_wait` [cluster setting](cluster-settings.html), so other nodes may not receive the gossip info in time.

I'm not positive how effective the "distribute query planning" part actually is (particularly given cockroachdb/cockroach#23601), but that's certainly how we'd like it to work so including it here is probably fine.

Out of curiosity, @solongordon do you know whether other nodes attempt to avoid scheduling distsql processing on draining nodes? Is it purely based on the leaseholder cache?

Comments from Reviewable

jseldess · 2018-03-09T17:16:55Z

v2.0/cluster-settings.md, line 25 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We shouldn't replace these with the entire list -- they were curated to be non-experimental settings that end users could reasonably understand and use on their own without additional documentation or handholding from us. I know it's a pain to keep this up-to-date, but we should either continue trying to do so or change the sentence above to not claim that they can all be configured without our advice.

cc @dt, who did the work to auto-generate this markdown.

Comments from Reviewable

solongordon · 2018-03-09T17:40:42Z

Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

v2.0/stop-a-node.md, line 21 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'm not positive how effective the "distribute query planning" part actually is (particularly given cockroachdb/cockroach#23601), but that's certainly how we'd like it to work so including it here is probably fine.

Out of curiosity, @solongordon do you know whether other nodes attempt to avoid scheduling distsql processing on draining nodes? Is it purely based on the leaseholder cache?

I haven't seen it in action but it does look like there's logic for avoiding processing on draining nodes: https://github.com/cockroachdb/cockroach/blob/afdf5e12bfdd11054582be7e33280fa185974d9e/pkg/sql/distsql_physical_planner.go#L540-L556

Comments from Reviewable

dt · 2018-03-09T18:07:59Z

Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

v2.0/cluster-settings.md, line 25 at r1 (raw file):

Previously, jseldess wrote…

cc @dt, who did the work to auto-generate this markdown.

I think we should put the whole list here and we should remove the sentence above.

Note that above that sentence is a call-out box saying that you should know what you're doing, or talk to us, before changing settings, which I believe adequately sets expectations around these.

For settings that are riskier or whatever, I think we should mention that it in their defined description (we might want to add a longer-form description field).

For extreme cases, we also have the ability to define settings as Hidden which excludes them from SHOW ALL CLUSTER SETTINGS and this list, though i'm generally in favor of documenting over hiding.

Comments from Reviewable

a-robinson · 2018-03-09T19:28:49Z

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

Comments from Reviewable

jseldess

LGTM, @rmloveland, with one minor request.

The new node shutdown description is big improvement. Thank you!

Are you sure you don't want to handle the other versions in this PR?

jseldess · 2018-03-10T18:22:20Z

v2.0/stop-a-node.md

+
+- Finishes in-flight requests. Note that this is a best effort that times out at the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
+- Transfers all *range leases* and Raft leadership to other nodes.
+- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node, and no leases are transferred to the draining node.  Note that this is best effort that times out at the `server.shutdown.drain_wait` [cluster setting](cluster-settings.html), so other nodes may not receive the gossip info in time.


nit: Use only one space after the first period. I know you prefer 2, @rmloveland, but convention in our docs is 1.

jseldess · 2018-03-10T18:24:21Z

v2.0/cluster-settings.md

-| `sql.metrics.statement_details.dump_to_logs` | On each node, also copy collected per-statement statistics to the [logging output](debug-and-error-logs.html) when automatic reporting is enabled. | Boolean | `false` |
-| `sql.metrics.statement_details.threshold` | Only collect per-statement statistics for statements that run longer than this threshold. | Interval | 0 seconds (all statements) |
-| `sql.trace.log_statement_execute` | On each node, copy all executed statements to the [logging output](debug-and-error-logs.html). | Boolean | `false` |
+{% include settings/v2.0/settings.md %}

 <!-- Add this section back in once `system.settings` has been fleshed out.


Please remove this old commented-out text.

cockroach-teamcity · 2018-03-13T19:20:34Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/94b0f11b30c65bb883518b55e0369797ce7d668d/

cockroach-teamcity · 2018-03-13T19:31:56Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/94f88ce6b36ea556370bb7caa90ba9f96dd2b958/

cockroach-teamcity · 2018-03-13T19:36:47Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/253660f05806ebc1bd4afb4f3e47c006007163d7/

rmloveland · 2018-03-13T19:38:03Z

Review status: 1 of 3 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.

v2.0/cluster-settings.md, line 25 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I think we should put the whole list here and we should remove the sentence above.

Note that above that sentence is a call-out box saying that you should know what you're doing, or talk to us, before changing settings, which I believe adequately sets expectations around these.

For settings that are riskier or whatever, I think we should mention that it in their defined description (we might want to add a longer-form description field).

For extreme cases, we also have the ability to define settings as Hidden which excludes them from SHOW ALL CLUSTER SETTINGS and this list, though i'm generally in favor of documenting over hiding.

Removed the sentence above the list as recommended - in 898fa26

v2.0/cluster-settings.md, line 25 at r2 (raw file):

Previously, jseldess wrote…

Please remove this old commented-out text.

Removed the commented-out text in 94b0f11

v2.0/stop-a-node.md, line 19 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd phrase this as "that times out after the duration specified by the server.shutdown.query_wait cluster setting"

Updated as recommended in 7fc7a2f

v2.0/stop-a-node.md, line 21 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/best effort/a best effort/

And a similar rephrasing as above would make sense here, too.

Thanks - fixed the "a" best effort thing in 94f88ce

Updated the cluster setting language in 253660f

v2.0/stop-a-node.md, line 21 at r2 (raw file):

Previously, jseldess wrote…

nit: Use only one space after the first period. I know you prefer 2, @rmloveland, but convention in our docs is 1.

Oops, sorry - fixed in 94f88ce

Comments from Reviewable

rmloveland · 2018-03-14T15:17:27Z

Hey @a-robinson, at @jseldess' recommendation I'm going to try to add the 1.1.{5,6} changes to this PR as well. Do you mind answering some additional questions? I'm a lot less clear on these versions, as you will see from the below. :-)

I have some text below (which I also pushed in 64e718f so you can more easily comment) that I prepared in person with Alfonso on FixitDay before he left for holiday. Each bullet is followed by the version in which Alfonso said the changes occurred.

I made it a numbered list here so I can ask related questions for each item below.

First, the draft text from 64e718f:

1. Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5) 2. Transfers all *range leases* and Raft leadership to other nodes. (1.1.6) 3. Gossips its draining state to the cluster so that no leases are transferred to the draining node. Note that this is a best effort that times out after the duration specified by the `???` cluster setting, so other nodes may not receive the gossip info in time. (1.1.6) 4. No new ranges are transferred to the draining node, to avoid a possible loss of quorum after the node shuts down. (1.1.5)

My questions:

Is there a cluster setting for this in 1.1.6? I'm not seeing one (at least with the server.* prefix) in the cluster settings SQL output below. If no setting, is there a default we can document?
Is this accurate?
Same as question 1 re: is there a cluster setting?
Is this accurate?

The binary I'm running to look up these cluster settings is v1.1.6:

build:      CCL v1.1.6 @ 2018/03/12 17:55:09 (go1.8.3)

Cluster settings don't seem to allow for control of the features in items 1 and 3 above, but I'm not very familiar:

=> SHOW ALL CLUSTER SETTINGS;

...
| server.consistency_check.interval   | 24h0m0s  | d | the time between range consistency checks; set to 0 to disable consistency checking                    |
| server.declined_reservation_timeout | 1s       | d | the amount of time to consider the store throttled for up-replication after a reservation was declined |
| server.failed_reservation_timeout   | 5s       | d | the amount of time to consider the store throttled for up-replication after a failed reservation call  |
| server.remote_debugging.mode        | local    | s | set to enable remote debugging, localhost-only or disable (any, local, off)                            |
| server.time_until_store_dead        | 5m0s     | d | the time after which if there is no new gossiped information about a store, it is considered dead      |
| server.web_session_timeout          | 168h0m0s | d | the duration that a newly created web session will be valid                                            |
...

cockroach-teamcity · 2018-03-14T15:17:35Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/4f5db79ac4fa270c208cb97105b03f9be66ae1fd/

Addresses #2436

The settings we want to keep hidden are also hidden from the auto-generated table, according to the discussion at cockroachdb/cockroach#23531

cockroach-teamcity · 2018-03-14T15:21:16Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/64e718fa417ecea36c738f0e6850db8585e9973f/

a-robinson · 2018-03-14T20:11:42Z

v1.1/stop-a-node.md

@@ -14,6 +14,11 @@ For information about permanently removing nodes to downsize a cluster or react

 ### How It Works

+- Finishes in-flight requests.  Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5)


This setting is not in v1.1. The server just cancels all current sessions without waiting.

Thanks, updated to say that in 14139e9

a-robinson · 2018-03-14T20:18:18Z

v1.1/stop-a-node.md

@@ -14,6 +14,11 @@ For information about permanently removing nodes to downsize a cluster or react

 ### How It Works

+- Finishes in-flight requests.  Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5)
+- Transfers all *range leases* and Raft leadership to other nodes. (1.1.6)
+- Gossips its draining state to the cluster so that no leases are transferred to the draining node. Note that this is a best effort that times out after the duration specified by the `???` cluster setting, so other nodes may not receive the gossip info in time. (1.1.6)


This still happens as of 1.1.6. In 1.1.5 and earlier, this part was broken.

Thanks for confirming. Do you know if there is a cluster setting for this in 1.1.6? None of the documented cluster settings for 1.1.6 look to be the one. And it wasn't clear from a quick SHOW ALL CLUSTER SETTINGS on a 1.1.6 binary (though I may have missed it).

a-robinson · 2018-03-14T20:20:03Z

v1.1/stop-a-node.md

@@ -14,6 +14,11 @@ For information about permanently removing nodes to downsize a cluster or react

 ### How It Works

+- Finishes in-flight requests.  Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5)
+- Transfers all *range leases* and Raft leadership to other nodes. (1.1.6)


This still happens as of 1.1.6. In 1.1.5 and earlier, these parts didn't always happen as intended.

Great, thanks! Since the stable web docs show 1.1.6 now, I'll leave this in (minus the version number).

a-robinson · 2018-03-14T20:20:46Z

v1.1/stop-a-node.md

+- Finishes in-flight requests.  Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5)
+- Transfers all *range leases* and Raft leadership to other nodes. (1.1.6)
+- Gossips its draining state to the cluster so that no leases are transferred to the draining node. Note that this is a best effort that times out after the duration specified by the `???` cluster setting, so other nodes may not receive the gossip info in time. (1.1.6)
+- No new ranges are transferred to the draining node, to avoid a possible loss of quorum after the node shuts down. (1.1.5)


This is true for all 1.1.x versions, as far as I'm aware.

Thanks! Again, leaving in but removing the version number.

cockroach-teamcity · 2018-03-15T14:53:08Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/14139e9bf4e8204750d450ab73f9c1fcc31d5a48/

cockroach-teamcity · 2018-03-15T14:59:24Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/76a56222df363f3bf2bf63b0ee883342dc258b11/

cockroach-teamcity · 2018-03-15T15:02:31Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/6c72d46e92a5659fe6d70e18e2bbc618b573fc25/

cockroach-teamcity · 2018-03-15T19:46:09Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/d9d22a13efe08cf98fccad6ef9c44b846cf9c9d9/

a-robinson · 2018-03-15T20:11:49Z

Reviewed 1 of 1 files at r5, 1 of 1 files at r8, 1 of 1 files at r13.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

v1.1/stop-a-node.md, line 19 at r9 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Thanks for confirming. Do you know if there is a cluster setting for this in 1.1.6? None of the documented cluster settings for 1.1.6 look to be the one. And it wasn't clear from a quick SHOW ALL CLUSTER SETTINGS on a 1.1.6 binary (though I may have missed it).

Nope, no cluster setting in any of the v1.1 releases. It'd have to be something pretty critical for us to introduce/backport a new cluster setting in a patch release.

Comments from Reviewable

cockroach-teamcity · 2018-03-19T15:24:58Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/094d164c56494a2366787b15985231649a0e6861/

rmloveland · 2018-03-19T15:28:02Z

@jseldess do you want a final editorial look at the v1.1 changes before merge?

jseldess

LGTM, with one nit.

jseldess · 2018-03-19T15:33:05Z

v1.1/stop-a-node.md

@@ -14,7 +14,12 @@ For information about permanently removing nodes to downsize a cluster or react

 ### How It Works

-When you stop a node, CockroachDB lets the node finish in-flight requests and transfers all **range leases** off the node before shutting it down. If the node then stays offline for a certain amount of time (5 minutes by default), the cluster considers the node dead and starts to transfer its **range replicas** to other nodes as well.
+- Cancels all current sessions without waiting.
+- Transfers all *range leases* and Raft leadership to other nodes.


nit: bold instead of italics.

Thanks, fixed in bb14e1b

cockroach-teamcity · 2018-03-19T17:59:32Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/bb14e1bcae2a9bdd83f806436b1d732ef75baf25/

rmloveland added the in progress label Mar 8, 2018

rmloveland requested a review from asubiotto March 8, 2018 21:15

rmloveland requested review from a-robinson and removed request for asubiotto March 8, 2018 21:28

a-robinson mentioned this pull request Mar 8, 2018

cherrypick-2.0: server: change server.drain_max_wait to server.shutdown.query_wait cockroachdb/cockroach#23629

Merged

jseldess approved these changes Mar 10, 2018

View reviewed changes

jseldess added the fixitday label Mar 13, 2018

rmloveland added 7 commits March 14, 2018 11:18

Update 'Stop a Node' with more draining info

ff04f3f

Addresses #2436

Remove note re: which settings are "safe"

4275ac1

The settings we want to keep hidden are also hidden from the auto-generated table, according to the discussion at cockroachdb/cockroach#23531

Remove commented-out text

7bfe68d

Update cluster setting language based on feedback

3df0393

One space after a period; it's "a" best effort

948ef35

Update cluster setting duration language

eb14466

First crack at 1.1.{5,6} shutdown updates

64e718f

rmloveland force-pushed the fixitday-node-draining-update branch from 4f5db79 to 64e718f Compare March 14, 2018 15:19

a-robinson reviewed Mar 14, 2018

View reviewed changes

Clarify that node cancels all current sessions

14139e9

Remove version # from range leases bullet

76a5622

Remove version # from quorum note

6c72d46

Remove duped info from para following list

d9d22a1

Update gossiped draining state note via feedback

094d164

jseldess approved these changes Mar 19, 2018

View reviewed changes

Make italic text bold

bb14e1b

rmloveland merged commit 9d34ae6 into master Mar 19, 2018

rmloveland deleted the fixitday-node-draining-update branch March 19, 2018 18:02

		@@ -14,6 +14,11 @@ For information about permanently removing nodes to downsize a cluster or react

		### How It Works

		- Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the `???` cluster setting (1.1.5)

Update 'Stop a Node' with more draining info #2671

Update 'Stop a Node' with more draining info #2671

Conversation

rmloveland commented Mar 8, 2018 • edited Loading

cockroach-teamcity commented Mar 8, 2018

cockroach-teamcity commented Mar 8, 2018

rmloveland commented Mar 8, 2018

cockroach-teamcity commented Mar 8, 2018

a-robinson commented Mar 8, 2018

jseldess commented Mar 9, 2018

solongordon commented Mar 9, 2018

dt commented Mar 9, 2018

a-robinson commented Mar 9, 2018

jseldess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 13, 2018

cockroach-teamcity commented Mar 13, 2018

cockroach-teamcity commented Mar 13, 2018

rmloveland commented Mar 13, 2018

rmloveland commented Mar 14, 2018 • edited Loading

cockroach-teamcity commented Mar 14, 2018

cockroach-teamcity commented Mar 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 15, 2018

cockroach-teamcity commented Mar 15, 2018

cockroach-teamcity commented Mar 15, 2018

cockroach-teamcity commented Mar 15, 2018

a-robinson commented Mar 15, 2018

cockroach-teamcity commented Mar 19, 2018

rmloveland commented Mar 19, 2018

jseldess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 19, 2018

rmloveland commented Mar 8, 2018 •

edited

Loading

rmloveland commented Mar 14, 2018 •

edited

Loading