Replication layer docs: add load-based rebalancing #3921

rmloveland · 2018-10-25T16:01:37Z

Fixes #2051.

Summary of changes:

Add a paragraph to Architecture > Replication Layer describing that
as of v2.1, in addition to the rebalancing that occurs when nodes are
added or removed, we also rebalance leases and replicas based on load.
Also added links to relevant cluster setting and zone config docs for
those who want more info.

cockroach-teamcity · 2018-10-25T16:01:43Z

This change is

cockroach-teamcity · 2018-10-25T16:04:51Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8c1d9102e9f6823546ba92fef08735d8828d978f/

rmloveland · 2018-10-25T16:08:19Z

Direct link to the affected section for easier reference:

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8c1d9102e9f6823546ba92fef08735d8828d978f/dev/architecture/replication-layer.html#membership-changes-rebalance-repair

a-robinson

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

v2.1/architecture/replication-layer.md, line 95 at r1 (raw file):

This is achieved by using a snapshot of a replica from the leaseholder, and then sending the data to another node over [gRPC](distribution-layer.html#grpc). After the transfer has been completed, the node with the new replica joins that range's Raft group; it then detects that its latest timestamp is behind the most recent entries in the Raft log and it replays all of the actions in the Raft log on itself.

<span class="version-tag">New in v2.1:</span> In addition to the rebalancing that occurs when nodes join or leave a cluster, leases and replicas are rebalanced automatically based on the relative load across the nodes within a cluster. For more information, see the `kv.allocator.load_based_rebalancing` [cluster setting](../cluster-settings.html).  Note that depending on the needs of your deployment, you can exercise additional control over the location of leases and replicas by [configuring replication zones](../configure-replication-zones.html).

I might also mention the kv.allocator.qps_rebalance_threshold cluster setting. Load-based rebalancing attempts to get each store's qps within that fraction of the mean qps on each store. It defaults to 0.25, meaning that it tries to get each store to no more than 25% above the mean QPS.

rmloveland

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

v2.1/architecture/replication-layer.md, line 95 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I might also mention the kv.allocator.qps_rebalance_threshold cluster setting. Load-based rebalancing attempts to get each store's qps within that fraction of the mean qps on each store. It defaults to 0.25, meaning that it tries to get each store to no more than 25% above the mean QPS.

Thanks Alex - I added that setting to the sentence since it might be non-obvious they are related (but did not add more description beyond what is on the cluster settings page).

PS thank you for the review!

cockroach-teamcity · 2018-10-25T17:26:09Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e61959452a04fe32f06314461dab10552c2bc935/

jseldess

LGTM, with some nits.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

v2.1/architecture/replication-layer.md, line 95 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Thanks Alex - I added that setting to the sentence since it might be non-obvious they are related (but did not add more description beyond what is on the cluster settings page).

PS thank you for the review!

This is a nice summary. I think we probably should spell out the implications of this type of rebalancing. But I'd be fine doing that later in a tutorial or more complete comprehensive doc about rebalancing, like the blog post Bram wrote. It might be nice to demonstrate both types of rebalancing in this local tutorial as well: https://www.cockroachlabs.com/docs/stable/demo-automatic-rebalancing.html. I'll open an issue.

v2.1/architecture/replication-layer.md, line 85 at r2 (raw file):

Whenever there are changes to a cluster's number of nodes, the members of Raft groups change and, to ensure optimal survivability and performance, replicas need to be rebalanced. What that looks like varies depending on whether the membership change is nodes being added or going offline.

**Nodes added**: The new node communicates information about itself to other nodes, indicating that it has space available. The cluster then rebalances some replicas onto the new node.

Let's make Nodes added and Nodes going offline bullets.

v2.1/architecture/replication-layer.md, line 89 at r2 (raw file):

**Nodes going offline**: If a member of a Raft group ceases to respond, after 5 minutes, the cluster begins to rebalance by replicating the data the downed node held onto other nodes.

#### Rebalancing replicas

This pre-exists your PR, but I don't think we need the `Rebalancing replicas subheading. I think it will flow well if we:

Remove the Rebalancing replicas heading
Remove the first sentence after that heading, which is basically a repetition of the first sentence of this section.
Change This is achieved by using a snapshot... to Rebalancing is achieved by using a snapshot...

Fixes #2051. Summary of changes: - Add a paragraph to *Architecture > Replication Layer* describing that as of v2.1, in addition to the rebalancing that occurs when nodes are added or removed, we also rebalance leases and replicas based on load. Also added links to relevant cluster settings and zone config docs for those who want more info.

rmloveland

Thanks for the reviews Alex and Jesse!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

v2.1/architecture/replication-layer.md, line 95 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

This is a nice summary. I think we probably should spell out the implications of this type of rebalancing. But I'd be fine doing that later in a tutorial or more complete comprehensive doc about rebalancing, like the blog post Bram wrote. It might be nice to demonstrate both types of rebalancing in this local tutorial as well: https://www.cockroachlabs.com/docs/stable/demo-automatic-rebalancing.html. I'll open an issue.

Sounds good - thanks Jesse!

v2.1/architecture/replication-layer.md, line 85 at r2 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Let's make Nodes added and Nodes going offline bullets.

Fixed.

v2.1/architecture/replication-layer.md, line 89 at r2 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

This pre-exists your PR, but I don't think we need the `Rebalancing replicas subheading. I think it will flow well if we:

Remove the Rebalancing replicas heading

Remove the first sentence after that heading, which is basically a repetition of the first sentence of this section.
Change This is achieved by using a snapshot... to Rebalancing is achieved by using a snapshot...

Fixed.

cockroach-teamcity · 2018-10-26T14:46:28Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/9a7b531e1236c0c374c105420da8beb39f8cd763/

rmloveland added the in progress label Oct 25, 2018

rmloveland requested a review from a-robinson October 25, 2018 16:07

a-robinson approved these changes Oct 25, 2018

View reviewed changes

rmloveland requested a review from jseldess October 25, 2018 17:20

rmloveland force-pushed the load-based-replica-rebalancing branch from 8c1d910 to e619594 Compare October 25, 2018 17:23

rmloveland commented Oct 25, 2018

View reviewed changes

jseldess approved these changes Oct 26, 2018

View reviewed changes

jseldess mentioned this pull request Oct 26, 2018

Expand Automatic Rebalancing #3928

Closed

2 tasks

rmloveland force-pushed the load-based-replica-rebalancing branch from e619594 to 9a7b531 Compare October 26, 2018 14:38

rmloveland commented Oct 26, 2018

View reviewed changes

rmloveland merged commit a7fa11f into master Oct 26, 2018

rmloveland deleted the load-based-replica-rebalancing branch October 26, 2018 14:49

rmloveland removed the in progress label Oct 26, 2018

rmloveland mentioned this pull request Oct 29, 2018

backport-2.1: storage: make load-based replica rebalancing decisions at the store level #3760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication layer docs: add load-based rebalancing #3921

Replication layer docs: add load-based rebalancing #3921

rmloveland commented Oct 25, 2018

cockroach-teamcity commented Oct 25, 2018

cockroach-teamcity commented Oct 25, 2018

rmloveland commented Oct 25, 2018

a-robinson left a comment

rmloveland left a comment

cockroach-teamcity commented Oct 25, 2018

jseldess left a comment

rmloveland left a comment

cockroach-teamcity commented Oct 26, 2018

Replication layer docs: add load-based rebalancing #3921

Replication layer docs: add load-based rebalancing #3921

Conversation

rmloveland commented Oct 25, 2018

cockroach-teamcity commented Oct 25, 2018

cockroach-teamcity commented Oct 25, 2018

rmloveland commented Oct 25, 2018

a-robinson left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Oct 25, 2018

jseldess left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Oct 26, 2018