WIP: storage: Support scattering replicas in addition to leases #26438

a-robinson · 2018-06-05T19:30:06Z

Release note: TBD

This currently isn't a very satisfying experience. It doesn't scatter things as widely as I'd like, with a lot of replicas for a table still winding up on the same handful of nodes that they were predominantly on to begin with. It helps, as documented on #26059, especially if you run it multiple times and on all of your tables rather than just one, but needs to be better. I think that putting everything through the allocator logic, while necessary for respecting zone constraints, is hurting the variety of stores on which things end up.

Sending out as a WIP because I'm going to focus on improving the actual rebalancing logic as the first-order goal, with this being a less important addition to follow up on later. It still needs an option added to the SQL grammar so that it isn't used every time SCATTER is run.

Release note: None

cockroach-teamcity · 2018-06-05T19:30:13Z

This change is

bdarnell · 2018-06-06T15:28:40Z

Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/sql/scatter.go, line 139 at r2 (raw file):

		RequestHeader:   roachpb.RequestHeader{Key: n.run.span.Key, EndKey: n.run.span.EndKey},
		RandomizeLeases: true,
		// TODO: Require an additional option for this.

Why? I'd think that normally when you issue a scatter command you'd want both.

Comments from Reviewable

a-robinson · 2018-06-06T15:38:56Z

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why? I'd think that normally when you issue a scatter command you'd want both.

Primarily because moving so many replicas is a lot more resource-intensive than just moving leases, and if anyone has made SCATTER a part of their workflow they could be in for quite a surprise when they run it after upgrading to 2.1.

Perhaps that's overly protective, though. I'm certainly open to leaving it like this.

Comments from Reviewable

bdarnell · 2018-06-06T15:40:39Z

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Primarily because moving so many replicas is a lot more resource-intensive than just moving leases, and if anyone has made SCATTER a part of their workflow they could be in for quite a surprise when they run it after upgrading to 2.1.

Perhaps that's overly protective, though. I'm certainly open to leaving it like this.

Yeah, it's more expensive, but I think that's what people generally want when they're calling this. We could have a separate lease-only option if there's concern about this, but I'm not sure if anyone would use it (check with Robert).

Comments from Reviewable

This avoids a flood of snapshots all at once and allows later rebalancing decisions to be based on updated store info from the earlier decisions, rather than all the decisions happening at about the same time with the same information, which will make the same few nodes look most desirable. Release note: None

petermattis · 2018-06-06T23:46:13Z

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, it's more expensive, but I think that's what people generally want when they're calling this. We could have a separate lease-only option if there's concern about this, but I'm not sure if anyone would use it (check with Robert).

I'm ok with not adding an option for this. Prior to this PR ALTER TABLE ... SCATTER would move replicas if there was an imbalance (it looped until the replicate queue decided there was nothing to do). Yes, forcing replica rebalancing is a change in behavior, but I don't think it is drastic. One thing I'm interested in is how this affects stuff like kv --splits=1000. Does the forced rebalancing significantly slow down that operation? That might be an argument for having an ALTER TABLE ... SCATTER LEASES.

pkg/storage/replica_command.go, line 2121 at r3 (raw file):

		}
		numReplicas := int(zone.NumReplicas)
		replicasToAdd := rand.Intn(liveStores) + 1 - numReplicas

Rather than liveStores, don't you want to know how many candidate stores there are given the zone config constraints?

pkg/storage/replica_command.go, line 2126 at r3 (raw file):

		} else if replicasToAdd > numReplicas {
			replicasToAdd = numReplicas
		}

So you're adding up to N more replicas and then letting replicate queue chose which ones to remove. What is preventing replicate queue from removing the newly added replicas? Does this approach work well in practice? I don't have a better suggestion. The code could you a comment describing what it is doing.

pkg/storage/replica_command.go, line 2130 at r3 (raw file):

		var replicasAdded int
		for re := retry.StartWithCtx(ctx, retryOpts); re.Next() && replicasAdded < replicasToAdd; {
			requeue, err := rq.processOneChange(ctx, r, sysCfg, func() bool { return false } /* canTransferLease */, false /* dryRun */, true /* disableStatsBasedRebalancing */, true /* forceAddReplica */)

That's a lot of boolean arguments. I think you should change processOneChange to take an options struct. I see you've already added replicateQueueOptions, but it appears to be unused.

Comments from Reviewable

As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None

With this update, TPC-C 10k on 30 went from overloaded to running at peak efficiency over the course of about 4 hours (the manual partitioning approach takes many hours to move all the replicas as well, for a point of comparison). This is without having to run the replica scatter from cockroachdb#26438. Doing a 5 minute run to get a result that doesn't include all the rebalancing time shows: _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 290.9s 124799.1 97.0% 548.6 486.5 872.4 1140.9 2281.7 10200.5 I think it may have a small bug in it still, since at one point early on one of the replicas from the warehouse table on the node doing the relocating thought that it had 16-17k QPS, which wasn't true by any other metric in the system. Restarting the node fixed it though. I'm not too concerned about the bug, since I assume I just made a code mistake, not that anything about the approach fundamentally leads to a random SQL table replica gets 10s of thousands of QPS. Range 1 is also back to getting a ton of QPS (~3k) even though I raised the range cache size from 1M to 50M. Looking at slow query traces shows a lot of range lookups, way more than I'd expect given that ranges weren't moving around at the time of the traces. Release note: None

a-robinson · 2018-08-02T15:00:52Z

I was using this while testing out load-based lease rebalancing and ran into a panic:

I180802 06:28:10.925661 3983220 storage/store.go:2551  [replicaGC,n1,s1,r7476/2:/Table/57/1/2{89/757…-90/115…}] removing replica
I180802 06:28:10.928514 3327671 util/mon/bytes_usage.go:592  [n1,client=10.142.0.36:46658,user=root] txn: bytes usage increases to 1.7 MiB (+1781760)
I180802 06:28:10.930257 3983220 storage/replica.go:842  [replicaGC,n1,s1,r7476/2:/Table/57/1/2{89/757…-90/115…}] removed 103972 (103965+7) keys in 4ms [clear=2ms commit=2ms]
E180802 06:28:10.931386 1768951 sql/conn_executor.go:657  [n1,client=127.0.0.1:48520,user=root] a SQL panic has occurred while executing "ALTER TABLE tpcc.public.stock SCATTER": context canceled
I180802 06:28:10.934749 3327671 util/mon/bytes_usage.go:592  [n1,client=10.142.0.36:46658,user=root] txn: bytes usage increases to 3.2 MiB (+3389440)
E180802 06:28:10.936822 1768951 util/log/crash_reporting.go:203  [n1,client=127.0.0.1:48520,user=root] a panic has occurred!
E180802 06:28:10.951230 1768951 util/log/crash_reporting.go:477  [n1,client=127.0.0.1:48520,user=root] Reported as error a1f4e80dd4e04ca699587d56f4eeb48d
panic: context canceled [recovered]
        panic: panic while executing 1 statements: ALTER TABLE _._._ SCATTER; caused by context canceled

goroutine 1768951 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc436288000, 0x2d41880, 0xc42ce7ce80, 0x278aa00, 0xc42a1002d0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:671 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc436288000, 0x2d41880, 0xc42ce7ce80)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:413 +0x61
panic(0x278aa00, 0xc42a1002d0)
        /usr/local/go/src/runtime/panic.go:502 +0x229
github.com/cockroachdb/cockroach/pkg/kv.(*RangeIterator).Desc(0xc431f08640, 0x2d41940)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_iter.go:70 +0x6a
github.com/cockroachdb/cockroach/pkg/sql.(*scatterNode).Next(0xc42c1891a0, 0x2d41940, 0xc42a1ab680, 0xc4362884c0, 0xc436288430, 0xc421d74c01, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scatter.go:164 +0xb4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).forEachRow(0xc436288000, 0x2d41940, 0xc42a1ab680, 0xc4362884c0, 0xc436288430, 0x2d335c0, 0xc42c1891a0, 0xc429492b28, 0x19c42c1, 0x2d41940)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0xb0
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithLocalEngine(0xc436288000, 0x2d41940, 0xc42a1ab680, 0xc436288430, 0x3, 0x7fd4bf799350, 0xc428e28b40, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:887 +0x286
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:802 +0xa8f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:401 +0xb1e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:95 +0x358
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc436288000, 0x2d41880, 0xc42ce7ce80, 0xc4272c1220, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1078 +0x2055
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc42058cea0, 0x2d41880, 0xc42ce7ce80, 0x0, 0x0, 0xc462b11050, 0x4, 0xc462b1103d, 0xd, 0x2d22e40, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:415 +0x1bb
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func3(0xc42058cea0, 0x2d41880, 0xc42ce7ce80, 0xc450fa8a80, 0x5400, 0x15000, 0xc4208a4600, 0xc4272c1220, 0xc4272c11f0, 0xc43019a830, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:267 +0x122
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:266 +0xf27

As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None [prototype] storage: Extend new allocator to also move range replicas With this update, TPC-C 10k on 30 went from overloaded to running at peak efficiency over the course of about 4 hours (the manual partitioning approach takes many hours to move all the replicas as well, for a point of comparison). This is without having to run the replica scatter from cockroachdb#26438. Doing a 5 minute run to get a result that doesn't include all the rebalancing time shows: _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 290.9s 124799.1 97.0% 548.6 486.5 872.4 1140.9 2281.7 10200.5 I think it may have a small bug in it still, since at one point early on one of the replicas from the warehouse table on the node doing the relocating thought that it had 16-17k QPS, which wasn't true by any other metric in the system. Restarting the node fixed it though. I'm not too concerned about the bug, since I assume I just made a code mistake, not that anything about the approach fundamentally leads to a random SQL table replica gets 10s of thousands of QPS. Range 1 is also back to getting a ton of QPS (~3k) even though I raised the range cache size from 1M to 50M. Looking at slow query traces shows a lot of range lookups, way more than I'd expect given that ranges weren't moving around at the time of the traces. Release note: None Release note: None

bobvawter · 2020-03-10T16:34:29Z

All committers have signed the CLA.

a-robinson added 2 commits May 31, 2018 15:49

storage: Add option to scatter replicas in addition to leases

10f3f00

Release note: None

storage: Continue trying replica scatter when stores are throttled

2d56c10

Release note: None

a-robinson requested review from a team June 5, 2018 19:30

a-robinson mentioned this pull request Jun 11, 2018

[prototype] storage: Make rebalance decisions at store-level #26608

Closed

a-robinson mentioned this pull request Oct 11, 2018

scatter: downreplication takes a long time #17000

Closed

a-robinson mentioned this pull request Jan 11, 2019

storage: SCATTER should have an option to randomize replicas #23358

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

a-robinson closed this Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: storage: Support scattering replicas in addition to leases #26438

WIP: storage: Support scattering replicas in addition to leases #26438

a-robinson commented Jun 5, 2018

cockroach-teamcity commented Jun 5, 2018

bdarnell commented Jun 6, 2018

a-robinson commented Jun 6, 2018

bdarnell commented Jun 6, 2018

petermattis commented Jun 6, 2018

a-robinson commented Aug 2, 2018

bobvawter commented Mar 10, 2020 •

edited by cockroach-teamcity

Loading

WIP: storage: Support scattering replicas in addition to leases #26438

WIP: storage: Support scattering replicas in addition to leases #26438

Conversation

a-robinson commented Jun 5, 2018

cockroach-teamcity commented Jun 5, 2018

bdarnell commented Jun 6, 2018

a-robinson commented Jun 6, 2018

bdarnell commented Jun 6, 2018

petermattis commented Jun 6, 2018

a-robinson commented Aug 2, 2018

bobvawter commented Mar 10, 2020 • edited by cockroach-teamcity Loading

bobvawter commented Mar 10, 2020 •

edited by cockroach-teamcity

Loading