storage: replicas not balanced across multiple stores in same node #6782

cuongdo · 2016-05-18T21:03:42Z

When running a cluster where nodes have multiple stores, one store often seems to get a disproportionate number of replicas allocated to it.

Here's an example cluster with 5 nodes, each with 5 stores:

This is happening on 3 of 5 nodes in this cluster. We currently do not rebalance between multiple stores in the same node, but we may want to consider it for performance reasons. For example, maybe each store is backed by a different SSD. Spreading our I/O evenly across stores could benefit performance.

bdarnell · 2016-05-18T21:47:11Z

Moving replicas between stores on the same node is tracked in #2067 (and i'll add some more comments there), but I think this is a separate issue. Those ranges should be moved somewhere, even if it's to an underloaded store on another node.

tbg · 2018-10-11T10:06:49Z

Closing as stale.

a-robinson · 2018-10-11T15:08:16Z

I get that we're very unlikely to work on this anytime soon, but it's a legitimate flaw in the product, not just a feature request. Do we not keep that sort of thing around anymore?

tbg · 2018-10-11T16:58:11Z

Oh, I missed that facet. I thought this was just a long-fixed replication bug. Sorry and thanks for complaining. Lots of issues to look into.

knz · 2019-08-27T09:52:25Z

Update from @tbg: with the recent changes to allow atomic replication changes, we're moving closer to resolving this issue. However:

This doesn’t work yet, but only because it’s out of scope. The work here is to remove the check that prevents us from doing it, and making sure there aren’t any other places where we’ve since bought into the “no two replicas of a range on a node” pattern
I think for the second part an audit on anyone who accesses NodeID should be enough. We want to make sure nobody tries to resolve a replica via its NodeID

Also a casual search also reveals related issue #39415 which we need to solve (maybe via PR #39497 from @nvanbenschoten) before we can consider multi-store nodes safe for use.

(cc @johnrk as this is an important long-standing issue which we're aiming to solve soon-ish)

nvanbenschoten · 2020-06-03T16:17:57Z

@lunevalex this is the issue I was thinking about. This has been unblocked by #12768.

Most of the work here will be in the kvserver.Allocator, where we need to lift the restriction about rebalancing between stores on the same node if we can rebalance atomically between them. I haven't personally touched this code in a long time, so @tbg might be better able to advise you on the details and subtleties that will go into addressing this.

tbg · 2020-06-04T14:38:50Z

I would start here by trying out a smoke test that proves that #12768 really fundamentally does allow the lateral rebalance. Concretely, I think this would be a test with boilerplate similar to

cockroach/pkg/kv/kvserver/client_atomic_membership_change_test.go

Line 34 in f80ec9c

func TestAtomicReplicationChange(t *testing.T) {

except that we want to start only a single server, with two stores:

cockroach/pkg/base/test_server_args.go

Line 66 in 7b7fcb0

StoreSpecs []StoreSpec

and then

	desc = runChange(desc, []roachpb.ReplicationChange{
		{ChangeType: roachpb.ADD_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 2},
		{ChangeType: roachpb.REMOVE_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 1},
	})

This should work (after removing some checks that probably refuse this for legacy reasons).

Assuming this is in the bag, we can turn our attention to the replicate queue. Rebalances generally start out here:

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 412 in 76c63de

return rq.considerRebalance(ctx, repl, voterReplicas, canTransferLease, dryRun)

We should ultimately be able to end up here with replication targets similar to the test:

cockroach/pkg/kv/kvserver/allocator.go

Lines 748 to 755 in f291061

    
           addTarget := roachpb.ReplicationTarget{ 
        
           	NodeID:  target.store.Node.NodeID, 
        
           	StoreID: target.store.StoreID, 
        
           } 
        
           removeTarget := roachpb.ReplicationTarget{ 
        
           	NodeID:  removeReplica.NodeID, 
        
           	StoreID: removeReplica.StoreID, 
        
           }

I browsed through the code and the only code in the allocator that rules out lateral rebalances now is here:

cockroach/pkg/kv/kvserver/allocator_scorer.go

Lines 613 to 621 in 4cf5a97

    
           // Nodes that already have a replica on one of their stores aren't valid 
        
           // rebalance targets. We do include stores that currently have a replica 
        
           // because we want them to be considered as valid stores in the 
        
           // ConvergesOnMean calculations below. This is subtle but important. 
        
           if nodeHasReplica(store.Node.NodeID, existingReplicas) && 
        
           	!storeHasReplica(store.StoreID, existingReplicas) { 
        
           	log.VEventf(ctx, 2, "nodeHasReplica(n%d, %v)=true", 
        
           		store.Node.NodeID, existingReplicas) 
        
           	continue

We can't just remove that - we want to allow a second replica on the same node only if we are in the context of a lateral move (we would not want to add a second replica to the node without atomically removing the first one). This will need to be established and unit tested.

Finally, we should have some sort of high-level validation that this works end-to-end. For example, we could start a three-node cluster on which the first node has two stores and ensure that we organically see replicas move to that store (which means there must have been lateral rebalances).

I'll point out that there are likely deficiencies in how the system tries to balance the load. For example, if n1 has 100 stores and n2 and n3 have one, then effectively n1 will end up with 100/102 percent of data (I think) which may not be what we want (n1 might be CPU bound long before that). But for mostly symmetric configurations, hopefully it would do the right thing. While we're looking at that code, we should document as much as possible the expected behavior and its strengths. I believe we have some simulated rebalancing unit tests, we should update those.

johnrk-zz · 2020-06-11T20:33:36Z

We have access to a VMware test environment until September. We should consider using that environment to test with vSAN. cc @a-entin

Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used accross a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node and leverage the locality feature to add the node as the last locality tier. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used across a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node and leverage the locality feature to add the node as the last locality tier. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in cockroachdb#12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder Before: the script wasn't implemented. Now: Part of the new release process, this script - tags the selected SHA to the provided name - compiles the binaries and archive and uploads them to S3 as the versioned name - uploads the docker image to docker.io/cockroachdb/cockroach - pushes the tag to github.com/cockroachdb/cockroach - push all the artificats to their respective `latest` locations as appropriate Release note: None 51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex Closes #6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in #12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node. 52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state. Typically this requires running a somewhat expensive RevertRange operation that finds the keys written by the IMPORT in amongst all the table data and deletes just those keys. This is somewhat expensive -- we need to iterate the keys in the target table and check them to see if they need to be reverted. Non-INTO style IMPORTs create the table into which they will IMPORT and thus can just drop it wholesale on failure, instead of doing this expensive revert. However INTO-style IMPORTs could use a similarly fast/cheap wholesale delete *if they knew the table was empty* when the IMPORT was started. This change tracks which tables were empty when the IMPORT started and then deletes, rather than reverts, the table span on failure. Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster. 53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft This commit adds index acceleration support for the bounding box comparison operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects. Release note (performance improvement): The ~ and && geospatial bounding box operations can now benefit from index acceleration if one of the operands is an indexed geometry column. 53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy Fixes #53033 Fixes #52959 Use transaction when querying for the schedules to run. In addition, ensure that a single bad schedule does not cause all of the previous work to be wasted by using transaction savepoints. Release Notes: None 53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten Fixes #50180. This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan. Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation: ``` UPSERT t = SELECT FROM t + INSERT INTO t => UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t ``` I plan to test this out on a contended `indexes` workload at some point in the future. Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting. Co-authored-by: James H. Linder <[email protected]> Co-authored-by: Alex Lunev <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

petermattis added this to the Q3 milestone Jul 21, 2016

petermattis modified the milestone: Q3 Aug 3, 2017

dianasaur323 added this to the Later milestone Sep 17, 2017

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jul 21, 2018

petermattis assigned a-robinson and unassigned a-robinson Jul 21, 2018

petermattis removed this from the Later milestone Oct 5, 2018

tbg closed this as completed Oct 11, 2018

tbg reopened this Oct 11, 2018

nvanbenschoten mentioned this issue Apr 27, 2020

storage: evaluate perf impact of RocksDB's cache_index_and_filter_blocks #48039

Closed

lunevalex self-assigned this Jun 4, 2020

lunevalex assigned darinpp and unassigned lunevalex Jun 10, 2020

rmloveland mentioned this issue Jun 24, 2020

Document that we do not balance replicas among multiple stores per node cockroachdb/docs#7666

Closed

lunevalex assigned lunevalex and unassigned darinpp Jul 13, 2020

irfansharif mentioned this issue Jul 13, 2020

storage: allocate NodeID via one-off RPC to initialized node #32574

Closed

lunevalex mentioned this issue Jul 17, 2020

kvserver: Allow rebalances between stores on the same nodes. #51567

Merged

craig bot closed this as completed in ed34965 Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: replicas not balanced across multiple stores in same node #6782

storage: replicas not balanced across multiple stores in same node #6782

cuongdo commented May 18, 2016

bdarnell commented May 18, 2016

tbg commented Oct 11, 2018

a-robinson commented Oct 11, 2018

tbg commented Oct 11, 2018

knz commented Aug 27, 2019 •

edited

Loading

nvanbenschoten commented Jun 3, 2020

tbg commented Jun 4, 2020

johnrk-zz commented Jun 11, 2020 •

edited

Loading

storage: replicas not balanced across multiple stores in same node #6782

storage: replicas not balanced across multiple stores in same node #6782

Comments

cuongdo commented May 18, 2016

bdarnell commented May 18, 2016

tbg commented Oct 11, 2018

a-robinson commented Oct 11, 2018

tbg commented Oct 11, 2018

knz commented Aug 27, 2019 • edited Loading

nvanbenschoten commented Jun 3, 2020

tbg commented Jun 4, 2020

johnrk-zz commented Jun 11, 2020 • edited Loading

knz commented Aug 27, 2019 •

edited

Loading

johnrk-zz commented Jun 11, 2020 •

edited

Loading