Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replicas not balanced across multiple stores in same node #6782

Closed
cuongdo opened this issue May 18, 2016 · 8 comments · Fixed by #51567
Closed

storage: replicas not balanced across multiple stores in same node #6782

cuongdo opened this issue May 18, 2016 · 8 comments · Fixed by #51567
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@cuongdo
Copy link
Contributor

cuongdo commented May 18, 2016

When running a cluster where nodes have multiple stores, one store often seems to get a disproportionate number of replicas allocated to it.

Here's an example cluster with 5 nodes, each with 5 stores:

screenshot 2016-05-18 16 51 54

This is happening on 3 of 5 nodes in this cluster. We currently do not rebalance between multiple stores in the same node, but we may want to consider it for performance reasons. For example, maybe each store is backed by a different SSD. Spreading our I/O evenly across stores could benefit performance.

@bdarnell
Copy link
Contributor

Moving replicas between stores on the same node is tracked in #2067 (and i'll add some more comments there), but I think this is a separate issue. Those ranges should be moved somewhere, even if it's to an underloaded store on another node.

@petermattis petermattis added this to the Q3 milestone Jul 21, 2016
@petermattis petermattis modified the milestone: Q3 Aug 3, 2017
@dianasaur323 dianasaur323 added this to the Later milestone Sep 17, 2017
@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jul 21, 2018
@petermattis petermattis assigned a-robinson and unassigned a-robinson Jul 21, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

Closing as stale.

@tbg tbg closed this as completed Oct 11, 2018
@a-robinson
Copy link
Contributor

I get that we're very unlikely to work on this anytime soon, but it's a legitimate flaw in the product, not just a feature request. Do we not keep that sort of thing around anymore?

@tbg
Copy link
Member

tbg commented Oct 11, 2018

Oh, I missed that facet. I thought this was just a long-fixed replication bug. Sorry and thanks for complaining. Lots of issues to look into.

@tbg tbg reopened this Oct 11, 2018
@knz
Copy link
Contributor

knz commented Aug 27, 2019

Update from @tbg: with the recent changes to allow atomic replication changes, we're moving closer to resolving this issue. However:

This doesn’t work yet, but only because it’s out of scope. The work here is to remove the check that prevents us from doing it, and making sure there aren’t any other places where we’ve since bought into the “no two replicas of a range on a node” pattern
I think for the second part an audit on anyone who accesses NodeID should be enough. We want to make sure nobody tries to resolve a replica via its NodeID

Also a casual search also reveals related issue #39415 which we need to solve (maybe via PR #39497 from @nvanbenschoten) before we can consider multi-store nodes safe for use.

(cc @johnrk as this is an important long-standing issue which we're aiming to solve soon-ish)

@nvanbenschoten
Copy link
Member

@lunevalex this is the issue I was thinking about. This has been unblocked by #12768.

Most of the work here will be in the kvserver.Allocator, where we need to lift the restriction about rebalancing between stores on the same node if we can rebalance atomically between them. I haven't personally touched this code in a long time, so @tbg might be better able to advise you on the details and subtleties that will go into addressing this.

@lunevalex lunevalex self-assigned this Jun 4, 2020
@tbg
Copy link
Member

tbg commented Jun 4, 2020

I would start here by trying out a smoke test that proves that #12768 really fundamentally does allow the lateral rebalance. Concretely, I think this would be a test with boilerplate similar to

func TestAtomicReplicationChange(t *testing.T) {

except that we want to start only a single server, with two stores:

StoreSpecs []StoreSpec

and then

	desc = runChange(desc, []roachpb.ReplicationChange{
		{ChangeType: roachpb.ADD_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 2},
		{ChangeType: roachpb.REMOVE_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 1},
	})

This should work (after removing some checks that probably refuse this for legacy reasons).

Assuming this is in the bag, we can turn our attention to the replicate queue. Rebalances generally start out here:

return rq.considerRebalance(ctx, repl, voterReplicas, canTransferLease, dryRun)

We should ultimately be able to end up here with replication targets similar to the test:

addTarget := roachpb.ReplicationTarget{
NodeID: target.store.Node.NodeID,
StoreID: target.store.StoreID,
}
removeTarget := roachpb.ReplicationTarget{
NodeID: removeReplica.NodeID,
StoreID: removeReplica.StoreID,
}

I browsed through the code and the only code in the allocator that rules out lateral rebalances now is here:

// Nodes that already have a replica on one of their stores aren't valid
// rebalance targets. We do include stores that currently have a replica
// because we want them to be considered as valid stores in the
// ConvergesOnMean calculations below. This is subtle but important.
if nodeHasReplica(store.Node.NodeID, existingReplicas) &&
!storeHasReplica(store.StoreID, existingReplicas) {
log.VEventf(ctx, 2, "nodeHasReplica(n%d, %v)=true",
store.Node.NodeID, existingReplicas)
continue

We can't just remove that - we want to allow a second replica on the same node only if we are in the context of a lateral move (we would not want to add a second replica to the node without atomically removing the first one). This will need to be established and unit tested.

Finally, we should have some sort of high-level validation that this works end-to-end. For example, we could start a three-node cluster on which the first node has two stores and ensure that we organically see replicas move to that store (which means there must have been lateral rebalances).

I'll point out that there are likely deficiencies in how the system tries to balance the load. For example, if n1 has 100 stores and n2 and n3 have one, then effectively n1 will end up with 100/102 percent of data (I think) which may not be what we want (n1 might be CPU bound long before that). But for mostly symmetric configurations, hopefully it would do the right thing. While we're looking at that code, we should document as much as possible the expected behavior and its strengths. I believe we have some simulated rebalancing unit tests, we should update those.

@lunevalex lunevalex assigned darinpp and unassigned lunevalex Jun 10, 2020
@johnrk-zz
Copy link

johnrk-zz commented Jun 11, 2020

We have access to a VMware test environment until September. We should consider using that environment to test with vSAN. cc @a-entin

@lunevalex lunevalex assigned lunevalex and unassigned darinpp Jul 13, 2020
lunevalex added a commit to lunevalex/cockroach that referenced this issue Jul 18, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
lunevalex added a commit to lunevalex/cockroach that referenced this issue Jul 21, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
lunevalex added a commit to lunevalex/cockroach that referenced this issue Aug 7, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance  of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used accross a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node  and leverage the locality feature to add the node as the last locality tier.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
lunevalex added a commit to lunevalex/cockroach that referenced this issue Aug 17, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance  of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used across a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node  and leverage the locality feature to add the node as the last locality tier.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
lunevalex added a commit to lunevalex/cockroach that referenced this issue Aug 19, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node.
This is possible because we previously introduced atomic rebalances in cockroachdb#12768.

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition.

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
craig bot pushed a commit that referenced this issue Aug 20, 2020
50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder

Before: the script wasn't implemented.

Now:

Part of the new release process, this script

- tags the selected SHA to the provided name
- compiles the binaries and archive and uploads them to S3 as the
  versioned name
- uploads the docker image to docker.io/cockroachdb/cockroach
- pushes the tag to github.com/cockroachdb/cockroach
- push all the artificats to their respective `latest` locations as
  appropriate

Release note: None

51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex

Closes #6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. 
This is possible because we previously introduced atomic rebalances in #12768. 

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. 

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. 

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt

When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state.
Typically this requires running a somewhat expensive RevertRange operation
that finds the keys written by the IMPORT in amongst all the table data
and deletes just those keys. This is somewhat expensive -- we need to
iterate the keys in the target table and check them to see if they
need to be reverted.

Non-INTO style IMPORTs create the table into which they will IMPORT and
thus can just drop it wholesale on failure, instead of doing this expensive
revert. However INTO-style IMPORTs could use a similarly fast/cheap
wholesale delete *if they knew the table was empty* when the IMPORT was
started.

This change tracks which tables were empty when the IMPORT started and
then deletes, rather than reverts, the table span on failure.

Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster.

53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft

This commit adds index acceleration support for the bounding box comparison
operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects.

Release note (performance improvement): The ~ and && geospatial bounding
box operations can now benefit from index acceleration if one of the
operands is an indexed geometry column.

53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy

Fixes #53033
Fixes #52959 

Use transaction when querying for the schedules to run.
In addition, ensure that a single bad schedule does not cause
all of the previous work to be wasted by using transaction savepoints.


Release Notes: None

53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten

Fixes #50180.

This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan.

Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation:
```
UPSERT t = SELECT FROM t + INSERT INTO t
=>
UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t
```

I plan to test this out on a contended `indexes` workload at some point in the future.

Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting.

Co-authored-by: James H. Linder <[email protected]>
Co-authored-by: Alex Lunev <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in ed34965 Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.