-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: replicas not balanced across multiple stores in same node #6782
Comments
Moving replicas between stores on the same node is tracked in #2067 (and i'll add some more comments there), but I think this is a separate issue. Those ranges should be moved somewhere, even if it's to an underloaded store on another node. |
Closing as stale. |
I get that we're very unlikely to work on this anytime soon, but it's a legitimate flaw in the product, not just a feature request. Do we not keep that sort of thing around anymore? |
Oh, I missed that facet. I thought this was just a long-fixed replication bug. Sorry and thanks for complaining. Lots of issues to look into. |
Update from @tbg: with the recent changes to allow atomic replication changes, we're moving closer to resolving this issue. However:
Also a casual search also reveals related issue #39415 which we need to solve (maybe via PR #39497 from @nvanbenschoten) before we can consider multi-store nodes safe for use. (cc @johnrk as this is an important long-standing issue which we're aiming to solve soon-ish) |
@lunevalex this is the issue I was thinking about. This has been unblocked by #12768. Most of the work here will be in the |
I would start here by trying out a smoke test that proves that #12768 really fundamentally does allow the lateral rebalance. Concretely, I think this would be a test with boilerplate similar to
except that we want to start only a single server, with two stores: cockroach/pkg/base/test_server_args.go Line 66 in 7b7fcb0
and then desc = runChange(desc, []roachpb.ReplicationChange{
{ChangeType: roachpb.ADD_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 2},
{ChangeType: roachpb.REMOVE_REPLICA, Target: roachpb.ReplicationTarge{NodeID:1 StoreID: 1},
}) This should work (after removing some checks that probably refuse this for legacy reasons). Assuming this is in the bag, we can turn our attention to the replicate queue. Rebalances generally start out here: cockroach/pkg/kv/kvserver/replicate_queue.go Line 412 in 76c63de
We should ultimately be able to end up here with replication targets similar to the test: cockroach/pkg/kv/kvserver/allocator.go Lines 748 to 755 in f291061
I browsed through the code and the only code in the allocator that rules out lateral rebalances now is here: cockroach/pkg/kv/kvserver/allocator_scorer.go Lines 613 to 621 in 4cf5a97
We can't just remove that - we want to allow a second replica on the same node only if we are in the context of a lateral move (we would not want to add a second replica to the node without atomically removing the first one). This will need to be established and unit tested. Finally, we should have some sort of high-level validation that this works end-to-end. For example, we could start a three-node cluster on which the first node has two stores and ensure that we organically see replicas move to that store (which means there must have been lateral rebalances). I'll point out that there are likely deficiencies in how the system tries to balance the load. For example, if n1 has 100 stores and n2 and n3 have one, then effectively n1 will end up with 100/102 percent of data (I think) which may not be what we want (n1 might be CPU bound long before that). But for mostly symmetric configurations, hopefully it would do the right thing. While we're looking at that code, we should document as much as possible the expected behavior and its strengths. I believe we have some simulated rebalancing unit tests, we should update those. |
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used accross a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node and leverage the locality feature to add the node as the last locality tier. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. To make this correct we add a number of guard rails into the Allocator to prevent a rebalance from placing multiple replicas of a range on the same node. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Unfortunately these changes are not enough and in the naive scenario the allocator ends up in an endless rebalance loop between the two stores on the same node. This change leverages the existing allocator heuristics to accomplish this goal, specifically balance_score and deversity_score. The balance_score is used to compute the right balance of replicas per node, so anytime we compare stores we factor the range count by the number of stores on a node. This allows the balance_score to be used across a heterogenous coackroach topology, where each node may have a different number of stores on it. To prevent replicas ending up on the same node, we extend the failure domain definition to include the node and leverage the locality feature to add the node as the last locality tier. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in cockroachdb#12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder Before: the script wasn't implemented. Now: Part of the new release process, this script - tags the selected SHA to the provided name - compiles the binaries and archive and uploads them to S3 as the versioned name - uploads the docker image to docker.io/cockroachdb/cockroach - pushes the tag to github.com/cockroachdb/cockroach - push all the artificats to their respective `latest` locations as appropriate Release note: None 51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex Closes #6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in #12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node. 52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state. Typically this requires running a somewhat expensive RevertRange operation that finds the keys written by the IMPORT in amongst all the table data and deletes just those keys. This is somewhat expensive -- we need to iterate the keys in the target table and check them to see if they need to be reverted. Non-INTO style IMPORTs create the table into which they will IMPORT and thus can just drop it wholesale on failure, instead of doing this expensive revert. However INTO-style IMPORTs could use a similarly fast/cheap wholesale delete *if they knew the table was empty* when the IMPORT was started. This change tracks which tables were empty when the IMPORT started and then deletes, rather than reverts, the table span on failure. Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster. 53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft This commit adds index acceleration support for the bounding box comparison operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects. Release note (performance improvement): The ~ and && geospatial bounding box operations can now benefit from index acceleration if one of the operands is an indexed geometry column. 53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy Fixes #53033 Fixes #52959 Use transaction when querying for the schedules to run. In addition, ensure that a single bad schedule does not cause all of the previous work to be wasted by using transaction savepoints. Release Notes: None 53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten Fixes #50180. This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan. Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation: ``` UPSERT t = SELECT FROM t + INSERT INTO t => UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t ``` I plan to test this out on a contended `indexes` workload at some point in the future. Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting. Co-authored-by: James H. Linder <[email protected]> Co-authored-by: Alex Lunev <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
When running a cluster where nodes have multiple stores, one store often seems to get a disproportionate number of replicas allocated to it.
Here's an example cluster with 5 nodes, each with 5 stores:
This is happening on 3 of 5 nodes in this cluster. We currently do not rebalance between multiple stores in the same node, but we may want to consider it for performance reasons. For example, maybe each store is backed by a different SSD. Spreading our I/O evenly across stores could benefit performance.
The text was updated successfully, but these errors were encountered: