-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Update shard allocation to evenly distribute primaries. #5240
Comments
Looking into it |
Tried a quick exercise to see existing shard allocation behavior. All shards were distributed evenly but not primary alone. This is due to existing EnableAllocationDecider but even after enabling rebalance based on Create 6 indices with SEGMENT as replication type.
The primary shard allocation on nodes remains uneven, even after using
I retried the exercise after enabling rebalancing on primaries but that did not change the result.
|
|
There are couple of approaches to solve this: 1. Handle Segment replication enabled shards differentlyAdd new parameters in WeightFunction
Pros
Cons
2. Change allocation logic for all type of shardsChange existing WeightFunction for all type of shards. The algorithm introduces another Pros:
Cons:
[Edit]: Based on above approach 2 seems promising as it simpler and doesn't handle SegRep indices separately and thus, avoid collation of two different algorithms for balancing. This approach can be made less intrusive by not using this new algorithm by default. |
I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic. |
Thanks @Bukhtawar for the feedback and the suggestion. This is definitely useful in larger scheme of things that we may need to implement. It will need more thoughts, discussion and probably overhaul of existing allocation. I am thinking of starting with updating existing |
RequirementEnsure even distribution of same index primaries (and replicas). Replicas also need even distribution because primaries for an index may do variant amount of work (hot shards), so does corresponding replica. GoalsAs part of phase 1, targeting below goals for this task.
AssumptionBalancing primary shards evenly for docrep indices does not impact the cluster performance. ApproachAs overall goal is to have uniform resource utilization among nodes containing various type of shards, we need to use same scale to measure different shard types. Using separate allocation logic for different shard types, working in isolation will NOT work! I think it make sense to update the existing weightFunction factoring in primary shards which solves use case here and is simpler to start with. This will be the initial work to evolve the existing weighing function and incorporate different factors which might need allocation logic overhaul. As discussed here, proceeding with Approach 2 which introduces a new primary shard balance factor. Future ImprovementThe weight function can be updated to cater future needs, where different shards have in-built weight against an attribute (or weighing factor). E.g. primary with more replicas (higher fan out) should have higher weight compared to other primaries; so that weight function prioritize even distribution of these primaries FIRST. POCTried a POC which introduces a new setting to balance shards based on primary shard count and corresponding update of WeightFunction. [Edit]: This can still result in skewness on segrep shards i.e. non-balanced primary segrep shards on node though overall primary shard balance is within threshold. Thanks @mch2 for pointing this. There are two already existing balance factors. This POC adds
float weight(ShardsBalancer balancer, ModelNode node, String index) {
final float weightShard = node.numShards() - balancer.avgShardsPerNode();
final float weightIndex = node.numShards(index) - balancer.avgShardsPerNode(index);
final float primaryWeightShard = node.numPrimaryShards() - balancer.avgPrimaryShardsPerNode(); // primary balance factor
return theta0 * weightShard + theta1 * weightIndex + theta2 * primaryWeightShard;
} There is no single value for PRIMARY_BALANCE_FACTOR_SETTING which distributes shards satisfying all three balance factors for every possible cluster configuration. This is true even today with only INDEX_BALANCE_FACTOR_SETTING and SHARD_BALANCE_FACTOR_SETTING. A higher value for primary shard balance factor may result in primary balance but not necessarily equal shards per node. This probably needs some analysis on distribution of shards for different constant weight factors. Added a subtask in description. @nknize @Bukhtawar @kotwanikunal @mch2 : Request for feedback |
For subtask 2 mentioned herePrevious change in #6017 introduces primary shard balance factor but doesn't differentiate on shard types (docrep vs segrep); with end-result of overall balanced primary shard distribution but not necessarily for individual shard types. Example, the nodes removal results in 4 primary unassigned shards (docrep 2 and segrep 2), there are chances of one node getting both docrep shard while other getting segrep shards. Changing this logic to accomodate segrep index is not required because:
For logging purpose, added an unsuccessful test to mimic scenario where LocalShardsBalancer allocation logic is applied here |
Tracking remaining work in #6210 for benchmarking, guidance on default value and a single sane default value (if possible). Closing this issue. |
With segment replication primary shards will be using more node resources than replicas. While still net less than docrep, this will lead to an uneven utilization of resources across the cluster.
We need to explore updating allocation with segrep enabled to evenly balance primary shards across a cluster.
The text was updated successfully, but these errors were encountered: