Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a guardrail to limit maximum number of shard on the cluster #6143

Merged

Conversation

RS146BIJAY
Copy link
Contributor

@RS146BIJAY RS146BIJAY commented Feb 1, 2023

Signed-off-by: Rishav Sagar [email protected]

Description

We have observed that as shard count on the cluster increases, it becomes more prone to instability and availability. We have observed multiple instances where user had issues in the past because of too many shards on their cluster.

As of now, OpenSearch allows us to restrict the number of shards per node using setting cluster.max_shards_per_node inside ShardLimitValidator. The maximum number of shards that can be present on the cluster is determined by maxShardsPerNode * nodeCount. Issue here is, we can increase this total shard limit on the cluster just by increasing the number of node. There is no way we can control the maximum number of shards on an entire cluster.

I would suggest to add a similar setting inside ShardLimitValidator which will control the maximum number of shards that can be present on a cluster. If the total number of shards exceeds this threshold, ShardLimitValidator will block any operation which creates new shards. The threshold value will be an optional configurable cluster setting, which user can set dynamically.

Issues Resolved

#6050

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch from 5fcf038 to 0ee3948 Compare February 1, 2023 20:54
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch 2 times, most recently from 0d89de4 to ad53c45 Compare February 1, 2023 21:09
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, ensure we handle min case of node and cluster limit

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch from ad53c45 to 9e5a63e Compare February 2, 2023 09:36
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2023

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2023

Codecov Report

Merging #6143 (b1397cf) into main (c5a1bdf) will decrease coverage by 0.01%.
The diff coverage is 87.09%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6143      +/-   ##
============================================
- Coverage     70.87%   70.86%   -0.01%     
+ Complexity    58830    58823       -7     
============================================
  Files          4776     4776              
  Lines        280993   281020      +27     
  Branches      40598    40599       +1     
============================================
  Hits         199141   199141              
- Misses        65488    65557      +69     
+ Partials      16364    16322      -42     
Impacted Files Coverage Δ
...rg/opensearch/common/settings/ClusterSettings.java 91.89% <ø> (ø)
...va/org/opensearch/indices/ShardLimitValidator.java 92.68% <87.09%> (-3.69%) ⬇️
.../index/shard/IndexShardNotRecoveringException.java 0.00% <0.00%> (-100.00%) ⬇️
...adonly/AddIndexBlockClusterStateUpdateRequest.java 0.00% <0.00%> (-75.00%) ⬇️
...readonly/TransportVerifyShardIndexBlockAction.java 9.75% <0.00%> (-58.54%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 60.00% <0.00%> (-40.00%) ⬇️
...h/action/ingest/SimulateDocumentVerboseResult.java 60.71% <0.00%> (-39.29%) ⬇️
...indices/readonly/TransportAddIndexBlockAction.java 20.68% <0.00%> (-37.94%) ⬇️
...luster/routing/allocation/RoutingExplanations.java 62.06% <0.00%> (-37.94%) ⬇️
...cluster/routing/allocation/RerouteExplanation.java 65.00% <0.00%> (-35.00%) ⬇️
... and 453 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch from 9e5a63e to 367840e Compare February 2, 2023 15:25
Comment on lines 264 to 268
final String inValidMaxShardSettingErrorMessage = "Max shard per cluster threshold "
+ maxShardsInCluster
+ "should be greater than or equal to max shard per node "
+ maxShardsPerNode;
return Optional.of(inValidMaxShardSettingErrorMessage);
Copy link
Member

@ashking94 ashking94 Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on how would this behave since the node count is dynamic. Lets says initially max shard per node is 100 and there are 10 nodes with max shard per cluster as 1010. Now another node is added to the cluster in which case 1010 (max shard per cluster) is lesser than 1000*11. May be we can just put a condition as mentioned in this comment and remove this condition here? wdyt @sachinpkale @Bukhtawar @reta?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maxShardsInCluster = Math.min(maxShardsInCluster, maxShardsPerNode * nodeCount) can stay as a single else block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2023

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch from 367840e to dd5e23d Compare February 3, 2023 11:11
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the cluster-shard-limit-guardrail branch from dd5e23d to b1397cf Compare February 3, 2023 12:25
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Bukhtawar Bukhtawar merged commit e42b76f into opensearch-project:main Feb 5, 2023
@Bukhtawar Bukhtawar added the backport 2.x Backport to 2.x branch label Feb 5, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 5, 2023
Signed-off-by: Rishav Sagar <[email protected]>
(cherry picked from commit e42b76f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Feb 6, 2023
Signed-off-by: Rishav Sagar <[email protected]>
(cherry picked from commit e42b76f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
dreamer-89 pushed a commit that referenced this pull request Feb 6, 2023
(cherry picked from commit e42b76f)

Signed-off-by: Rishav Sagar <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants