Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritize primary shard movement during shard relocation #1445

Merged
merged 12 commits into from
Jan 20, 2022

Conversation

jainankitk
Copy link
Collaborator

@jainankitk jainankitk commented Oct 26, 2021

Description

The primary shards are always picked up first from node for shard movement. That is achieved by bucketing the shards into primary/replicas and iterating over primaries first.

Issues Resolved

#1349

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success d85c55c

@opensearch-ci-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed d85c55c

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure d85c55c
Log 1430

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 9f5b306

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 9f5b306

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 9f5b306

@nknize nknize self-requested a review October 28, 2021 15:34
Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things missing here:

  1. What problem is this solving? The linked issue doesn't explain the problem.
  2. The PR is missing tests. This is a change to code in the critical path. Unit and integration tests should be included.
  3. Are there benchmarks that show this isn't introducing undesired regressions for common cases? Are there benchmarks that show this is improving performance for certain use cases?
  4. Is this an enhancement or new feature?
  5. What version is this targeting? 2.0? 1.2? 1.3?

@nknize nknize added the feedback needed Issue or PR needs feedback label Oct 28, 2021
@jainankitk
Copy link
Collaborator Author

jainankitk commented Oct 28, 2021

Thank you @nknize for initial review.

What problem is this solving? The linked issue doesn't explain the problem.

Let us say we exclude some set of nodes (based on some attribute) from cluster setting, current implementation of BalancedShardsAllocator iterates over them in breadth first over nodes picking 1 shard from each node and repeating the process. The shards from each node are picked randomly. It could happen that we relocate the p and r of shard1 first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. Instead if we prioritize the p of both shard1 and shard2 first, cluster does not become red if the excluded nodes were to go down before relocating other shards

The PR is missing tests. This is a change to code in the critical path. Unit and integration tests should be included.

While existing UTs give sufficient coverage, I am planning to add more unit and integration tests for testing this specific functionality

Are there benchmarks that show this isn't introducing undesired regressions for common cases? Are there benchmarks that show this is improving performance for certain use cases?

I don't expect this to improve performance as the change is looking to improve robustness. There should not be any or regression as the implementation clearly abstracts out functionality

Is this an enhancement or new feature?

This is an enhancement to improve robustness.

What version is this targeting? 2.0? 1.2? 1.3?

I am looking to target this for 1.3

@jainankitk
Copy link
Collaborator Author

jainankitk commented Nov 2, 2021

Benchmark results for shard relocation:

benchmark                                       indices         shards          replica         source nodes    target nodes    conc recovery   without change          with change             percentage diff    
measureExclusionOnZoneAwareStartedShard         10              2               0               1               1               1               2.739 ± 0.078 ms/op     2.782 ± 0.082 ms/op     1.57
measureExclusionOnZoneAwareStartedShard         10              3               0               1               1               2               2.701 ± 0.080 ms/op     2.689 ± 0.085 ms/op     -0.44
measureExclusionOnZoneAwareStartedShard         10              10              0               1               1               5               2.783 ± 0.082 ms/op     2.800 ± 0.076 ms/op     0.61
measureExclusionOnZoneAwareStartedShard         100             1               0               1               1               10              2.946 ± 0.086 ms/op     2.961 ± 0.075 ms/op     0.51
measureExclusionOnZoneAwareStartedShard         100             3               0               1               1               10              3.139 ± 0.116 ms/op     3.128 ± 0.088 ms/op     -0.35
measureExclusionOnZoneAwareStartedShard         100             10              0               1               1               10              3.996 ± 0.104 ms/op     4.044 ± 0.112 ms/op     1.20
measureExclusionOnZoneAwareStartedShard         10              2               0               10              10              1               2.802 ± 0.071 ms/op     2.790 ± 0.072 ms/op     -0.43
measureExclusionOnZoneAwareStartedShard         10              3               0               10              5               2               2.818 ± 0.086 ms/op     2.804 ± 0.082 ms/op     -0.50
measureExclusionOnZoneAwareStartedShard         10              10              0               10              5               5               3.007 ± 0.083 ms/op     3.024 ± 0.107 ms/op     0.57
measureExclusionOnZoneAwareStartedShard         100             1               0               5               10              5               3.104 ± 0.080 ms/op     3.095 ± 0.076 ms/op     -0.29
measureExclusionOnZoneAwareStartedShard         100             3               0               10              5               5               3.332 ± 0.090 ms/op     3.416 ± 0.100 ms/op     2.52
measureExclusionOnZoneAwareStartedShard         100             10              0               10              20              6               4.843 ± 0.146 ms/op     4.947 ± 0.150 ms/op     2.15
measureExclusionOnZoneAwareStartedShard         10              1               1               10              10              1               2.831 ± 0.075 ms/op     2.825 ± 0.072 ms/op     -0.21
measureExclusionOnZoneAwareStartedShard         10              3               1               10              3               3               2.807 ± 0.069 ms/op     2.823 ± 0.076 ms/op     0.57
measureExclusionOnZoneAwareStartedShard         10              10              1               5               12              5               3.192 ± 0.103 ms/op     3.140 ± 0.074 ms/op     -1.63
measureExclusionOnZoneAwareStartedShard         100             1               1               10              10              6               3.708 ± 0.085 ms/op     3.620 ± 0.084 ms/op     -2.37
measureExclusionOnZoneAwareStartedShard         100             3               1               10              5               8               3.927 ± 0.107 ms/op     3.976 ± 0.121 ms/op     1.25
measureExclusionOnZoneAwareStartedShard         100             10              1               8               17              8               5.873 ± 0.137 ms/op     6.216 ± 0.189 ms/op     5.84
measureExclusionOnZoneAwareStartedShard         10              1               2               10              10              1               2.891 ± 0.079 ms/op     2.824 ± 0.071 ms/op     -2.32
measureExclusionOnZoneAwareStartedShard         10              3               2               10              5               3               3.039 ± 0.077 ms/op     3.042 ± 0.089 ms/op     0.10
measureExclusionOnZoneAwareStartedShard         10              10              2               5               10              5               3.657 ± 0.103 ms/op     3.673 ± 0.150 ms/op     0.44
measureExclusionOnZoneAwareStartedShard         100             1               2               10              8               7               4.465 ± 0.107 ms/op     4.581 ± 0.121 ms/op     2.60
measureExclusionOnZoneAwareStartedShard         100             3               2               13              17              5               5.197 ± 0.180 ms/op     5.330 ± 0.180 ms/op     2.56
measureExclusionOnZoneAwareStartedShard         100             10              2               10              20              8               8.709 ± 0.275 ms/op     8.737 ± 0.288 ms/op     0.32
measureExclusionOnZoneAwareStartedShard         10              2               1               20              20              1               3.026 ± 0.074 ms/op     2.995 ± 0.069 ms/op     -1.02
measureExclusionOnZoneAwareStartedShard         10              3               1               20              30              1               3.084 ± 0.093 ms/op     3.086 ± 0.082 ms/op     0.06
measureExclusionOnZoneAwareStartedShard         10              10              1               20              10              3               3.244 ± 0.072 ms/op     3.280 ± 0.089 ms/op     1.11
measureExclusionOnZoneAwareStartedShard         100             1               1               20              5               5               3.306 ± 0.102 ms/op     3.350 ± 0.162 ms/op     1.33
measureExclusionOnZoneAwareStartedShard         100             3               1               20              23              6               5.350 ± 0.142 ms/op     5.380 ± 0.162 ms/op     0.56
measureExclusionOnZoneAwareStartedShard         100             10              1               40              20              8               8.227 ± 0.315 ms/op     8.410 ± 0.294 ms/op     2.22
measureExclusionOnZoneAwareStartedShard         10              3               2               50              30              1               3.632 ± 0.108 ms/op     3.552 ± 0.119 ms/op     -2.20
measureExclusionOnZoneAwareStartedShard         10              3               2               50              25              1               3.492 ± 0.096 ms/op     3.413 ± 0.101 ms/op     -2.26
measureExclusionOnZoneAwareStartedShard         10              10              1               50              33              2               4.184 ± 0.183 ms/op     4.258 ± 0.127 ms/op     1.77
measureExclusionOnZoneAwareStartedShard         100             1               1               40              50              2               4.653 ± 0.133 ms/op     4.570 ± 0.130 ms/op     -1.78
measureExclusionOnZoneAwareStartedShard         100             3               1               50              70              3               7.001 ± 0.205 ms/op     7.151 ± 0.221 ms/op     2.14
measureExclusionOnZoneAwareStartedShard         100             10              1               60              50              3               9.056 ± 0.314 ms/op     9.085 ± 0.303 ms/op     0.32
measureExclusionOnZoneAwareStartedShard         10              10              2               50              50              1               4.539 ± 0.138 ms/op     4.381 ± 0.113 ms/op     -3.48
measureExclusionOnZoneAwareStartedShard         10              3               2               50              30              1               3.611 ± 0.107 ms/op     3.526 ± 0.077 ms/op     -2.35
measureExclusionOnZoneAwareStartedShard         10              10              2               50              40              2               5.209 ± 0.165 ms/op     5.049 ± 0.153 ms/op     -3.07
measureExclusionOnZoneAwareStartedShard         100             1               2               40              50              2               5.591 ± 0.164 ms/op     5.223 ± 0.173 ms/op     -6.58
measureExclusionOnZoneAwareStartedShard         100             3               2               50              30              6               8.810 ± 0.307 ms/op     8.626 ± 0.283 ms/op     -2.09
measureExclusionOnZoneAwareStartedShard         100             10              2               33              55              6               12.814 ± 0.413 ms/op    13.018 ± 0.410 ms/op    1.59
measureExclusionOnZoneAwareStartedShard         500             60              1               100             100             12              241.278 ± 9.901 ms/op   255.860 ± 8.410 ms/op   6.04
measureExclusionOnZoneAwareStartedShard         500             60              1               100             40              12              210.525 ± 8.003 ms/op   222.254 ± 6.678 ms/op   5.57
measureExclusionOnZoneAwareStartedShard         500             60              1               40              100             12              192.286 ± 8.678 ms/op   203.785 ± 7.117 ms/op   5.98
measureExclusionOnZoneAwareStartedShard         50              60              1               100             100             6               33.909 ± 1.247 ms/op    34.418 ± 1.283 ms/op    1.50
measureExclusionOnZoneAwareStartedShard         50              60              1               100             40              6               19.245 ± 0.744 ms/op    19.199 ± 0.621 ms/op    -0.24
measureExclusionOnZoneAwareStartedShard         50              60              1               40              100             6               18.052 ± 0.677 ms/op    18.617 ± 0.735 ms/op    3.13
benchmark                               indices         shards          replica         source nodes    target nodes    conc recovery   without change          with change             percentage diff 
measureShardRelocationComplete          10              2               0               1               1               1               2.840 ± 0.084 ms/op     2.836 ± 0.085 ms/op     -0.14
measureShardRelocationComplete          10              3               0               1               1               2               2.764 ± 0.070 ms/op     2.788 ± 0.082 ms/op     0.87
measureShardRelocationComplete          10              10              0               1               1               5               3.002 ± 0.079 ms/op     3.011 ± 0.070 ms/op     0.30
measureShardRelocationComplete          100             1               0               1               1               10              3.338 ± 0.085 ms/op     3.372 ± 0.094 ms/op     1.02
measureShardRelocationComplete          100             3               0               1               1               10              3.833 ± 0.114 ms/op     3.838 ± 0.110 ms/op     0.13
measureShardRelocationComplete          100             10              0               1               1               10              5.827 ± 0.191 ms/op     6.012 ± 0.181 ms/op     3.17
measureShardRelocationComplete          10              2               0               10              10              1               2.914 ± 0.072 ms/op     2.926 ± 0.093 ms/op     0.41
measureShardRelocationComplete          10              3               0               10              5               2               2.936 ± 0.084 ms/op     2.994 ± 0.081 ms/op     1.98
measureShardRelocationComplete          10              10              0               10              5               5               3.297 ± 0.090 ms/op     3.321 ± 0.090 ms/op     0.73
measureShardRelocationComplete          100             1               0               5               10              5               3.596 ± 0.079 ms/op     3.611 ± 0.073 ms/op     0.42
measureShardRelocationComplete          100             3               0               10              5               5               4.206 ± 0.116 ms/op     4.283 ± 0.108 ms/op     1.83
measureShardRelocationComplete          100             10              0               10              20              6               7.453 ± 0.211 ms/op     7.685 ± 0.249 ms/op     3.11
measureShardRelocationComplete          10              1               1               10              10              1               2.966 ± 0.078 ms/op     2.997 ± 0.083 ms/op     1.05
measureShardRelocationComplete          10              3               1               10              3               3               2.996 ± 0.095 ms/op     3.023 ± 0.078 ms/op     0.90
measureShardRelocationComplete          10              10              1               5               12              5               3.597 ± 0.113 ms/op     3.624 ± 0.091 ms/op     0.75
measureShardRelocationComplete          100             1               1               10              10              6               4.482 ± 0.156 ms/op     4.576 ± 0.131 ms/op     2.10
measureShardRelocationComplete          100             3               1               10              5               8               5.260 ± 0.156 ms/op     5.485 ± 0.144 ms/op     4.28
measureShardRelocationComplete          100             10              1               8               17              8               10.017 ± 0.313 ms/op    10.639 ± 0.307 ms/op    6.21
measureShardRelocationComplete          10              1               2               10              10              1               3.059 ± 0.077 ms/op     3.027 ± 0.084 ms/op     -1.05
measureShardRelocationComplete          10              3               2               10              5               3               3.217 ± 0.072 ms/op     3.295 ± 0.089 ms/op     2.42
measureShardRelocationComplete          10              10              2               5               10              5               4.161 ± 0.110 ms/op     4.197 ± 0.115 ms/op     0.87
measureShardRelocationComplete          100             1               2               10              8               7               5.320 ± 0.178 ms/op     5.659 ± 0.163 ms/op     6.37
measureShardRelocationComplete          100             3               2               13              17              5               7.248 ± 0.276 ms/op     7.607 ± 0.211 ms/op     4.95
measureShardRelocationComplete          100             10              2               10              20              8               14.381 ± 0.527 ms/op    15.240 ± 0.410 ms/op    5.97
measureShardRelocationComplete          10              2               1               20              20              1               3.263 ± 0.088 ms/op     3.251 ± 0.082 ms/op     -0.37
measureShardRelocationComplete          10              3               1               20              30              1               3.333 ± 0.072 ms/op     3.340 ± 0.077 ms/op     0.21
measureShardRelocationComplete          10              10              1               20              10              3               3.716 ± 0.119 ms/op     3.764 ± 0.090 ms/op     1.29
measureShardRelocationComplete          100             1               1               20              5               5               3.858 ± 0.104 ms/op     3.990 ± 0.128 ms/op     3.42
measureShardRelocationComplete          100             3               1               20              23              6               7.026 ± 0.199 ms/op     7.357 ± 0.273 ms/op     4.71
measureShardRelocationComplete          100             10              1               40              20              8               12.798 ± 0.400 ms/op    13.682 ± 0.396 ms/op    6.91
measureShardRelocationComplete          10              3               2               50              30              1               4.012 ± 0.128 ms/op     3.930 ± 0.099 ms/op     -2.04
measureShardRelocationComplete          10              3               2               50              25              1               3.902 ± 0.147 ms/op     3.724 ± 0.092 ms/op     -4.56
measureShardRelocationComplete          10              10              1               50              33              2               4.776 ± 0.133 ms/op     4.868 ± 0.145 ms/op     1.93
measureShardRelocationComplete          100             1               1               40              50              2               5.557 ± 0.131 ms/op     5.739 ± 0.185 ms/op     3.28
measureShardRelocationComplete          100             3               1               50              70              3               8.994 ± 0.285 ms/op     9.114 ± 0.305 ms/op     1.33
measureShardRelocationComplete          100             10              1               60              50              3               13.973 ± 0.465 ms/op    14.191 ± 0.415 ms/op    1.56
measureShardRelocationComplete          10              10              2               50              50              1               5.255 ± 0.150 ms/op     5.182 ± 0.154 ms/op     -1.39
measureShardRelocationComplete          10              3               2               50              30              1               3.991 ± 0.103 ms/op     3.934 ± 0.110 ms/op     -1.43
measureShardRelocationComplete          10              10              2               50              40              2               5.965 ± 0.210 ms/op     5.915 ± 0.193 ms/op     -0.84
measureShardRelocationComplete          100             1               2               40              50              2               6.508 ± 0.223 ms/op     6.691 ± 0.189 ms/op     2.81
measureShardRelocationComplete          100             3               2               50              30              6               11.108 ± 0.412 ms/op    11.218 ± 0.262 ms/op    0.99
measureShardRelocationComplete          100             10              2               33              55              6               20.016 ± 0.746 ms/op    21.353 ± 0.765 ms/op    6.68
measureShardRelocationComplete          500             60              1               100             100             12              540.946 ± 20.389 ms/op  561.109 ± 20.277 ms/op  3.73
measureShardRelocationComplete          500             60              1               100             40              12              504.320 ± 19.405 ms/op  519.140 ± 17.360 ms/op  2.94
measureShardRelocationComplete          500             60              1               40              100             12              461.199 ± 19.855 ms/op  468.763 ± 17.183 ms/op  1.64
measureShardRelocationComplete          50              60              1               100             100             6               50.915 ± 2.835 ms/op    53.403 ± 2.160 ms/op    4.89
measureShardRelocationComplete          50              60              1               100             40              6               33.995 ± 1.205 ms/op    36.410 ± 1.409 ms/op    7.10
measureShardRelocationComplete          50              60              1               40              100             6               32.178 ± 1.052 ms/op    33.855 ± 1.202 ms/op    5.21

@jainankitk
Copy link
Collaborator Author

jainankitk commented Nov 2, 2021

@nknize - I have uploaded the results from shard relocation benchmark. Kindly review.
IMO, there is not much difference in performance with and without the change.

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 8e7c453

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 8e7c453
Log 964

Reports 964

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure 8e7c453
Log 1503

@nknize
Copy link
Collaborator

nknize commented Nov 4, 2021

Thank you so much for the information @jainankitk! This is super useful and I think a good example of a thorough PR for a critical core change.

If the excluded nodes were to go down the cluster becomes red. ...While existing UTs give sufficient coverage

It looks like this is the crux of the problem? Can you add an integration test that simulates this use case? I don't think existing UTs provide sufficient coverage beyond detecting basic logic failures. We need a thorough integration test that validates and verifies the change is doing what we expect it to do and preventing those RED scenarios.

IMO, there is not much difference in performance with and without the change.

I'm not so sure I agree. It looks like performance is a function of indices, shards, replica, and source nodes? As these values increase so does the regression. This is a bit of a flag that this may not scale well with users operating with a large number of indices.

measureExclusionOnZoneAwareStartedShard         100             10              1               8               17              8               5.873 ± 0.137 ms/op     6.216 ± 0.189 ms/op     5.84
measureExclusionOnZoneAwareStartedShard         100             10              2               33              55              6               12.814 ± 0.413 ms/op    13.018 ± 0.410 ms/op    1.59
measureExclusionOnZoneAwareStartedShard         500             60              1               100             100             12              241.278 ± 9.901 ms/op   255.860 ± 8.410 ms/op   6.04
measureExclusionOnZoneAwareStartedShard         500             60              1               100             40              12              210.525 ± 8.003 ms/op   222.254 ± 6.678 ms/op   5.57
measureExclusionOnZoneAwareStartedShard         500             60              1               40              100             12              192.286 ± 8.678 ms/op   203.785 ± 7.117 ms/op   5.98


measureShardRelocationComplete          100             1               1               10              10              6               4.482 ± 0.156 ms/op     4.576 ± 0.131 ms/op     2.10
measureShardRelocationComplete          100             3               1               10              5               8               5.260 ± 0.156 ms/op     5.485 ± 0.144 ms/op     4.28
measureShardRelocationComplete          100             10              1               8               17              8               10.017 ± 0.313 ms/op    10.639 ± 0.307 ms/op    6.21
measureShardRelocationComplete          100             10              2               33              55              6               20.016 ± 0.746 ms/op    21.353 ± 0.765 ms/op    6.68
measureShardRelocationComplete          500             60              1               100             100             12              540.946 ± 20.389 ms/op  561.109 ± 20.277 ms/op  3.73
measureShardRelocationComplete          500             60              1               100             40              12              504.320 ± 19.405 ms/op  519.140 ± 17.360 ms/op  2.94
measureShardRelocationComplete          500             60              1               40              100             12              461.199 ± 19.855 ms/op  468.763 ± 17.183 ms/op  1.64
measureShardRelocationComplete          50              60              1               100             100             6               50.915 ± 2.835 ms/op    53.403 ± 2.160 ms/op    4.89
measureShardRelocationComplete          50              60              1               100             40              6               33.995 ± 1.205 ms/op    36.410 ± 1.409 ms/op    7.10
measureShardRelocationComplete          50              60              1               40              100             6               32.178 ± 1.052 ms/op    33.855 ± 1.202 ms/op    5.21

This is an enhancement to improve robustness.

Possibly at the cost of performance. We might consider placing this behind a cluster-wide setting and discuss what the default should be? Or is this concerning enough that it's worth paying the performance penalty?

@nknize nknize added v1.3.0 discuss Issues intended to help drive brainstorming and decision making labels Nov 4, 2021
@jainankitk
Copy link
Collaborator Author

It looks like this is the crux of the problem? Can you add an integration test that simulates this use case? I don't think existing UTs provide sufficient coverage beyond detecting basic logic failures. We need a thorough integration test that validates and verifies the change is doing what we expect it to do and preventing those RED scenarios.

I will try to add integration test for simulating this use case and ensure it passes after this code change

I'm not so sure I agree. It looks like performance is a function of indices, shards, replica, and source nodes? As these values increase so does the regression. This is a bit of a flag that this may not scale well with users operating with a large number of indices.
Possibly at the cost of performance. We might consider placing this behind a cluster-wide setting and discuss what the default should be? Or is this concerning enough that it's worth paying the performance penalty?

Though performance numbers don't look concerning to me, as the absolute time difference is fairly small, it is worth placing behind cluster-wide setting. We can begin with default disabled, and consider switching the default once we get more confidence?

Copy link
Contributor

@malpani malpani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, shards are balanced across nodes based on count as the key factor - irrespective of whether they are primary/replica. There will be multiple scenarios where certain nodes (eg. a restarted node) will only have replica shards.

By ordering the relocation and considering primaries first, I like how this change can help with robustness but i have concerns that this change can yield a net reduction to relocation speeds. By only considering primaries first, overall relocation speed for the cluster can drop with a tail on these nodes that have more replicas than primary. Could you please add a test case on relocation benchmarks for the same?

Modifying from simple count to primary count based balancing logic in the allocator will address this limitation. In the meantime, gating this change under a setting (default to disabled) could be a good option to experiment

Comment on lines 60 to 65
private static Map<Boolean, Integer> map = new HashMap<Boolean, Integer>() {
{
put(true, 0);
put(false, 1);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only use case for indexing into the shards array (0 or 1), will enum be better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something here, but enum will define new type which is different than returned from shardRouting.primary() boolean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right enum wont work here. Can leave it as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

org.opensearch.common.collect.Map.of(true, 0, false, 1)

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success fc281a8

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure fc281a8
Log 1524

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure fc281a8
Log 1013

Reports 1013

@nknize nknize added enhancement Enhancement or improvement to existing feature or request Indexing & Search and removed feedback needed Issue or PR needs feedback discuss Issues intended to help drive brainstorming and decision making labels Jan 20, 2022
Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thx for putting this behind a cluster setting! Tests look good as well. LGTM!

@nknize nknize merged commit 6eb8f6f into opensearch-project:main Jan 20, 2022
@nknize
Copy link
Collaborator

nknize commented Jan 20, 2022

Merged to main. I updated the commit message to be more descriptive of the problem this change addresses. Can we get a separate PR to add some more thorough documentation of this new cluster setting and behavior (including the performance as a function of scale?)

@nknize nknize added pending backport Identifies an issue or PR that still needs to be backported v2.0.0 Version 2.0.0 labels Jan 20, 2022
Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor comments, otherwise looks good to me.

@jainankitk jainankitk deleted the prim-first branch January 21, 2022 00:17
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 9, 2022
When some node or set of nodes is excluded (based on some cluster setting)
BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from
each node and repeating the process until all shards are balanced. Since shards from
each node are picked randomly it's possible the p and r of shard1 is relocated first
leaving behind both p and r of shard2. If the excluded nodes were to go down the
cluster becomes red.

This commit introduces a new setting  "cluster.routing.allocation.move.primary_first"
that prioritizes the p of both shard1 and shard2 first so the cluster does not become
red if the excluded nodes were to go down before relocating other shards. Note that
with this setting enabled performance of this change is a direct function of number
of indices, shards, replicas, and nodes. The larger the indices, replicas, and
distribution scale, the slower the allocation becomes. This should be used with care.

Signed-off-by: Ankit Jain <[email protected]>
(cherry picked from commit 6eb8f6f)
dblock pushed a commit that referenced this pull request Feb 10, 2022
)

When some node or set of nodes is excluded (based on some cluster setting)
BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from
each node and repeating the process until all shards are balanced. Since shards from
each node are picked randomly it's possible the p and r of shard1 is relocated first
leaving behind both p and r of shard2. If the excluded nodes were to go down the
cluster becomes red.

This commit introduces a new setting  "cluster.routing.allocation.move.primary_first"
that prioritizes the p of both shard1 and shard2 first so the cluster does not become
red if the excluded nodes were to go down before relocating other shards. Note that
with this setting enabled performance of this change is a direct function of number
of indices, shards, replicas, and nodes. The larger the indices, replicas, and
distribution scale, the slower the allocation becomes. This should be used with care.

Signed-off-by: Ankit Jain <[email protected]>
(cherry picked from commit 6eb8f6f)

Co-authored-by: Ankit Jain <[email protected]>
@jainankitk jainankitk changed the title Prioritize primary shard movement during shard allocation Prioritize primary shard movement during shard relocation Feb 21, 2022
tlfeng pushed a commit that referenced this pull request Aug 7, 2023
…relocation (#8875) (#9153)

When some node or set of nodes is excluded, the shards are moved away in random order. When segment replication is enabled for a cluster, we might end up in a mixed version state where replicas will be on lower version and unable to read segments sent from higher version primaries and fail.

To avoid this, we could prioritize replica shard movement to avoid entering this situation.

Adding a new setting called shard movement strategy - `SHARD_MOVEMENT_STRATEGY_SETTING` - that will allow us to specify in which order we want to move our shards: `NO_PREFERENCE` (default), `PRIMARY_FIRST` or `REPLICA_FIRST`. 

The `PRIMARY_FIRST` option will perform the same behavior as the previous setting `SHARD_MOVE_PRIMARY_FIRST_SETTING` which will be now deprecated in favor of the shard movement strategy setting. 

Expected behavior: 

If `SHARD_MOVEMENT_STRATEGY_SETTING` is changed from its default behavior to be either `PRIMARY_FIRST` or `REPLICA_FIRST` then we perform this behavior whether or not `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled. 

If `SHARD_MOVEMENT_STRATEGY_SETTING` is still at its default setting of `NO_PREFERENCE` and `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled we move the primary shards first. This ensures that users still using this setting will not see any changes in behavior. 

Reference: #1445

Parent issue: #3881
---------

Signed-off-by: Poojita Raj <[email protected]>
(cherry picked from commit c6e4bcd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.x enhancement Enhancement or improvement to existing feature or request Indexing & Search pending backport Identifies an issue or PR that still needs to be backported v1.3.0 v2.0.0 Version 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants