-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… #2048
Conversation
…ClusterGreenAfterPartialRelocation Signed-off-by: Ankit Jain <[email protected]>
Can one of the admins verify this patch? |
@owaiskazi19 , @andrross - Can you review this PR? |
Signed-off-by: Ankit Jain <[email protected]>
// All primaries are relocated before 60% of overall shards are started on new nodes | ||
if (primaryShardCount <= startedCount && startedCount <= 6 * primaryShardCount / 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's obvious, but it is not immediately clear to me why the 6 * primaryShardCount / 5
math is correct for calculating that 60% of shards are started on new nodes. Can you explain how this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total number of shards are double the primary shard count (1 replica) - 2 * primaryShardCount
. Hence, 60% of total shards is 3 * total number of shards / 5
which is same as 6 * primaryShardCount / 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I suggest creating an intermediate variable just to make this more readable, like:
final int totalShardCount = primaryShardCount * 2;
if (primaryShardCount <= startedCount && startedCount <= totalShardCount * 3 / 5) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense
@@ -113,6 +120,6 @@ public void testClusterGreenAfterPartialRelocation() throws InterruptedException | |||
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(z1n1)); | |||
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(z1n2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to shutdown nodes z2n1 and z2n2 as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 4 nodes in the cluster. If we shutdown all 4, cluster will not be green. We want to shutdown all excluded nodes (in this case 2) after 60% of total shards have relocated to z2n1 and z2n2. Due to [#1445 ] all primaries would have started in those 60% and hence, cluster will become eventually green
Signed-off-by: Ankit Jain <[email protected]>
Gradle check failure did not reproduce locally:
|
start gradle check |
The fact that gradle check 2275 failed test MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation suggests that this PR doesn't actually fix the issue with that flaky test, right? |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.x 1.x
# Navigate to the new working tree
cd .worktrees/backport-1.x
# Create a new branch
git switch --create backport/backport-2048-to-1.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 343b82fe24525bbab01ef5a0d9bb8917068c71bf
# Push it to GitHub
git push --set-upstream origin backport/backport-2048-to-1.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.x Then, create a pull request where the |
The fix is definitely helping as I observed the below: Without the fix, it passed 93 out of 100 times. And, most of the failed tests are due to unassigned primary shards: |
@jainankitk may be try adding a timeout of 120 sec for ensureGreen() method so that nodes get available within the same time. Not sure if it will solve the above issue but let's give it a try. |
The 2 times that test failed with the fix did not seem anything to do with the timeout. Though, I will wait and see if we observe more of these instances |
opensearch-project#2048) * Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation Signed-off-by: Ankit Jain <[email protected]> * Removing unused import Signed-off-by: Ankit Jain <[email protected]> * Making code more readable Signed-off-by: Ankit Jain <[email protected]> (cherry picked from commit 343b82f)
* Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… (#2048) * Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation Signed-off-by: Ankit Jain <[email protected]> * Removing unused import Signed-off-by: Ankit Jain <[email protected]> * Making code more readable Signed-off-by: Ankit Jain <[email protected]> (cherry picked from commit 343b82f) * Added timeout to ensureGreen() for testClusterGreenAfterPartialRelocation (#2074) Signed-off-by: Ankit Jain <[email protected]> (cherry picked from commit f0984eb)
…ClusterGreenAfterPartialRelocation
Signed-off-by: Ankit Jain [email protected]
Description
The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes
Issues Resolved
#1957
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.