Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… #2048

Merged
merged 3 commits into from
Feb 8, 2022

Conversation

jainankitk
Copy link
Collaborator

…ClusterGreenAfterPartialRelocation

Signed-off-by: Ankit Jain [email protected]

Description

The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes

Issues Resolved

#1957

Check List

  • New functionality includes testing.
    • All tests pass
  • [] New functionality has been documented.
    • [] New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ClusterGreenAfterPartialRelocation

Signed-off-by: Ankit Jain <[email protected]>
@jainankitk jainankitk requested a review from a team as a code owner February 3, 2022 02:40
@opensearch-ci-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@jainankitk
Copy link
Collaborator Author

@owaiskazi19 , @andrross - Can you review this PR?

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 59f7dc0
Log 2210

Reports 2210

Signed-off-by: Ankit Jain <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 31559da
Log 2212

Reports 2212

@dblock dblock requested a review from andrross February 3, 2022 17:12
Comment on lines 103 to 104
// All primaries are relocated before 60% of overall shards are started on new nodes
if (primaryShardCount <= startedCount && startedCount <= 6 * primaryShardCount / 5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's obvious, but it is not immediately clear to me why the 6 * primaryShardCount / 5 math is correct for calculating that 60% of shards are started on new nodes. Can you explain how this works?

Copy link
Collaborator Author

@jainankitk jainankitk Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total number of shards are double the primary shard count (1 replica) - 2 * primaryShardCount. Hence, 60% of total shards is 3 * total number of shards / 5 which is same as 6 * primaryShardCount / 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I suggest creating an intermediate variable just to make this more readable, like:

final int totalShardCount = primaryShardCount * 2;
if (primaryShardCount <= startedCount && startedCount <= totalShardCount * 3 / 5) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense

@@ -113,6 +120,6 @@ public void testClusterGreenAfterPartialRelocation() throws InterruptedException
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(z1n1));
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(z1n2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to shutdown nodes z2n1 and z2n2 as well here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 nodes in the cluster. If we shutdown all 4, cluster will not be green. We want to shutdown all excluded nodes (in this case 2) after 60% of total shards have relocated to z2n1 and z2n2. Due to [#1445 ] all primaries would have started in those 60% and hence, cluster will become eventually green

Signed-off-by: Ankit Jain <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 1ed647e
Log 2275

Reports 2275

@jainankitk
Copy link
Collaborator Author

Gradle check failure did not reproduce locally:

akjain@88665a3f79e7 OpenSearch % ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=EFD233475A0C0A16 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-ME -Dtests.timezone=Indian/Reunion -Druntime.java=17 > /tmp/log
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/akjain/ws/OpenSearch/test/framework/build/distributions/framework-2.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/akjain/.gradle/wrapper/dists/gradle-7.3.3-all/4295vidhdd9hd3gbjyw1xqxpo/gradle-7.3.3/lib/plugins/gradle-testing-base-7.3.3.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release
akjain@88665a3f79e7 OpenSearch %

@owaiskazi19
Copy link
Member

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 1ed647e
Log 2276

Reports 2276

@andrross
Copy link
Member

andrross commented Feb 8, 2022

The fact that gradle check 2275 failed test MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation suggests that this PR doesn't actually fix the issue with that flaky test, right?

@saratvemulapalli saratvemulapalli added :test Adding or fixing a test bug Something isn't working v2.0.0 Version 2.0.0 v1.3.0 backport 1.x labels Feb 8, 2022
@saratvemulapalli saratvemulapalli merged commit 343b82f into opensearch-project:main Feb 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2022

The backport to 1.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.x 1.x
# Navigate to the new working tree
cd .worktrees/backport-1.x
# Create a new branch
git switch --create backport/backport-2048-to-1.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 343b82fe24525bbab01ef5a0d9bb8917068c71bf
# Push it to GitHub
git push --set-upstream origin backport/backport-2048-to-1.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.x

Then, create a pull request where the base branch is 1.x and the compare/head branch is backport/backport-2048-to-1.x.

@jainankitk
Copy link
Collaborator Author

jainankitk commented Feb 8, 2022

The fact that gradle check 2275 failed test MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation suggests that this PR doesn't actually fix the issue with that flaky test, right?

The fix is definitely helping as I observed the below:
I ran it locally with the fix for 100 times and it passed 98 times. Even 2 times that it failed looked unrelated reasons
Screen Shot 2022-02-07 at 7 37 10 PM

Without the fix, it passed 93 out of 100 times. And, most of the failed tests are due to unassigned primary shards:
Screen Shot 2022-02-07 at 10 33 22 PM

@owaiskazi19
Copy link
Member

@jainankitk may be try adding a timeout of 120 sec for ensureGreen() method so that nodes get available within the same time. Not sure if it will solve the above issue but let's give it a try.

@jainankitk jainankitk deleted the fix-1957 branch February 8, 2022 17:31
@jainankitk
Copy link
Collaborator Author

jainankitk commented Feb 8, 2022

@jainankitk may be try adding a timeout of 120 sec for ensureGreen() method so that nodes get available within the same time. Not sure if it will solve the above issue but let's give it a try.

The 2 times that test failed with the fix did not seem anything to do with the timeout. Though, I will wait and see if we observe more of these instances

jainankitk added a commit to jainankitk/OpenSearch that referenced this pull request Feb 10, 2022
opensearch-project#2048)

* Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation

Signed-off-by: Ankit Jain <[email protected]>

* Removing unused import

Signed-off-by: Ankit Jain <[email protected]>

* Making code more readable

Signed-off-by: Ankit Jain <[email protected]>
(cherry picked from commit 343b82f)
dblock pushed a commit that referenced this pull request Feb 11, 2022
* Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… (#2048)

* Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation

Signed-off-by: Ankit Jain <[email protected]>

* Removing unused import

Signed-off-by: Ankit Jain <[email protected]>

* Making code more readable

Signed-off-by: Ankit Jain <[email protected]>
(cherry picked from commit 343b82f)

* Added timeout to ensureGreen() for testClusterGreenAfterPartialRelocation (#2074)

Signed-off-by: Ankit Jain <[email protected]>
(cherry picked from commit f0984eb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.x bug Something isn't working :test Adding or fixing a test v1.3.0 v2.0.0 Version 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants