Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Flakey Tests #211

Merged
merged 2 commits into from
Dec 7, 2021
Merged

Conversation

downsrob
Copy link
Contributor

@downsrob downsrob commented Dec 3, 2021

Signed-off-by: Robert Downs [email protected]

Issue #, if available:
#90
Description of changes:

  • Running multi node integTest on a t2.large EC2 instance would fail around 10% of the time before this change, this was tested using a shell script to run the integration test 100 times.
  • The majority of failures were due to jobs not starting. This appeared to be due to the job scheduler not setting the index operation listener on the index before the force start was attempted. These failures would often say "metadata document not created" or "target index not created".
  • By initializing the job index before any tests are run, the job sweeper can register a listener on the job index before the job is created, and catch the force start when it occurs. The github hosted runners are slightly lower resource than the t2.large, and they have additional background tasks, so it is possible that some flakiness could persist here, if that is the case, sleeping for a few seconds after initializing the index should help.
  • Fixes the 'test disabling and reenabling ism' test flakiness by allowing 4 retryOnConflicts. It seems that in all of the forcing starts and updating cluster settings, that sometimes the managed index config would be modified while forcing the start. This additional option allows for retrying the force start if it fails. In my tests, this removed all flakiness.

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov-commenter
Copy link

Codecov Report

Merging #211 (8408320) into main (c739190) will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #211      +/-   ##
============================================
- Coverage     76.03%   76.01%   -0.02%     
+ Complexity     1938     1936       -2     
============================================
  Files           260      260              
  Lines         11094    11094              
  Branches       1746     1746              
============================================
- Hits           8435     8433       -2     
- Misses         1704     1706       +2     
  Partials        955      955              
Impacted Files Coverage Δ
...nt/indexstatemanagement/model/destination/Chime.kt 40.90% <0.00%> (-13.64%) ⬇️
...anagement/indexstatemanagement/model/Transition.kt 63.07% <0.00%> (-4.62%) ⬇️
...ort/action/getpolicy/TransportGetPoliciesAction.kt 75.00% <0.00%> (-4.55%) ⬇️
...transport/action/explain/TransportExplainAction.kt 69.36% <0.00%> (-4.05%) ⬇️
.../rollup/action/start/TransportStartRollupAction.kt 65.06% <0.00%> (-3.62%) ⬇️
...gedindex/TransportRetryFailedManagedIndexAction.kt 61.80% <0.00%> (-2.78%) ⬇️
...nt/indexstatemanagement/ManagedIndexCoordinator.kt 71.08% <0.00%> (-1.75%) ⬇️
...nt/rollup/action/stop/TransportStopRollupAction.kt 69.51% <0.00%> (-1.22%) ⬇️
...management/indexstatemanagement/MetadataService.kt 60.86% <0.00%> (-1.09%) ⬇️
...sport/action/getpolicy/TransportGetPolicyAction.kt 70.27% <0.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c739190...8408320. Read the comment docs.

@dbbaughe
Copy link
Contributor

dbbaughe commented Dec 3, 2021

Any insights around the 'test disabling and reenabling ism' test?

dbbaughe
dbbaughe previously approved these changes Dec 3, 2021
@downsrob
Copy link
Contributor Author

downsrob commented Dec 3, 2021

Any insights around the 'test disabling and reenabling ism' test?

Not yet, the failure is:
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[5kpYRYqfQqCycTwvkqxqhA]: version conflict, required seqNo [7], primary term [1]. current document has seqNo [8] and primary term [1]", "index":".opendistro-ism-config","shard":"0","index_uuid":"rrtpcbBuR0CGImHh8rNATQ"},"status":409}
On my EC2 instances I can only replicate this every 1 or 2/100 attempts, so it is much slower to debug, but I am going to add additional debugging around this case and see if I can identify what the intermediate operation is that is causing the sequence number to be out of date.

@dbbaughe
Copy link
Contributor

dbbaughe commented Dec 3, 2021

@downsrob Are you able to replicate it every 1 or 2/100 attempts even when it's the only test that is run? Should be able to run that test 100 times in succession a lot faster than the entire test suite. If not perhaps it has to do with another test bleeding over somehow.

@downsrob
Copy link
Contributor Author

downsrob commented Dec 3, 2021

@dbbaughe That is a good idea, I started the single test run to see. If that doesn't work then I can try the test suite. Using a failing test seed might be important too, as these flakey failures often depend on test order.

Do you know how the test seeds are applied? I have often wondered if the test seed is set per test, per suite, or per overall run. I am curious, as if the seed is set once at the start of the overall run, then it may not guarantee the same test order or values when only the specific tests or test suites are run. I would assume the seed only works to replicate the randomized values in a single test, so it doesn't help replicate test order.

@downsrob downsrob changed the title Fixes Flakey Tests by initializing the job index before the tests Fixes Flakey Tests Dec 6, 2021
@downsrob downsrob merged commit 7f29718 into opensearch-project:main Dec 7, 2021
@downsrob downsrob deleted the flakeySimpleWait branch December 7, 2021 21:11
@ohltyler ohltyler mentioned this pull request Feb 1, 2022
downsrob added a commit to downsrob/index-management that referenced this pull request Mar 9, 2022
* Initializes the job index before each test

Signed-off-by: Robert Downs <[email protected]>

* Fixes test disabling and reenabling ism flakiness

Signed-off-by: Robert Downs <[email protected]>
downsrob added a commit to downsrob/index-management that referenced this pull request Mar 9, 2022
* Initializes the job index before each test

Signed-off-by: Robert Downs <[email protected]>

* Fixes test disabling and reenabling ism flakiness

Signed-off-by: Robert Downs <[email protected]>
downsrob added a commit that referenced this pull request Mar 9, 2022
* Initializes the job index before each test

Signed-off-by: Robert Downs <[email protected]>

* Fixes test disabling and reenabling ism flakiness

Signed-off-by: Robert Downs <[email protected]>
wuychn pushed a commit to ochprince/index-management that referenced this pull request Mar 16, 2023
* Initializes the job index before each test

Signed-off-by: Robert Downs <[email protected]>

* Fixes test disabling and reenabling ism flakiness

Signed-off-by: Robert Downs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants