[Meta] Fix random test failures #1715

anasalkouz · 2021-12-13T19:36:25Z

PRs were blocked by transient gradle check errors multiple times. Provide a plan to stabilize the tests.

andrross · 2021-12-14T18:21:42Z

I did a quick experiment overnight on my dev machine where I ran the internalClusterTest all night in a loop:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew ':server:internalClusterTest' >> test-output.txt 2>&1 ; done

Results:

$ egrep 'BUILD (SUCCESSFUL|FAILED)' test-output.txt | wc -l
152
$ egrep 'BUILD FAILED' test-output.txt | wc -l
3

$ egrep '^REPRODUCE' test-output.txt | less -S | uniq
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=7B8B067879F3C91F -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Brazil/West -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=9F8306D99E2C2EF1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=id -Dtests.timezone=Asia/Aqtau -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=6D39D8439C254FF0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Pacific/Honolulu -Druntime.java=17

All 3 failures were caused by "Suite timeout exceeded (>= 1200000 msec)."

From this I'll make a couple hypotheses:

There is a bug in the logic of ShardIndexingPressureSettingsIT that sometimes causes it to hang and fail with the overall test timeout. See previous issue where this same failure occurred.
While failures we see in the PR workflows that run ./gradlew check often manifest as a failure somewhere in :server:internalClusterTest, they are not the result of buggy logic within the tests themselves, but instead are the result of interference between gradle tasks running concurrently, or some other problem with the CI environment. (I make this claim because the ~2% failure rate observed in my experiment seems much lower than the failure rate we're observing in the PR checks)

I'm going to repeat my experiment but run the full check task instead of just :server:internalClusterTest. If hypothesis 2 is correct then I should see a higher failure rate than 3 out of 152 observed in this first experiment.

Dev environment:

OS: Ubuntu 20.04
Host type: c6i.8xlarge
Branch: main at 309649ce8a

saratvemulapalli · 2021-12-14T20:07:56Z

Another flaky test:
Coming from: #1725

* What went wrong:
Execution failed for task ':qa:rolling-upgrade:v1.3.0#oldClusterTest'.
> `node{:qa:rolling-upgrade:v1.3.0-0}` failed to wait for ports files after 120000 MILLISECONDS

dreamer-89 · 2021-12-15T18:44:46Z

Looking into it.

dreamer-89 · 2021-12-16T18:00:27Z

A simple plan to begin with can involve below steps:

Analyze.
Analyze last X failed Jenkins builds (X=20), identify failed tests and count frequency of failure. This will help in priortizing the right failure.
Reproduce.
Failures identified above may need more deep dive for root causes; and also the ability to reproduce those failures locally. The expectation from this step is to have dev setup where failures can be replicated. Begin with targeted test (fast); if it does not help, run entire tests suite (slow). Failures may not always happen so need to repeat the tests multiple times as done by @andrross above. Replication may need setup similar to as used in Jenkins (worst case; have Jenkins setup). Add required logs wherever necessary to deep dive into the issue. Replication may discover new bugs/issues in tests, these failures should be properly documented and fixed as well in order to increase the overall tests stability.
Fix. Fixing tests depends on type of failure and can broadlly be classified in below categories. The step may run in sequence after step 2 or in parallel depending upon failure identified in step 1.
a. True transient failures.
Failures which are happen randomly and are out of our control. For e.g. nodes connection time out happening due to bad node, networking issue etc. The only fix in this case it to either increase corresponding parameters (timeout) or skip the test until a proper fix is identified.
b. Setup related.
There may be class of failures related to mis-configurations (bcwd compatibility tests etc) and easiest one to identify. These tests may need minor configuration changes.
b. Bug fix.
The remaining class of failures are corner cases which are more tricky root cause and may need specific area of expertise. Based on area of failure, required engineer needs to be involved to debug the issue further.

andrross · 2021-12-17T00:11:16Z

Analyze last X failed Jenkins builds (X=20)

I think it is a good idea to collect this data. It might be a bit hard to separate out the failures that were caused by the change in the PR that triggered the build. Setting up a test machine to run checks continually should be able to get similar data, and will have the benefit of running against a static code base.

Reproduce

We've probably seen enough of these to know they aren't reproducable when re-run in isolation. We have open issues with quite a few errors and none of them can be reproduced even when re-running the individual test many many times. I think running the entire test suite is the way to go, but we probably don't need to worry about the Jenkins stuff and can just trigger the ./gradlew check command directly.

saratvemulapalli · 2021-12-17T23:19:31Z

Another one, coming from: #1766

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=28AD28E1A3FF50C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-PH -Dtests.timezone=Etc/GMT+8 -Druntime.java=15

org.opensearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
    java.lang.AssertionError: node_t1: [Tuple [v1=node_t2, v2=null]]
        at __randomizedtesting.SeedInfo.seed([28AD28E1A3FF50C7:77AB65EE82248FCB]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.opensearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:253)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1048)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1021)
        at org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:250)

andrross · 2021-12-20T19:27:32Z

I ran another experiment over the weekend, the theory being that maybe :qa:mixed-cluster:v1.2.2#mixedClusterTest was interfering with :server:internalClusterTest:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew clean > /dev/null 2>&1 && ./gradlew :server:internalClusterTest :qa:mixed-cluster:v1.2.2#mixedClusterTest >> ../build-failure-tests/test-output-2021-12-17_2.txt 2>&1 ; done

but the results were 7 failures out of 330, which is in line with the ~2% failure rate of the integ tests in isolation. The failures were:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=60436199814D8A58 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-CS -Dtests.timezone=Etc/GMT+5 -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=8EC37C710AA42BCE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=no-NO -Dtests.timezone=EET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=B4175006736B7460 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-US -Dtests.timezone=Africa/Casablanca -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=6AF32DFBEB864CEE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=PRC -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=D921821394B6DBAA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-GB -Dtests.timezone=America/Nipigon -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FA529FAA49915455 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SY -Dtests.timezone=AET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FC550CFC70BBB318 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hans-CN -Dtests.timezone=America/Knox_IN -Druntime.java=17

There are likely bugs within ClusterHealthIT, ShardIndexingPressureIT, and ShardIndexingPressureSettingsIT that cause rare failures. But it remains a mystery what is causing ./gradlew check to fail at a much higher rate in the CI workflow than in these experiments.

dblock · 2021-12-22T19:30:08Z

#1725

I opened #1793 for this one specifically.

nknize · 2022-01-14T17:57:20Z

/cc @getsaurabh02

ShardIndexingPressureSettingsIT is a problem child. Can y'all investigate the recurring "Suite timeout exceeded (>= 1200000 msec)." and see if this is either a real issue with the Indexing Pressure implementation or simply a test cluster resourcing issue when run in the context of the entire check suite?

andrross · 2022-01-14T18:01:00Z

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

nknize · 2022-01-14T18:06:23Z

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

👍 Also note open PR #1592

reta · 2022-01-14T18:10:15Z

Few more flaky tests:

dblock · 2022-01-14T18:17:10Z

I copied some links into the body of this issue... it's quite a list.

penghuo · 2022-02-18T20:09:16Z

another one #2176.

dblock · 2022-11-10T21:35:05Z

Between gradle check 6786 and 6688 (100 builds) the following tests failed more than once:

org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}: 12
org.opensearch.test.rest.ClientYamlTestSuiteIT/test {p0=search/30_limits/Regexp length limit}: 6
org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/test {yaml=search/30_limits/Regexp length limit}: 6
org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests/testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections: 5
org.opensearch.action.support.AutoCreateIndexTests/testParseFailed: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfReplicasIsNonNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotZero: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfRoutingShards: 2
org.opensearch.cluster.routing.allocation.DiskThresholdSettingsTests/testInvalidHighDiskThreshold: 2
org.opensearch.cluster.allocation.AwarenessAllocationIT/testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness: 2
org.opensearch.common.settings.ScopedSettingsTests/testLoggingUpdates: 2
org.opensearch.cluster.coordination.NoClusterManagerBlockServiceTests/testRejectsInvalidSetting: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search/320_disallow_queries/Test disallow expensive queries}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset persistent settings}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search.aggregation/240_max_buckets/Max bucket}: 2
org.opensearch.action.support.AutoCreateIndexTests/testParseFailedMissingIndex: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset transient settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in sniff mode with invalid proxy settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Switch connection mode for configured cluster}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in proxy mode with invalid sniff settings}: 2
org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT/testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=scroll/20_keep_alive/Max keep alive}: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client}: 2
org.opensearch.cluster.coordination.ElectionSchedulerFactoryTests/testSettingsValidation: 2
org.opensearch.common.settings.ScopedSettingsTests/testValidate: 2
org.opensearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests/testChunkSize: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testUpdateOfValidationDependentSettings: 2
org.opensearch.cluster.routing.OperationRoutingTests/testWeightedOperationRoutingWeightUndefinedForOneZone: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client}: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testAllOrNothing: 2
org.opensearch.cluster.metadata.AutoExpandReplicasTests/testInvalidValues: 2

Another ~100 failed once.

anasalkouz · 2022-11-11T17:55:46Z

I am targeting to close these flakey tests down to zero by Dec 30, 2022. Please if anyone want to help in this effort, feel free to pick one of the flakey test issues in this list

anasalkouz · 2022-11-12T02:16:55Z

I have added the following 2 issues as a proactive mechanisms to detect flaky test failures and prevent new introduced flaky tests.
#5226
#5227

Poojita-Raj · 2022-11-15T15:15:48Z

Url	Status	Group	Owner	Reproducible	Note
[BUG] DecommissionControllerTests.testTimesOut f...	Closed	decommission	andrross
[BUG] AwarenessAttributeDecommissionIT.testNodes...	assigned	decommission	pranikum	no,100 passing tests on local
[BUG] Failed Integ test testDecommissionStatusUp...	assigned	decommission	pranikum		@imRishN opened and merged a fix - #4822 - doesn't resolve issue since it's seen since then
[Meta] Fix random test failures	untriaged				meta issue
[BUG] testCoordinatingPrimaryThreadedUpdateToSha...	pending	shardIndexing
[BUG] ShardIndexingPressureIT.testShardIndexingP...	pending	shardIndexing
[BUG] org.opensearch.action.bulk.BulkIntegration...	pending			yes, failed 2/100 tests
[BUG] org.opensearch.persistent.PersistentTasksE...	pending			no, 200 tests passing on local
[BUG] Failures with org.opensearch.smoketest.Smo...	pending			yes
[BUG] DedicatedClusterSnapshotRestoreIT.testInde...	assigned		xuezhou	yes, failed 3/100 tests	Xue wrote original test
[BUG] Deterministic failure of AggregationsTests...	pending			yes
[BUG] flaky test index/80_geo_point/Single point...	Closed	MixedClusterClientYamlTestSuiteIT
[BUG] Fix flaky test org.opensearch.index.ShardI...	assigned	shardIndexing	rrpasham	yes
[CI] flaky test failure - o.o.indices.stats.Inde...	pending			yes, failed 3/100 tests	off by 1 error
[CI] Test Failure org.opensearch.cluster.allocat...	pending				@imRishN worked on original PR, had a fix out and merged in (#3646), still seeing failures after that
[BUG] org.opensearch.gateway.QuorumGatewayIT > t...	pending			no, passing 100 tests
[BUG] org.opensearch.repositories.s3.RepositoryS...	untriaged	RepositoryS3ClientYamlTestSuiteIT
[BUG] Intermittent test failure - Snapshot and R...	untriaged	RepositoryS3ClientYamlTestSuiteIT
[BUG] OperationRoutingTests.testWeightedOperatio...	pending			yes	There's one PR out for a fix currently - #4980 - not sure if it resolves issue
[BUG] org.opensearch.search.aggregations.metrics...	pending			yes	There's one PR out for this - #4850
[BUG] Fix flaky org.opensearch.search.PitMultiNo...	pending	PitMultiNode		yes, failed 1/100
[CI] o.o.aliases.IndexAliasesIT.testSameAlias fa...	pending	AcknowledgedResponse failed		no
[CI] o.o.gateway.RecoveryFromGatewayIT.testReuse...	untriaged				No occurences since April, can be closed out?
[BUG] Fix new flaky test org.opensearch.search.D...	pending	PitMultiNode		yes, failed 1/100 times
[CI] o.o.cluster.remote.test.RemoteClustersIT.te...	untriaged				No occurences since June, can be closed out?
[TEST] Failures in IndexingMemoryControllerTests...	untriaged			no	Not seen since Jan, can be closed out?
[BUG] org.opensearch.discovery.DiscoveryDisrupti...	untriaged				Only 1 occurence in May
[BUG] org.opensearch.action.admin.cluster.tasks....	untriaged				timeout issue
[BUG] :test:logger-usage:test failure flakey tes...	untriaged
[BUG] o.o.search.SearchCancellationIT.testCancel...	pending	SearchCancellationIT		no, passed 100 tests
[BUG] node drop on o.o.cluster.routing.allocatio...	pending			no, passed 100 tests
[CI] o.o.blocks.SimpleBlocksIT.testAddBlockWhile...	pending			no, passed 100 tests	also documented in issue -#2442
[CI] o.o.versioning.ConcurrentSeqNoVersioningIT....	pending			no, passed 100 tests
[CI] flaky test faiure - o.o.indices.recovery.In...	pending			no, passed 100 tests
[CI] o.o.discovery.SnapshotDisruptionIT.testDisr...	pending	SnapshotDisruptionIT		no, passed 100 times
[BUG] testCancellationDuringQueryPhaseUsingReque...	pending	SearchCancellationIT		no, passed 150 times
[BUG] cluster.routing.PrimaryAllocationIT.testPr...	pending			no, passed 100 times
[BUG] org.opensearch.search.SearchCancellationIT...	pending	SearchCancellationIT		no, passed 100 times
[BUG] StableMasterDisruptionIT.testStaleMasterNo...	pending			no, passed 100 times
[CI] flaky test faiure - o.o.upgrades.IndexingIT...	untriaged
[BUG] Flaky test failure - v1.2.5#mixedClusterTe...	untriaged	MixedClusterClientYamlTestSuiteIT
[BUG] Master bootstrap takes time causing interm...	pending			no, passed 100 tests	renamed test
[BUG] ClusterRerouteIT.testDelayWithALargeAmount...	untriaged	AcknowledgedResponse failed		no, passed 100 times
[BUG] Flaky test failure - org.opensearch.blocks...	closed				Same as #33 -#2472
[BUG] org.opensearch.snapshots.ConcurrentSnapsho...	pending			no, passed 100 times
[BUG] testRestartIndexCreationAfterFullClusterRe...	pending			no,passed 100 times
[BUG] org.opensearch.cluster.routing.allocation....	untriaged
[BUG] org.opensearch.discovery.SnapshotDisruptio...	untriaged	SnapshotDisruptionIT
[CI] Test failure in "org.opensearch.cluster.coo...	untriaged
[BUG] Upgrade cli test failure while detecting e...	untriaged
[CI] oldClusterTest fails intermittently	untriaged
[BUG] Netty Transport test failing with large re...	pending			No
[BUG] InstallPluginCommandTests.testOfficialPlug...	pending			No
[BUG] :distribution:packages:rpm:checkExtraction...	pending			No
[BUG] Transport NIO test intermittently failing ...	pending			No
[BUG] :rest-api-spec:yamlRestTest org.opensearch...	pending			No
[BUG] MinimumMasterNodesIT.testThreeNodesNoMaste...	pending			No	Test doesn't exist? renamed to MinimumClusterManagerNodesIT
[BUG] SharedClusterSnapshotRestoreIT.testSnapsho...	pending			No

andrross · 2022-12-03T00:19:32Z

I wrote a script to crawl the Jenkins output for unstable builds: https://gist.github.com/andrross/ee07a8a05beb63f1173bcb98523918b9

Below are the results for the last 1000 builds. There is a long tail of tests with a few failures, but the top 4 failures have issues already (#5219, #4212, #5157, #3603).

41 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
23 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492)
22 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315)
17 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555)
11 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
9 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (6790,6828,6965,7220,7256,7315,7361,7447,7543)
8 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (6589,6709,6952,6952,6953,6953,7200,7277)
7 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (6585,7154,7183,7255,7292,7300,7551)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6599,6602,6731,6771)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (6952,6953,7077,7320)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (6711,6711,6711,6952)
4 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (6624,6635,6723,6979)
4 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (7185,7212,7231,7342)
4 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (6591,6591,6952,7201)
4 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6593,6615,6664,6682)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (6766,6953,6956)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (6782,6952,7309)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (6894,7091,7144)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (6738,6792,6825)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (6562,7201)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (6723,6979)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6685,7406)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (6723,6979)
2 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (6723,6979)
2 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (6766,7076)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
2 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (6893,7166)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (7464)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (6606)
1 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6653)
1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (6667)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (6723)
1 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (6747)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (6758)
1 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (6764)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (6781)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (6953)
1 org.opensearch.client.ReindexIT.testReindexTask (6962)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (6970)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (7345)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (7415)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (7464)

dblock · 2022-12-06T00:11:02Z

@andrross I swear I wrote very similar code to produce #1715 (comment), but where did I put it? :) thank you!

dblock · 2022-12-06T18:15:29Z

Found it! https://github.com/dblock/gradle-checks

Rishikesh1159 · 2022-12-06T20:31:26Z

Thanks @andrross for the script. I ran @andrross script's to get all flaky tests from past 2 months. (From Sep 30 2022 - Dec 5 2022). Here is the List of 104 flaky tests found:

Will crawl builds from 3600 to 7680
------------------
130 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (3619,3619,3695,3719,3719,3720,3720,3743,3744,3744,3902,3902,4173,4279,4382,4602,4602,4602,4751,4751,4752,4752,4793,4793,4793,4946,4946,4946,5122,5123,5123,5298,5298,5341,5341,5354,5354,5396,5396,5396,5399,5489,5533,5533,5533,5556,5557,5557,5557,5572,5954,5955,5955,6060,6061,6061,6061,6132,6132,6133,6151,6155,6156,6172,6188,6218,6218,6221,6221,6233,6234,6234,6254,6254,6389,6389,6391,6436,6469,6469,6470,6470,6475,6476,6476,6476,6547,6547,6548,6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
38 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (3666,3679,4180,4207,4679,4691,4866,4953,5343,5395,5396,5437,5577,5733,5897,5923,6096,6175,6205,6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555,7563,7612)
38 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (3858,3914,3961,4292,4293,4332,4382,4514,4539,4603,4858,4897,5426,5467,5489,5525,5530,5552,5788,5973,6081,6130,6132,6199,6234,6343,6376,6546,6790,6828,6965,7220,7256,7315,7361,7447,7543,7644)
37 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6028,6199,6350,6350,6351,6359,6359,6365,6365,6365,6371,6399,6399,6411,6413,6413,6415,6436,6436,6436,6458,6458,6468,6547,6547,6554,6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
35 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (3619,3719,4382,4638,4792,5122,5294,5354,5395,5531,5556,5878,6060,6128,6133,6151,6152,6152,6156,6156,6218,6254,6390,6436,6436,6475,6548,6589,6709,6952,6952,6953,6953,7200,7277)
29 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6474,6481,6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315,7596,7596,7611,7617,7617)
25 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492,7651,7651)
17 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermark (3635,3641,3798,3920,3928,4137,4189,4240,4279,4447,4511,4536,4787,4793,4818,4818,5134)
14 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (3658,3695,4453,4599,5142,5347,5740,5858,5894,6183,7185,7212,7231,7342)
14 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (4100,4514,5829,6238,6332,6336,6337,6585,7154,7183,7255,7292,7300,7551)
12 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast (3964,4234,4268,4272,4446,4826,4879,4891,4975,4975,5114,5121)
12 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177,7634,7634,7634,7634)
9 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (3607,3757,3789,3839,4952,6624,6635,6723,6979)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (4638,5556,6151,6156,6952,6953,7077,7320)
8 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6172,6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (3616,4279,4700,4802,5396,6554,6764)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (4450,6156,6390,6711,6711,6711,6952)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (5341,5341,6233,6591,6591,6952,7201)
6 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (3700,3852,6371,6894,7091,7144)
6 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (3715,4173,4293,5557,6259,6781)
6 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLastReverse (4271,4329,4533,5011,5114,5114)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testDecommissionStatusUpdatePublishedToAllNodes (5165,5379,5530,5612,5642,5677)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6356,6359,6599,6602,6731,6771)
6 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177,7634,7634)
5 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (4124,4131,4131,4142,4142)
5 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (4450,4792,6389,6766,7076)
5 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6463,6593,6615,6664,6682)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (3607,3789,4952,6723)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (3619,4042,4281,6758)
4 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (3651,3805,6468,6747)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (3695,6390,6476,6953)
4 org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse (3755,4539,5576,6073)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (3789,4952,6723,6979)
4 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (3932,4946,6391,6667)
4 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirstReverse (4279,4294,4420,4714)
4 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (4320,5358,6893,7166)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (4751,6782,6952,7309)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (5363,6766,6953,6956)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (5908,6738,6792,6825)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testSegmentReplication_Index_Update_Delete (3739,4867,6401)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaRestarts (4420,4889,6401)
3 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (4894,6393,7345)
3 org.opensearch.index.shard.IndexShardIT.testIndexCanChangeCustomDataPath (4953,4953,4953)
3 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (5165,6562,7201)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6241,6685,7406)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6241,6685,7406)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
2 org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse (3618,7628)
2 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (3648,6606)
2 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithBackoff (3802,3821)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.testRandomDirectoryIOExceptions (3814,5399)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.classMethod (3814,5399)
2 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (4178,7415)
2 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirst (4925,4975)
2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (5302,6970)
2 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (5361,5600)
2 org.opensearch.client.ReindexIT.testReindexTask (6007,6962)
2 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6170,6653)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (3778)
1 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithoutBackoff (3821)
1 org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery (3837)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testSplitFromOneToN (4178)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.split/30_copy_settings/Copy settings during split index} (4236)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.shrink/30_copy_settings/Copy settings during shrink index} (4236)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.classMethod (4420)
1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimits (4758)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase (4926)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesLowerGeneration (5234)
1 org.opensearch.cluster.routing.allocation.RemoteShardsMoveShardsTests.testIndexLevelExclusions (5484)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testDeleteBlobs (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testWriteRead (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testRequestStats (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testMultipleSnapshotAndRollback (5620)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringQueryPhaseUsingRequestParameter (5760)
1 org.opensearch.discovery.StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority (5915)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkCommitsMergeOnIdle (6241)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode (6241)
1 org.opensearch.gradle.BuildPluginIT.testInsecureMavenRepository (6406)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase (6430)
1 org.opensearch.search.aggregations.bucket.terms.StringTermsIT.classMethod (6465)
1 org.opensearch.upgrade.DetectEsInstallationTaskTests.testTaskExecution (6537)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (7668)

dbwiddis · 2023-08-08T17:53:26Z

How flaky acceptable? I closed #6739 after calculating the expected failure rate of a random-alpha-of-length-5 collision at 1 in 19,164. It failed once on run 12,467. It'll probably fail again in a few years. Is that OK?

anasalkouz · 2023-10-25T23:35:39Z

Closing this campaign.

anasalkouz added enhancement Enhancement or improvement to existing feature or request flaky-test Random test failure that succeeds on second run labels Dec 13, 2021

anasalkouz changed the title ~~Put a Plan~~ Put a plan for the flaky random test failures Dec 13, 2021

anasalkouz changed the title ~~Put a plan for the flaky random test failures~~ Put a plan for the flakey random test failures Dec 13, 2021

Poojita-Raj assigned dreamer-89 Dec 22, 2021

This was referenced Dec 29, 2021

[CI] Intermittent java.lang.Exception: Suite timeout exceeded (>= 1200000 msec). #1826

Closed

[BUG] Intermittent java.lang.Exception: Suite timeout exceeded (>= 1200000 msec). #1843

Closed

meghasaik mentioned this issue Jan 12, 2022

[BUG] Netty Transport test failing with large responses #1847

Closed

dblock mentioned this issue Jan 14, 2022

[BWC] Ensure 2.x compatibility with Legacy 7.10.x #1902

Merged

nknize mentioned this issue Jan 14, 2022

[BUG] node drop on o.o.cluster.routing.allocation.decider.MockDiskUsagesIT.testRerouteOccursOnDiskPassingHighWatermark #1907

Closed

dblock added the Meta Meta issue, not directly linked to a PR label Jan 14, 2022

dblock changed the title ~~Put a plan for the flakey random test failures~~ [Meta] Fix random test failures Jan 14, 2022

tlfeng unassigned dreamer-89 Jan 21, 2022

nknize mentioned this issue Jan 26, 2022

[Deprecate] Setting explicit version on analysis component #1978

Merged

This was referenced Mar 17, 2022

[Proposal] Automate detection and quarantine of flakey tests #2496

Open

Retry failed flakey tests #2547

Closed

dblock assigned anasalkouz Nov 11, 2022

Rishikesh1159 mentioned this issue Dec 5, 2022

[BUG] Fix flaky test org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling #5452

Closed

dreamer-89 mentioned this issue Dec 8, 2022

[BUG] org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue flaky #5031

Closed

cwperks mentioned this issue Dec 13, 2022

[Feature/Identity] Identity Module and tokens for internal authentication #5471

Closed

6 tasks

ntantri mentioned this issue Dec 29, 2022

Added support for feature flags in opensearch.yml #4959

Merged

6 tasks

dreamer-89 mentioned this issue Dec 31, 2022

[Meta] Segment Replication flaky test failures #5669

Closed

8 tasks

dreamer-89 mentioned this issue Jan 20, 2023

MasterServiceTests.testThrottlingForMultipleTaskTypes #5958

Closed

VachaShah mentioned this issue Feb 9, 2023

Bump reactor-netty-http from 1.0.24 to 1.1.2 in /plugins/repository-azure #5973

Merged

navneet1v mentioned this issue Feb 10, 2023

[AUTOCUT] Gradle Check Failure on push to main #6286

Closed

nknize mentioned this issue May 22, 2023

[Refactor] OpenSearchException streamables to a registry #7646

Merged

BhumikaSaini-Amazon mentioned this issue Jun 19, 2023

[Remote Segment Store] Add Lucene major version to UploadedSegmentMetadata #8088

Merged

6 tasks

harishbhakuni mentioned this issue Sep 7, 2023

[Snapshot Interop] Fix Flaky Snapshot Interop Tests #9795

Merged

6 tasks

nknize mentioned this issue Oct 4, 2023

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

Closed

reta mentioned this issue Oct 20, 2023

[Backport 2.x] Update Github pull request template to have a task for inspecting failing checks #10792

Merged

dhwanilpatel mentioned this issue Oct 21, 2023

[Backport 2.x] Fix flaky remote cluster state UT #10784

Merged

7 tasks

anasalkouz closed this as completed Oct 25, 2023

dbwiddis mentioned this issue Apr 17, 2024

[MAINTAINERS] [Action Required] Your help reviewing changes and issues in opensearch-project/OpenSearch #12970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Fix random test failures #1715

[Meta] Fix random test failures #1715

anasalkouz commented Dec 13, 2021 •

edited

Loading

andrross commented Dec 14, 2021 •

edited

Loading

saratvemulapalli commented Dec 14, 2021

dreamer-89 commented Dec 15, 2021

dreamer-89 commented Dec 16, 2021 •

edited by dblock

Loading

andrross commented Dec 17, 2021

saratvemulapalli commented Dec 17, 2021

andrross commented Dec 20, 2021

dblock commented Dec 22, 2021

nknize commented Jan 14, 2022

andrross commented Jan 14, 2022

nknize commented Jan 14, 2022

reta commented Jan 14, 2022

dblock commented Jan 14, 2022

penghuo commented Feb 18, 2022

dblock commented Nov 10, 2022

anasalkouz commented Nov 11, 2022

anasalkouz commented Nov 12, 2022

Poojita-Raj commented Nov 15, 2022 •

edited by anasalkouz

Loading

andrross commented Dec 3, 2022

dblock commented Dec 6, 2022

dblock commented Dec 6, 2022

Rishikesh1159 commented Dec 6, 2022 •

edited

Loading

dbwiddis commented Aug 8, 2023 •

edited

Loading

anasalkouz commented Oct 25, 2023

[Meta] Fix random test failures #1715

[Meta] Fix random test failures #1715

Comments

anasalkouz commented Dec 13, 2021 • edited Loading

andrross commented Dec 14, 2021 • edited Loading

saratvemulapalli commented Dec 14, 2021

dreamer-89 commented Dec 15, 2021

dreamer-89 commented Dec 16, 2021 • edited by dblock Loading

andrross commented Dec 17, 2021

saratvemulapalli commented Dec 17, 2021

andrross commented Dec 20, 2021

dblock commented Dec 22, 2021

nknize commented Jan 14, 2022

andrross commented Jan 14, 2022

nknize commented Jan 14, 2022

reta commented Jan 14, 2022

dblock commented Jan 14, 2022

penghuo commented Feb 18, 2022

dblock commented Nov 10, 2022

anasalkouz commented Nov 11, 2022

anasalkouz commented Nov 12, 2022

Poojita-Raj commented Nov 15, 2022 • edited by anasalkouz Loading

andrross commented Dec 3, 2022

dblock commented Dec 6, 2022

dblock commented Dec 6, 2022

Rishikesh1159 commented Dec 6, 2022 • edited Loading

dbwiddis commented Aug 8, 2023 • edited Loading

anasalkouz commented Oct 25, 2023

anasalkouz commented Dec 13, 2021 •

edited

Loading

andrross commented Dec 14, 2021 •

edited

Loading

dreamer-89 commented Dec 16, 2021 •

edited by dblock

Loading

Poojita-Raj commented Nov 15, 2022 •

edited by anasalkouz

Loading

Rishikesh1159 commented Dec 6, 2022 •

edited

Loading

dbwiddis commented Aug 8, 2023 •

edited

Loading