Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/fix randomly failing testWatcherRestart #35243

Merged
merged 6 commits into from
Jan 10, 2019

Conversation

pgomulka
Copy link
Contributor

@pgomulka pgomulka commented Nov 5, 2018

As noted on an original ticket, the test fails when it starts to perform a stop while watchers were still in a starting phase. This is reproducible on 6.x (fails around 10% of a time)
When adding additional assertion waiting for watchers to be in a started phase, this seems to be fixed. No failures out of 1000times run.

closes #33753

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@pgomulka
Copy link
Contributor Author

pgomulka commented Nov 5, 2018

@danielmitterdorfer would you be interested in reviewing this?

@danielmitterdorfer
Copy link
Member

@pgomulka sorry missed your review request earlier today. It seems the test has failed on the PR build in CI?

12:01:17 FAILURE 10.5s | WatchBackwardsCompatibilityIT.testWatcherRestart <<< FAILURES!
12:01:17    > Throwable #1: java.lang.AssertionError: 
12:01:17    > Expected: not a string containing "\"watcher_state\":\"stopped\""
12:01:17    >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"qEaJXBb1QpykMWg1zoHQXQ\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}},{\"node_id\":\"c4ECo32dS-iJma93i_bT6Q\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":5}},{\"node_id\":\"iDdnsQzmQi6NTgadGo7JoQ\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}}]}"
12:01:17    > 	at __randomizedtesting.SeedInfo.seed([B30C18912E5C52C3:2654161E65408B9A]:0)
12:01:17    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:01:17    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$13(WatchBackwardsCompatibilityIT.java:315)
12:01:17    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
12:01:17    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
12:01:17    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$14(WatchBackwardsCompatibilityIT.java:309)
12:01:17    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.executeAgainstMasterNode(WatchBackwardsCompatibilityIT.java:274)
12:01:17    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.ensureWatcherStarted(WatchBackwardsCompatibilityIT.java:309)
12:01:17    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.testWatcherRestart(WatchBackwardsCompatibilityIT.java:182)
12:01:17    > 	at java.lang.Thread.run(Thread.java:748)
12:01:17    > 	Suppressed: java.lang.AssertionError: 
12:01:17    > Expected: not a string containing "\"watcher_state\":\"stopped\""
12:01:17    >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"qEaJXBb1QpykMWg1zoHQXQ\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}},{\"node_id\":\"c4ECo32dS-iJma93i_bT6Q\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":5}},{\"node_id\":\"iDdnsQzmQi6NTgadGo7JoQ\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}}]}"
12:01:17    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:01:17    > 		at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$13(WatchBackwardsCompatibilityIT.java:315)
12:01:17    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
12:01:17    > 		... 42 more
...

fix typos and refactor to DRY up documentation for bulk, reindex and migration apis
relates elastic#35345
@pgomulka
Copy link
Contributor Author

The testWatcherRestart seems to be failing on 6.x branch because it is running on a cluster which is not in an expected state (all watchers started). This is most likely because previous tests did not clean up cluster after its run properly. It could be either testWatcherStats or testWatcherCrud

 1> [2018-11-16T09:00:54,801][INFO ][o.e.u.WatchBackwardsCompatibilityIT] [testWatcherStats] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery, internal:index/shard/recovery/translog_ops].
  1> [2018-11-16T09:00:54,802][INFO ][o.e.u.WatchBackwardsCompatibilityIT] [testWatcherStats] after test
  1> [2018-11-16T09:00:54,814][INFO ][o.e.u.WatchBackwardsCompatibilityIT] [testWatcherRestart] before test

Here we can see the received state of the cluster

1> [2018-11-16T09:00:55,182][INFO ][o.e.u.WatchBackwardsCompatibilityIT] [testWatcherRestart] Nodes in cluster before test: bwc [[Node{id='_GNLC00iRCG0iaNtbh0OpQ', nodeName='node-2', version=5.6.14, address=http://127.0.0.1:42343}]], new [[Node{id='pWFKeECcQxKU4HISKwKRWA', nodeName='upgraded-node-1', version=6.6.0, address=http://127.0.0.1:42958}, Node{id='VV2NI2eNTKOPbxkF582xdQ', no
deName='upgraded-node-0', version=6.6.0, address=http://127.0.0.1:37879}]], master [Node{id='VV2NI2eNTKOPbxkF582xdQ', nodeName='upgraded-node-0', version=6.6.0, address=http://127.0.0.1:37879}]

Additional assertion was added to wait until the cluster is in expected state

1> [2018-11-16T09:00:55,417][INFO ][o.e.u.WatchBackwardsCompatibilityIT] [testWatcherRestart] ensureWatcherStarted(), stats response [{"_nodes":{"total":3,"successful":3,"failed":0},"cluster_name":"rolling-upgrade","manually_stopped":false,"stats":[{"node_id":"VV2NI2eNTKOPbxkF582xdQ","watcher_state":"started","watch_count":1,"execution_thread_pool":{"queue_size":0,"max_size":3}},{"no
de_id":"pWFKeECcQxKU4HISKwKRWA","watcher_state":"started","watch_count":0,"execution_thread_pool":{"queue_size":0,"max_size":0}},{"node_id":"_GNLC00iRCG0iaNtbh0OpQ","watcher_state":"stopped","watch_count":-1,"execution_thread_pool":{"queue_size":-1,"max_size":-1}}]}]

Test failing as the watcher never recovers from failed state.

 > Throwable #1: java.lang.AssertionError:
   > Expected: not a string containing "\"watcher_state\":\"stopped\""
   >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"VV2NI2eNTKOPbxkF582xdQ\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":3}},{\"node_id\":\"pWFKeECcQxKU4HISKwKRWA\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}},{\"node_id\":\"_GNLC00iRCG0iaNtbh0OpQ\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}}]}”

However, this test is also on master as WatcherRestartIT and it tend to pass all the time. It only contains one test testWatcherRestart so it is never receiving a cluster from previous run.
The fix could be to refactor testWatcherRestart to a separate class, but that would be just a workaround as the main reason - not cleaning the cluster - would be still broken

@cbuescher
Copy link
Member

@pgomulka fyi I muted some of the tests in #33753 (comment)
Whats the current state of this PR? Do you think it would fix the failures at least on the 6.x/6.5 branches? Sorry I didn't have time to dig into the details deeper so far, just wanted to ping you on this.

@pgomulka
Copy link
Contributor Author

@cbuescher The fix from this PR won't be enough. I will look into this again as #35271 might help here

only muting two failing test cases in a subclass where they fail.
issue referencing the problem: elastic#36598
@jasontedor jasontedor added v6.7.0 and removed v6.6.0 labels Dec 19, 2018
@cbuescher
Copy link
Member

@pgomulka any updates on this? Has the merge of #35271 proven to be useful or does the issue still exist?

@pgomulka
Copy link
Contributor Author

pgomulka commented Jan 8, 2019

retest this please

@pgomulka
Copy link
Contributor Author

@cbuescher it was usually not easy to reproduce locally, but I run the failing jobs again on my machine and seems like the problem is at least not existing on these seeds.
I think we can merge that change where I remove the suppression and that extra line ensureWatcherStarted(); That extra line is there as in some unfortunate timings the watcher was not started by the time the test was running and was helping to prevent failures.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgomulka great, lets try this on 6.x then

@cbuescher cbuescher removed the v6.4.2 label Jan 10, 2019
@pgomulka pgomulka merged commit 7f20c0b into elastic:6.x Jan 10, 2019
@cbuescher
Copy link
Member

I have a test failure in a 6.x PR that contains the merged commit:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/4142/console

20:18:13 Tests with failures:
20:18:13   - org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.testWatcherRestart
20:18:08 FAILURE 10.6s | WatchBackwardsCompatibilityIT.testWatcherRestart <<< FAILURES!
20:18:08    > Throwable #1: java.lang.AssertionError: 
20:18:08    > Expected: not a string containing "\"watcher_state\":\"stopped\""
20:18:08    >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"07ZkAMMdQO-st1-Ig3bb4A\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":3}},{\"node_id\":\"d6hg5rTOR1OCl7hd3lrUoQ\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}},{\"node_id\":\"mEAFzcpiQWyshkvsH0_oUA\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}}]}"
20:18:08    > 	at __randomizedtesting.SeedInfo.seed([4FC250442116C19D:DA9A5ECB6A0A18C4]:0)
20:18:08    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
20:18:08    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$13(WatchBackwardsCompatibilityIT.java:315)
20:18:08    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
20:18:08    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
20:18:08    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$14(WatchBackwardsCompatibilityIT.java:309)
20:18:08    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.executeAgainstMasterNode(WatchBackwardsCompatibilityIT.java:274)
20:18:08    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.ensureWatcherStarted(WatchBackwardsCompatibilityIT.java:309)
20:18:08    > 	at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.testWatcherRestart(WatchBackwardsCompatibilityIT.java:182)
20:18:08    > 	at java.lang.Thread.run(Thread.java:748)
20:18:08    > 	Suppressed: java.lang.AssertionError: 
20:18:08    > Expected: not a string containing "\"watcher_state\":\"stopped\""
20:18:08    >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"07ZkAMMdQO-st1-Ig3bb4A\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":3}},{\"node_id\":\"d6hg5rTOR1OCl7hd3lrUoQ\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}},{\"node_id\":\"mEAFzcpiQWyshkvsH0_oUA\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}}]}"
20:18:08    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
20:18:08    > 		at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$13(WatchBackwardsCompatibilityIT.java:315)
20:18:08    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
20:18:08    > 		... 42 more
20:18:08    > 	Suppressed: java.lang.AssertionError: 
20:18:08    > Expected: not a string containing "\"watcher_state\":\"stopped\""
20:18:08    >      but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"rolling-upgrade\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"07ZkAMMdQO-st1-Ig3bb4A\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":3}},{\"node_id\":\"d6hg5rTOR1OCl7hd3lrUoQ\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}},{\"node_id\":\"mEAFzcpiQWyshkvsH0_oUA\",\"watcher_state\":\"stopped\",\"watch_count\":-1,\"execution_thread_pool\":{\"queue_size\":-1,\"max_size\":-1}}]}"
20:18:08    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
20:18:08    > 		at org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT.lambda$ensureWatcherStarted$13(WatchBackwardsCompatibilityIT.java:315)
20:18:08    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
20:18:08    > 		... 42 more
./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v5.6.15#twoThirdsUpgradedTestRunner -Dtests.seed=4FC250442116C19D -Dtests.class=org.elasticsearch.upgrades.WatchBackwardsCompatibilityIT -Dtests.method="testWatcherRestart" -Dtests.security.manager=true -Dtests.locale=en-IE -Dtests.timezone=SystemV/YST9YDT -Dcompiler.java=11 -Druntime.java=8

I'll quickly scan the last 6.x CI failures and will revert 7f20c0b if I find more WatchBackwardsCompatibilityIT failures.

@cbuescher
Copy link
Member

There seem to be several, I just add the last one here:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/933/console

Will revert 7f20c0b

cbuescher pushed a commit that referenced this pull request Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants