-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FullClusterRestartIT#testWatcher failures #48381
Comments
Pinging @elastic/es-core-features (:Core/Features/Watcher) |
Increased timeout on for yellow state on #48434 (will backport) and will look into the assertion error : https://gradle-enterprise.elastic.co/s/kcjrg2hoa7zqe/ |
another: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+bwc/1837/console I think the last failures had the |
…r yellow The timeout was increased to 60s to allow this test more time to reach a yellow state. However, the test will still on occasion fail even with the 60s timeout. Related: elastic#48381 Related: elastic#48434 Related: elastic#47950 Related: elastic#40178
…tic#48848) The timeout was increased to 60s to allow this test more time to reach a yellow state. However, the test will still on occasion fail even with the 60s timeout. Related: elastic#48381 Related: elastic#48434 Related: elastic#47950 Related: elastic#40178
…tic#48848) The timeout was increased to 60s to allow this test more time to reach a yellow state. However, the test will still on occasion fail even with the 60s timeout. Related: elastic#48381 Related: elastic#48434 Related: elastic#47950 Related: elastic#40178
This test has been re-muted across branches and the timeout reduced back to the original 30s. |
Previously this test failed waiting for yellow: https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676 Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards. Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed. Relates to elastic#48381
Previously this test failed waiting for yellow: https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676 Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards. Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed. Relates to #48381
Previously this test failed waiting for yellow: https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676 Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards. Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed. Relates to elastic#48381
This test hasn't failed since it was enabled on |
Previously this test failed waiting for yellow: https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676 Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards. Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed. Relates to #48381
removed unchecked suppress warnings. See #48381
removed unchecked suppress warnings. See #48381
This test failed twice yesterday on master and 7 dot x branches. I will mute the test now and investigate these failures with the additional logging that was added recently. |
Unmuted test and when it fails again that test captures watcher stats as well. The watcher debug logs, don't indicate that a watch executes, which could result in the index action not being executed. I think the watch may be stuck and hopefully the watcher stats will capture this. |
This test hasn't failed for almost a month. I will close this issue, |
@martijnvg I just saw this test still muted on 7.x, should your unmute commit 9d7b80f be backported or was this done so on purpose? |
@cbuescher I think I forgot the backport... I will backport the commit you mention today. |
Reopening the issue, as we had another failure on master today: Log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/176/console REPRODUCE WITH: ./gradlew ':x-pack:qa:full-cluster-restart:v8.0.0#upgradedClusterTest' --tests "org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher"
The failure doesn't reproduce locally. |
The logging that was added to help debug this issue is still there today - this line:
That line was added over a year ago in 106c3ce. It means that the server-side logs for the X-Pack full cluster restart tests are 99% watcher debug, making it hard to debug anything else. Is this extra debug level logging still required today? |
@droberts195 I will remove that line. If needed we can enable watcher debug logging later when we get back to investigation this test failure (and not forgetting to remove it). |
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
A different instance of this test failing: https://gradle-enterprise.elastic.co/s/alpg4ojedmfhg
|
Started happening again (failed 5 times over the last week)
|
Another failure, again due to timeout: https://gradle-enterprise.elastic.co/s/o6w6vll7qjego
|
Another failure https://gradle-enterprise.elastic.co/s/htn2dtr4ajl5c This one seems to be a genuine test failure
|
And another https://gradle-enterprise.elastic.co/s/snmnlk6itki4w, timeout this time
|
Looks like this fits here: https://gradle-enterprise.elastic.co/s/tb7rzfmtfukac |
Pinging @elastic/es-data-management (Team:Data Management) |
I think that the version is getting incremented because watcher sometimes happens to run, and it updates the document in |
The timeout failure is more interesting though. From the latest failure, it appears to be happening because the |
I realized I hadn't been paying attention when I said that the watch had run 29 times -- the threadpool max size was 29. In the run that failed there were only 2 watches in the watch stats. The test creates 3 watches. I dumped out the watcher stats in a successful run, and one node showed 2 watches and the other 1 watch (as you'd expect). So it appears that for some reason in this timeout run the |
This issue has been closed because it has been open for too long with no activity. Any muted tests that were associated with this issue have been unmuted. If the tests begin failing again, a new issue will be opened, and they may be muted again. |
This has been failing a couple of times a day since it was re-enabled in #48000
Failures fall into two types; on 7.x and 7.5:
eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191023094703-5F65A9C2&_g=()
and on master:
eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191022054329-BF6E5EA6&_g=()
The text was updated successfully, but these errors were encountered: