-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] BasicWatcherTests.testModifyWatches fails in CI #35503
Comments
Pinging @elastic/es-core-infra |
After lots of trying, (10s of thousands of runs of this test) under various conditions, including on a heavily-loaded system, on 6.x and master, I have not been able to reproduce this failure once. This test isn't muted, and has not failed on CI since this issue was opened 3 months ago. While it has failed a few times on CI within the past year, it has only failed with this particular failure once, and the other times were before some significant Watcher refactorings/bug fixes, so I'm not sure if those failures are relevant. I'm very hesitant to make any changes to this test to try to address this failure given that I can't reproduce it, and thus have no way of verifying if any changes actually increased the stability of the test, so I'm wondering what we should do with this issue: Unless someone else has some insight into how we might be able to better reproduce this, I don't think we'll be able to do much for the moment. |
Build-stats link |
I dug into these failures while investigating #37882, which seems to have the same root cause as both this issue and #39306. From looking at logs and stepping through code, I saw the same cause that @jaymode identified above. There is a change in shard routing that causes the watch service to be reloaded, which then clears the list of registered watches from the mock trigger engine. The assertion errors we're seeing can be consistently reproduced if we simulate a change in shard routing, and repeat a test case ~10 times:
I don't think that waiting on a green cluster health will improve the situation, since we already wait for the watcher indices to be green in
From looking at the mock trigger engine, I actually don't think it's correct to be clearing the list of watches when watcher is reloaded. I opened a PR that proposes to remove this behavior, which could also help with these test failures: #39724. With that change, I can no longer reproduce the failures when simulating a change in shard routing. |
I'm closing this now that #39724 has merged. Feel free to reopen the issue if similar failures pop up again. |
Previous issues around these tests (but not convinced that they are the same cause or not). #35503 #39306 Previous investigations have brought about changes in this PR #39724 Previous guesses at the cause: reloading causes watches to be cleared (fixed via the PR), internalCluster not shutting down properly. Failing builds: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/951/consoleFull
Relevant trace:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2761/console Reproduction:
Trace:
Looking at the two traces, it seems that the watcher reload occurs BEFORE the watch is created. So, I am not sure it is the same suspected cause as before (reloading causing the watches to be removed). Similar failure:
|
Another failure around the same thing:
Muting these BasicWatcherTests and WebhookIntegrationTests as many of them appear to be failing across master and 7.x |
Another test suffering from the same For example:
I muted this test too: master: dbbad81 It didn't reproduce locally for me with:
|
This should be fixed in master by #40658, if anyone runs across this failure let me know. I'll wait for the backport before closing this. |
#40658 has been merged to all currently maintained branches after a week of running in master with no failures in this test, so I'm closing this issue. |
Note: Later in the logs we see the output along the lines of |
Closing this issue as I can not find any actionable items here. None of the tests referenced here are currently disabled and (outside a very small handful of PRs) could not find any evidence of failures in the past 90 days. The small handful of PRs appear to have larger problems and include many other test failures. If any of the referenced tests fail please log a new issue. |
BasicWatcherTests testModify watches failed in CI. I looked at the test and one possible cause could be a concurrent reloading of the watch service, which will remove all watches from the trigger engine since the trigger service gets paused. Looking at the logs we see a reload get triggered right after adding the watch due to new local watcher shard allocation ids. Maybe we should wait for a green cluster health or use
ensureClusterStateConsistency
in this test?Failure link: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+release-tests/92/consoleFull
Reproduce line:
The text was updated successfully, but these errors were encountered: