-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FileSettingsRoleMappingsStartupIT testFailsOnStartMasterNodeWithError failing #98391
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
This doesn't fail when run over 1000 times locally. It may be that the 20s timeout is very occasionally not long enough when run on a cloud machine |
investigating failing tests relates elastic#98391
investigating failing tests relates #98391
Failed in https://gradle-enterprise.elastic.co/s/ehay4b7m3bshm/console-log?task=:x-pack:plugin:security:internalClusterTest (8.10 so no extra logging) |
Here's a build on Main from September 9th where the test failed: https://gradle-enterprise.elastic.co/s/cydubbrfqdqv6/tests/task/:x-pack:plugin:security:internalClusterTest/details/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT/testFailsOnStartMasterNodeWithError?top-execution=1 |
In order to continue debugging elastic#98391, this commit adds more debug logging to the test, to determine if the error metadata is not being placed in the cluster state correctly.
In order to continue debugging #98391, this commit adds more debug logging to the test, to determine if the error metadata is not being placed in the cluster state correctly.
I added some more debug logging in #100313. It continues to appear as if the cluster state update containing the error metadata disappears. We do see that the update made it into a cluster state update task:
Hopefully the next time this occurs the new logging will help determine if the cluster state update appears in a different form than we expect, causing us to never countdown the latch. |
Another failure:
|
I've investigated a couple of more recent failures (1, 2, 3, 4, 5), the pattern I observed for all of these looks to be a bit different from what @rjernst observed above. It looks like the file setting changes were never picked up at all. And with that we're never getting into the expected error state. It looks as if there's an issue with the filesystem watcher service. Another thing that was kind of noticeable is that all failures almost exclusively (I've seen 1 exception) happened on GCP hosts, though that's also not a sufficient condition. Wondering if we should be using |
Looking at the first example failure you linked @mosche, it looks like the write to the json file may be happening before the file watcher service is actually started.
I suspect the problem is we start the master only node with:
But we don't actually wait to ensure the file watcher service is actually running before writing the json file. We do account for the file being created after the service is started, but perhaps we get lucky most of the time and see the file already when the service is started. @mosche Could you try creating an explicit test for the file being created after the file watcher is setup?
|
Got another one: https://gradle-enterprise.elastic.co/s/ziecy75cdooqc |
And another today: https://gradle-enterprise.elastic.co/s/v6zn6stsyd3c2 |
This test no longer exists on main. @albertzaharovits Given that this test was always a little flaky, wdyt about muting on 8.14 (where it still exists) and closing this issue? |
@rjernst This issue should be closed, yes, and ideally muting it on 8.14. |
Done in 5c28758 |
The test seems very unlikely to fail, but it's not the first time: it fails every month more or less
Build scan:
https://gradle-enterprise.elastic.co/s/endf3egql5efm/tests/:x-pack:plugin:security:internalClusterTest/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT/testFailsOnStartMasterNodeWithError
Reproduction line:
Applicable branches:
main
Reproduces locally?:
No
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT&tests.test=testFailsOnStartMasterNodeWithError
Failure excerpt:
The text was updated successfully, but these errors were encountered: