Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FileSettingsRoleMappingsStartupIT testFailsOnStartMasterNodeWithError failing #98391

Closed
luigidellaquila opened this issue Aug 11, 2023 · 16 comments
Labels
:Core/Infra/Core Core issues without another label medium-risk An open issue or test failure that is a medium risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@luigidellaquila
Copy link
Contributor

The test seems very unlikely to fail, but it's not the first time: it fails every month more or less

Build scan:
https://gradle-enterprise.elastic.co/s/endf3egql5efm/tests/:x-pack:plugin:security:internalClusterTest/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT/testFailsOnStartMasterNodeWithError

Reproduction line:

./gradlew ':x-pack:plugin:security:internalClusterTest' --tests "org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT.testFailsOnStartMasterNodeWithError" -Dtests.seed=93A257C80B269D49 -Dtests.locale=es-US -Dtests.timezone=America/Monterrey -Druntime.java=20

Applicable branches:
main

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT&tests.test=testFailsOnStartMasterNodeWithError

Failure excerpt:

java.lang.AssertionError: (No message provided)

  at __randomizedtesting.SeedInfo.seed([93A257C80B269D49:1F823D8B47ABA87D]:0)
  at org.junit.Assert.fail(Assert.java:86)
  at org.junit.Assert.assertTrue(Assert.java:41)
  at org.junit.Assert.assertTrue(Assert.java:52)
  at org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT.testFailsOnStartMasterNodeWithError(FileSettingsRoleMappingsStartupIT.java:122)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:578)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1623)

@luigidellaquila luigidellaquila added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Aug 11, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Aug 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@thecoop
Copy link
Member

thecoop commented Aug 22, 2023

This doesn't fail when run over 1000 times locally. It may be that the 20s timeout is very occasionally not long enough when run on a cloud machine

@pgomulka pgomulka self-assigned this Sep 5, 2023
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue Sep 6, 2023
pgomulka added a commit that referenced this issue Sep 8, 2023
@DaveCTurner
Copy link
Contributor

rjernst added a commit to rjernst/elasticsearch that referenced this issue Oct 5, 2023
In order to continue debugging elastic#98391, this commit adds more debug
logging to the test, to determine if the error metadata is not being
placed in the cluster state correctly.
rjernst added a commit that referenced this issue Oct 5, 2023
In order to continue debugging #98391, this commit adds more debug
logging to the test, to determine if the error metadata is not being
placed in the cluster state correctly.
@rjernst
Copy link
Member

rjernst commented Oct 5, 2023

I added some more debug logging in #100313. It continues to appear as if the cluster state update containing the error metadata disappears. We do see that the update made it into a cluster state update task:

[2023-09-08T21:56:04,197][INFO ][o.e.r.s.ReservedClusterStateService] [node_t0] Successfully applied new reserved error state for namespace [file_settings]

Hopefully the next time this occurs the new logging will help determine if the cluster state update appears in a different form than we expect, causing us to never countdown the latch.

@stu-elastic stu-elastic added the medium-risk An open issue or test failure that is a medium risk to future releases label Oct 17, 2023
@maxhniebergall
Copy link
Member

Another failure:

REPRODUCE WITH: ./gradlew ':x-pack:plugin:security:internalClusterTest' --tests "org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT.testFailsOnStartMasterNodeWithError" -Dtests.seed=3B8EE670640C4AEB -Dtests.locale=ko-KR -Dtests.timezone=Asia/Krasnoyarsk -Druntime.java=21

https://gradle-enterprise.elastic.co/s/4cu6xticbpvou/console-log/task/:x-pack:plugin:security:internalClusterTest?anchor=318&page=1

@mosche
Copy link
Contributor

mosche commented Jan 17, 2024

I've investigated a couple of more recent failures (1, 2, 3, 4, 5), the pattern I observed for all of these looks to be a bit different from what @rjernst observed above.

It looks like the file setting changes were never picked up at all. And with that we're never getting into the expected error state. It looks as if there's an issue with the filesystem watcher service.

Another thing that was kind of noticeable is that all failures almost exclusively (I've seen 1 exception) happened on GCP hosts, though that's also not a sufficient condition.

Wondering if we should be using watchService.poll(long timeout, TimeUnit unit) rather than the blocking watchService.take()?

@rjernst
Copy link
Member

rjernst commented Jan 18, 2024

Looking at the first example failure you linked @mosche, it looks like the write to the json file may be happening before the file watcher service is actually started.

[2024-01-02T02:15:55,952][INFO ][o.e.c.f.AbstractFileWatchingService] [node_t0] starting file watcher ...	
[2024-01-02T02:15:55,953][DEBUG][o.e.c.f.AbstractFileWatchingService] [node_t0] watched directory [/opt/buildkite-agent/builds/bk-agent-prod-gcp-1704172408764267528/elastic/elasticsearch-periodic/x-pack/plugin/security/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT_F610F38063F87943-001/tempDir-003/TEST-0/config/operator] not found, will watch for its creation...
[2024-01-02T02:15:55,953][INFO ][o.e.x.s.FileSettingsRoleMappingsStartupIT] [testFailsOnStartMasterNodeWithError] --> writing JSON config to node node_t0 with path /opt/buildkite-agent/builds/bk-agent-prod-gcp-1704172408764267528/elastic/elasticsearch-periodic/x-pack/plugin/security/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT_F610F38063F87943-001/tempFile-001.tmp
...
[2024-01-02T02:15:55,953][INFO ][o.e.c.f.AbstractFileWatchingService] [[elasticsearch[file-watcher[/opt/buildkite-agent/builds/bk-agent-prod-gcp-1704172408764267528/elastic/elasticsearch-periodic/x-pack/plugin/security/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.security.FileSettingsRoleMappingsStartupIT_F610F38063F87943-001/tempDir-003/TEST-0/config/operator/settings.json]]]] file settings service up and running [tid=423]

I suspect the problem is we start the master only node with:

internalCluster().startMasterOnlyNode();

But we don't actually wait to ensure the file watcher service is actually running before writing the json file. We do account for the file being created after the service is started, but perhaps we get lucky most of the time and see the file already when the service is started.

@mosche Could you try creating an explicit test for the file being created after the file watcher is setup?

Wondering if we should be using watchService.poll(long timeout, TimeUnit unit) rather than the blocking watchService.take()?

take() is blocking, while poll is effectively not with its timeout. If we timeout, what would we do other than poll again?

@iverase
Copy link
Contributor

iverase commented Mar 7, 2024

@alex-spies
Copy link
Contributor

@ldematte
Copy link
Contributor

And another today: https://gradle-enterprise.elastic.co/s/v6zn6stsyd3c2

@rjernst
Copy link
Member

rjernst commented May 21, 2024

This test no longer exists on main. @albertzaharovits Given that this test was always a little flaky, wdyt about muting on 8.14 (where it still exists) and closing this issue?

@albertzaharovits
Copy link
Contributor

@rjernst This issue should be closed, yes, and ideally muting it on 8.14.

@rjernst
Copy link
Member

rjernst commented May 21, 2024

Done in 5c28758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label medium-risk An open issue or test failure that is a medium risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests