Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

Closed
tlrx opened this issue Feb 6, 2020 · 9 comments
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@tlrx
Copy link
Member

tlrx commented Feb 6, 2020

The test ClusterRerouteIT » testDelayWithALargeAmountOfShards failed today on CI for the 7.x branch with the error:

timed out waiting for green state
java.lang.AssertionError
: 
timed out waiting for green state
Close stacktrace
at __randomizedtesting.SeedInfo.seed([B25CE8847D0ACBC2:9E2DE2656C14E120]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:929)
at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:882)
at org.elasticsearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards(ClusterRerouteIT.java:224)

Build scan: https://gradle-enterprise.elastic.co/s/3l2rtho4qznu6

Looking at the build stats it also fails on 7.6 (https://gradle-enterprise.elastic.co/s/bxlllkb4jubxs) and seems to be related to the new way of persisting metadata state (see #50907), in which case tests might need to be adapted?

@tlrx tlrx added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Feb 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@DaveCTurner DaveCTurner self-assigned this Feb 12, 2020
@pugnascotia
Copy link
Contributor

Failed again on 7.6 - https://gradle-enterprise.elastic.co/s/wmy4bjcpxqnte

@henningandersen
Copy link
Contributor

henningandersen commented Feb 18, 2020

Failed again on master: https://gradle-enterprise.elastic.co/s/getynn6hnzfm2

Looked a bit into it, looks like retention lease sync causes the shards to become stale one by one, not sure if this is part of why it is timing out, but it is at least a significant difference since this test was written.

Did not reproduce locally.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2020
It looks like elastic#52000 is caused by a slowdown in cluster state application
(maybe due to elastic#50907) but I would like to understand the details to ensure that
there's nothing else going on here too before simply increasing the timeout.
This commit enables some relevant `DEBUG` loggers and also captures stack
traces from all threads rather than just the three hottest ones.
DaveCTurner added a commit that referenced this issue Feb 18, 2020
It looks like #52000 is caused by a slowdown in cluster state application
(maybe due to #50907) but I would like to understand the details to ensure that
there's nothing else going on here too before simply increasing the timeout.
This commit enables some relevant `DEBUG` loggers and also captures stack
traces from all threads rather than just the three hottest ones.
DaveCTurner added a commit that referenced this issue Feb 18, 2020
It looks like #52000 is caused by a slowdown in cluster state application
(maybe due to #50907) but I would like to understand the details to ensure that
there's nothing else going on here too before simply increasing the timeout.
This commit enables some relevant `DEBUG` loggers and also captures stack
traces from all threads rather than just the three hottest ones.
@DaveCTurner
Copy link
Contributor

Hmm. All 7 of the failures of this test over the last 30 days were using org.apache.lucene.mockfile.WindowsFS. It typically completes in 30-60 seconds but the failures are 6-8 minutes long. WindowsFS adds extra validation that we behave correctly on filesystems that don't let you delete open files; I wonder if this is introducing extra slowness.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2020
Issue elastic#52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of `testDelayWithALargeAmountOfShards` take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead.  When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use `WindowsFS` to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of `WindowsFS` for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
@DaveCTurner
Copy link
Contributor

I opened #52488 to suppress WindowsFS on this test suite.

I am not marking #52488 as closes #52000 since I am not fully certain that this is the underlying problem. I would prefer to have a couple of weeks without failures before declaring this issue closed.

DaveCTurner added a commit that referenced this issue Feb 19, 2020
Issue #52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of `testDelayWithALargeAmountOfShards` take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead.  When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use `WindowsFS` to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of `WindowsFS` for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
DaveCTurner added a commit that referenced this issue Feb 19, 2020
Issue #52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of `testDelayWithALargeAmountOfShards` take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead.  When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use `WindowsFS` to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of `WindowsFS` for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
@DaveCTurner
Copy link
Contributor

Echoing Henning's comment here: if after a couple of weeks we decide that WindowsFS was the source of our slowdown we have a few options to discuss:

  • reduce the scope of the suppression by moving the one slow test into its own suite and suppress WindowsFS there
  • reinstate WindowsFS and either
    • extend the timeout anyway, possibly only if WindowsFS is in use, or
    • dig further into why WindowsFS is slow for us here and fix it upstream
  • take no further action

@DaveCTurner
Copy link
Contributor

We discussed this today and decided that we'll take no further action unless we see another failure, and therefore we can close this; please reopen it if you come across another failure.

sbourke pushed a commit to sbourke/elasticsearch that referenced this issue Feb 19, 2020
It looks like elastic#52000 is caused by a slowdown in cluster state application
(maybe due to elastic#50907) but I would like to understand the details to ensure that
there's nothing else going on here too before simply increasing the timeout.
This commit enables some relevant `DEBUG` loggers and also captures stack
traces from all threads rather than just the three hottest ones.
sbourke pushed a commit to sbourke/elasticsearch that referenced this issue Feb 19, 2020
Issue elastic#52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of `testDelayWithALargeAmountOfShards` take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead.  When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use `WindowsFS` to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of `WindowsFS` for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
@probakowski
Copy link
Contributor

Another failure from today: https://gradle-enterprise.elastic.co/s/gws7m6tms6coy

@probakowski probakowski reopened this Mar 4, 2020
DaveCTurner added a commit that referenced this issue Mar 5, 2020
Issue #52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of `testDelayWithALargeAmountOfShards` take well under a minute in
CI, but there are occasional failures that take 6+ minutes instead.  When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.

The failures all have in common that they use `WindowsFS` to emulate Windows'
behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of `WindowsFS` for this test suite.

[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
@DaveCTurner
Copy link
Contributor

WindowsFS strikes again. I backported #52488 to the 7.6 branch in 2878ec8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants