Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress failing #47834

Closed
mark-vieira opened this issue Oct 9, 2019 · 6 comments · Fixed by #47841, #48433 or #48944
Closed
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress has started failing pretty often now with the following error:

java.lang.AssertionError: expected at least one master-eligible node left in {node_sc1=org.elasticsearch.test.InternalTestCluster$NodeAndClient@6e458321}

Here are some example build scans:
https://gradle-enterprise.elastic.co/s/hlldgfw3cx4bc/tests/rfsroxnx4sflo-uvkkyo6qd6mno
https://gradle-enterprise.elastic.co/s/i4ncjbmiflt6a/tests/rfsroxnx4sflo-uvkkyo6qd6mno
https://gradle-enterprise.elastic.co/s/33mp5rqowzsbi/tests/rfsroxnx4sflo-uvkkyo6qd6mno

Looking at build-stats this looks to be happening in both master and 7.x.

@mark-vieira mark-vieira added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@mark-vieira
Copy link
Contributor Author

This has been failing often enough that I've muted this in master and 7.x.

@original-brownbear original-brownbear self-assigned this Oct 10, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 10, 2019
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in elastic#47834

Closes elastic#47834
@original-brownbear original-brownbear added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

original-brownbear added a commit that referenced this issue Oct 10, 2019
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834

Closes #47834
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 10, 2019
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in elastic#47834

Closes elastic#47834
original-brownbear added a commit that referenced this issue Oct 10, 2019
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834

Closes #47834
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Oct 23, 2019

The test failed on the intake: https://gradle-enterprise.elastic.co/s/anqtao57gamow

org.elasticsearch.repositories.RepositoryException: [my-repo] Could not determine repository generation from root blobsClose stacktrace
at __randomizedtesting.SeedInfo.seed([AF89D7C3B10D5638:D253C9D96B797893]:0)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:906)
at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:163)
at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:201)
at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:105)
at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:65)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:166)
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
Caused by: java.nio.file.NoSuchFileException: /dev/shm/elastic+elasticsearch+master+multijob+fast+part2/x-pack/plugin/ilm/build/testrun/test/temp/org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests_AF89D7C3B10D5638-001/tempDir-002/repos/UOoOkN/index-0Open stacktrace

@original-brownbear
Copy link
Member

This will be trivial to fix now thanks to #48329 , I'm on it :)

original-brownbear added a commit that referenced this issue Oct 24, 2019
Just like #48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.

Closes #47834
@dnhatn
Copy link
Member

dnhatn commented Oct 30, 2019

org.elasticsearch.client.SnapshotIT > testCreateSnapshot FAILED
    org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=repository_exception, reason=[test_repository] Could not determine repository generation from root blobs]
        at __randomizedtesting.SeedInfo.seed([BCF37B148E73CF08:4295EEA7B8C03D11]:0)
        at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177)
        at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1793)
        at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1770)
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1527)
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
        at org.elasticsearch.client.SnapshotClient.delete(SnapshotClient.java:344)
        at org.elasticsearch.client.ESRestHighLevelClientTestCase.execute(ESRestHighLevelClientTestCase.java:90)
        at org.elasticsearch.client.ESRestHighLevelClientTestCase.execute(ESRestHighLevelClientTestCase.java:81)
        at org.elasticsearch.client.SnapshotIT.testCreateSnapshot(SnapshotIT.java:167)

        Caused by:
        org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=no_such_file_exception, reason=/dev/shm/elastic+elasticsearch+master+multijob+fast+part1/client/rest-high-level/build/testclusters/integTest-0/repo/index-18]
            at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
            at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
            at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
            at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
            at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169)
            ... 9 more
REPRODUCE WITH: ./gradlew ':client:rest-high-level:integTestRunner' --tests "org.elasticsearch.client.SnapshotIT.testCreateSnapshot" -Dtests.seed=BCF37B148E73CF08 -Dtests.security.manager=true -Dtests.locale=es-US -Dtests.timezone=America/Jujuy -Dcompiler.java=12 -Druntime.java=11

This failed again on master but with HLRC: https://gradle-enterprise.elastic.co/s/5abbkuha6acvy/console-log

@dnhatn dnhatn reopened this Oct 30, 2019
original-brownbear added a commit that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941 
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment