[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

original-brownbear · 2019-10-04T04:54:06Z

Failed here: https://gradle-enterprise.elastic.co/s/tr5z6fea45tsu/console-log#L2589

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects" -Dtests.seed=C9164645C24DC7E4 -Dtests.security.manager=true -Dtests.locale=ti -Dtests.timezone=Asia/Novosibirsk -Dcompiler.java=12 -Druntime.java=11

fails with


org.elasticsearch.snapshots.SnapshotResiliencyTests > testSnapshotWithNodeDisconnects FAILED
--
java.lang.AssertionError:
Expected: map containing ["snap-atluIEEOR0-Jg-L2LxO3wg.dat"->ANYTHING]
but: map was []
at __randomizedtesting.SeedInfo.seed([C9164645C24DC7E4:C7D9E98DD88BFC24]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertSnapshotUUIDs(BlobStoreTestUtil.java:175)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil$1.doRun(BlobStoreTestUtil.java:110)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertConsistency(BlobStoreTestUtil.java:92)
at org.elasticsearch.snapshots.SnapshotResiliencyTests.verifyReposThenStopServices(SnapshotResiliencyTests.java:238)

I can reproduce this locally when running the test in a loop with the given seed.
This means two problems:

The test fails
The test is supposed to be fully deterministic but apparently it's not since it doesn't reliably fail with the given seed.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-04T04:54:07Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Relates elastic#47550 (not closing since the issue that the test isn't 100% deterministic remains)

ywelsch · 2019-10-04T07:32:55Z

Try using -Dhppc.bitmixer=DETERMINISTIC to see if that makes the failures deterministic (it's the same trick we have to use for the CoordinatorTests

original-brownbear · 2019-10-04T08:35:20Z

Thanks @ywelsch that does indeed make things deterministic (albeit in this case deterministically passing :)).

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes elastic#47550

This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550

dnhatn · 2020-01-23T19:30:39Z

@original-brownbear This test failed on my backport PR: https://gradle-enterprise.elastic.co/s/4zqn6knvgu64s.

original-brownbear · 2020-01-23T21:06:06Z

Ah, thanks for pinging @dnhatn ... I was wondering if this could happen but could never find a seed to reproduce :)

I'll fix the test shortly, this is the fallout from only dealing with the index.latest blob on a best effort basis starting in 7.6 but asserting that the index.latest is fully consistent even during master-failover.

This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes elastic#47550

* Fix Inconsistent Shard Failure Count in Failed Snapshots This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes #47550

* Fix Inconsistent Shard Failure Count in Failed Snapshots This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes elastic#47550

…1426) * Fix Inconsistent Shard Failure Count in Failed Snapshots This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes #47550

* Fix Inconsistent Shard Failure Count in Failed Snapshots This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes elastic#47550

…4480) * Fix Inconsistent Shard Failure Count in Failed Snapshots This fix was necessary to allow for the below test enhancement: We were not adding shard failure entries to a failed snapshot for those snapshot entries that were never attempted because the snapshot failed during the init stage and wasn't partial. This caused the never attempted snapshots to be counted towards the successful shard count which seems wrong and broke repository consistency tests. Also, this change adjusts snapshot resiliency tests to run another snapshot at the end of each test run to guarantee a correct `index.latest` blob exists after each run. Closes #47550

original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 4, 2019

original-brownbear self-assigned this Oct 4, 2019

original-brownbear mentioned this issue Oct 4, 2019

Fix Snapshot Corruption in Edge Case #47552

Merged

original-brownbear closed this as completed in e244d65 Oct 4, 2019

original-brownbear mentioned this issue Oct 5, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47620

Merged

original-brownbear mentioned this issue Oct 5, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47621

Merged

original-brownbear mentioned this issue Oct 7, 2019

Fix Snapshot Corruption in Edge Case (#47552) #47636

Merged

codebrain mentioned this issue Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

original-brownbear reopened this Jan 23, 2020

original-brownbear mentioned this issue Jan 24, 2020

Fix Inconsistent Shard Failure Count in Failed Snapshots #51416

Merged

original-brownbear closed this as completed in #51416 Jan 24, 2020

original-brownbear mentioned this issue Jan 24, 2020

Fix Inconsistent Shard Failure Count in Failed Snapshots (#51416) #51426

Merged

original-brownbear mentioned this issue Mar 31, 2020

Fix Inconsistent Shard Failure Count in Failed Snapshots (#51416) #54480

Merged

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

original-brownbear commented Oct 4, 2019 •

edited

Loading

elasticmachine commented Oct 4, 2019

ywelsch commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

dnhatn commented Jan 23, 2020

original-brownbear commented Jan 23, 2020 •

edited

Loading

[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

Comments

original-brownbear commented Oct 4, 2019 • edited Loading

elasticmachine commented Oct 4, 2019

ywelsch commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

dnhatn commented Jan 23, 2020

original-brownbear commented Jan 23, 2020 • edited Loading

original-brownbear commented Oct 4, 2019 •

edited

Loading

original-brownbear commented Jan 23, 2020 •

edited

Loading