Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failure in org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects #47550

Closed
original-brownbear opened this issue Oct 4, 2019 · 5 comments · Fixed by #51416
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Oct 4, 2019

Failed here: https://gradle-enterprise.elastic.co/s/tr5z6fea45tsu/console-log#L2589

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotWithNodeDisconnects" -Dtests.seed=C9164645C24DC7E4 -Dtests.security.manager=true -Dtests.locale=ti -Dtests.timezone=Asia/Novosibirsk -Dcompiler.java=12 -Druntime.java=11

fails with


org.elasticsearch.snapshots.SnapshotResiliencyTests > testSnapshotWithNodeDisconnects FAILED
--
java.lang.AssertionError:
Expected: map containing ["snap-atluIEEOR0-Jg-L2LxO3wg.dat"->ANYTHING]
but: map was []
at __randomizedtesting.SeedInfo.seed([C9164645C24DC7E4:C7D9E98DD88BFC24]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertSnapshotUUIDs(BlobStoreTestUtil.java:175)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil$1.doRun(BlobStoreTestUtil.java:110)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertConsistency(BlobStoreTestUtil.java:92)
at org.elasticsearch.snapshots.SnapshotResiliencyTests.verifyReposThenStopServices(SnapshotResiliencyTests.java:238)

I can reproduce this locally when running the test in a loop with the given seed.
This means two problems:

  1. The test fails
  2. The test is supposed to be fully deterministic but apparently it's not since it doesn't reliably fail with the given seed.
@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 4, 2019
@original-brownbear original-brownbear self-assigned this Oct 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 4, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Relates elastic#47550 (not closing since the issue that the test isn't 100% deterministic remains)
@ywelsch
Copy link
Contributor

ywelsch commented Oct 4, 2019

Try using -Dhppc.bitmixer=DETERMINISTIC to see if that makes the failures deterministic (it's the same trick we have to use for the CoordinatorTests

@original-brownbear
Copy link
Member Author

Thanks @ywelsch that does indeed make things deterministic (albeit in this case deterministically passing :)).

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit that referenced this issue Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
original-brownbear added a commit that referenced this issue Oct 5, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 7, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
original-brownbear added a commit that referenced this issue Oct 7, 2019
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
@dnhatn
Copy link
Member

dnhatn commented Jan 23, 2020

@original-brownbear This test failed on my backport PR: https://gradle-enterprise.elastic.co/s/4zqn6knvgu64s.

@original-brownbear
Copy link
Member Author

original-brownbear commented Jan 23, 2020

Ah, thanks for pinging @dnhatn ... I was wondering if this could happen but could never find a seed to reproduce :)

I'll fix the test shortly, this is the fallout from only dealing with the index.latest blob on a best effort basis starting in 7.6 but asserting that the index.latest is fully consistent even during master-failover.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 24, 2020
This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes elastic#47550
original-brownbear added a commit that referenced this issue Jan 24, 2020
* Fix Inconsistent Shard Failure Count in Failed Snapshots

This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes #47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 24, 2020
* Fix Inconsistent Shard Failure Count in Failed Snapshots

This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes elastic#47550
original-brownbear added a commit that referenced this issue Jan 24, 2020
…1426)

* Fix Inconsistent Shard Failure Count in Failed Snapshots

This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes #47550
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 31, 2020
* Fix Inconsistent Shard Failure Count in Failed Snapshots

This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes elastic#47550
original-brownbear added a commit that referenced this issue Mar 31, 2020
…4480)

* Fix Inconsistent Shard Failure Count in Failed Snapshots

This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.

Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.

Closes #47550
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants