Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

Closed
talevy opened this issue May 10, 2018 · 5 comments
Closed
Labels
blocker :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v6.3.0

Comments

@talevy
Copy link
Contributor

talevy commented May 10, 2018

This test reminds me of older SharedClusterSnapshotRestoreIT failure on CI, but what is unique about
this one is that it just recently popped up. Although it has only been encountered twice in CI in the past 90 day, those two times were this week.

link to CI:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=debian/1047/console

reproduce step:

./gradlew :server:integTest -Dtests.seed=A207A05308BD2693 -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT -Dtests.method="testSnapshotSucceedsAfterSnapshotFailure" -Dtests.security.manager=true -Dtests.locale=uk-UA -Dtests.timezone=Europe/Belfast

stacktrace:

FAILURE 0.22s J1 | SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: expected:<0> but was:<1>
   > 	at __randomizedtesting.SeedInfo.seed([A207A05308BD2693:DD692EFC33F4422E]:0)
   > 	at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure(SharedClusterSnapshotRestoreIT.java:3150)
   > 	at java.lang.Thread.run(Thread.java:748)
  1> [2018-05-10T01:13:14,639][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotStatusOnFailedIndex]: before test
  1> [2018-05-10T01:13:14,639][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testSnapshotStatusOnFailedIndex]: setting up test
  1> [2018-05-10T01:13:14,646][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [random_index_template] for index patterns [*]
@talevy talevy added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels May 10, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented May 10, 2018

@tlrx @ywelsch can one of you have a look?

@jtibshirani
Copy link
Contributor

jtibshirani commented May 10, 2018

This test failed again in a recent build, so we've decided to mute it.

I've included the build information of the latest failure, in case it's helpful to have another example.

CI logs: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1874/console
Command to reproduce:

./gradlew :server:integTest \
  -Dtests.seed=7FCF996A1699F999 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotSucceedsAfterSnapshotFailure" \
  -Dtests.security.manager=true \
  -Dtests.locale=lv \
  -Dtests.timezone=NZ-CHAT

Note that this command didn't reproduce the problem for me locally.

jtibshirani added a commit that referenced this issue May 10, 2018
…Failure with @AwaitsFix.

The issue is being tracked in #30507.
jtibshirani added a commit that referenced this issue May 10, 2018
…Failure with @AwaitsFix.

The issue is being tracked in #30507.
jtibshirani added a commit that referenced this issue May 10, 2018
…Failure with @AwaitsFix.

The issue is being tracked in #30507.
@ywelsch
Copy link
Contributor

ywelsch commented May 11, 2018

@tlrx I've quickly looked at these failures and I think it's caused by #30332. As we now first write the new index shard snapshots file, a previous (older) failed attempt might block us from doing so:

Caused by: java.nio.file.FileAlreadyExistsException: blob [pending-index-0] already exists, cannot overwrite
05:26:19   1> 	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:127) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.snapshots.mockstore.BlobContainerWrapper.writeBlob(BlobContainerWrapper.java:53) ~[test/:?]
05:26:19   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.writeBlob(MockRepository.java:361) ~[test/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:191) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeAtomic(ChecksumBlobStoreFormat.java:136) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:955) ~[main/:?]
05:26:19   1> 	... 10 more

This looks serious to me, and I think we should fix this asap before it gets released. I see two options:

  1. similar to before, start deleting tmp files (indexShardSnapshotsFormat.isTempBlobName(blobName)) before writing the new index file.
  2. Adapt ChecksumBlobStoreFormat.writeAtomic to use a temp blob name that has a uuid in it.

@ywelsch
Copy link
Contributor

ywelsch commented May 11, 2018

I've opened #30528 as a fix.

ywelsch added a commit that referenced this issue May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
ywelsch added a commit that referenced this issue May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
ywelsch added a commit that referenced this issue May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
@bleskes bleskes added v6.3.0 and removed v6.3.1 labels May 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v6.3.0
Projects
None yet
Development

No branches or pull requests

5 participants