[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

talevy · 2018-05-10T04:46:06Z

This test reminds me of older SharedClusterSnapshotRestoreIT failure on CI, but what is unique about
this one is that it just recently popped up. Although it has only been encountered twice in CI in the past 90 day, those two times were this week.

link to CI:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=debian/1047/console

reproduce step:

./gradlew :server:integTest -Dtests.seed=A207A05308BD2693 -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT -Dtests.method="testSnapshotSucceedsAfterSnapshotFailure" -Dtests.security.manager=true -Dtests.locale=uk-UA -Dtests.timezone=Europe/Belfast

stacktrace:

FAILURE 0.22s J1 | SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: expected:<0> but was:<1>
   > 	at __randomizedtesting.SeedInfo.seed([A207A05308BD2693:DD692EFC33F4422E]:0)
   > 	at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure(SharedClusterSnapshotRestoreIT.java:3150)
   > 	at java.lang.Thread.run(Thread.java:748)
  1> [2018-05-10T01:13:14,639][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotStatusOnFailedIndex]: before test
  1> [2018-05-10T01:13:14,639][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [SharedClusterSnapshotRestoreIT#testSnapshotStatusOnFailedIndex]: setting up test
  1> [2018-05-10T01:13:14,646][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_s0] adding template [random_index_template] for index patterns [*]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-05-10T04:46:07Z

Pinging @elastic/es-distributed

bleskes · 2018-05-10T09:34:56Z

@tlrx @ywelsch can one of you have a look?

jtibshirani · 2018-05-10T17:21:24Z

This test failed again in a recent build, so we've decided to mute it.

I've included the build information of the latest failure, in case it's helpful to have another example.

CI logs: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1874/console
Command to reproduce:

./gradlew :server:integTest \
  -Dtests.seed=7FCF996A1699F999 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotSucceedsAfterSnapshotFailure" \
  -Dtests.security.manager=true \
  -Dtests.locale=lv \
  -Dtests.timezone=NZ-CHAT

Note that this command didn't reproduce the problem for me locally.

…Failure with @AwaitsFix. The issue is being tracked in #30507.

ywelsch · 2018-05-11T07:30:52Z

@tlrx I've quickly looked at these failures and I think it's caused by #30332. As we now first write the new index shard snapshots file, a previous (older) failed attempt might block us from doing so:

Caused by: java.nio.file.FileAlreadyExistsException: blob [pending-index-0] already exists, cannot overwrite
05:26:19   1> 	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:127) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.snapshots.mockstore.BlobContainerWrapper.writeBlob(BlobContainerWrapper.java:53) ~[test/:?]
05:26:19   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.writeBlob(MockRepository.java:361) ~[test/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:191) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeAtomic(ChecksumBlobStoreFormat.java:136) ~[main/:?]
05:26:19   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:955) ~[main/:?]
05:26:19   1> 	... 10 more

This looks serious to me, and I think we should fix this asap before it gets released. I see two options:

similar to before, start deleting tmp files (indexShardSnapshotsFormat.isTempBlobName(blobName)) before writing the new index file.
Adapt ChecksumBlobStoreFormat.writeAtomic to use a temp blob name that has a uuid in it.

ywelsch · 2018-05-11T07:59:56Z

I've opened #30528 as a fix.

Fixes an (un-released) bug introduced in #30332. Closes #30507

talevy added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels May 10, 2018

jkakavas mentioned this issue May 10, 2018

Replace custom reloadable Key/TrustManager #30509

Merged

jtibshirani added a commit that referenced this issue May 10, 2018

Mute SharedClusterSnapshotRestoreIT#testSnapshotSucceedsAfterSnapshot…

1112fac

…Failure with @AwaitsFix. The issue is being tracked in #30507.

jtibshirani added a commit that referenced this issue May 10, 2018

Mute SharedClusterSnapshotRestoreIT#testSnapshotSucceedsAfterSnapshot…

40e7648

…Failure with @AwaitsFix. The issue is being tracked in #30507.

jtibshirani added a commit that referenced this issue May 10, 2018

Mute SharedClusterSnapshotRestoreIT#testSnapshotSucceedsAfterSnapshot…

501cc64

…Failure with @AwaitsFix. The issue is being tracked in #30507.

ywelsch added blocker v6.3.1 labels May 11, 2018

ywelsch mentioned this issue May 11, 2018

Delete temporary blobs before creating index file #30528

Merged

ywelsch closed this as completed in #30528 May 11, 2018

ywelsch added a commit that referenced this issue May 11, 2018

Delete temporary blobs before creating index file (#30528)

323bcd8

Fixes an (un-released) bug introduced in #30332. Closes #30507

ywelsch added a commit that referenced this issue May 11, 2018

Delete temporary blobs before creating index file (#30528)

1a9e045

Fixes an (un-released) bug introduced in #30332. Closes #30507

ywelsch added a commit that referenced this issue May 11, 2018

Delete temporary blobs before creating index file (#30528)

68ce328

Fixes an (un-released) bug introduced in #30332. Closes #30507

bleskes added v6.3.0 and removed v6.3.1 labels May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

talevy commented May 10, 2018

elasticmachine commented May 10, 2018

bleskes commented May 10, 2018

jtibshirani commented May 10, 2018 •

edited

Loading

ywelsch commented May 11, 2018

ywelsch commented May 11, 2018

[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

[CI] SharedClusterSnapshotRestoreIT.testSnapshotSucceedsAfterSnapshotFailure fails #30507

Comments

talevy commented May 10, 2018

elasticmachine commented May 10, 2018

bleskes commented May 10, 2018

jtibshirani commented May 10, 2018 • edited Loading

ywelsch commented May 11, 2018

ywelsch commented May 11, 2018

jtibshirani commented May 10, 2018 •

edited

Loading