[CI] SnapshotsStressTestsIT failure #101029

DaveCTurner · 2023-10-18T07:15:39Z

CI Link

https://gradle-enterprise.elastic.co/s/q7czw7twlsxik

Repro line

':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SnapshotStressTestsIT.testRandomActivities" -Dtests.seed=F1980E360247D5B1 -Dtests.locale=ar-LY -Dtests.timezone=Chile/Continental -Druntime.java=21

Does it reproduce?

No

Applicable branches

main

Failure history

No response

Failure excerpt

    java.lang.AssertionError: java.lang.AssertionError: java.nio.file.NoSuchFileException: /home/davidturner/src/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.snapshots.SnapshotStressTestsIT_F1980E360247D5B1-001/tempDir-002/repos/PylHhYAhiy/indices/OANV69t0QOCAiQkr30IfHw/2/index--feK6pXDR9e7yalgyGFAEg
        at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertConsistency(BlobStoreTestUtil.java:94)
        at org.elasticsearch.snapshots.AbstractSnapshotIntegTestCase.lambda$assertRepoConsistency$1(AbstractSnapshotIntegTestCase.java:153)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at org.elasticsearch.snapshots.AbstractSnapshotIntegTestCase.assertRepoConsistency(AbstractSnapshotIntegTestCase.java:147)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
        at java.base/java.lang.Thread.run(Thread.java:1583)

        Caused by:
        java.lang.AssertionError: java.nio.file.NoSuchFileException: /home/davidturner/src/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.snapshots.SnapshotStressTestsIT_F1980E360247D5B1-001/tempDir-002/repos/PylHhYAhiy/indices/OANV69t0QOCAiQkr30IfHw/2/index--feK6pXDR9e7yalgyGFAEg
            at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil$2.onResponse(BlobStoreTestUtil.java:268)
            at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil$2.onResponse(BlobStoreTestUtil.java:262)
            at org.elasticsearch.repositories.GetSnapshotInfoContext.onResponse(GetSnapshotInfoContext.java:117)
            at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getOneSnapshotInfo$24(BlobStoreRepository.java:1849)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            ... 1 more

            Caused by:
            java.nio.file.NoSuchFileException: /home/davidturner/src/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.snapshots.SnapshotStressTestsIT_F1980E360247D5B1-001/tempDir-002/repos/PylHhYAhiy/indices/OANV69t0QOCAiQkr30IfHw/2/index--feK6pXDR9e7yalgyGFAEg
                at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
                at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261)
                at java.base/java.nio.file.Files.newByteChannel(Files.java:379)
                at java.base/java.nio.file.Files.newByteChannel(Files.java:431)
                at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:420)
                at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193)
                at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193)
                at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193)
                at org.apache.lucene.tests.mockfile.HandleTrackingFS.newInputStream(HandleTrackingFS.java:94)
                at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193)
                at java.base/java.nio.file.Files.newInputStream(Files.java:159)
                at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:188)
                at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:121)
                at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3656)
                at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3633)
                at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertSnapshotInfosConsistency(BlobStoreTestUtil.java:351)
                at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil$2.onResponse(BlobStoreTestUtil.java:266)
                ... 7 more

This failure is super super rare, I've seen it maybe twice after running this test in a loop for several days. It indicates we're doing something wrong with shard generations still.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-10-18T07:16:03Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-12-27T07:23:20Z

I've found a way to reproduce this - it requires deleting an index during out-of-order snapshot finalization, so it's pretty delicate.

Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Closes elastic#101029

kingherc · 2024-01-05T10:58:07Z

One more at elasticsearch / periodic / platform-support / main / debian-11 / platform-support-unix

    java.lang.AssertionError: java.lang.AssertionError: java.nio.file.NoSuchFileException: /opt/local-ssd/buildkite/builds/bk-agent-prod-gcp-1704430908311388851/elastic/elasticsearch-periodic-platform-support/server/build/testrun/internalClusterTest/temp/org.elasticsearch.snapshots.SnapshotStressTestsIT_D5C3A255A514ABB2-001/tempDir-002/repos/wkujWTVxAr/indices/uiEFev9pSX2Ii-6iR6aMBg/0/index-7dx3dt7bQiG13lDoLKWIag	
        at org.elasticsearch.repositories.blobstore.BlobStoreTestUtil.assertConsistency(BlobStoreTestUtil.java:96)
...

Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Closes #101029

Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Backports elastic#103817 to 8.12 Closes elastic#101029

Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Backports elastic#103817 to 7.17 Closes elastic#101029

Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Backports #103817 to 8.12 Closes #101029

* Fix deleting index during snapshot finalization Today if an index is deleted during a very specific order of snapshot finalizations then it's possible we'll miscalculate the latest shard generations for the shards in that index, causing the deletion of a shard-level `index-UUID` blob which prevents further snapshots of that shard. Backports #103817 to 7.17 Closes #101029 * Test fixup

DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 18, 2023

elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 18, 2023

volodk85 added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Oct 20, 2023

DaveCTurner added medium-risk An open issue or test failure that is a medium risk to future releases and removed low-risk An open issue or test failure that is a low risk to future releases labels Oct 24, 2023

This was referenced Oct 25, 2023

[CI] SnapshotStressTestsIT testRandomActivities failing #101352

Closed

[CI] SnapshotStressTestsIT testRandomActivities failing #101410

Closed

DaveCTurner mentioned this issue Jan 2, 2024

Fix deleting index during snapshot finalization #103817

Merged

elasticsearchmachine closed this as completed in #103817 Jan 15, 2024

DaveCTurner mentioned this issue Jan 15, 2024

Fix deleting index during snapshot finalization #104378

Merged

DaveCTurner mentioned this issue Jan 15, 2024

Fix deleting index during snapshot finalization #104380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SnapshotsStressTestsIT failure #101029

[CI] SnapshotsStressTestsIT failure #101029

DaveCTurner commented Oct 18, 2023

elasticsearchmachine commented Oct 18, 2023

DaveCTurner commented Dec 27, 2023

kingherc commented Jan 5, 2024

[CI] SnapshotsStressTestsIT failure #101029

[CI] SnapshotsStressTestsIT failure #101029

Comments

DaveCTurner commented Oct 18, 2023

CI Link

Repro line

Does it reproduce?

Applicable branches

Failure history

Failure excerpt

elasticsearchmachine commented Oct 18, 2023

DaveCTurner commented Dec 27, 2023

kingherc commented Jan 5, 2024