SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

DaveCTurner · 2021-02-16T08:04:10Z

Build scan: e.g. https://gradle-enterprise.elastic.co/s/q5qondriqjpbm/console-log?task=:server:internalClusterTest, https://gradle-enterprise.elastic.co/s/vntohmkcphpm6/console-log?task=:server:internalClusterTest

Repro line: e.g. ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState" -Dtests.seed=486F4AB75B1B3573 -Dtests.security.manager=true -Dtests.locale=de-LU -Dtests.timezone=Asia/Saigon -Druntime.java=11

Reproduces locally?: Not for me

Applicable branches: master, 7.x

Failure history: A few times a day since introduction in #63513.

Failure excerpt:

org.elasticsearch.snapshots.SystemIndicesSnapshotIT > testParallelIndexDeleteRemovesFeatureState FAILED |  
  | java.lang.AssertionError: |  
  | Expected: <OK> |  
  | but: was <INTERNAL_SERVER_ERROR> |  
  | at __randomizedtesting.SeedInfo.seed([3D4C5F9F619C6859:BD5380622B536EEB]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState(SystemIndicesSnapshotIT.java:870) |  
  | Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8 |  
  | REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState" -Dtests.seed=3D4C5F9F619C6859 -Dtests.security.manager=true -Dtests.locale=en -Dtests.timezone=Asia/Gaza -Druntime.java=8 -Dtests.fips.enabled=true |

Possibly related:

  1> [2021-02-15T16:56:26,144][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] started
  1> [2021-02-15T16:56:26,166][INFO ][o.e.s.m.MockRepository   ] [node_t4] [test-repo] blocking I/O operation for file [__k2PP4I0dTTSh4X8eUqH_Dg] at path [[indices][rSosZS4TQKKX_1zXyu5uHA][1]]
  1> [2021-02-15T16:56:26,171][INFO ][o.e.s.m.MockRepository   ] [node_t4] [test-repo] blocking I/O operation for file [__IZ_atGHURkCKt1I6JXt1Zw] at path [[indices][rSosZS4TQKKX_1zXyu5uHA][1]]
  1> [2564-02-15T16:56:26,173][INFO ][o.e.s.SystemIndicesSnapshotIT] [testParallelIndexDeleteRemovesFeatureState] --> Repo hit block, deleting the index...
  1> [2021-02-15T16:56:26,185][INFO ][o.e.c.m.MetadataDeleteIndexService] [node_t0] [.test-system-idx/kjxPo3W1QjG4E2OJi0MiZw] deleting index
  1> [2021-02-15T16:56:26,216][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][0] start check index
  1> [2021-02-15T16:56:26,217][WARN ][o.e.s.SnapshotShardsService] [node_t2] [[.test-system-idx][1]][test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [.test-system-idx]
  1> 	at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:521) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:250) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [main/:7.12.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
  1> [2021-02-15T16:56:26,218][WARN ][o.e.s.SnapshotShardsService] [node_t2] [[.test-system-idx][2]][test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [.test-system-idx]
  1> 	at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:521) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:250) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [main/:7.12.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
  1> [2021-02-15T16:56:26,230][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][0] end check index
  1> [2021-02-15T16:56:26,236][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][1] start check index
  1> [2021-02-15T16:56:26,239][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][1] end check index
  1> [2021-02-15T16:56:26,247][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][2] start check index
  1> [2021-02-15T16:56:26,250][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][2] end check index
  1> [2021-02-15T16:56:26,256][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][3] start check index
  1> [2021-02-15T16:56:26,259][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][3] end check index
  1> [2021-02-15T16:56:26,278][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][4] start check index
  1> [2021-02-15T16:56:26,309][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][4] end check index
  1> [2021-02-15T16:56:26,316][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][5] start check index
  1> [2021-02-15T16:56:26,318][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][5] end check index
  1> [2564-02-15T16:56:26,328][INFO ][o.e.s.SystemIndicesSnapshotIT] [testParallelIndexDeleteRemovesFeatureState] --> Index deleted, unblocking repo...

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2021-02-16T08:18:52Z

I muted this in 5bd935e (master) and f03e70d (7.x)

If an index is deleted while a partial snapshot is running the behavior was not deterministic. If an index was deleted just as one of its shard snapshots was about to start then it would be recorded as a shard snapshot failure in the snapshot result and the snapshot would show up as `PARTIAL`. If the index delete however happened after the shard had been snapshotted, then the snapshot would show `SUCCESS`. In both cases however, the snapshot would contain the exact same data because the deleted index would become part of the final snapshot. Also, it was confusing that in the `PARTIAL` case, there would be errors recorded for shards the indices of which would not be part of the snapshot. This commit makes it such that not only are indices filtered from the list of indices in a snapshot but also from the shard snapshot errors in a snapshot entry so that the snapshot always shows up as `SUCCESS` because concurrent index deletes are not a failure but allowed in partial snapshots. Closes elastic#69014

If an index is deleted while a partial snapshot is running the behavior was not deterministic. If an index was deleted just as one of its shard snapshots was about to start then it would be recorded as a shard snapshot failure in the snapshot result and the snapshot would show up as `PARTIAL`. If the index delete however happened after the shard had been snapshotted, then the snapshot would show `SUCCESS`. In both cases however, the snapshot would contain the exact same data because the deleted index would become part of the final snapshot. Also, it was confusing that in the `PARTIAL` case, there would be errors recorded for shards the indices of which would not be part of the snapshot. This commit makes it such that not only are indices filtered from the list of indices in a snapshot but also from the shard snapshot errors in a snapshot entry so that the snapshot always shows up as `SUCCESS` because concurrent index deletes are not a failure but allowed in partial snapshots. Closes #69014

If an index is deleted while a partial snapshot is running the behavior was not deterministic. If an index was deleted just as one of its shard snapshots was about to start then it would be recorded as a shard snapshot failure in the snapshot result and the snapshot would show up as `PARTIAL`. If the index delete however happened after the shard had been snapshotted, then the snapshot would show `SUCCESS`. In both cases however, the snapshot would contain the exact same data because the deleted index would become part of the final snapshot. Also, it was confusing that in the `PARTIAL` case, there would be errors recorded for shards the indices of which would not be part of the snapshot. This commit makes it such that not only are indices filtered from the list of indices in a snapshot but also from the shard snapshot errors in a snapshot entry so that the snapshot always shows up as `SUCCESS` because concurrent index deletes are not a failure but allowed in partial snapshots. Closes elastic#69014

If an index is deleted while a partial snapshot is running the behavior was not deterministic. If an index was deleted just as one of its shard snapshots was about to start then it would be recorded as a shard snapshot failure in the snapshot result and the snapshot would show up as `PARTIAL`. If the index delete however happened after the shard had been snapshotted, then the snapshot would show `SUCCESS`. In both cases however, the snapshot would contain the exact same data because the deleted index would become part of the final snapshot. Also, it was confusing that in the `PARTIAL` case, there would be errors recorded for shards the indices of which would not be part of the snapshot. This commit makes it such that not only are indices filtered from the list of indices in a snapshot but also from the shard snapshot errors in a snapshot entry so that the snapshot always shows up as `SUCCESS` because concurrent index deletes are not a failure but allowed in partial snapshots. Closes #69014 Co-authored-by: Tamara Braun <[email protected]>

DaveCTurner added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Feb 16, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Feb 16, 2021

DaveCTurner added a commit that referenced this issue Feb 16, 2021

AwaitsFix for #69014

5bd935e

DaveCTurner added a commit that referenced this issue Feb 16, 2021

AwaitsFix for #69014

f03e70d

original-brownbear self-assigned this Feb 16, 2021

TommyWind mentioned this issue Feb 17, 2021

Fix Partial Snapshots Recording Spurious Errors #69150

Merged

original-brownbear closed this as completed in #69150 Feb 17, 2021

original-brownbear mentioned this issue Feb 22, 2021

Fix Partial Snapshots Recording Spurious Errors (#69150) #69372

Merged

original-brownbear mentioned this issue Feb 22, 2021

Fix Partial Snapshots Recording Spurious Errors (#69150) #69373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

DaveCTurner commented Feb 16, 2021

DaveCTurner commented Feb 16, 2021

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

Comments

DaveCTurner commented Feb 16, 2021

DaveCTurner commented Feb 16, 2021