Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

Closed
DaveCTurner opened this issue Feb 16, 2021 · 1 comment · Fixed by #69150
Closed

SystemIndicesSnapshotIT#testParallelIndexDeleteRemovesFeatureState failures #69014

DaveCTurner opened this issue Feb 16, 2021 · 1 comment · Fixed by #69150
Assignees
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Build scan: e.g. https://gradle-enterprise.elastic.co/s/q5qondriqjpbm/console-log?task=:server:internalClusterTest, https://gradle-enterprise.elastic.co/s/vntohmkcphpm6/console-log?task=:server:internalClusterTest

Repro line: e.g. ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState" -Dtests.seed=486F4AB75B1B3573 -Dtests.security.manager=true -Dtests.locale=de-LU -Dtests.timezone=Asia/Saigon -Druntime.java=11

Reproduces locally?: Not for me

Applicable branches: master, 7.x

Failure history: A few times a day since introduction in #63513.

Failure excerpt:

org.elasticsearch.snapshots.SystemIndicesSnapshotIT > testParallelIndexDeleteRemovesFeatureState FAILED |  
  | java.lang.AssertionError: |  
  | Expected: <OK> |  
  | but: was <INTERNAL_SERVER_ERROR> |  
  | at __randomizedtesting.SeedInfo.seed([3D4C5F9F619C6859:BD5380622B536EEB]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState(SystemIndicesSnapshotIT.java:870) |  
  | Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8 |  
  | REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SystemIndicesSnapshotIT.testParallelIndexDeleteRemovesFeatureState" -Dtests.seed=3D4C5F9F619C6859 -Dtests.security.manager=true -Dtests.locale=en -Dtests.timezone=Asia/Gaza -Druntime.java=8 -Dtests.fips.enabled=true |  

Possibly related:

  1> [2021-02-15T16:56:26,144][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] started
  1> [2021-02-15T16:56:26,166][INFO ][o.e.s.m.MockRepository   ] [node_t4] [test-repo] blocking I/O operation for file [__k2PP4I0dTTSh4X8eUqH_Dg] at path [[indices][rSosZS4TQKKX_1zXyu5uHA][1]]
  1> [2021-02-15T16:56:26,171][INFO ][o.e.s.m.MockRepository   ] [node_t4] [test-repo] blocking I/O operation for file [__IZ_atGHURkCKt1I6JXt1Zw] at path [[indices][rSosZS4TQKKX_1zXyu5uHA][1]]
  1> [2564-02-15T16:56:26,173][INFO ][o.e.s.SystemIndicesSnapshotIT] [testParallelIndexDeleteRemovesFeatureState] --> Repo hit block, deleting the index...
  1> [2021-02-15T16:56:26,185][INFO ][o.e.c.m.MetadataDeleteIndexService] [node_t0] [.test-system-idx/kjxPo3W1QjG4E2OJi0MiZw] deleting index
  1> [2021-02-15T16:56:26,216][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][0] start check index
  1> [2021-02-15T16:56:26,217][WARN ][o.e.s.SnapshotShardsService] [node_t2] [[.test-system-idx][1]][test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [.test-system-idx]
  1> 	at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:521) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:250) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [main/:7.12.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
  1> [2021-02-15T16:56:26,218][WARN ][o.e.s.SnapshotShardsService] [node_t2] [[.test-system-idx][2]][test-repo:test-partial-snap/gc60KLpMT2WxruB1UnyOLw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [.test-system-idx]
  1> 	at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:521) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:250) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [main/:7.12.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
  1> [2021-02-15T16:56:26,230][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][0] end check index
  1> [2021-02-15T16:56:26,236][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][1] start check index
  1> [2021-02-15T16:56:26,239][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][1] end check index
  1> [2021-02-15T16:56:26,247][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][2] start check index
  1> [2021-02-15T16:56:26,250][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][2] end check index
  1> [2021-02-15T16:56:26,256][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][3] start check index
  1> [2021-02-15T16:56:26,259][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][3] end check index
  1> [2021-02-15T16:56:26,278][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][4] start check index
  1> [2021-02-15T16:56:26,309][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][4] end check index
  1> [2021-02-15T16:56:26,316][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][5] start check index
  1> [2021-02-15T16:56:26,318][INFO ][o.e.t.s.M.Listener       ] [node_t2] [.test-system-idx][5] end check index
  1> [2564-02-15T16:56:26,328][INFO ][o.e.s.SystemIndicesSnapshotIT] [testParallelIndexDeleteRemovesFeatureState] --> Index deleted, unblocking repo...
@DaveCTurner DaveCTurner added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Feb 16, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Feb 16, 2021
DaveCTurner added a commit that referenced this issue Feb 16, 2021
DaveCTurner added a commit that referenced this issue Feb 16, 2021
@DaveCTurner
Copy link
Contributor Author

I muted this in 5bd935e (master) and f03e70d (7.x)

@original-brownbear original-brownbear self-assigned this Feb 16, 2021
TommyWind added a commit to TommyWind/elasticsearch that referenced this issue Feb 17, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes elastic#69014
original-brownbear pushed a commit that referenced this issue Feb 17, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes #69014
original-brownbear pushed a commit to original-brownbear/elasticsearch that referenced this issue Feb 22, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes elastic#69014
original-brownbear pushed a commit to original-brownbear/elasticsearch that referenced this issue Feb 22, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes elastic#69014
original-brownbear added a commit that referenced this issue Feb 22, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes #69014

Co-authored-by: Tamara Braun <[email protected]>
original-brownbear added a commit that referenced this issue Feb 23, 2021
If an index is deleted while a partial snapshot is running the behavior
was not deterministic.
If an index was deleted just as one of its shard snapshots was about to
start then it would be recorded as a shard snapshot failure in the
snapshot result and the snapshot would show up as `PARTIAL`.
If the index delete however happened after the shard had been
snapshotted, then the snapshot would show `SUCCESS`.
In both cases however, the snapshot would contain the exact same data
because the deleted index would become part of the final snapshot.
Also, it was confusing that in the `PARTIAL` case, there would be errors
recorded for shards the indices of which would not be part of the
snapshot.

This commit makes it such that not only are indices filtered from the
list of indices in a snapshot but also from the shard snapshot errors
in a snapshot entry so that the snapshot always shows up as `SUCCESS`
because concurrent index deletes are not a failure but allowed in
partial snapshots.

Closes #69014

Co-authored-by: Tamara Braun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants