Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Missing assignment for [[index-0][4]] during snapshot #75423

Closed
DaveCTurner opened this issue Jul 16, 2021 · 1 comment · Fixed by #75501
Closed

AssertionError: Missing assignment for [[index-0][4]] during snapshot #75423

DaveCTurner opened this issue Jul 16, 2021 · 1 comment · Fixed by #75501
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.0.0-alpha1

Comments

@DaveCTurner
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): https://github.com/DaveCTurner/elasticsearch/tree/2021-07-16-snapshot-finalization-order-testing-WIP which includes #75362 plus some extra stress tests.

After many iterations of SnapshotStressTestsIT on the linked branch we tripped an assertion in SnapshotsService$RemoveSnapshotDeletionAndContinueTask, see below.

Provide logs (if relevant):

        java.lang.AssertionError: Missing assignment for [[index-0][4]]
            at __randomizedtesting.SeedInfo.seed([A599955E699077CA]:0)
            at org.elasticsearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.updatedSnapshotsInProgress(SnapshotsService.java:2662)
            at org.elasticsearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.execute(SnapshotsService.java:2502)
            at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48)
            at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:701)
            at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:323)
            at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:218)
            at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:155)
            at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)
            at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678)
            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259)
            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
            at java.base/java.lang.Thread.run(Thread.java:831)

testoutput-1626444817.tar.gz

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 labels Jul 16, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear original-brownbear self-assigned this Jul 16, 2021
original-brownbear added a commit that referenced this issue Jul 30, 2021
…#75501)

This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530.

These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:
* snapshot-1 for index-A with uuid-A runs and is partial
* index-A is deleted and re-created and now has uuid-B
* snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
* snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
  * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
  * this change fixes all these spots by always taking the round trip via `RepositoryShardId`
 
planned follow-ups here are:
* dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
* serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
    * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted
 * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes #75423 

relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 15, 2021
…elastic#75501)

This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in elastic#75530.

These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:
* snapshot-1 for index-A with uuid-A runs and is partial
* index-A is deleted and re-created and now has uuid-B
* snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
* snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
  * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
  * this change fixes all these spots by always taking the round trip via `RepositoryShardId`

planned follow-ups here are:
* dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
* serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
    * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted
 * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes elastic#75423

relates (elastic#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear added a commit that referenced this issue Aug 16, 2021
…#75501) (#76539)

This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530.

These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:
* snapshot-1 for index-A with uuid-A runs and is partial
* index-A is deleted and re-created and now has uuid-B
* snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
* snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
  * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
  * this change fixes all these spots by always taking the round trip via `RepositoryShardId`

planned follow-ups here are:
* dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
* serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
    * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted
 * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes #75423

relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 16, 2021
…elastic#75501) (elastic#76539)

This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in elastic#75530.

These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:
* snapshot-1 for index-A with uuid-A runs and is partial
* index-A is deleted and re-created and now has uuid-B
* snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
* snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
  * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
  * this change fixes all these spots by always taking the round trip via `RepositoryShardId`

planned follow-ups here are:
* dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
* serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
    * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted
 * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes elastic#75423

relates (elastic#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear added a commit that referenced this issue Aug 16, 2021
…#75501) (#76539) (#76547)

This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530.

These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:
* snapshot-1 for index-A with uuid-A runs and is partial
* index-A is deleted and re-created and now has uuid-B
* snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
* snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
  * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
  * this change fixes all these spots by always taking the round trip via `RepositoryShardId`

planned follow-ups here are:
* dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
* serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
    * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted
 * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes #75423

relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.0.0-alpha1
Projects
None yet
4 participants