Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Snapshot Resiliency Testing #39504

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 28, 2019

This PR contains all the steps needed to be able to simulate issues with eventually consistent blob-stores like AWS S3 in the org.elasticsearch.snapshots.SnapshotResiliencyTests, so that various failure scenarios can be reproduced in a deterministic fashion.

As a prerequisite, this required being able to execute index and search operations via the deterministic task queue. This required removing all blocking logic from the bulk request execution which was done in this PR.

--- WIP more description incoming ---

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 5, 2019
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for elastic#39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
original-brownbear added a commit that referenced this pull request Mar 5, 2019
* Use threadpool's time in `ClusterApplierService` to allow for deterministic tests
* This is a part of/requirement for #39504
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 5, 2019
* Use threadpool's time in `ClusterApplierService` to allow for deterministic tests
* This is a part of/requirement for elastic#39504
original-brownbear added a commit that referenced this pull request Mar 5, 2019
* Use threadpool's time in `ClusterApplierService` to allow for deterministic tests
* This is a part of/requirement for #39504
original-brownbear added a commit that referenced this pull request Mar 5, 2019
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for #39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 5, 2019
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for elastic#39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
original-brownbear added a commit that referenced this pull request Mar 5, 2019
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for #39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 7, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 29, 2019
* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
original-brownbear added a commit that referenced this pull request Apr 3, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in #39504
original-brownbear added a commit that referenced this pull request Apr 6, 2019
This is a dependency of #39504 

Motivation: 
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead. 
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.

The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.

Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
@original-brownbear
Copy link
Member Author

closing here since all of this is now part of other non-draft PRs

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 11, 2019
This is a dependency of elastic#39504

Motivation:
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead.
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.

The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.

Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
original-brownbear added a commit that referenced this pull request Apr 11, 2019
This is a dependency of #39504

Motivation:
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead.
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.

The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.

Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 25, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
original-brownbear added a commit that referenced this pull request Apr 26, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in #39504
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
This is a dependency of elastic#39504 

Motivation: 
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead. 
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.

The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.

Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >refactoring >test Issues or PRs that are addressing/adding tests v7.2.0 v8.0.0-alpha1 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants