Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wipe Snapshots Before Indices in RestTests #39662

Conversation

original-brownbear
Copy link
Member

  • If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
    • Fixed by first deleting/aborting snapshots

Example failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+bwc-tests/109/console

@markharwood fyi :)

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v7.0.0 v6.7.0 v8.0.0 v7.2.0 labels Mar 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am concerned that this might miss bugs. I don't yet understand the failure. Here are some (edited) logs:

Suite: org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT
  1> [2562-03-04T13:32:52,610][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_alias/10_basic/Get all aliases via /_alias] before test
  1> [2562-03-04T13:32:52,616][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] initializing REST clients against [http://[::1]:40899, http://[::1]:41395, http://[::1]:41282, http://[::1]:44755]
  1> [2562-03-04T13:32:52,786][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] initializing yaml client, minimum es version: [5.6.1] master version: [5.6.1] hosts: [http://[::1]:40899, http://[::1]:41395, http://[::1]:41282, http://[::1]:44755]
  1> [2562-03-04T13:32:52,844][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_alias/10_basic/Get all aliases via /_alias] after test
  1> [2562-03-04T13:32:52,875][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/110_max_metric/Missing field with missing param] before test
  1> [2562-03-04T13:32:53,385][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/110_max_metric/Missing field with missing param] after test
  1> [2562-03-04T13:32:53,400][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/130_sum_metric/Aggregating wrong datatype test] before test
  1> [2562-03-04T13:32:54,108][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/130_sum_metric/Aggregating wrong datatype test] after test
  1> [2562-03-04T13:32:54,133][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/180_percentiles_tdigest_metric/Metadata test] before test
  1> [2562-03-04T13:32:54,164][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/180_percentiles_tdigest_metric/Metadata test] after test
  1> [2562-03-04T13:32:54,183][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=delete/20_internal_version/Internal version] before test
  1> [2562-03-04T13:32:54,213][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=delete/20_internal_version/Internal version] after test
  1> [2562-03-04T13:32:54,239][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.open/10_basic/Open index with wait_for_active_shards set to all] before test
  1> [2562-03-04T13:32:54,270][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.open/10_basic/Open index with wait_for_active_shards set to all] after test
  1> [2562-03-04T13:32:54,281][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/30_missing_index/Raise 404 when index doesn't exist] before test
  1> [2562-03-04T13:32:54,322][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/30_missing_index/Raise 404 when index doesn't exist] after test
  1> [2562-03-04T13:32:54,332][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.rollover/10_basic/Rollover no condition matched] before test
  1> [2562-03-04T13:32:54,693][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.rollover/10_basic/Rollover no condition matched] after test
  1> [2562-03-04T13:32:54,707][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.stats/13_fields/Fields - star] before test
  1> [2562-03-04T13:32:55,274][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.stats/13_fields/Fields - star] after test
  1> [2562-03-04T13:32:55,288][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cluster.reroute/11_explain/Explain API with empty command list] before test
  1> [2562-03-04T13:32:55,518][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cluster.reroute/11_explain/Explain API with empty command list] after test
  1> [2562-03-04T13:32:55,532][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/110_field_collapsing/no hits and inner_hits] before test
  1> [2562-03-04T13:32:56,276][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/110_field_collapsing/no hits and inner_hits] after test
  1> [2562-03-04T13:32:56,293][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/70_legacy_multi_type/Get /index*/_mapping/{type}] before test
  1> [2562-03-04T13:32:57,054][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/70_legacy_multi_type/Get /index*/_mapping/{type}] after test
  1> [2562-03-04T13:32:57,065][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=snapshot.get/10_basic/Get snapshot info contains include_global_state] before test
  1> [2562-03-04T13:32:57,091][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=snapshot.get/10_basic/Get snapshot info contains include_global_state] after test
  1> [2562-03-04T13:32:57,108][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/160_exists_query/Test exists query on mapped float field with no doc values] before test
  1> [2562-03-04T13:32:59,071][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/160_exists_query/Test exists query on mapped float field with no doc values] after test
  1> [2562-03-04T13:32:59,087][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.stats/20_translog/Translog retention] before test
  1> [2562-03-04T13:32:59,119][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.stats/20_translog/Translog retention] after test
  1> [2562-03-04T13:32:59,133][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=termvectors/10_basic/Basic tests for termvector get] before test
  1> [2562-03-04T13:32:59,559][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=termvectors/10_basic/Basic tests for termvector get] after test
  1> [2562-03-04T13:32:59,572][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/10_histogram/Format test] before test
  1> [2562-03-04T13:32:59,997][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/10_histogram/Format test] after test
  1> [2562-03-04T13:33:00,009][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cat.shards/10_basic/Help] before test
  1> [2562-03-04T13:33:00,037][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=cat.shards/10_basic/Help] after test
  1> [2562-03-04T13:33:00,050][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/11_basic_with_types/Get /{index}/_mapping/_all] before test
  1> [2562-03-04T13:33:00,797][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/11_basic_with_types/Get /{index}/_mapping/_all] after test
  1> [2562-03-04T13:33:00,806][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/50_wildcard_expansion/Get test-* with wildcard_expansion=open,closed] before test
  1> [2562-03-04T13:33:01,670][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_mapping/50_wildcard_expansion/Get test-* with wildcard_expansion=open,closed] after test
  1> [2562-03-04T13:33:01,681][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/160_exists_query/Test exists query on mapped long field with no doc values] before test
  1> [2562-03-04T13:33:03,172][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search/160_exists_query/Test exists query on mapped long field with no doc values] after test
  1> [2562-03-04T13:33:03,182][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=get/70_source_filtering/Source filtering] before test
  1> [2562-03-04T13:33:03,704][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=get/70_source_filtering/Source filtering] after test
  1> [2562-03-04T13:33:03,714][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_template/10_basic/Get template with flat settings and master timeout] before test
  1> [2562-03-04T13:33:03,736][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_template/10_basic/Get template with flat settings and master timeout] after test
  1> [2562-03-04T13:33:03,746][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_field_mapping/50_field_wildcards/Get field mapping with * for fields] before test
  1> [2562-03-04T13:33:03,767][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.get_field_mapping/50_field_wildcards/Get field mapping with * for fields] after test
  1> [2562-03-04T13:33:03,776][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/190_percentiles_hdr_metric/Missing field with missing param] before test
  1> [2562-03-04T13:33:04,115][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=search.aggregation/190_percentiles_hdr_metric/Missing field with missing param] after test
  1> [2562-03-04T13:33:04,125][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.put_settings/all_path_options/put settings in * index] before test
  1> [2562-03-04T13:33:04,893][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.put_settings/all_path_options/put settings in * index] after test
  1> [2562-03-04T13:33:04,901][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.put_mapping/all_path_options/put mapping in * index] before test
  1> [2562-03-04T13:33:04,929][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=indices.put_mapping/all_path_options/put mapping in * index] after test
  1> [2562-03-04T13:33:04,946][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=snapshot.restore/10_basic/Create a snapshot and then restore it] before test
  1> [2562-03-04T13:33:35,158][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "concurrent_snapshot_execution_exception",
  1>             "reason" : "[test_repo_restore_1:test_snapshot]a snapshot is already running",
  1>             "stack_trace" : "ConcurrentSnapshotExecutionException[[test_repo_restore_1:test_snapshot]a snapshot is already running]
  1>  at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:266)
  1>  at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
  1>  at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:634)
  1>  at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:612)
  1>  at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:571)
  1>  at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
  1>  at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1>  at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:210)
  1>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

As far as I can see there's only one test that does anything with snapshots before the one that fails, snapshot.get/10_basic/Get snapshot info contains include_global_state, it shouldn't be finishing while there are still snapshots running. So I am wondering what was the snapshot running here? I note that this was a 6.7/5.6.1 mixed cluster with a 5.6.1 master so maybe this is a bug that has since been fixed?

@original-brownbear
Copy link
Member Author

original-brownbear commented Mar 5, 2019

@DaveCTurner sorry I wasn't so clear in the PR description. There's def. some other issue in that test (on that too), but imo the current behavior is just annoying: We leak in one test => all tests fail and one has to go through MBs of logs to track down what triggered this because the teardown method isn't stable. I guess we could register the fact that we had in-progress snapshots in the teardown method and then fail after the teardown actually happened if we wanted to make a leaked snapshot a failure condition?

@original-brownbear
Copy link
Member Author

@DaveCTurner I think one possible explanation for the first failure seen in that specific log would be a master failure while the snapshot request was made (that situation is known to behave weirdly and can fail in exactly this way due to master request retries). But then again, there could be a number of other issues like the one fixed in #27214 here I think since it's 5.6.1 => I still like this change to make things a little easier to debug log wise :D

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm with you on the current behaviour being annoying. Could we make it fail this one time if it finds a snapshot in progress, but then clean them up anyway, so at least the build will fail if a test leaks a snapshot?

@original-brownbear
Copy link
Member Author

@DaveCTurner thanks, added code to first wipe the snapshots and then fail in the end if we found any that weren't in a completed state :)

@original-brownbear
Copy link
Member Author

original-brownbear commented Mar 6, 2019

The real issue here is fixed in #39737, but I'm still a fan of this change just to make things a little more ergonomic :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

@DaveCTurner thanks!

@original-brownbear original-brownbear merged commit 42649ff into elastic:master Mar 6, 2019
@original-brownbear original-brownbear deleted the wipe-snapshots-before-indices branch March 6, 2019 18:26
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
@original-brownbear
Copy link
Member Author

backports in #39765 #39764 #39763

original-brownbear added a commit that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
original-brownbear added a commit that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
original-brownbear added a commit that referenced this pull request Mar 6, 2019
* Wipe Snapshots Before Indices in RestTests

* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
   * Fixed by first deleting/aborting snapshots
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 7, 2019
* 6.7:
  Fix CCR HLRC docs
  Introduce forget follower API (elastic#39718)
  6.6.2 release notes.
  Update release notes for 6.7.0
  Add documentation for min_hash filter (elastic#39671)
  Unmute testIndividualActionsTimeout
  Unmute testFollowIndexAndCloseNode
  Use unwrapped cause to determine if node is closing (elastic#39723)
  Don’t ack if unable to remove failing replica (elastic#39584)
  Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765)
  Bug fix for AnnotatedTextHighlighter (elastic#39525)
  Fix Snapshot BwC with Version 5.6.x (elastic#39737)
  Fix occasional SearchServiceTests failure (elastic#39697)
  Correct date in daterange-aggregation.asciidoc (elastic#39727)
  Add a note to purge the ingest-geoip plugin (elastic#39553)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 25, 2019
* Mistake was made in elastic#39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit that referenced this pull request Mar 26, 2019
* Mistake was made in #39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 26, 2019
* Mistake was made in elastic#39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 26, 2019
* Mistake was made in elastic#39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 26, 2019
* Mistake was made in elastic#39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit that referenced this pull request Mar 26, 2019
* Mistake was made in #39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit that referenced this pull request Mar 26, 2019
* Mistake was made in #39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
original-brownbear added a commit that referenced this pull request Mar 26, 2019
* Mistake was made in #39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v6.7.0 v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants