Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MixedClusterClientYamlTestSuiteIT failure due to trying to delete indices being snapshotted/creating indices that already exist #39721

Closed
gwbrown opened this issue Mar 5, 2019 · 6 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Mar 5, 2019

This hit a 6.7 intake build on one of my commits. I'm pretty sure it's not related to the changes in the commit, as they pertain mostly to Watcher.

CI Link: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+intake/306/console

A ton of stuff in MixedClusterClientYamlTestSuiteIT failed, none reproduces locally. Sample reproduce line:

./gradlew :qa:mixed-cluster:v5.6.16#mixedClusterTestRunner \
  -Dtests.seed=AFFFC69E10895A69 \
  -Dtests.class=org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT \
  -Dtests.method="test {p0=cat.snapshots/10_basic/Test cat snapshots output}" \
  -Dtests.security.manager=true \
  -Dtests.locale=he-IL \
  -Dtests.timezone=America/Grenada \
  -Dcompiler.java=11 \
  -Druntime.java=8

There's two kinds of exceptions that keep popping up in the logs that look like they might be related.

One is a failure to delete some indices that are currently being snapshotted:

org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:33913], URI [*], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node-0][127.0.0.1:44482][indices:admin/delete]"}],"type":"illegal_argument_exception","reason":"Cannot delete indices that are being snapshotted: [[index2/vK05jwCyTIiYeTcVLqKQpw], [index1/TvYH0C5FSHSG4l_PhEQMHA]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}
	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:936)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233)
	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:455)
	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:273)
	at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:33913], URI [*], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node-0][127.0.0.1:44482][indices:admin/delete]"}],"type":"illegal_argument_exception","reason":"Cannot delete indices that are being snapshotted: [[index2/vK05jwCyTIiYeTcVLqKQpw], [index1/TvYH0C5FSHSG4l_PhEQMHA]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:552)
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:537)
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
	... 1 more

And the other is trying to create an index that already exists, which just so happens to be one of the indices that couldn't be deleted because it was being snapshotted. This one is a bit harder to get a clean stack trace of, so here's a snippet of the returned JSON:

"error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "index_already_exists_exception",
  1>             "reason" : "index [index1/TvYH0C5FSHSG4l_PhEQMHA] already exists",
  1>             "index_uuid" : "TvYH0C5FSHSG4l_PhEQMHA",
  1>             "index" : "index1",
  1>             "stack_trace" : "[index1/TvYH0C5FSHSG4l_PhEQMHA] ResourceAlreadyExistsException[index [index1/TvYH0C5FSHSG4l_PhEQMHA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:147)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:512)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.access$000(MetaDataCreateIndexService.java:106)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.execute(MetaDataCreateIndexService.java:239)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
  1> 	at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:634)
  1> 	at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:612)
  1> 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:571)
  1> 	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:576)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:247)

Logs, for posterity: consoleText.txt.zip

[edit: pasted the wrong second snippet the first time]

@gwbrown gwbrown added >test-failure Triaged test failures from CI :Delivery/Build Build or test infrastructure labels Mar 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@gwbrown
Copy link
Contributor Author

gwbrown commented Mar 5, 2019

I've tagged this as core/infra/build mostly because there's a ton of stuff in this test suite and I'm not sure whether this is a problem with one of the tests, or the test infrastructure.

@original-brownbear original-brownbear self-assigned this Mar 5, 2019
@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Delivery/Build Build or test infrastructure labels Mar 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member

This is a snapshot Bwc issue between 6.7.0 and 5.6.x resulting from #39550. I'll deal with this tomorrow.
Also, link #39662 which makes the logging for this failure a lot less painful.

@original-brownbear
Copy link
Member

I tracked this down I think, we are sending the wrong snapshot shard status update message to 5.6 nodes from 6.7, fixing now.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 6, 2019
* We were sending the wrong snapshot shard status update format to 5.6 (but reading the correct version) so tests would fail with 5.6 masters but not with 5.6 nodes running against a 6.7 master
* Closes elastic#39721
original-brownbear added a commit that referenced this issue Mar 6, 2019
* Fix Snapshot BwC with Version 5.6.x

* We were sending the wrong snapshot shard status update format to 5.6 (but reading the correct version) so tests would fail with 5.6 masters but not with 5.6 nodes running against a 6.7 master
* Closes #39721
@original-brownbear
Copy link
Member

Closed via #39737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants