Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] resource_already_exists_exception failures #45605

Closed
droberts195 opened this issue Aug 15, 2019 · 12 comments
Closed

[CI] resource_already_exists_exception failures #45605

droberts195 opened this issue Aug 15, 2019 · 12 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

This problem occurred while running indices.update_aliases/10_basic/Basic test for multiple aliases in the qa/smoke-test-multinode directory. However, I do not think that particular test is to blame for the failure - it seems more like a very rare problem with index management in multi-node clusters.

The build where this problem was seen was https://gradle-enterprise.elastic.co/s/fknt4cque3p4e

The client side problem that made the test fail was "index already exists":

  1> [2019-08-15T18:29:39,851][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] before test
  1> [2019-08-15T18:30:09,895][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>             "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>             "index" : "test_index",
  1>             "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>         "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>         "index" : "test_index",
  1>         "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery].
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] after test

The underlying reason from the server side logs was that a previous test's index of the same name could not be deleted:

[2019-08-15T08:28:37,221][WARN ][o.e.i.IndicesService     ] [integTest-0] [test_index/THgD8CcvQyGEf_ADVnVoCw] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [test_index][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-08-15T08:28:37,240][WARN ][o.e.i.s.RetentionLeaseSyncAction] [integTest-0] unexpected error during the primary phase for action [indices:admin/seq_no/retention_lease_sync], request [RetentionLeaseSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=2, leases={peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA=RetentionLease{id='peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA', retainingSequenceNumber=0, timestamp=1565857716949, source='peer recovery'}, peer_recovery/bnbl33BbQveq_mL2_DzTcg=RetentionLease{id='peer_recovery/bnbl33BbQveq_mL2_DzTcg', retainingSequenceNumber=0, timestamp=1565857717149, source='peer recovery'}}}, shardId=[test_index][0], timeout=1m, index='test_index', waitForActiveShards=0}]
org.elasticsearch.index.IndexNotFoundException: no such index [test_index]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:190) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:116) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteSingleIndex(IndexNameExpressionResolver.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.concreteIndex(TransportReplicationAction.java:234) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:651) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:795) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.postAdded(ClusterStateObserver.java:220) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$1.run(ClusterApplierService.java:299) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The repro command is:

./gradlew :qa:smoke-test-multinode:integTestRunner --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" \
  -Dtests.method="test {yaml=indices.update_aliases/10_basic/Basic test for multiple aliases}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-ML \
  -Dtests.timezone=Australia/ACT \
  -Dcompiler.java=12 \
  -Druntime.java=11

This did not reproduce for me.

The other thing that might be relevant is that around the time this was failing a completely different test against a different test cluster failed on the same CI worker. It was:

./gradlew :docs:integTestRunner --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/docs/update-by-query/line_253}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=dyo \
  -Dtests.timezone=Asia/Jerusalem \
  -Dcompiler.java=12 \
  -Druntime.java=11

The client side error for that was:

  1> [2019-08-15T11:29:39,823][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] before test
  1> [2019-08-15T11:30:09,868][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>             "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>             "index" : "twitter",
  1>             "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>         "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>         "index" : "twitter",
  1>         "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T11:30:36,334][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] after test

The server side for that docs test failure shows nothing apart from a big gap:

[2019-08-15T08:29:39,832][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [twitter] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-08-15T08:30:24,229][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [twitter/I-JDSTD4S6KmKfAEkI95iA] deleting index

These two test clusters were different - it's not that the docs tests and the multi-node smoke tests were simultaneously talking to the same test cluster. But they were running at the same time and suffered failures around the same time.

It may be that these failures are caused because CI worker was just completely overloaded by multiple test clusters running simultaneously around the time of these failures.

But maybe there's something interesting in the ShardLockObtainFailedException that might be worth considering in case users see it on machines that are under heavy load.

@droberts195 droberts195 added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Aug 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@droberts195
Copy link
Contributor Author

droberts195 commented Aug 15, 2019

I just noticed this is a more detailed investigation of one specific case of #45600, so this may not be a rare problem at all.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t
Copy link
Contributor

alpar-t commented Aug 15, 2019

Could it be that somehow the clusters unite so in fact the tests are running against a larger cluster not two separate ones ? I think that would explain the behavior from the logs.
I remember @ywelsch suggested we use unique cluster names for tests to make sure this doesn't happen, even trough I don't think it should with the current configuration.

@droberts195
Copy link
Contributor Author

Could it be that somehow the clusters unite

Maybe that has happened in other CI runs, but in the CI run described in the description of this issue the two test clusters did not unite. The server side logs for the docs cluster show that it only had 1 node (as expected) and the server side logs for the qa/smoke-test-multinode cluster show that it only had 2 nodes (as expected).

@droberts195
Copy link
Contributor Author

The two examples we have so far where tests for qa/smoke-test-multinode and docs failed around the same time may be coincidence.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2016/71/console is an example where the failure to delete an index between tests occurs and just a single test fails in the CI run:

  1> [2019-08-15T04:48:49,931][WARN ][o.e.i.IndicesService     ] [node_t1] [test/y12jwHyUSiS70AhJ_iAD3g] failed to delete index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [test][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]

The failure that follows on from that is:

    java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.

The repro command for this failure was:

gradlew :server:integTest --tests "org.elasticsearch.cluster.allocation.ClusterRerouteIT.testRerouteExplain" \
  -Dtests.seed=13BCAC0C1240CA7E \
  -Dtests.security.manager=true \
  -Dtests.locale=fi \
  -Dtests.timezone=Canada/Yukon \
  -Dcompiler.java=12 \
  -Druntime.java=8

Given that the problem occurs in the cleanup in between tests I'm not sure individual repro commands are that helpful though.

@alpar-t
Copy link
Contributor

alpar-t commented Aug 15, 2019

As a side note, it seems that we should fail if the cleanup fails instead of attempting to run the tests

@alpar-t
Copy link
Contributor

alpar-t commented Aug 22, 2019

We looked at this failure on slack a couple of days ago with @original-brownbear and did not update here. Doing so in case someone finds it useful.

There were some CCR test failures that looked similar ( index not found ) that had to do with caching test tasks when we shouldn't have. CCR tests are usually made up of a test that sets CCR up and another one that checks on it, so if part of it get's cached the index will never replicate. This was fixed in #45659 and is unrelated to this issue as here we are dealing with tests that do run with testclusters which can correctly deal with caching.

It also seems that this fails on tests that run against a multi node cluster.

The following exception shows up after the test when the test framework tries to clean up all indices:

[2019-08-19T12:37:44,585][WARN ][o.e.i.IndicesService     ] [integTest-0] [goodbad/29xCvTM8T9GIAsfnTPWONQ] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [goodbad][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

In this case we had 2 tests that used the same index.
The first one to do so failed, saying the index already exists, and then the cleanup failed ( silently ) again when trying to delete that existing index, yet the next test that used that index name passed.

There is no evidence that the index that exists was ever created. No tests that use that index seem to have ran prior to this, and the test is connected to the right cluster, that has the right nodes, excluding cross-talk with other tests.
This also seems to happen in bwc tests that do not run in parallel.
checkPart1 seems to fail most often so running ./gradlew checkPart1 --parallel on a CI worker is likely to reproduce this.

@dnhatn suggested this could be related to #45409 but we seem to have seen failures after that was merged as well.

@original-brownbear original-brownbear self-assigned this Aug 22, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 22, 2019
In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates elastic#45605 which might be caused by unacked deletes that were just slow.
@williamrandolph
Copy link
Contributor

On this intake build, OpenCloseIndexIT failed with symptoms that were similar to what @atorok described above: mysterious test failure, then ShardLockObtainFailedException at test cleanup time. https://gradle-enterprise.elastic.co/s/t26bxyvqgurkq/tests/kyv2y2z3r4v7m-rg5rz5gnpfyvq

original-brownbear added a commit that referenced this issue Aug 22, 2019
In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates #45605 which might be caused by unacked deletes that were just slow.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 22, 2019
)

In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates elastic#45605 which might be caused by unacked deletes that were just slow.
original-brownbear added a commit that referenced this issue Aug 22, 2019
…45842)

In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates #45605 which might be caused by unacked deletes that were just slow.
@DaveCTurner
Copy link
Contributor

As noted in #45956 (comment) I do not think that org.elasticsearch.env.ShardLockObtainFailedException: [goodbad][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory] is necessarily a problem. But looking at the build failure linked in #45605 (comment) (i.e. https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2016/71/consoleText) I see this:

  1> [2019-08-15T04:48:59,949][WARN ][o.e.i.c.IndicesClusterStateService] [node_t0] [[test/y12jwHyUSiS70AhJ_iAD3g]] failed to complete pending deletion for index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [test][0]: thread interrupted while trying to obtain shard lock
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:775) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:684) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:627) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.processPendingDeletes(IndicesService.java:1077) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService$2.doRun(IndicesClusterStateService.java:356) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> Caused by: java.lang.InterruptedException
  1> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) ~[?:1.8.0_221]
  1> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) ~[?:1.8.0_221]
  1> 	at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) ~[?:1.8.0_221]
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:766) ~[main/:?]
  1> 	... 9 more

Interrupting this wait for the shard lock while the node is closing does seem like it might be problematic.

@ywelsch ywelsch changed the title [CI] "failed to delete index" due to ShardLockObtainFailedException [CI] resource_already_exists_exception failures Sep 5, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Sep 5, 2019

I've edited the title here. The ShardLockObtainFailedException and IndexNotFoundException warnings are harmless and also happen on successful runs. I've created #46368 to quiet the ShardLockObtainFailedException down as they seem to cause confusion (as any warning that we log).

We still need to figure out what's causing the resource_already_exists_exception

@droberts195
Copy link
Contributor Author

Subsequent updates moved to #46091, so closing this one.

@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

8 participants