Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X-pack rolling upgrade failures #31827

Closed
davidkyle opened this issue Jul 5, 2018 · 3 comments
Closed

X-pack rolling upgrade failures #31827

davidkyle opened this issue Jul 5, 2018 · 3 comments
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.3.2 v6.4.0

Comments

@davidkyle
Copy link
Member

There have been a number of failures in the x-pack:qa:rolling-upgrade suite recently with what appears to be the same error.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles/1155
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/126
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/142
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/388
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+bwc-tests/175

The tests are failing in ESRestTestCase.waitForClusterStateUpdatesToFinish with the assertion:

{time_in_queue=15ms, time_in_queue_millis=15, source=shard-failed, executing=true, priority=HIGH, insert_order=197}
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
        ... 37 more

There a lots of these queued, time_in_queue goes up to 4 seconds. The assertBusy is for 30 seconds so something has blocked the updates or the cluster is very very slow. Note source=shard-failed. Sometimes the ML tests fail with a timeout on updating the persistent task state suggesting this is also due to the slow cluster state updates.

In all cases it is the twoThirdsUpgradedTestRunner that fails (2 out of 3 nodes have been upgraded) this runner has 3 extra tests that don't run in the oneThirdUpgradedTestRunner. Let's start with the idea that one of these tests is causing the failure:

 'mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade',
 'mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data',
 'mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster'

When the tests fail this message is repeated in the logs approximately every 5 seconds

[2018-07-01T20:31:31,474][WARN ][o.e.i.c.IndicesClusterStateService] [upgraded-node-1] [[upgraded_scroll][1]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:565) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:152) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:550) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:527) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:230) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:430) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][1]: obtaining shard lock timed out after 5000ms
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    ... 17 more

which appears to be related to the test mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade which creates the upgraded_scroll index

  - do:
      indices.create:
        index: upgraded_scroll
        wait_for_active_shards: all
        body:
          settings:
            number_of_replicas: 0
            index.routing.allocation.include.upgraded: true

The upgraded nodes have the node.attr.upgraded: true set so at this point the index can be allocated to 2 of the 3 nodes in the cluster.

It's not clear if this or the ML jobs is the root cause.

@davidkyle davidkyle added >test-failure Triaged test failures from CI v6.4.0 v6.3.2 labels Jul 5, 2018
@dnhatn dnhatn added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Jul 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@davidkyle
Copy link
Member Author

@bleskes
Copy link
Contributor

bleskes commented Jul 9, 2018

I started looking at these and figure I had to wait for a fresh failure to grab all the logs from the machines. Thanks for getting these @davidkyle

bleskes added a commit that referenced this issue Jul 10, 2018
…ning a scroll

A scroll holds a reference to the shard store. If the cluster is moving shards
around that reference can prevent a shard from relocating back to node it used
to be on, causing test failures.

Closes #31827
bleskes added a commit that referenced this issue Jul 10, 2018
…ning a scroll

A scroll holds a reference to the shard store. If the cluster is moving shards
around that reference can prevent a shard from relocating back to node it used
to be on, causing test failures.

Closes #31827
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.3.2 v6.4.0
Projects
None yet
Development

No branches or pull requests

4 participants