X-pack rolling upgrade failures #31827
Labels
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>test-failure
Triaged test failures from CI
v6.3.2
v6.4.0
There have been a number of failures in the
x-pack:qa:rolling-upgrade
suite recently with what appears to be the same error.https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles/1155
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/126
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/142
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/388
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+bwc-tests/175
The tests are failing in
ESRestTestCase.waitForClusterStateUpdatesToFinish
with the assertion:There a lots of these queued, time_in_queue goes up to 4 seconds. The
assertBusy
is for 30 seconds so something has blocked the updates or the cluster is very very slow. Notesource=shard-failed
. Sometimes the ML tests fail with a timeout on updating the persistent task state suggesting this is also due to the slow cluster state updates.In all cases it is the
twoThirdsUpgradedTestRunner
that fails (2 out of 3 nodes have been upgraded) this runner has 3 extra tests that don't run in theoneThirdUpgradedTestRunner
. Let's start with the idea that one of these tests is causing the failure:When the tests fail this message is repeated in the logs approximately every 5 seconds
which appears to be related to the test
mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade
which creates theupgraded_scroll
indexThe upgraded nodes have the
node.attr.upgraded: true
set so at this point the index can be allocated to 2 of the 3 nodes in the cluster.It's not clear if this or the ML jobs is the root cause.
The text was updated successfully, but these errors were encountered: