ML: better handle task state race condition #38040

benwtrent · 2019-01-30T15:56:54Z

There are certain cluster conditions that may cause task state to change in unexpected ways while we are trying to set the cluster's upgrade_mode flag.

Tasks could go away in an order not anticipated
Tasks allocation_id could change unexpectedly

This change handles these conditions in the following way:

Continues to try an unallocate tasks, even if a previous one was not found with the current name + allocation_id
Fails if we timeout waiting for tasks to complete, or if there are other node failures. This potentially means that some other action changed the task allocation and it is still running. This is a failure as upgrade_mode was not set cleanly and ML indices could still be in use via those running tasks.

This race-condition is highlighted by the following build failures:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/212/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1255/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+corretto-periodic/ES_BUILD_JAVA=java11,label=amazon/84/console

elasticmachine · 2019-01-30T15:56:56Z

Pinging @elastic/ml-core

droberts195

LGTM

…ersion * elastic/master: Do not set up NodeAndClusterIdStateListener in test (elastic#38110) ML: better handle task state race condition (elastic#38040) Soft-deletes policy should always fetch latest leases (elastic#37940) Handle scheduler exceptions (elastic#38014) Minor logging improvements (elastic#38084) Fix Painless void return bug (elastic#38046) Update PutFollowAction serialization post-backport (elastic#37989) fix a few versionAdded values in ElasticsearchExceptions (elastic#37877) Reenable BWC tests after backport of elastic#37899 (elastic#38093) Mute failing test Mute failing test Fail start on obsolete indices documentation (elastic#37786) SQL: Implement FIRST/LAST aggregate functions (elastic#37936) Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock remove unused parser fields in RemoteResponseParsers

ML: better handle task state race condition

1938ac0

benwtrent added >non-issue v7.0.0 :ml Machine learning v6.7.0 labels Jan 30, 2019

droberts195 approved these changes Jan 31, 2019

View reviewed changes

benwtrent merged commit be381b4 into elastic:master Jan 31, 2019

benwtrent deleted the feature/ml-upgrade-mode-race-condition-failure branch January 31, 2019 17:07

benwtrent added a commit that referenced this pull request Jan 31, 2019

ML: better handle task state race condition (#38040)

4cdc4bd

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

droberts195 mentioned this pull request Mar 27, 2020

[ML] Integrate data frame analytics with ML upgrade mode #54326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML: better handle task state race condition #38040

ML: better handle task state race condition #38040

benwtrent commented Jan 30, 2019

elasticmachine commented Jan 30, 2019

droberts195 left a comment

ML: better handle task state race condition #38040

ML: better handle task state race condition #38040

Conversation

benwtrent commented Jan 30, 2019

elasticmachine commented Jan 30, 2019

droberts195 left a comment

Choose a reason for hiding this comment