Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: better handle task state race condition #38040

Conversation

benwtrent
Copy link
Member

There are certain cluster conditions that may cause task state to change in unexpected ways while we are trying to set the cluster's upgrade_mode flag.

  • Tasks could go away in an order not anticipated
  • Tasks allocation_id could change unexpectedly

This change handles these conditions in the following way:

  • Continues to try an unallocate tasks, even if a previous one was not found with the current name + allocation_id
  • Fails if we timeout waiting for tasks to complete, or if there are other node failures. This potentially means that some other action changed the task allocation and it is still running. This is a failure as upgrade_mode was not set cleanly and ML indices could still be in use via those running tasks.

This race-condition is highlighted by the following build failures:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/212/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1255/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+corretto-periodic/ES_BUILD_JAVA=java11,label=amazon/84/console

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit be381b4 into elastic:master Jan 31, 2019
@benwtrent benwtrent deleted the feature/ml-upgrade-mode-race-condition-failure branch January 31, 2019 17:07
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 31, 2019
…ersion

* elastic/master:
  Do not set up NodeAndClusterIdStateListener in test (elastic#38110)
  ML: better handle task state race condition (elastic#38040)
  Soft-deletes policy should always fetch latest leases (elastic#37940)
  Handle scheduler exceptions (elastic#38014)
  Minor logging improvements (elastic#38084)
  Fix Painless void return bug (elastic#38046)
  Update PutFollowAction serialization post-backport (elastic#37989)
  fix a few versionAdded values in ElasticsearchExceptions (elastic#37877)
  Reenable BWC tests after backport of elastic#37899 (elastic#38093)
  Mute failing test
  Mute failing test
  Fail start on obsolete indices documentation (elastic#37786)
  SQL: Implement FIRST/LAST aggregate functions (elastic#37936)
  Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock
  remove unused parser fields in RemoteResponseParsers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants