Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.x] [ML] fix machine learning job close/kill race condition (#71656) #71750

Merged
merged 1 commit into from
Apr 15, 2021

Conversation

benwtrent
Copy link
Member

Backports the following commits to 7.x:

If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close.

This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. 

But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it.

This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started.

Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created).

closes elastic#71646
@benwtrent benwtrent added :ml Machine learning backport labels Apr 15, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent merged commit f6e1a8f into elastic:7.x Apr 15, 2021
@benwtrent benwtrent deleted the backport/7.x/pr-71656 branch April 15, 2021 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants