-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646
Comments
Pinging @elastic/ml-core (Team:ML) |
I see the race condition 🤦 take a look:
We attempt to kill the job while its awaiting to get assigned, THEN we call close as But still, see line
The task STILL gets assigned to open 🤦. So, if the task is just recently assigned (reverting the snapshot), and reset mode is enabled, it ignores the stop/kill requests and finishes opening... |
I'm seeing loads of other ML test failures as well, are these all related? |
When The characteristics to look for to say whether a test failure is collateral damage of
|
Are those the same failure? There's a lot of similar failures on windows |
If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes #71646
If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes elastic#71646
) If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes #71646
Build scan:
https://gradle-enterprise.elastic.co/s/7fst6frfszn2q
Repro line:
Reproduces locally?:
No
Applicable branches:
master
Failure history:
Just once for this particular failure - it's a rare side effect of #71552 which I merged today.
Failure excerpt:
Also, from the server-side logs:
When we close immediately after killing the process it seems that we might report the close as successful while the persistent task still exists.
The text was updated successfully, but these errors were encountered: