-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274
Comments
@architkulkarni @sihanwang41 @edoakes Opening this so it can be tracked. Two options discussed so far:
|
I see, so we're currently effectively sending SIGTERM as a fire-and-forget and never sending SIGKILL? Option 2 sounds like a good option, but if we kill process 2 instead of the whole process group, do we need to worry about child processes of process 2 staying alive? |
The actual job (process 2) never gets explicitly killed, until the actor itself exits. That's a valid concern for option 2. Seems like before this we also didn't take care of child processes, because we just send SIGKILL to process 1. We could possibly send SIGTERM signal to all the child processes of process 2, but in that case would it make sense to track all child processes here, instead of tracking only process 1? (by track I mean send SIGTERM, wait for the process to exit, and send SIGKILL after timeout) Either way, it seems that when the actor exits, all processes are force killed. |
Ah true, but I think before the SIGTERM PR we were sending SIGKILL to the entire process group.
That's surprising, any ideas how this happens? Maybe the first question is what the user would expect the behavior to be. If a non-Ray user runs a command that starts some background child processes, then when the process for the command exits (naturally or by being killed) the child processes won't die, and this is expected. In that case, maybe we should have the same behavior in Ray Jobs, where we only kill the entrypoint process and don't take care of child processes from the user's entrypoint command. @zcin @edoakes any thoughts on this? Would this scenario happen in the real world? If we do need to kill all child processes, then if you know a way to track all of them (or track the "process group" somehow), then I think that would be the right thing to do. |
Hmm, not sure. If this is surprising I should probably double check, it was what I observed after many tests so I had just assumed. |
Ah, we send SIGKILL to the entire process group when the actor exits. |
I experimented, and we can use |
Hi @zcin, option2 sounds good to me. we can have a timeout for all the child process, if timeout, we send SIGKILL and kill all progress in the same group. |
Ah nice, at first I thought it was some generic thing for all Ray actors which would have been surprising. This makes more sense. Maybe we can make the decision by just preserving the original behavior as much as possible (where we SIGKILL the entire group.) If we see a strong need (like a user request for the background processes scenario) then we can consider changing it. So to summarize what you/Sihan have come up with (let me know if this is off):
If this turns out to be hard or complicated, maybe Step 2 could be replaced by "Track process 2 and if it's still alive, SIGKILL the group", and it would be the user's responsibility to make sure child processes of process 2 exit whenever process 2 exits. |
I'm wondering even if we for sure want to execute Step 1; is it possible that the user's job is itself tracking its child processes, and upon receiving a SIGTERM signal might want to terminate their child processes in a custom way? Do we want to give that option to the user? |
That's a good point, I'm not sure (@edoakes @sihanwang41 thoughts?) Two weak points in favor of signaling the whole process group:
My vague impression is that this is a corner case which we should be able to change in the future without too much trouble if we get feedback on it. |
This should be the standard way of all terminations. It is also how Also, I think sending signals to the group seems more useful in general. If there's a compelling use case of managing child proc themselves, we should support with a flag, but I think defaulting it to this option seems to be more prone to error for many users who don't know about this |
Resolves the issue described in #31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well) Co-authored-by: shrekris-anyscale <[email protected]>
Resolves the issue described in #31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well) Co-authored-by: shrekris-anyscale <[email protected]>
Resolves the issue described in ray-project#31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well) Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: tmynn <[email protected]>
What happened + What you expected to happen
The following is how we start a new Ray Job:
But because we use
shell=True
, this will kick off the following two processes (assuming_entrypoint="python script.py"
):[1] /bin/sh -c python script.py
(child_process
points to this process)[2] python script.py
So when we terminate the process group that contains both process 1 and 2, process 1 will terminate, but process 2 (the actual job) may handle
SIGTERM
in a custom way. But we only track the status ofchild_process
= process 1, so the job is assumed to be done and the actor is killed -> all processes are killed.Versions / Dependencies
Linux
Reproduction script
You can try to submit the following
script.py
as a job:When the job is stopped, this should attempt to wait 5 seconds, but then get forcefully killed at 3 seconds. Instead the logs show that the job gracefully terminated by itself.
Issue Severity
None
The text was updated successfully, but these errors were encountered: