-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drivers: Capture exit code when task is killed #10494
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left one question but really it's on the existing code so not a blocker.
waitCh, err := handle.WaitCh(tr.shutdownCtx) | ||
if resultCh == nil { | ||
var err error | ||
resultCh, err = handle.WaitCh(tr.shutdownCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is existing code that just has a new conditional, but why do we get the WaitCh
after we send the killTask
? Wouldn't we avoid the error handling code here if we got the WaitCh
first and then blocked on it after we send the killTask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Not sure, it wasn't the cause of the bug but it might lead to other interesting cases. I'll merge this PR as-is, do some testing, and follow up in another PR.
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to
raw_exec
driver, as noted in #10430 .We fix this issue by ensuring that the TaskRunner only calls
driver.WaitTask
once. The TaskRunner monitors the completion of the task by callingdriver.WaitTask
which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.
By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.
I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.
Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the
killTask
call may retry to kill an already-dead task for up to 5 minutes before giving up.You can see the failing test in https://app.circleci.com/pipelines/github/hashicorp/nomad/15996/workflows/0da22972-e21b-45ca-b1ac-4845154e5d69/jobs/153250 .