Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: Capture exit code when task is killed #10494

Merged
merged 2 commits into from
May 4, 2021
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented May 3, 2021

This PR ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to raw_exec driver, as noted in #10430 .

We fix this issue by ensuring that the TaskRunner only calls driver.WaitTask once. The TaskRunner monitors the completion of the task by calling driver.WaitTask which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the killTask call may retry to kill an already-dead task for up to 5 minutes before giving up.

You can see the failing test in https://app.circleci.com/pipelines/github/hashicorp/nomad/15996/workflows/0da22972-e21b-45ca-b1ac-4845154e5d69/jobs/153250 .

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left one question but really it's on the existing code so not a blocker.

waitCh, err := handle.WaitCh(tr.shutdownCtx)
if resultCh == nil {
var err error
resultCh, err = handle.WaitCh(tr.shutdownCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is existing code that just has a new conditional, but why do we get the WaitCh after we send the killTask? Wouldn't we avoid the error handling code here if we got the WaitCh first and then blocked on it after we send the killTask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Not sure, it wasn't the cause of the bug but it might lead to other interesting cases. I'll merge this PR as-is, do some testing, and follow up in another PR.

@notnoop notnoop merged commit 61a3b73 into main May 4, 2021
@notnoop notnoop deleted the b-killed-exit-code branch May 4, 2021 14:54
tgross pushed a commit that referenced this pull request May 17, 2021
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 .

We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants