-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430
Comments
It looks like the job is trying to do If you build the binary and run that it should work:
Then the logs are what we expect after
However, the exit code is still incorrectly reporting 0. This appears to be a valid bug
|
Thanks @benbuzbee ! Good point - this is confusing indeed. Looking at your case and the code, I believe this is localized to allocs that were explicitly stopped by Nomad (due to user request or job upgrade); probably with the believe that exit code doesn't reflect the task state properly. If an alloc is stopped, the Exit Message should be meaningful to users: e.g. "Alloc was stopped due to {job update|user request}", rather than the cryptic message. Would love your suggestions on how to handle the exit code: How do you think about reporting 0 exit code (or removing the field) and including the true exit code in message? Reporting non-zero may signal to downstream monitoring services) that the task is failed and needs attention - operators may need to update their monitoring dashboard to account for job upgrades and other reasons of stopping. There is also a backward compatibility concerns impacting users unexpectedly. |
An example where the exit code is relevant even after stop: on shutdown, a service does some cleanup - it deletes entries in a database, or flushes state to a file. If that fails, someone might want to know from the exit code. So for me the simplest thing would be for it to be non-zero so we can check that, but that's because I have no sympathy for backwards compatibility :) If there was a status message that indicated "Exit code 123 after stopped" I suppose that would be a good compromise. I do think it might be confusing though, e.g. if i were writing some automation to check the exitcode the API returns I would not expect this behavior. |
Thanks @benbuzbee for the explanation. I'm convinced :). Upon investigation, I noticed that the current behavior depends on the driver (docker seemed to report the proper exit code regardless) and some raciness due to order of operations. It's indeed best to report the exit code the app returns without re-interpretation by Nomad. I have opened a PR to address the issue in #10494 . I'll discuss back-porting it to a 1.0.x releaes with the team. Let us know if you have any questions! |
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
Shipped in 1.1.0-rc1 |
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Dev build
Issue
The executor that most drivers, including exec and raw_exec, use to wait on child processes' to exit does not wait correctly
Reproduction steps
sudo ./bin/nomad agent -dev
./bin/nomad job run bug.hcl
./bin/nomad job stop bug
Expected Result
Exit Code: 123, Message: Non-error
Actual Result
Note: Exit code incorrect, error from executor
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: