Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

Closed
benbuzbee opened this issue Apr 22, 2021 · 6 comments
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug

Comments

@benbuzbee
Copy link
Contributor

benbuzbee commented Apr 22, 2021

Nomad version

Dev build

$ ./bin/nomad version
Nomad v1.1.0-dev (f7437800eb6e676653fb73f2f946a8e75bf6a79d+CHANGES)

Issue

The executor that most drivers, including exec and raw_exec, use to wait on child processes' to exit does not wait correctly

Reproduction steps

  1. sudo ./bin/nomad agent -dev
  2. create bug.hcl - a runs a simple inline bash script that exits with code "123" when it receives SIGTERM or SIGINT - see "Job file" section for copy
  3. ./bin/nomad job run bug.hcl
  4. ./bin/nomad job stop bug
  5. Inspect status of alloc
  6. Note that exit code is incorrect, exit message shows an error in the executor wait function

Expected Result

Exit Code: 123, Message: Non-error

Actual Result

Note: Exit code incorrect, error from executor

$ ./bin/nomad status 3089
ID                  = 30892551-9007-3030-192b-9daaa26765ea
Eval ID             = b0b9c5e1
Name                = bug.bug[0]
Node ID             = c15bbff6
Node Name           = browser-devbox
Job ID              = bug
Job Version         = 18
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 20s ago
Modified            = 6s ago

Task "bug" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  33 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2021-04-23T17:49:00Z
Finished At    = 2021-04-23T17:49:13Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-04-23T17:49:13Z  Killed      Task successfully killed
2021-04-23T17:49:13Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"
2021-04-23T17:49:05Z  Killing     Sent interrupt. Waiting 1m0s before force killing
2021-04-23T17:49:00Z  Started     Task started by client
2021-04-23T17:49:00Z  Task Setup  Building Task Directory
2021-04-23T17:49:00Z  Received    Task received by client
vagrant@browser-devbox:~/nomad$ ./bin/nomad logs 3089
Hello world
Sleeping now!
Goodbye world

Job file (if appropriate)

job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {

    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      kill_timeout = "1m"
      config {
        command = "/bin/bash"
        args = [
          "${NOMAD_TASK_DIR}/bug.sh",
        ]
      }

      template {
        data = <<EOH
#!/bin/bash
echo "Hello world"
trap "echo Goodbye world && exit 123" EXIT
trap "echo Sleeping now! && sleep 10" SIGINT SIGTERM
sleep 1000&
wait $!

EOH

        destination = "local/bug.sh"
      }
    }
  }
}
@prestonp
Copy link

prestonp commented Apr 23, 2021

It looks like the job is trying to do go run as root in which you probably don't have env vars set for GOPATH, GOROOT.

If you build the binary and run that it should work:

go build -o /path/to/bug bug.go
job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {

    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      config {
        command = "/path/to/bug"
      }
    }
  }
}

Then the logs are what we expect after nomad job stop:

$ nomad logs ab0e149b
Hello world!
Goodbye world!

However, the exit code is still incorrectly reporting 0. This appears to be a valid bug

2021-04-23T16:33:57Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"

@benbuzbee benbuzbee changed the title Executor (e.g raw_exec driver) does not wait on process correctly Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code Apr 23, 2021
@notnoop
Copy link
Contributor

notnoop commented Apr 25, 2021

Thanks @benbuzbee ! Good point - this is confusing indeed. Looking at your case and the code, I believe this is localized to allocs that were explicitly stopped by Nomad (due to user request or job upgrade); probably with the believe that exit code doesn't reflect the task state properly.

If an alloc is stopped, the Exit Message should be meaningful to users: e.g. "Alloc was stopped due to {job update|user request}", rather than the cryptic message.

Would love your suggestions on how to handle the exit code: How do you think about reporting 0 exit code (or removing the field) and including the true exit code in message? Reporting non-zero may signal to downstream monitoring services) that the task is failed and needs attention - operators may need to update their monitoring dashboard to account for job upgrades and other reasons of stopping. There is also a backward compatibility concerns impacting users unexpectedly.

@notnoop notnoop self-assigned this Apr 25, 2021
@notnoop notnoop added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Apr 25, 2021
@benbuzbee
Copy link
Contributor Author

An example where the exit code is relevant even after stop: on shutdown, a service does some cleanup - it deletes entries in a database, or flushes state to a file. If that fails, someone might want to know from the exit code.

So for me the simplest thing would be for it to be non-zero so we can check that, but that's because I have no sympathy for backwards compatibility :) If there was a status message that indicated "Exit code 123 after stopped" I suppose that would be a good compromise. I do think it might be confusing though, e.g. if i were writing some automation to check the exitcode the API returns I would not expect this behavior.

@notnoop
Copy link
Contributor

notnoop commented May 3, 2021

Thanks @benbuzbee for the explanation. I'm convinced :). Upon investigation, I noticed that the current behavior depends on the driver (docker seemed to report the proper exit code regardless) and some raciness due to order of operations. It's indeed best to report the exit code the app returns without re-interpretation by Nomad.

I have opened a PR to address the issue in #10494 . I'll discuss back-porting it to a 1.0.x releaes with the team. Let us know if you have any questions!

notnoop pushed a commit that referenced this issue May 4, 2021
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 .

We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
@tgross
Copy link
Member

tgross commented May 17, 2021

Shipped in 1.1.0-rc1

@tgross tgross closed this as completed May 17, 2021
tgross pushed a commit that referenced this issue May 17, 2021
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 .

We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug
Projects
None yet
Development

No branches or pull requests

4 participants