Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

Closed
cheeseprocedure opened this issue Apr 22, 2019 · 3 comments · Fixed by #5598

Comments

@cheeseprocedure
Copy link
Contributor

cheeseprocedure commented Apr 22, 2019

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

  • Ubuntu 16.04LTS (our prod environment)
  • macOS Mojave (reproduction case).

Issue

tl;dr: processes launched by raw_exec tasks are orphaned if the Nomad agent restarts and a job stop/update stops allocations on the host that were live before the agent restart.

We use Nomad to run a simple guest VM supervisor process via raw_exec. After deploying Nomad 0.9 to a limited set of datacenters, a monitor started firing intermittently which indicated an instance of our supervisor found running without a Nomad executor parent process.

After some digging, we found this was happening to allocations started prior to a restart of the local Nomad agent. A while back, we tweaked our configuration management tooling as a workaround for #4413 so Nomad agents running in client mode would simply restart (regardless of whether TLS key material had been updated). Because of this behaviour, we often have allocations running that are older than the current Nomad agent process.

Allocations impacted by this issue have an event log much like this:

Recent Events:
Time                  Type        Description
2019-04-18T18:51:17Z  Killed      Task successfully killed
2019-04-18T18:51:17Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin415396184: connect: connection refused\""
2019-04-18T18:51:12Z  Killing     Sent interrupt
2019-04-17T18:42:49Z  Started     Task started by client
2019-04-17T18:42:48Z  Task Setup  Building Task Directory
2019-04-17T18:42:48Z  Received    Task received by client

Despite Nomad reporting Task successfully killed, the process launched by the raw_exec driver is still running at this point.

When Nomad reports Sent interrupt, the allocation's process does receive the appropriate signal; our tooling kicks off graceful shutdown/eventual forced termination of our guest VMs, so the orphaned process doesn't stick around for long. (This is why our orphaned process monitor was flapping.)

While submitting this issue, I realized the same failure mode occurs after restarting the Nomad agent and calling nomad stop [job_name] to stop that job's allocations on the agent's host. That's the case documented below.

Reproduction steps

Configs/resources

/tmp/nomad_config.json:

{
    "datacenter": "macbook",
    "data_dir": "/tmp/nomad_data",
    "disable_update_check": true,
    "enable_syslog": true,
    "leave_on_terminate": true,
    "syslog_facility": "LOCAL0",
    "log_level": "DEBUG",
    "server": {
        "bootstrap_expect": 1,
        "enabled": true
    },
    "client": {
        "enabled": true,
        "options": {
            "driver.whitelist": "raw_exec",
            "driver.raw_exec.enable": "1"
        }
    }
}

/tmp/testjob.sh:

#!/bin/bash

function log {
    echo "$(date) - testjob - $(echo ${NOMAD_GROUP_NAME}) - $(echo ${NOMAD_ALLOC_ID} | cut -f1 -d-) - PID ${$?} - ${1}" \
        | tee -a /tmp/testjob.log
}

function handleint {
    log "received interrupt but ignoring it"
}
trap handleint INT

SLEEPTIME="5"
while true; do
    log "sleeping for ${SLEEPTIME} seconds..."
    sleep "${SLEEPTIME}"
done

Procedure

  1. Run the Nomad agent in its own terminal session: nomad agent -config /tmp/nomad_config.json
  2. Run the simple job definition below: nomad run /tmp/testjob.hcl
  3. tail -f /tmp/testjob.log to confirm allocation is live and generating output
  4. In the Nomad agent's terminal session, hit Ctrl-C to exit gracefully; once the agent has exited, run it again.
  5. Stop the Nomad job: nomad stop testjob
  6. /tmp/testjob.log should immediately include a received interrupt but ignoring it line from the current allocation, and nomad status [alloc_id] should include aSent interrupt event.
  7. Wait a few seconds (if necessary), then look at the allocation status again. It should include a Task successfully killed event following an RPC error with all SubConns are in TransientFailure and /tmp/testjob.log should continue showing output from the still-running bash script (now with a parent PID of 1) launched by the supposedly-complete allocation:

2019-04-22T14:08:28-07:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin954529704: connect: connection refused\""

Job file (if appropriate)

/tmp/testjob.hcl:

job "testjob" {
  datacenters = ["macbook"]
  type = "service"
  update {
    stagger      = "3s"
    max_parallel = 1
  }
  group "group1" {
    task "task1" {
      kill_timeout = "30s"
      driver = "raw_exec"
      config {
        command = "/tmp/testjob.sh"
      }
      resources {
        cpu = 50
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad agent logs after a Nomad client restart and nomad stop testjob:

    2019-04-22T14:43:50.921-0700 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=testjob duration=1.000755ms
    2019-04-22T14:43:50.924-0700 [DEBUG] http: request complete: method=GET path=/v1/job/testjob duration=446.263µs
    2019-04-22T14:43:50.928-0700 [DEBUG] worker: dequeued evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.928-0700 [DEBUG] http: request complete: method=DELETE path=/v1/job/testjob?purge=false duration=2.704206ms
    2019-04-22T14:43:50.928-0700 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 1)
Deployment Update for ID "9d3f5652-5912-9e71-32b0-ad1499e8155f": Status "cancelled"; Description "Cancelled because job is stopped"
Desired Changes for "group1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "group2": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2019-04-22T14:43:50.930-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=289.389µs
    2019-04-22T14:43:50.931-0700 [DEBUG] worker: submitted plan for evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.931-0700 [DEBUG] worker.service_sched: setting eval status: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default status=complete
    2019-04-22T14:43:50.931-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=169.652µs
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: updated evaluation: eval="<Eval "8ca24f07-4e90-a5d5-00ea-47ebfd2be47c" JobID: "testjob" Namespace: "default">"
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: ack evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.933-0700 [DEBUG] client: updated allocations: index=1460 total=5 pulled=5 filtered=0
    2019-04-22T14:43:50.933-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=5 ignored=0
    2019-04-22T14:43:50.937-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=5 ignored=0 errors=0
    2019-04-22T14:43:50.939-0700 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1 error="rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 backoff=5s error="executor Shutdown failed: rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: error fetching stats of task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin630219183: connect: connection refused""
    2019-04-22T14:43:51.185-0700 [DEBUG] client: updated allocations: index=1462 total=5 pulled=4 filtered=1
    2019-04-22T14:43:51.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:51.188-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:51.926-0700 [DEBUG] client.driver_mgr.raw_exec.executor: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1
    2019-04-22T14:43:51.936-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=230.683µs
    2019-04-22T14:43:51.937-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=92.024µs
    2019-04-22T14:43:55.944-0700 [INFO ] client.gc: marking allocation for GC: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6
    2019-04-22T14:43:56.186-0700 [DEBUG] client: updated allocations: index=1463 total=5 pulled=4 filtered=1
    2019-04-22T14:43:56.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:56.189-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.932-0700 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: updated allocations: index=1464 total=5 pulled=4 filtered=1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:59.190-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
@cheeseprocedure cheeseprocedure changed the title Nomad 0.9: raw_exec task processes leaking after agent restart Nomad 0.9: raw_exec task processes leaking from stopped allocations after agent restart Apr 22, 2019
@cheeseprocedure cheeseprocedure changed the title Nomad 0.9: raw_exec task processes leaking from stopped allocations after agent restart Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart Apr 22, 2019
notnoop pushed a commit that referenced this issue Apr 23, 2019
Fixes #5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted.  The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr.  `go-plugin` sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal.  SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
notnoop pushed a commit that referenced this issue Apr 23, 2019
Fixes #5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted.  The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr.  `go-plugin` sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal.  SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
@preetapan
Copy link
Contributor

@cheeseprocedure - We just released 0.9.1-rc which resolves this issue, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would be great if you can try it out and let us know if it fixed what you saw.

@cheeseprocedure
Copy link
Contributor Author

Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants