use of -purge leads to leak of goroutines #19988

shoenig · 2024-02-14T20:20:16Z

When stopping a job with the -purge flag, a Nomad client may leak goroutines. This becomes readily apparent when rapidly starting and stopping a job as described below. There are two goroutines that show up in large quantity in a dump after applying the reproduction steps, both associated with deployment watcher. If the -purge flag is not set, no leak is observed.

goroutine 1246 [select]:
runtime.gopark(0xc0016b9f78?, 0x4?, 0x90?, 0x5e?, 0xc0016b9eb0?)
        runtime/proc.go:398 +0xce fp=0xc0016b9ce8 sp=0xc0016b9cc8 pc=0x4426ce
runtime.selectgo(0xc0016b9f78, 0xc0016b9ea8, 0x0?, 0x0, 0xedd5f13c8?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc0016b9e08 sp=0xc0016b9ce8 pc=0x452e45
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).watch(0xc001826840)
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:440 +0x226 fp=0xc0016b9fc8 sp=0xc0016b9e08 pc=0x196afe6
github.com/hashicorp/nomad/nomad/deploymentwatcher.newDeploymentWatcher.func1()
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:134 +0x25 fp=0xc0016b9fe0 sp=0xc0016b9fc8 pc=0x19697c5
runtime.goexit()
        runtime/asm_amd64.s:1650 +0x1 fp=0xc0016b9fe8 sp=0xc0016b9fe0 pc=0x4764a1
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.newDeploymentWatcher in goroutine 327
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:134 +0x3aa

goroutine 20345 [select]:
runtime.gopark(0xc00121fac8?, 0x21?, 0x20?, 0x0?, 0xc00121f986?)
        runtime/proc.go:398 +0xce fp=0xc00121f7e8 sp=0xc00121f7c8 pc=0x4426ce
runtime.selectgo(0xc00121fac8, 0xc00121f944, 0x2ac1820?, 0x0, 0x42955c?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc00121f908 sp=0xc00121f7e8 pc=0x452e45
github.com/hashicorp/go-memdb.watchFew({0x3261788, 0xc000e024b0}, {0xc00121fd78?, 0x10?, 0xc0010a89b8?})
        github.com/hashicorp/[email protected]/watch_few.go:16 +0x5a5 fp=0xc00121fce8 sp=0xc00121f908 pc=0x18ee625
github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0x292a840?, {0x3261788, 0xc000e024b0})
        github.com/hashicorp/[email protected]/watch.go:86 +0x125 fp=0xc00121fe88 sp=0xc00121fce8 pc=0x18ed945
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc000a9fd70, 0xc00121ff60, 0x257, {0x3261788, 0xc000e024b0})
        github.com/hashicorp/nomad/nomad/state/state_store.go:371 +0x22e fp=0xc00121ff20 sp=0xc00121fe88 pc=0x191aa2e
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocs(0xc001816000, 0x0?)
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:928 +0x50 fp=0xc00121ff80 sp=0xc00121ff20 pc=0x196db70
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocsCh.func1()
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:914 +0x28 fp=0xc00121ffe0 sp=0xc00121ff80 pc=0x196da68
runtime.goexit()
        runtime/asm_amd64.s:1650 +0x1 fp=0xc00121ffe8 sp=0xc00121ffe0 pc=0x4764a1
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocsCh in goroutine 2950
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:913 +0x98

Script to run and stop a job, causing the Nomad client leak goroutines.

#!/usr/bin/env bash

set -xeuo pipefail

for v in $(seq 1 1000);
do
  nomad job run -detach -var=v=$v sleep0.hcl
  sleep 5
  nomad job stop -purge=true sleep0 || true
  sleep 5
done

simple "sleep" service job

variable "v" {
  type    = number
  default = 1
}

job "sleep0" {

  group "group" {
    restart {
      attempts = 0
      mode     = "fail"
    }

    update {
      min_healthy_time = "3s"
    }

    meta {
      v = var.v
    }

    task "task" {
      driver = "raw_exec"

      config {
        command = "bash"
        args    = ["-c", "cat local/file.txt && sleep infinity"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

The deployment watcher on the leader makes blocking queries to detect when the set of active deployments changes. It takes the resulting list of deployments and adds or removes watchers based on whether the deployment is active. But when a job is purged, the deployment will be deleted. This unblocks the query but the query result only shows the remaining deployments. When the query unblocks, ensure that all active watchers have a corresponding deployment in state. If not, remove the watcher so that the goroutine stops. Fixes: #19988

…#20348) (#20359) The deployment watcher on the leader makes blocking queries to detect when the set of active deployments changes. It takes the resulting list of deployments and adds or removes watchers based on whether the deployment is active. But when a job is purged, the deployment will be deleted. This unblocks the query but the query result only shows the remaining deployments. When the query unblocks, ensure that all active watchers have a corresponding deployment in state. If not, remove the watcher so that the goroutine stops. Fixes: #19988 Co-authored-by: Tim Gross <[email protected]>

The deployment watcher on the leader makes blocking queries to detect when the set of active deployments changes. It takes the resulting list of deployments and adds or removes watchers based on whether the deployment is active. But when a job is purged, the deployment will be deleted. This unblocks the query but the query result only shows the remaining deployments. When the query unblocks, ensure that all active watchers have a corresponding deployment in state. If not, remove the watcher so that the goroutine stops. Fixes: #19988

shoenig added type/bug theme/client labels Feb 14, 2024

jrasell added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 16, 2024

louievandyke added the hcc/cst Admin - internal label Apr 8, 2024

tgross self-assigned this Apr 10, 2024

tgross mentioned this issue Apr 10, 2024

deployment watcher: fix goroutine leak when job is purged #20348

Merged

tgross added theme/deployments and removed theme/client labels Apr 10, 2024

tgross added this to the 1.7.x milestone Apr 10, 2024

tgross closed this as completed in #20348 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use of -purge leads to leak of goroutines #19988

use of -purge leads to leak of goroutines #19988

shoenig commented Feb 14, 2024

use of -purge leads to leak of goroutines #19988

use of -purge leads to leak of goroutines #19988

Comments

shoenig commented Feb 14, 2024