Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of -purge leads to leak of goroutines #19988

Closed
shoenig opened this issue Feb 14, 2024 · 0 comments · Fixed by #20348
Closed

use of -purge leads to leak of goroutines #19988

shoenig opened this issue Feb 14, 2024 · 0 comments · Fixed by #20348
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments type/bug
Milestone

Comments

@shoenig
Copy link
Member

shoenig commented Feb 14, 2024

When stopping a job with the -purge flag, a Nomad client may leak goroutines. This becomes readily apparent when rapidly starting and stopping a job as described below. There are two goroutines that show up in large quantity in a dump after applying the reproduction steps, both associated with deployment watcher. If the -purge flag is not set, no leak is observed.

goroutine 1246 [select]:
runtime.gopark(0xc0016b9f78?, 0x4?, 0x90?, 0x5e?, 0xc0016b9eb0?)
        runtime/proc.go:398 +0xce fp=0xc0016b9ce8 sp=0xc0016b9cc8 pc=0x4426ce
runtime.selectgo(0xc0016b9f78, 0xc0016b9ea8, 0x0?, 0x0, 0xedd5f13c8?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc0016b9e08 sp=0xc0016b9ce8 pc=0x452e45
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).watch(0xc001826840)
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:440 +0x226 fp=0xc0016b9fc8 sp=0xc0016b9e08 pc=0x196afe6
github.com/hashicorp/nomad/nomad/deploymentwatcher.newDeploymentWatcher.func1()
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:134 +0x25 fp=0xc0016b9fe0 sp=0xc0016b9fc8 pc=0x19697c5
runtime.goexit()
        runtime/asm_amd64.s:1650 +0x1 fp=0xc0016b9fe8 sp=0xc0016b9fe0 pc=0x4764a1
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.newDeploymentWatcher in goroutine 327
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:134 +0x3aa
goroutine 20345 [select]:
runtime.gopark(0xc00121fac8?, 0x21?, 0x20?, 0x0?, 0xc00121f986?)
        runtime/proc.go:398 +0xce fp=0xc00121f7e8 sp=0xc00121f7c8 pc=0x4426ce
runtime.selectgo(0xc00121fac8, 0xc00121f944, 0x2ac1820?, 0x0, 0x42955c?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc00121f908 sp=0xc00121f7e8 pc=0x452e45
github.com/hashicorp/go-memdb.watchFew({0x3261788, 0xc000e024b0}, {0xc00121fd78?, 0x10?, 0xc0010a89b8?})
        github.com/hashicorp/[email protected]/watch_few.go:16 +0x5a5 fp=0xc00121fce8 sp=0xc00121f908 pc=0x18ee625
github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0x292a840?, {0x3261788, 0xc000e024b0})
        github.com/hashicorp/[email protected]/watch.go:86 +0x125 fp=0xc00121fe88 sp=0xc00121fce8 pc=0x18ed945
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc000a9fd70, 0xc00121ff60, 0x257, {0x3261788, 0xc000e024b0})
        github.com/hashicorp/nomad/nomad/state/state_store.go:371 +0x22e fp=0xc00121ff20 sp=0xc00121fe88 pc=0x191aa2e
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocs(0xc001816000, 0x0?)
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:928 +0x50 fp=0xc00121ff80 sp=0xc00121ff20 pc=0x196db70
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocsCh.func1()
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:914 +0x28 fp=0xc00121ffe0 sp=0xc00121ff80 pc=0x196da68
runtime.goexit()
        runtime/asm_amd64.s:1650 +0x1 fp=0xc00121ffe8 sp=0xc00121ffe0 pc=0x4764a1
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.(*deploymentWatcher).getAllocsCh in goroutine 2950
        github.com/hashicorp/nomad/nomad/deploymentwatcher/deployment_watcher.go:913 +0x98

Script to run and stop a job, causing the Nomad client leak goroutines.

#!/usr/bin/env bash

set -xeuo pipefail

for v in $(seq 1 1000);
do
  nomad job run -detach -var=v=$v sleep0.hcl
  sleep 5
  nomad job stop -purge=true sleep0 || true
  sleep 5
done
simple "sleep" service job
variable "v" {
  type    = number
  default = 1
}

job "sleep0" {

  group "group" {
    restart {
      attempts = 0
      mode     = "fail"
    }

    update {
      min_healthy_time = "3s"
    }

    meta {
      v = var.v
    }

    task "task" {
      driver = "raw_exec"

      config {
        command = "bash"
        args    = ["-c", "cat local/file.txt && sleep infinity"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}
@jrasell jrasell added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 16, 2024
@louievandyke louievandyke added the hcc/cst Admin - internal label Apr 8, 2024
@tgross tgross self-assigned this Apr 10, 2024
tgross added a commit that referenced this issue Apr 10, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
tgross added a commit that referenced this issue Apr 10, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
@tgross tgross added this to the 1.7.x milestone Apr 10, 2024
tgross added a commit that referenced this issue Apr 11, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
tgross added a commit that referenced this issue Apr 11, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
tgross added a commit that referenced this issue Apr 11, 2024
…#20348) (#20359)

The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988

Co-authored-by: Tim Gross <[email protected]>
philrenaud pushed a commit that referenced this issue Apr 18, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants