Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check routine leaks using new nomad provider #15477

Closed
thetooth opened this issue Dec 6, 2022 · 1 comment · Fixed by #15855
Closed

Health check routine leaks using new nomad provider #15477

thetooth opened this issue Dec 6, 2022 · 1 comment · Fixed by #15855
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug
Milestone

Comments

@thetooth
Copy link

thetooth commented Dec 6, 2022

Nomad version

Nomad v1.4.3 (f464aca)

Issue

I have a service that logs HTTP requests and noticed that the endpoint given for health checking is being executed a few hundred times per second. There is a pretty aggressive restart policy on this job and we had a netsplit issue last night which lead to the service restarting around 600 times, so the logs are quite busy to say the least.

Reproduction steps

Run the job below and either stop the job and resubmit or have the process crash. The number of requests hitting the service increase until nomad is restarted.

Job file (if appropriate)

job "signaling" {
  datacenters = ["cloud"]
  type        = "service"

  reschedule {
    unlimited      = true
    delay          = "15s"
    delay_function = "constant"
    attempts       = 0
  }

  group "signaling" {
    restart {
      attempts = 2
      delay    = "1s"
      interval = "15s"
      mode     = "fail"
    }

    volume "opt" {
      type      = "host"
      source    = "opt"
      read_only = true
    }

    network {
      mode = "host"
      port "api" {
        static = 8000
      }
    }

    task "signaling" {
      driver = "exec"

      config {
        command = "entry.bash"
      }

      template {
        data = <<EOH
#!/bin/bash

/local/opt/signaling -bind=:8000
EOH

        destination = "local/entry.bash"
        perms       = "755"
      }

      template {
        data        = <<EOH
{{ with nomadVar "proapps" -}}
PROAPPS_ID={{ .id }}
PROAPPS_SECRET={{ .secret }}
{{- end }}
EOH
        destination = "local/file.env"
        env         = true
      }

      volume_mount {
        volume      = "opt"
        destination = "/local/opt"
      }

      resources {
        cpu    = 100 # Mhz
        memory = 512 # Mb
      }

      service {
        name     = "api"
        port     = "api"
        provider = "nomad"
        check {
          type     = "http"
          path     = "/debug/pprof/"
          interval = "5s"
          timeout  = "1s"
          method   = "GET"
        }
      }
    }
  }
}
@shoenig shoenig self-assigned this Dec 6, 2022
@shoenig shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 23, 2023
@shoenig shoenig added this to the 1.4.x milestone Jan 23, 2023
@shoenig
Copy link
Member

shoenig commented Jan 23, 2023

Thanks for the report @thetooth, and apologies for the slow response. I was able to reproduce this with this simpler job and bash script. AFAICT the duplication in requests to the healthcheck happens on reschedule, which TBH is surprising, but at least now I know where to look.

job "demo" {
  datacenters = ["dc1"]

  group "group1" {
    network {
      mode = "host"
      port "http" {
        static = 8888
      }
    }
    
    reschedule {
      unlimited = true
      delay = "15s"
      delay_function = "constant"
      attempts = 0
    }
    
    restart {
      attempts = 2
      delay = "1s"
      interval = "15s"
      mode = "fail"
    }

    task "task1" {
      driver = "raw_exec"
      user = "shoenig"

      config {
        command = "python3"
        args = ["-m", "http.server", "8888", "--directory", "/tmp"]
      }

      service {
        provider = "nomad"
        port = "http"
        check{
          path = "/"
          type = "http"
          interval = "3s"
          timeout = "1s"         
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
#!/usr/bin/env bash

while true
do
  sleep 5
  pid=$(ps -ef | grep http.server | head -n1 | fields 2)
  echo "kill pid $pid"
  kill -9 $pid
done

shoenig added a commit that referenced this issue Jan 23, 2023
This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477
shoenig added a commit that referenced this issue Jan 27, 2023
* client: run alloc pre-kill hooks on last pass despite no live tasks

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

* client: do not run ar cleanup hooks if client is shutting down
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants