Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reschedule "system" type job #19043

Closed
eshcheglov opened this issue Nov 9, 2023 · 1 comment · Fixed by #19147
Closed

Can't reschedule "system" type job #19043

eshcheglov opened this issue Nov 9, 2023 · 1 comment · Fixed by #19147
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/bug

Comments

@eshcheglov
Copy link

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c

Operating system and Environment details

Ubuntu 20.04, arm64, nVidia Jetson

Issue

I'm forced to reschedule all Nomad jobs after node reboot or init because of #16812
However, if job type is "system", I can't reschedule it: Nomad stuck in infinite "Still waiting for allocation to be replaced" process.

==> 2023-11-09T05:51:48-05:00: Restarting 1 allocation
    2023-11-09T05:51:48-05:00: Rescheduling allocation "6fe8d748" for group "brokers"
    2023-11-09T05:52:48-05:00: Still waiting for allocation "6fe8d748" to be replaced
    2023-11-09T05:53:48-05:00: Still waiting for allocation "6fe8d748" to be replaced
    2023-11-09T05:54:48-05:00: Still waiting for allocation "6fe8d748" to be replaced
    2023-11-09T05:55:48-05:00: Still waiting for allocation "6fe8d748" to be replaced
    2023-11-09T05:56:48-05:00: Still waiting for allocation "6fe8d748" to be replaced
    2023-11-09T05:57:48-05:00: Still waiting for allocation "6fe8d748" to be replaced

Reproduction steps

  1. Deploy example job (see below)
  2. Reschedule it: nomad job restart -yes -reschedule mqtt-broker

Job file (if appropriate)

job "mqtt-broker" {
    type        = "system"

  group "brokers" {

    network {
      mode = "bridge"
      port "mqtt" { static = 1883 }
    }

    task "broker-1" {
      driver = "docker"

      config {
        image          = "eclipse-mosquitto"
        auth_soft_fail = "true"
        ports          = ["mqtt"]
        command        = "/usr/sbin/mosquitto"
        args           = ["-c", "local/mosquitto.conf"]
      }

      template {
        change_mode = "signal"
        change_signal = "SIGHUP"

        data = <<EOH
        listener 1883
        log_dest stdout
        allow_anonymous true
        EOH

        destination = "local/mosquitto.conf"
      }

      service {
        name = "mqtt-broker"
        port = "mqtt"
        tags = ["mqtt","mosquitto","broker"]

        check {
          type     = "tcp"
          port     = "mqtt"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
}

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 22, 2023

Hi @eshcheglov 👋

Thanks for the report. The -reschedule flag is indeed probably for non-system jobs as it assumes the Nomad reconciler will recreate the allocation, which doesn't happen for other types of jobs.

For system jobs I think we can just re-register the job, and that should trigger the reconciler to create the replacements.

But batch and sysbatch jobs don't sound should be allowed to be restart, they should run to completion unless stopped, so the command should check for them and exit early.

I have a draft PR up in #19043 to fix this, I just need to write some extra tests.

It's far from ideal but, as a workaround for now, you can call nomad job eval every time an allocation is rescheduled. This will trigger Nomad to create its replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants