change_mode = "script" sometime fails with Docker #15851

dani · 2023-01-23T09:51:19Z

Nomad version

Nomad v1.4.3 (f464aca)

Operating system and Environment details

AlmaLinux 8.7
Nomad installed manually from the pre-built binary. Using Docker CE 20.10.22

Issue

For some of my tasks, I trigger a script when a template changes. Typicaly, I use this to trigger a custom reload action when a vault generated certificate gets renewed. Most of the time, this is working great. But sometimes, the execution fails with this error

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

Reproduction steps

A job with a templated cert, a reload script and change_mode = "script", for example with :

      template {
        data        = <<-EOF
          {{ with pkiCert "pki/db/issue/mongo" "common_name=mongo.service.consul" "ttl=5m" }}
          {{ .Cert }}
          {{ .Key }}
          {{ end }}
        EOF
        destination = "secrets/mongo.bundle.pem"
        change_mode = "script"
        change_script {
          command = "/local/bin/rotate-cert.sh"
        }
      }

Using a short TTL like in this example makes the problem more visible, but it's still quite random

Expected Result

The /local/bin/rotate-cert.sh script is fired everytime the cert is renewed

Actual Result

Most of the time, it's working, but sometimes (couldn't find any common pattern), the script fails with

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

As a workaround, I've added the fail_on_error stanza

        change_script {
          command = "/local/bin/rotate-cert.sh"
          # If cert rotation fails, kill the task and restart it
          fail_on_error = true
        }

So when this happens, the task is killed and restarted (so it gets it's new cert)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nothing interesting, we can see the template being rendered, and the task being stopped (because of the fail_on_error)

janv. 23 10:14:54 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:54.138+0100 [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9b688daf-6269-eef6-1752-4f1db4413b3f/tracking-mongod/secrets/mongo.bundle.pem"
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.342+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=28a3bfc55b0d869f3abe3d2e67fa0ac2b26f1984133ee93da03d053fc110782d driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.577+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=639b3b2291c13487a9c6e09e0dad620403938c9f1201cf33cc1f463f851e82af driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.579+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=6103d61436006b17d96a1dc4a9f7f767e59eea9e6f782e0b8da5bbda2b344a81 driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.613+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.614+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.628+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.629+0100 [INFO]  agent: (runner) received finish

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-01-23T22:29:49Z

Hi @dani 👋

Thanks for the bug report. Do you have any other template in this task that could cause the container to restart? Or maybe something in the rotate-cert.sh script could be causing something like that?

dani · 2023-01-23T22:56:09Z

Nop, nothing specific in the script, which looks like

#!/bin/sh

mongosh --tlsAllowInvalidCertificates --eval <<_EOF
  db.adminCommand( { rotateCertificates: 1, message: "Rotating certificates" } );
_EOF

And nothing else which could be triggered at the same time. The error message is quite surprising. If the script was simply failing, it should be something else (don't remember the exact error message when the script fails, but it clearly state its exit code). In this case, it looks like the script isn't even fired ("because task driver doesn't support the exec operation").
The task restarting is because of the fail_on_error = true. The default behavior is for the change_script to fail (not being fired), and the task staying active, with a non working service as its certificate is now expired.

lgfa29 · 2023-01-26T23:21:49Z

I was able to reproduce this with this job:

job "example" {
  datacenters = ["dc1"]

  group "client" {
    task "curl" {
      driver = "docker"

      config {
        image   = "curlimages/curl:7.87.0"
        command = "/bin/ash"
        args    = ["local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

while true; do
{{range nomadService "server"}}
  curl http://{{.Address}}:{{.Port}}/
{{end}}
  sleep 1
done
EOF
        destination = "local/script.sh"
        change_mode = "script"
        change_script {
          command = "/bin/ash"
          args    = ["local/change.sh"]
        }
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

date
echo "change"
EOF
        destination = "local/change.sh"

      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }

  group "server" {
    network {
      port "http" {}
    }

    service {
      name     = "server"
      provider = "nomad"
      port     = "http"
    }

    task "http" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "${NOMAD_PORT_http}", "-h", "/local"]
        ports   = ["http"]
      }

      template {
        data        = <<EOF
hello world
EOF
        destination = "local/index.html"
      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }
}

Changing the server->http task and re-running the job will trigger the error.

The problem is that if the Nomad or Vault token changes the template hook will shutdown the existing template manager and create a new one.

The new template manager will not have access to the task driver handle that was set when in the Poststart hook since these are in-place updates.

#15915 restores the driver handler when the template manager is recreated.

@dani do you have more than one task in your group or are you using Vault by any chance?

dani · 2023-01-27T08:00:54Z

@dani do you have more than one task in your group or are you using Vault by any chance?

Yes to both, there're 3 tasks + one prestart. And using vault (to get certificates)

lgfa29 · 2023-01-27T17:33:30Z

Ah cool, so yeah I think the reproduction I had was the main cause behind the issue you're having. The next release of Nomad will have a fix for it.

kkornienko-aparavi · 2023-06-20T08:49:36Z

Still experiencing this error in Nomad 1.5.6.

change_script runs every day. It can run successfully several times, but sooner or later I'm getting:

Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

Any suggestions?

lgfa29 · 2023-06-22T21:43:27Z

Hi @kkornienko-aparavi 👋

Do you have any logs around when the problem happens? And would you be able to provide a sample of your job?

kkornienko-aparavi · 2023-06-26T16:10:40Z

Hello, @lgfa29 , thanks for reacting.
Looks like I found the root cause of the issue. It happens right after nomad client is restarted. New process tries to re-render the templates and fails.
Don't know is it at all should work or not )

Trying to post important parts of the job.

It's Grafana deployment job, which consists of 2 tasks.
One is MariaDB and second is Grafana App.
We use Vault with short-living certificates, we issue certificates and apply them to MariaDB.
MariaDB requires some special command to be invoked to reload the certificates, that's why we use change script.

job "grafana" {

  dynamic "group" {
    for_each = var.db_enabled ? [1] : []
    labels   = ["db"]
    content {
      count = 1

      network {
        port "mariadb" {
          to = 3306
        }
      }

      volume "nfsgrafana" {
        type            = "csi"
        source          = "nfsgrafana"
        access_mode     = "multi-node-multi-writer"
        attachment_mode = "file-system"
      }

      task "db" {
        driver = "docker"
        vault {
          env      = false
          policies = [...]
        }
        volume_mount {
          volume      = "nfsgrafana"
          destination = "/bitnami/mariadb"
        }
        config {
          image = "bitnami/mariadb"
          ports = ["mariadb"]
        }
        service {
          name = "grafana-db"
          port = "mariadb"
          check {
            name     = "tcp_validate"
            type     = "tcp"
            port     = "mariadb"
            interval = "15s"
            timeout  = "5s"
          }
        }
        env {
          MARIADB_DATABASE     = "grafana"
          MARIADB_SKIP_TEST_DB = "yes"
          MARIADB_EXTRA_FLAGS = join(" ", [
            "--require_secure_transport=ON",
            "--ssl_ca=/secrets/ca.crt",
            "--ssl_cert=/secrets/cert.crt",
            "--ssl_key=/secrets/cert.key",
            "--tls_version=TLSv1.3",
          ])
        }
        template {
          destination = "secrets/ca.crt"
          change_mode = "restart"
          data        = <<EOH
{{ with secret "pki/cert/ca" }}{{.Data.certificate}}{{ end }}
  EOH
        }
        template {
          destination = "secrets/cert.crt"
          change_mode = "noop"
          data        = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.certificate}}
{{- end}}
  EOH
        }
        template {
          destination = "secrets/cert.key"
          change_mode = "script"
          change_script {
            command = "/local/mariadb_ssl_reload.sh"
          }
          data = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.private_key}}
{{- end}}
  EOH
        }
        template {
          data        = file("grafana/mariadb_ssl_reload.sh")
          destination = "local/mariadb_ssl_reload.sh"
          perms       = "750"
        }
      }
    }
  }

  group "app" {
    count = 1

    network {
      port "grafana-http" {
        to = 3000
      }
    }

    task "app" {
      driver = "docker"
    }
    
    ...
    grafana consumes the DB, getting it's address via Consul.
    ...

  }
}

mariadb_ssl_reload.sh
#!/usr/bin/env bash mariadb --batch -u $MARIADB_ROOT_USER -p$MARIADB_ROOT_PASSWORD -e "flush ssl" --ssl && \ openssl x509 -in /secrets/cert.crt -dates -subject -issuer -noout > /local/ssl_reload_out.txt

Logs from Nomad client:
Nomad UI:
Task hook failed | Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

lgfa29 · 2023-07-05T00:08:49Z

Thanks for the extra info @kkornienko-aparavi. Unfortunately I have not been able to reproduce this yet, even after restarting the Nomad agent.

It happens right after nomad client is restarted

Just to clarify this part, do you mean the Nomad client process is restarted or is it the entire machine, like a server reboot?

I'm reopening this issue just in case until we find a better answer.

the-nando · 2023-12-11T14:04:29Z

hey @lgfa29 😄 I've run into the same issue on 1.6.3 with a script used to update the Java keystore on changes to a certificate rendered from Vault's PKI via Consul template:

template {
        destination = "secrets/kafka.pem"
        change_mode = "script"
        change_script {
          command       = "/local/hot-reload-keystore.sh"
          timeout       = "20s"
          fail_on_error = false
        }
        splay = "0ms"
        data  = <<-EOF
        {{- with secret "pki/issue/kafka-brokers" "common_name=xxx" "ttl=10m" "alt_names=*.yyyy" "private_key_format=pkcs8" -}}
        {{ .Data.private_key }}

        {{ .Data.certificate }}

        {{ range .Data.ca_chain }}
        {{ . }}

        {{ end }}
        {{- end -}}
        EOF
}

While investigating the issue I was able to reproduce it with a for loop updating a key in Consul K/V and restarting the Nomad client while the loop is running.

Start the job

job "test" {
    region      = "dc"
    datacenters = ["dca"]

    constraint {
        attribute = node.unique.name
        value     = "xxx"
    }

    group "test" {
        count = 1

        task "test" {
            driver = "docker"
            config {
                image = "alpine:3.18.2"
                args  = ["tail", "-f", "/dev/null"]
            }

            template {
        data        = <<EOF
{{ range ls "services/my-ns/test" }}
{{.Key}}={{.Value}}
{{ end }}
EOF
        destination = "local/test.tmpl"
        change_mode = "script"
        change_script {
          command = "/local/change.sh"
        }
      }

      template {
        data        = <<EOF
#!/bin/sh
date >> local/run.log
EOF
        destination = "local/change.sh"
        perms       = "555"

      }
            resources {
                cores  = 4
                memory = 64
            }
        }
    }
}

Start the loop:

while : ; do consul kv put services/my-ns/test/foo a=$RANDOM ; sleep 5 ; done

Restart the Nomad client:

2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) receiving dependency kv.list(services/my-ns/test)
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) initiating run
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) checking template 06ec856a7a6a01f081141dc0b6392a33
2023-12-11T14:08:34.830Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [INFO]  agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [DEBUG] agent: (runner) checking template a55d0446878b1ccea6af24e1d973c69e
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/change.sh"
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) diffing and updating dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) kv.list(services/my-ns/test) is still needed
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) watching 1 dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) all templates rendered
2023-12-11T14:08:37.219Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

Following runs succeed:

2023-12-11T14:08:43.021Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook message" msg="Template successfully ran script /local/change.sh with arguments: []. Exit code: 0" failed=false

Can you please check if you are able to reproduce it as well?

ron-savoia · 2023-12-12T19:29:43Z

I've reproduced this in my lab as well, using the jobspecs provided by the-nando and lgfa29 above.
In each case the issue was noted only after a client restart.
The logs (nomad, journal & docker) and bundle attached were captured while testing with @the-nando's jobspec and steps.

Nomad version used

1.6.2+ent

Client restart timestamps / nomad log grep

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:07 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:34 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:27:20 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:45.017Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

client_docker.log.zip
journal_nomad_client.log.zip
journal_nomad_server.log.zip
nomad_client.log.zip
nomad_server.log.zip
nomad-debug-2023-12-12-152541Z.tar.gz

aaronkvanmeerten · 2024-04-30T11:31:16Z

Anecdotally (I don't have any of the above dumps currently), this behavior continues in 1.7.5, and it seems to continue to be related to a nomad client restart.

tgross · 2024-07-18T20:52:45Z

Hi folks, I'm just getting started digging into this. At first glance it's pretty bizarre:

The error we're getting comes from the Poststart method of the template hook (ref template_hook.go#L203-L212) whenever the poststart request's DriverExec field is nil.
That request object is created in the taskrunner (ref task_runner_hooks.go#L350-L375) and we can see it's populated by a call to the function NewLazyHandle.
NewLazyHandle always returns an object, so there's no way for the field to be set to nil!

While I could easily imagine given the circumstances of the bug around client restarts that we've got some kind of race condition where the handle is not yet valid, that's not what the bug appears to be. My suspicion is that when the lazy handle's refreshHandleLocked method times out that we're setting the resulting nil object as the handle and effectively "cross contaminating" between hooks. This will take a good bit more debugging, but I'll update once I know more.

tgross · 2024-07-19T19:37:18Z

Oh, looks like I was looking at the wrong error message. That's embarassing. 😊 The "Template failed to run script %v with arguments %v because task driver doesn't support the exec operation" is from the task template manager, not the hook (ref template.go#L588). That's much less puzzling.

In that case, the error isn't happening in the post-stop hook but in the pre-start hook. But the handle we need to fire the change script isn't being set until the post-stop hook. So the template renders for the "first" time (according to the task runner lifecycle), immediately detects a change to the file, and that triggers the change script even though we don't have a handle to the task yet.

The very helpful client logs that @ron-savoia posted show the order of events pretty clearly for alloc 05607ec2 (with irrelevant bits elided):

2023-12-12T15:26:03.008Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:03.008869715 +0000 UTC m=+10.148452814"
2023-12-12T15:26:03.008Z [INFO] agent: (runner) creating new runner (dry: false, once: false)
...
(yadda yadda consul-template stuff)
...
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/opt/nomad/data/alloc/05607ec2-43e9-283c-e652-7b5ed2c4c412/test/local/change.sh"
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) diffing and updating dependencies
2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) watching 1 dependencies
...
2023-12-12T15:26:03.012Z [DEBUG] agent: (runner) all templates rendered
...
2023-12-12T15:26:04.381Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
...
2023-12-12T15:26:04.389Z [TRACE] client.alloc_runner.task_runner: finished prestart hooks: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test end="2023-12-12 15:26:04.389997341 +0000 UTC m=+11.529580481" duration=1.389301764s
...
2023-12-12T15:26:04.402Z [TRACE] client.alloc_runner.task_runner: running poststart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:04.402213108 +0000 UTC m=+11.541796249"

We're never going to have a valid task handle at the point of the prestart, so we simply can't run the script at that point.

What I'll do next is write a unit test demonstrating the exact behavior we're seeing, and then see if I can figure out a reasonable way to "queue-up" the script execution so that if we restore a task handle we can run it afterwards. Without getting stuck or trying to execute if the task handle fails to be restored (ex. the task gets killed between the time we render and get to the poststop hook method), of course.

(Edit: I thought this sounded familiar, and as it turns out we didn't trigger change mode on templates during restore until I fixed that way back in #9636. So either this is a regression or a bug in that original implementation years ago.)

tgross · 2024-07-19T20:22:11Z

At this point reproduction is easy and deterministic, when running at least one Nomad server and separate Nomad client.

Create a variable:

$ nomad var put nomad/jobs/example name=Tim
Created variable "nomad/jobs/example" with modify index 428

Run the following job.

jobspec

job "example" {

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data = <<EOT
<html>
<h1>hello, {{ with nomadVar "nomad/jobs/example" -}}{{.name}}{{- end -}}</h1>
</html>
        EOT

        destination = "local/index.html"

        change_mode = "script"
        change_script {
          command       = "/bin/sh"
          args          = ["-c", "echo ok"]
          fail_on_error = true
        }
      }

      resources {
        cpu    = 50
        memory = 50
      }

    }
  }
}

Stop the client.

$ sudo systemctl stop nomad

Update the variable.

$ nomad var put -force nomad/jobs/example name=Ron
Created variable "nomad/jobs/example" with modify index 440

Restart the client.

$ sudo systemctl start nomad

See the allocation fail:

$ nomad alloc status
...
Recent Events:
Time                       Type              Description
2024-07-19T16:15:46-04:00  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-07-19T16:15:41-04:00  Killing           Template script failed, task is being killed
2024-07-19T16:15:41-04:00  Task hook failed  Template failed to run script /bin/sh with arguments [-c echo ok] because task driver doesn't support the exec operation
2024-07-19T16:15:08-04:00  Started           Task started by client
2024-07-19T16:15:08-04:00  Task Setup        Building Task Directory
2024-07-19T16:15:08-04:00  Received          Task received by client

For templates with `change_mode = "script"`, we set a driver handle in the poststart method, so the template runner can execute the script inside the task. But when the client is restarted and the template contents change during that window, we trigger a change_mode in the prestart method. In that case, the hook will not have the handle and so returns an errror trying to run the change mode. We restore the driver handle before we call any prestart hooks, so we can pass that handle in the constructor whenever it's available. In the normal task start case the handle will be empty but also won't be called. The error messages are also misleading, as there's no capabilities check happening here. Update the error messages to match. Fixes: #15851 Ref: https://hashicorp.atlassian.net/browse/NET-9338

tgross · 2024-07-22T18:37:18Z

I've got a fix up here: #23663. The task driver handle was actually already available, but we simply we're providing it to the template hook in time for this use.

aisvarya2 · 2024-07-31T08:27:29Z

We are using Nomad version v1.8.2+ent. The template block looks as follows

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "/bin/sh"
          args    = ["sleep", "180", "&&", "touch", "/local/test"]
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

However, when the secret in the Vault is updated the post hook fails with "Template ran script /bin/sh with arguments [sleep 5 && touch /local/test] on change but it exited with code code: 2".
There are no client restarts prior to this activity.

tgross · 2024-07-31T12:51:53Z

@aisvarya2 that's a different issue. In your case, the change script is running but you're getting an error. That's because you're passing invalid arguments to /bin/sh -- you can't pass what you'd type in the shell as arguments to /bin/sh. I believe you want something like:

change_script {
  command = "/bin/sh"
  args = ["-c", "'sleep 180 && touch /local/test'"]
  timeout = "90s"
}

Also, your timeout is less than the sleep window, but I suspect you've got some redaction going on here.

aisvarya2 · 2024-08-01T06:05:49Z

@tgross we tried using the change_script as you mentioned with appropriate timeout but it does not work as expected when we pass it in the args.
It works as expected when we use the command to run the shell script directly but not when used in args

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "local/change.sh"
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

template {
        data        = <<EOF
#!/bin/sh
sleep 60
touch /local/test
EOF
        destination = "local/change.sh"
        perms       = "555"

      }

tgross · 2024-08-01T12:43:31Z

@aisvarya2 you'll want to look at some trace-level logging to see what's going on with the script then. Getting exit code 2 means there's something wrong with the script itself, not the change_mode block. If you're unable to debug that, please open a new issue as what you're seeing is not at all related to this one.

EugenKon · 2024-08-01T14:29:50Z

@aisvarya2 I suppose it could be something wrong with permissions: 555 is different from 0555. And use = <<-EOF, thus you can indent your code properly. You can try to run empty script to test change_mode first:

#!/bin/sh
echo "Ok"

Or

#!/bin/bash
echo "Ok"

Add more permissions to the script (do not forget to remove them after debugging): 0777;

dani added the type/bug label Jan 23, 2023

lgfa29 self-assigned this Jan 23, 2023

lgfa29 added stage/waiting-reply theme/template labels Jan 23, 2023

lgfa29 added stage/needs-investigation and removed stage/waiting-reply labels Jan 26, 2023

lgfa29 mentioned this issue Jan 26, 2023

template: restore driver handle on update #15915

Merged

lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 26, 2023

lgfa29 closed this as completed in #15915 Jan 27, 2023

lgfa29 reopened this Jul 5, 2023

davemay99 added the hcc/cst Admin - internal label Dec 15, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

tgross added the hcc/jira label Jun 24, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross added the theme/driver/docker label Jun 24, 2024

tgross self-assigned this Jul 18, 2024

tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Jul 18, 2024

tgross added this to the 1.8.x milestone Jul 18, 2024

tgross mentioned this issue Jul 22, 2024

template: allow change_mode script to run after client restart #23663

Merged

tgross modified the milestones: 1.8.x, 1.8.3 Jul 22, 2024

tgross closed this as completed in #23663 Jul 24, 2024

tgross closed this as completed in c280891 Jul 24, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jul 24, 2024

hc-github-team-nomad-core mentioned this issue Jul 24, 2024

Backport of template: allow change_mode script to run after client restart into release/1.8.x #23676

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change_mode = "script" sometime fails with Docker #15851

change_mode = "script" sometime fails with Docker #15851

dani commented Jan 23, 2023

lgfa29 commented Jan 23, 2023

dani commented Jan 23, 2023

lgfa29 commented Jan 26, 2023

dani commented Jan 27, 2023

lgfa29 commented Jan 27, 2023

kkornienko-aparavi commented Jun 20, 2023 •

edited

Loading

lgfa29 commented Jun 22, 2023

kkornienko-aparavi commented Jun 26, 2023 •

edited

Loading

lgfa29 commented Jul 5, 2023

the-nando commented Dec 11, 2023 •

edited

Loading

ron-savoia commented Dec 12, 2023

aaronkvanmeerten commented Apr 30, 2024

tgross commented Jul 18, 2024

tgross commented Jul 19, 2024 •

edited

Loading

tgross commented Jul 19, 2024

tgross commented Jul 22, 2024

aisvarya2 commented Jul 31, 2024 •

edited

Loading

tgross commented Jul 31, 2024

aisvarya2 commented Aug 1, 2024

tgross commented Aug 1, 2024

EugenKon commented Aug 1, 2024

change_mode = "script" sometime fails with Docker #15851

change_mode = "script" sometime fails with Docker #15851

Comments

dani commented Jan 23, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Jan 23, 2023

dani commented Jan 23, 2023

lgfa29 commented Jan 26, 2023

dani commented Jan 27, 2023

lgfa29 commented Jan 27, 2023

kkornienko-aparavi commented Jun 20, 2023 • edited Loading

lgfa29 commented Jun 22, 2023

kkornienko-aparavi commented Jun 26, 2023 • edited Loading

lgfa29 commented Jul 5, 2023

the-nando commented Dec 11, 2023 • edited Loading

ron-savoia commented Dec 12, 2023

Nomad version used

Client restart timestamps / nomad log grep

aaronkvanmeerten commented Apr 30, 2024

tgross commented Jul 18, 2024

tgross commented Jul 19, 2024 • edited Loading

tgross commented Jul 19, 2024

tgross commented Jul 22, 2024

aisvarya2 commented Jul 31, 2024 • edited Loading

tgross commented Jul 31, 2024

aisvarya2 commented Aug 1, 2024

tgross commented Aug 1, 2024

EugenKon commented Aug 1, 2024

kkornienko-aparavi commented Jun 20, 2023 •

edited

Loading

kkornienko-aparavi commented Jun 26, 2023 •

edited

Loading

the-nando commented Dec 11, 2023 •

edited

Loading

tgross commented Jul 19, 2024 •

edited

Loading

aisvarya2 commented Jul 31, 2024 •

edited

Loading