-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change_mode = "script" sometime fails with Docker #15851
Comments
Hi @dani 👋 Thanks for the bug report. Do you have any other |
Nop, nothing specific in the script, which looks like
And nothing else which could be triggered at the same time. The error message is quite surprising. If the script was simply failing, it should be something else (don't remember the exact error message when the script fails, but it clearly state its exit code). In this case, it looks like the script isn't even fired ("because task driver doesn't support the exec operation"). |
I was able to reproduce this with this job: job "example" {
datacenters = ["dc1"]
group "client" {
task "curl" {
driver = "docker"
config {
image = "curlimages/curl:7.87.0"
command = "/bin/ash"
args = ["local/script.sh"]
}
template {
data = <<EOF
#!/usr/bin/env ash
while true; do
{{range nomadService "server"}}
curl http://{{.Address}}:{{.Port}}/
{{end}}
sleep 1
done
EOF
destination = "local/script.sh"
change_mode = "script"
change_script {
command = "/bin/ash"
args = ["local/change.sh"]
}
}
template {
data = <<EOF
#!/usr/bin/env ash
date
echo "change"
EOF
destination = "local/change.sh"
}
resources {
cpu = 10
memory = 50
}
}
}
group "server" {
network {
port "http" {}
}
service {
name = "server"
provider = "nomad"
port = "http"
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "${NOMAD_PORT_http}", "-h", "/local"]
ports = ["http"]
}
template {
data = <<EOF
hello world
EOF
destination = "local/index.html"
}
resources {
cpu = 10
memory = 50
}
}
}
} Changing the The problem is that if the Nomad or Vault token changes the template hook will shutdown the existing template manager and create a new one. The new template manager will not have access to the task driver handle that was set when in the #15915 restores the driver handler when the template manager is recreated. @dani do you have more than one task in your group or are you using Vault by any chance? |
Yes to both, there're 3 tasks + one prestart. And using vault (to get certificates) |
Ah cool, so yeah I think the reproduction I had was the main cause behind the issue you're having. The next release of Nomad will have a fix for it. |
Still experiencing this error in Nomad 1.5.6. change_script runs every day. It can run successfully several times, but sooner or later I'm getting:
Any suggestions? |
Hi @kkornienko-aparavi 👋 Do you have any logs around when the problem happens? And would you be able to provide a sample of your job? |
Hello, @lgfa29 , thanks for reacting. Trying to post important parts of the job. It's Grafana deployment job, which consists of 2 tasks.
mariadb_ssl_reload.sh Logs from Nomad client: |
Thanks for the extra info @kkornienko-aparavi. Unfortunately I have not been able to reproduce this yet, even after restarting the Nomad agent.
Just to clarify this part, do you mean the Nomad client process is restarted or is it the entire machine, like a server reboot? I'm reopening this issue just in case until we find a better answer. |
hey @lgfa29 😄 I've run into the same issue on 1.6.3 with a script used to update the Java keystore on changes to a certificate rendered from Vault's PKI via Consul template: template {
destination = "secrets/kafka.pem"
change_mode = "script"
change_script {
command = "/local/hot-reload-keystore.sh"
timeout = "20s"
fail_on_error = false
}
splay = "0ms"
data = <<-EOF
{{- with secret "pki/issue/kafka-brokers" "common_name=xxx" "ttl=10m" "alt_names=*.yyyy" "private_key_format=pkcs8" -}}
{{ .Data.private_key }}
{{ .Data.certificate }}
{{ range .Data.ca_chain }}
{{ . }}
{{ end }}
{{- end -}}
EOF
} While investigating the issue I was able to reproduce it with a for loop updating a key in Consul K/V and restarting the Nomad client while the loop is running.
job "test" {
region = "dc"
datacenters = ["dca"]
constraint {
attribute = node.unique.name
value = "xxx"
}
group "test" {
count = 1
task "test" {
driver = "docker"
config {
image = "alpine:3.18.2"
args = ["tail", "-f", "/dev/null"]
}
template {
data = <<EOF
{{ range ls "services/my-ns/test" }}
{{.Key}}={{.Value}}
{{ end }}
EOF
destination = "local/test.tmpl"
change_mode = "script"
change_script {
command = "/local/change.sh"
}
}
template {
data = <<EOF
#!/bin/sh
date >> local/run.log
EOF
destination = "local/change.sh"
perms = "555"
}
resources {
cores = 4
memory = 64
}
}
}
}
while : ; do consul kv put services/my-ns/test/foo a=$RANDOM ; sleep 5 ; done
Following runs succeed:
Can you please check if you are able to reproduce it as well? |
I've reproduced this in my lab as well, using the jobspecs provided by the-nando and lgfa29 above. Nomad version used1.6.2+ent Client restart timestamps / nomad log grep
client_docker.log.zip |
Anecdotally (I don't have any of the above dumps currently), this behavior continues in 1.7.5, and it seems to continue to be related to a nomad client restart. |
Hi folks, I'm just getting started digging into this. At first glance it's pretty bizarre:
While I could easily imagine given the circumstances of the bug around client restarts that we've got some kind of race condition where the handle is not yet valid, that's not what the bug appears to be. My suspicion is that when the lazy handle's |
Oh, looks like I was looking at the wrong error message. That's embarassing. 😊 The In that case, the error isn't happening in the post-stop hook but in the pre-start hook. But the handle we need to fire the change script isn't being set until the post-stop hook. So the template renders for the "first" time (according to the task runner lifecycle), immediately detects a change to the file, and that triggers the change script even though we don't have a handle to the task yet. The very helpful client logs that @ron-savoia posted show the order of events pretty clearly for alloc
We're never going to have a valid task handle at the point of the prestart, so we simply can't run the script at that point. What I'll do next is write a unit test demonstrating the exact behavior we're seeing, and then see if I can figure out a reasonable way to "queue-up" the script execution so that if we restore a task handle we can run it afterwards. Without getting stuck or trying to execute if the task handle fails to be restored (ex. the task gets killed between the time we render and get to the poststop hook method), of course. (Edit: I thought this sounded familiar, and as it turns out we didn't trigger change mode on templates during restore until I fixed that way back in #9636. So either this is a regression or a bug in that original implementation years ago.) |
At this point reproduction is easy and deterministic, when running at least one Nomad server and separate Nomad client. Create a variable:
Run the following job. jobspecjob "example" {
group "group" {
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-vv", "-f", "-p", "8001", "-h", "/local"]
ports = ["www"]
}
template {
data = <<EOT
<html>
<h1>hello, {{ with nomadVar "nomad/jobs/example" -}}{{.name}}{{- end -}}</h1>
</html>
EOT
destination = "local/index.html"
change_mode = "script"
change_script {
command = "/bin/sh"
args = ["-c", "echo ok"]
fail_on_error = true
}
}
resources {
cpu = 50
memory = 50
}
}
}
} Stop the client.
Update the variable.
Restart the client.
See the allocation fail:
|
For templates with `change_mode = "script"`, we set a driver handle in the poststart method, so the template runner can execute the script inside the task. But when the client is restarted and the template contents change during that window, we trigger a change_mode in the prestart method. In that case, the hook will not have the handle and so returns an errror trying to run the change mode. We restore the driver handle before we call any prestart hooks, so we can pass that handle in the constructor whenever it's available. In the normal task start case the handle will be empty but also won't be called. The error messages are also misleading, as there's no capabilities check happening here. Update the error messages to match. Fixes: #15851 Ref: https://hashicorp.atlassian.net/browse/NET-9338
I've got a fix up here: #23663. The task driver handle was actually already available, but we simply we're providing it to the template hook in time for this use. |
We are using Nomad version v1.8.2+ent. The template block looks as follows
However, when the secret in the Vault is updated the post hook fails with "Template ran script /bin/sh with arguments [sleep 5 && touch /local/test] on change but it exited with code code: 2". |
@aisvarya2 that's a different issue. In your case, the change script is running but you're getting an error. That's because you're passing invalid arguments to change_script {
command = "/bin/sh"
args = ["-c", "'sleep 180 && touch /local/test'"]
timeout = "90s"
} Also, your timeout is less than the sleep window, but I suspect you've got some redaction going on here. |
@tgross we tried using the change_script as you mentioned with appropriate timeout but it does not work as expected when we pass it in the args.
|
@aisvarya2 you'll want to look at some trace-level logging to see what's going on with the script then. Getting exit code 2 means there's something wrong with the script itself, not the |
@aisvarya2 I suppose it could be something wrong with permissions:
Or
Add more permissions to the script (do not forget to remove them after debugging): |
Nomad version
Nomad v1.4.3 (f464aca)
Operating system and Environment details
AlmaLinux 8.7
Nomad installed manually from the pre-built binary. Using Docker CE 20.10.22
Issue
For some of my tasks, I trigger a script when a template changes. Typicaly, I use this to trigger a custom reload action when a vault generated certificate gets renewed. Most of the time, this is working great. But sometimes, the execution fails with this error
Reproduction steps
A job with a templated cert, a reload script and change_mode = "script", for example with :
Using a short TTL like in this example makes the problem more visible, but it's still quite random
Expected Result
The /local/bin/rotate-cert.sh script is fired everytime the cert is renewed
Actual Result
Most of the time, it's working, but sometimes (couldn't find any common pattern), the script fails with
As a workaround, I've added the fail_on_error stanza
So when this happens, the task is killed and restarted (so it gets it's new cert)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Nothing interesting, we can see the template being rendered, and the task being stopped (because of the fail_on_error)
The text was updated successfully, but these errors were encountered: