Driver config appears to fail to parse, but it is inconsistent #6036

ships · 2019-07-30T16:58:29Z

Nomad version

Nomad v0.9.3 (c5e8b66)

Operating system and Environment details

I am running a Ubuntu 18.04, cluster topology is all on a single host virtualizing with VirtualBox in bridged networking mode:

3 servers, which run Nomad and Consul both in server mode
2 clients, which run Nomad and Consul in client mode

Issue

This issue presents similarly to these issues:
#5680
#5694

But with a feature that I can't find a reference to, which is that my jobs eventually start up, they just fail about 3 times before succeeding. I have in my job allocation 2 instances, but on average I only get about 1.2 running at any time.

I suspect that it is related to the use of

        dns_servers = [
          "${attr.unique.network.ip-address}",
        ]

but the funny thing is, if you look at the job file you see same directive in the web node, which does not fail. only the work node fails, and only some of the time.

Perhaps this variable is only some of the time set?

I also note that the failed allocations show template with change_mode restart re-rendered, and the successful allocations do not, which i think is what takes my happy instance and kicks it into the spiral for about 10 minutes. The web node, which does not fail on account of the driver, also gets restarted due to a template change. Notably, the web node has only 1 desired instance at the moment.

Reproduction steps

I deploy this job file

Job file (if appropriate)

job "concourse" {
  datacenters = ["dc1"]
  type = "service"

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    auto_promote = true
    canary = 1
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  vault {
    policies = ["concourse"]

    change_mode = "noop"
  }

# groups


  group "web" {
    count = 1

    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }

    task "web" {
      driver = "docker"

      config {
        image = "concourse/concourse:5.4"
        dns_servers = [ "${attr.unique.network.ip-address}" ]
        args = [
          "web",
          "--tsa-host-key", "${NOMAD_SECRETS_DIR}/concourse-keys/tsa_host_key",
          "--tsa-authorized-keys", "${NOMAD_SECRETS_DIR}/concourse-keys/authorized_worker_keys",
          "--tsa-session-signing-key", "${NOMAD_SECRETS_DIR}/concourse-keys/session_signing_key",
          "--vault-ca-cert", "${NOMAD_SECRETS_DIR}/ssl/vault/vault_ca_certificate.pem",
          "--tls-cert", "${NOMAD_SECRETS_DIR}/ssl/web_tls/certificate.pem",
          "--tls-key", "${NOMAD_SECRETS_DIR}/ssl/web_tls/key.pem",
					"--vault-client-token", "${VAULT_TOKEN}",
        ]

        port_map = {
          "atc" = 443
          "tsa" = 2222
        }
      }

      env {
        CONCOURSE_POSTGRES_USER = "pgadmin"
        CONCOURSE_POSTGRES_DATABASE = "concourse"
        CONCOURSE_MAIN_TEAM_LOCAL_USER = "test"
        CONCOURSE_VAULT_PATH_PREFIX = "/kvv1/concourse"
        CONCOURSE_DEFAULT_BUILD_LOGS_TO_RETAIN = 40
        CONCOURSE_MAX_BUILD_LOGS_TO_RETAIN = 100
        CONCOURSE_TLS_BIND_PORT = 443
      }

      template {
        data = <<EOH
          CONCOURSE_EXTERNAL_URL="https://ci.service.skelter:50808"
          {{ with service "postgres" }}
          {{ with index . 0}}
          CONCOURSE_POSTGRES_HOST="{{.Address}}"
          CONCOURSE_POSTGRES_PORT="{{.Port}}"
          {{end}}{{end}}
          {{with secret "kv/data/ci/web"}}
          CONCOURSE_POSTGRES_PASSWORD={{.Data.data.pg_password}}
          CONCOURSE_GITHUB_CLIENT_ID={{.Data.data.github_client_id}}
          CONCOURSE_GITHUB_CLIENT_SECRET={{.Data.data.github_client_secret}}
          CONCOURSE_MAIN_TEAM_GITHUB_USER={{.Data.data.github_main_user}}
          {{end}}
          {{ with service "active.vault" }}
          {{ with index . 0 }}
          CONCOURSE_VAULT_URL="https://active.vault.service.skelter:{{.Port}}"
          {{end}}{{end}}
        EOH

        env = true
        destination = "run/secrets.env"
				change_mode = "restart"
      }

      template {
        source = "/var/vcap/jobs/nomad-client/ssl/vault_ca_certificate.pem"
			  destination = "secrets/ssl/vault/vault_ca_certificate.pem"
      }

      template {
        data = <<EOH
{{with secret "kv/data/ci/web"}}{{.Data.data.tsa_host_key}}{{end}}EOH

        destination = "secrets/concourse-keys/tsa_host_key"
      }

      template {
        data = <<EOH
{{with secret "pki/issue/skelter-services" "common_name=ci.service.skelter"}}{{.Data.private_key}}{{end}}EOH

        destination = "secrets/ssl/web_tls/key.pem"
      }

      template {
        data = <<EOH
{{with secret "pki/issue/skelter-services" "common_name=ci.service.skelter"}}{{.Data.certificate}}
{{.Data.issuing_ca}}{{end}}EOH

        destination = "secrets/ssl/web_tls/certificate.pem"
      }

      template {
        data = <<EOH
{{with secret "kv/data/ci/web"}}{{.Data.data.authorized_worker_keys}}{{end}}EOH

        destination = "secrets/concourse-keys/authorized_worker_keys"
      }

      template {
        data = <<EOH
{{with secret "kv/data/ci/web"}}{{.Data.data.session_signing_key}}{{end}}EOH

        destination = "secrets/concourse-keys/session_signing_key"
      }

      resources {
        cpu    = 2000 # 2000 MHz
        memory = 1536
        network {
          port "atc" {
            static = 50808
          }
          port "tsa" {}
        }
      }

      service {
        name = "ci-tsa"
        tags = ["internal"]
        port = "tsa"
      }

      service {
        name = "ci"
        tags = ["global"]
        port = "atc"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }

  group "worker" {
    count = 2

    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }

    ephemeral_disk {
      size = 30000
    }

    task "work" {
      driver = "docker"

      config {
        image = "concourse/concourse:5.4"
        dns_servers = [
          "${attr.unique.network.ip-address}",
        ]
        privileged = true
        args = [
          "worker",
          "--work-dir", "${NOMAD_TASK_DIR}/worker",
          "--tsa-worker-private-key", "${NOMAD_SECRETS_DIR}/concourse-keys/worker_ssh_key",
          "--tsa-public-key", "${NOMAD_SECRETS_DIR}/concourse-keys/tsa_host_key.pub",
          "--tsa-host", "${CONCOURSE_TSA_HOST}",
          "--baggageclaim-bind-port",  "${NOMAD_PORT_baggageclaim}",
          "--bind-port", "${NOMAD_PORT_garden}",
        ]
      }

      template {
        data = <<EOH
{{with secret "kv/data/ci/worker"}}{{.Data.data.tsa_host_key_pub}}{{end}}EOH

        destination = "secrets/concourse-keys/tsa_host_key.pub"
      }

      template {
        data = <<EOH
{{with secret "kv/data/ci/worker"}}{{.Data.data.worker_ssh_key}}{{end}}EOH

        destination = "secrets/concourse-keys/worker_ssh_key"
      }

      template {
        data = <<EOH
        {{ with service "ci-tsa" }}
        {{ with index . 0}}
        CONCOURSE_TSA_HOST="{{.Address}}:{{.Port}}"
        {{end}}{{end}}
        EOH

        env = true
        destination = "run/secrets.env"
      }


      resources {
        cpu    = 2000 # 1000 MHz
        memory = 1024 # 512MB
        network {
          port "garden" {}
          port "baggageclaim" {}
          port "garbagecollection" {}
        }
      }
    }
  }
}

Nomad Client logs (if appropriate)

    2019-07-24T18:23:13.094Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=501b36c2-6ec0-3286-4955-70ba03639be2 task=work error="failed to deco
de driver config: [pos 201]: readContainerLen: Unrecognized descriptor byte: hex: d4, decimal: 212"

The text was updated successfully, but these errors were encountered:

notnoop · 2019-07-31T07:46:27Z

@ships That looks very odd. Mind if you provide nomad node status --verbose <node-id> for web and worker nodes? If you can provide the Vagrantfile for us to replicate your configuration, we can debug this more effectively.

My hypothesis is that the worker node is missing unique.network.ip-address attribute for some reason, so config rendering fails resulting in the obscure error you see.

ships · 2019-08-01T00:29:09Z

@notnoop answering your questions in reverse order:

my deployment topology is a bit awkward, being not using Vagrant.
I am using BOSH which packages for Nomad 0.9.3 and consul and vault.

If it means anything to you, my configuration is public but i can also try to answer any questions specifically if you think of them.

To be clear, there are 2 "client" nodes of consul/nomad, and so the web and worker nodes i describe are groups in my nomad job. With that, the worker nodes share Nomad nodes with web nodes currently. I scaled web up to 2, in case that difference explained anything, but I can still see this huge difference in how the worker and web nodes respond:

anyway, I have gotten the output requested on the two Nomad client nodes:

$ nomad node status --verbose 58b6e2c4
ID          = 58b6e2c4-fd9b-9868-007a-8d6d33d5976e
Name        = carrack/0
Class       = <none>
DC          = dc1
Drain       = false
Eligibility = eligible
Status      = ready
Uptime      = 219h8m48s

Drivers
Driver    Detected  Healthy  Message                                                                         Time
docker    true      true     Healthy                                                                         2019-07-30T00:29:17Z
exec      true      true     Healthy                                                                         2019-07-30T00:29:17Z
java      false     false    <none>                                                                          2019-07-30T00:29:17Z
qemu      false     false    <none>                                                                          2019-07-30T00:29:17Z
raw_exec  false     false    disabled                                                                        2019-07-30T00:29:17Z
rkt       false     false    Failed to execute rkt version: exec: "rkt": executable file not found in $PATH  2019-07-30T00:29:17Z

Node Events
Time                  Subsystem  Message                  Details
2019-07-30T00:29:17Z  Cluster    Node re-registered       <none>
2019-07-30T00:29:13Z  Cluster    Node heartbeat missed    <none>
2019-07-30T00:28:22Z  Drain      Node drain complete      <none>
2019-07-30T00:28:22Z  Drain      Node drain strategy set  <none>
2019-07-30T00:26:59Z  Cluster    Node re-registered       <none>
2019-07-30T00:26:18Z  Cluster    Node heartbeat missed    <none>
2019-07-30T00:25:35Z  Drain      Node drain complete      <none>
2019-07-30T00:25:35Z  Drain      Node drain strategy set  <none>
2019-07-30T00:20:46Z  Cluster    Node re-registered       <none>
2019-07-30T00:20:15Z  Cluster    Node heartbeat missed    <none>

Allocated Resources
CPU             Memory           Disk
4000/10368 MHz  2.5 GiB/5.8 GiB  30 GiB/112 GiB

Allocation Resource Utilization
CPU           Memory
14/10368 MHz  148 MiB/5.8 GiB

Host Resource Utilization
CPU            Memory           Disk
128/10368 MHz  1.1 GiB/5.8 GiB  21 GiB/119 GiB

Allocations
ID                                    Eval ID                               Node ID                               Task Group  Version  Desired  Status   Created                    Modified
730fb131-77ea-9aa7-1b55-3efd75027f83  04ee4342-06b9-52b2-4dc5-69f16e929560  58b6e2c4-fd9b-9868-007a-8d6d33d5976e  web         39       run      running  2019-07-30T10:08:36-07:00  2019-07-31T17:06:59-07:00
be67f724-b73a-cee3-4cfb-c84762feb463  04ee4342-06b9-52b2-4dc5-69f16e929560  58b6e2c4-fd9b-9868-007a-8d6d33d5976e  worker      39       run      running  2019-07-30T10:01:34-07:00  2019-07-31T17:07:01-07:00

Attributes
consul.datacenter                = yao
consul.revision                  = a82e6a7fd
consul.server                    = false
consul.version                   = 1.5.2
cpu.arch                         = amd64
cpu.frequency                    = 2592
cpu.modelname                    = Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
cpu.numcores                     = 4
cpu.totalcompute                 = 10368
driver.docker                    = 1
driver.docker.bridge_ip          = 172.17.0.1
driver.docker.os_type            = linux
driver.docker.privileged.enabled = true
driver.docker.runtimes           = runc
driver.docker.version            = 18.06.3-ce
driver.docker.volumes.enabled    = true
driver.exec                      = 1
kernel.name                      = linux
kernel.version                   = 4.15.0-50-generic
memory.totalbytes                = 6250000384
nomad.advertise.address          = 192.168.1.27:4646
nomad.revision                   = c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce
nomad.version                    = 0.9.3
os.name                          = ubuntu
os.signals                       = SIGINT,SIGUSR1,SIGWINCH,SIGXFSZ,SIGABRT,SIGFPE,SIGHUP,SIGQUIT,SIGSTOP,SIGTERM,SIGTSTP,SIGPROF,SIGUSR2,SIGILL,SIGIOT,SIGKILL,SIGTTOU,SIGALRM,SIGPIPE,SIGSEGV,SIGTRAP,SIGBUS,SIGCONT,SIGTTIN,SIGXCPU,SIGCHLD,SIGIO,SIGSYS,SIGURG
os.version                       = 16.04
unique.cgroup.mountpoint         = /sys/fs/cgroup
unique.consul.name               = work-carrack-0
unique.hostname                  = cc3b187a-82e1-48b0-866a-eeebf637811a
unique.network.ip-address        = 192.168.1.27
unique.storage.bytesfree         = 120080527360
unique.storage.bytestotal        = 128135868416
unique.storage.volume            = /dev/sdb2
vault.accessible                 = true
vault.cluster_id                 = ded3a7b5-19bb-4dd4-18bc-976f1114f88c
vault.cluster_name               = vault-cluster-eee58161
vault.version                    = 1.1.1

Meta
node_name = carrack/0

$ nomad node status --verbose 5a2d6a66
ID          = 5a2d6a66-203f-c3a8-d864-7643df0cc04c
Name        = carrack/1
Class       = <none>
DC          = dc1
Drain       = false
Eligibility = eligible
Status      = ready
Uptime      = 219h7m45s

Drivers
Driver    Detected  Healthy  Message                                                                         Time
docker    true      true     Healthy                                                                         2019-07-23T03:35:34Z
exec      true      true     Healthy                                                                         2019-07-23T03:35:34Z
java      false     false    <none>                                                                          2019-07-23T03:35:34Z
qemu      false     false    <none>                                                                          2019-07-23T03:35:34Z
raw_exec  false     false    disabled                                                                        2019-07-23T03:35:34Z
rkt       false     false    Failed to execute rkt version: exec: "rkt": executable file not found in $PATH  2019-07-23T03:35:34Z

Node Events
Time                  Subsystem  Message                  Details
2019-07-23T03:34:42Z  Drain      Node drain complete      <none>
2019-07-23T03:34:39Z  Drain      Node drain strategy set  <none>
2019-07-22T21:22:06Z  Cluster    Node registered          <none>

Allocated Resources
CPU             Memory           Disk
5100/10368 MHz  3.5 GiB/5.8 GiB  69 GiB/112 GiB

Allocation Resource Utilization
CPU            Memory
522/10368 MHz  294 MiB/5.8 GiB

Host Resource Utilization
CPU            Memory           Disk
825/10368 MHz  1.4 GiB/5.8 GiB  7.2 GiB/119 GiB

Allocations
ID                                    Eval ID                               Node ID                               Task Group  Version  Desired  Status   Created                    Modified
27b85eef-90d3-7fba-9008-c3e7d4a4ecf1  b3e39820-5251-d8d6-e069-07df1fe9a521  5a2d6a66-203f-c3a8-d864-7643df0cc04c  worker      39       run      running  2019-07-31T14:09:21-07:00  2019-07-31T17:07:06-07:00
486d5a3e-76eb-5fc7-1407-080418735a2b  e8f056cd-9d5c-86d5-2595-1fe0c939b7e9  5a2d6a66-203f-c3a8-d864-7643df0cc04c  node        5        run      running  2019-07-29T20:06:39-07:00  2019-07-29T20:07:27-07:00
18221f87-6011-563f-c9f8-19f5824e0e3a  0c80eca5-e897-6af8-0b30-1b08b6de6375  5a2d6a66-203f-c3a8-d864-7643df0cc04c  example     4        run      running  2019-07-29T17:19:21-07:00  2019-07-29T17:19:35-07:00
7b90aabe-deb3-f4ae-8337-083d97728eb8  04ee4342-06b9-52b2-4dc5-69f16e929560  5a2d6a66-203f-c3a8-d864-7643df0cc04c  web         39       run      running  2019-07-25T14:46:34-07:00  2019-07-31T17:06:56-07:00

Attributes
consul.datacenter                = yao
consul.revision                  = a82e6a7fd
consul.server                    = false
consul.version                   = 1.5.2
cpu.arch                         = amd64
cpu.frequency                    = 2592
cpu.modelname                    = Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
cpu.numcores                     = 4
cpu.totalcompute                 = 10368
driver.docker                    = 1
driver.docker.bridge_ip          = 172.17.0.1
driver.docker.os_type            = linux
driver.docker.privileged.enabled = true
driver.docker.runtimes           = runc
driver.docker.version            = 18.06.3-ce
driver.docker.volumes.enabled    = true
driver.exec                      = 1
kernel.name                      = linux
kernel.version                   = 4.15.0-50-generic
memory.totalbytes                = 6250000384
nomad.advertise.address          = 192.168.1.28:4646
nomad.revision                   = c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce
nomad.version                    = 0.9.3
os.name                          = ubuntu
os.signals                       = SIGBUS,SIGSTOP,SIGTERM,SIGTTOU,SIGABRT,SIGFPE,SIGILL,SIGURG,SIGUSR1,SIGCONT,SIGINT,SIGPIPE,SIGQUIT,SIGIO,SIGXCPU,SIGALRM,SIGHUP,SIGKILL,SIGSYS,SIGSEGV,SIGXFSZ,SIGCHLD,SIGPROF,SIGTSTP,SIGTTIN,SIGIOT,SIGTRAP,SIGUSR2,SIGWINCH
os.version                       = 16.04
unique.cgroup.mountpoint         = /sys/fs/cgroup
unique.consul.name               = work-carrack-1
unique.hostname                  = 312cc0be-b393-4f07-bb58-9f4346c5579f
unique.network.ip-address        = 192.168.1.28
unique.storage.bytesfree         = 119911559168
unique.storage.bytestotal        = 128135868416
unique.storage.volume            = /dev/sdb2
vault.accessible                 = true
vault.cluster_id                 = ded3a7b5-19bb-4dd4-18bc-976f1114f88c
vault.cluster_name               = vault-cluster-eee58161
vault.version                    = 1.1.1

Meta
node_name = carrack/1

stale · 2019-10-30T01:01:47Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2019-11-29T01:56:01Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-16T02:30:13Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added theme/client stage/needs-investigation labels Jul 31, 2019

stale bot added the stage/waiting-reply label Oct 30, 2019

stale bot closed this as completed Nov 29, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver config appears to fail to parse, but it is inconsistent #6036

Driver config appears to fail to parse, but it is inconsistent #6036

ships commented Jul 30, 2019 •

edited

Loading

notnoop commented Jul 31, 2019

ships commented Aug 1, 2019 •

edited

Loading

stale bot commented Oct 30, 2019

stale bot commented Nov 29, 2019

github-actions bot commented Nov 16, 2022

Driver config appears to fail to parse, but it is inconsistent #6036

Driver config appears to fail to parse, but it is inconsistent #6036

Comments

ships commented Jul 30, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

notnoop commented Jul 31, 2019

ships commented Aug 1, 2019 • edited Loading

stale bot commented Oct 30, 2019

stale bot commented Nov 29, 2019

github-actions bot commented Nov 16, 2022

ships commented Jul 30, 2019 •

edited

Loading

ships commented Aug 1, 2019 •

edited

Loading