Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to submit plan for evaluation: ... no such key \"<snip>\" in keyring error after moving cluster to 1.4.1 #14981

Closed
bfqrst opened this issue Oct 20, 2022 · 48 comments · Fixed by #14987, #15092 or #15227
Assignees
Milestone

Comments

@bfqrst
Copy link

bfqrst commented Oct 20, 2022

Nomad version

Nomad v1.4.1 (2aa7e66)

Operating system and Environment details

Ubuntu 22.04, Nomad 1.4.1

Issue

After moving the Nomad server and clients to v1.4.1, I noticed that sometimes (unfortunately not always) after cycling Nomad server ASGs and Nomad client ASGs with new AMIs, jobs scheduled on the workers can't be allocated. So to be precise:

  1. Pipeline creates new Nomad AMIs via Packer
  2. Pipeline terraforms Nomad server ASG with server config
  3. Pipeline terraforms client ASG or dedicated instances with updated AMI
  4. Lost jobs on worker (like for instance the Traefik ingress job) can't be allocated

This literally never happened before 1.4.X

Client output looks like this:

nomad eval list

ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
427e9905 50 failed-follow-up plugin-aws-ebs-nodes default pending false
35f4fdfb 50 failed-follow-up plugin-aws-efs-nodes default pending false
46152dcd 50 failed-follow-up spot-drainer default pending false
71e3e58a 50 failed-follow-up plugin-aws-ebs-nodes default pending false
e86177a6 50 failed-follow-up plugin-aws-efs-nodes default pending false
2289ba5f 50 failed-follow-up spot-drainer default pending false
da3fdad6 50 failed-follow-up plugin-aws-ebs-nodes default pending false
b445b976 50 failed-follow-up plugin-aws-efs-nodes default pending false
48a6771e 50 failed-follow-up ingress default pending false

Reproduction steps

Unclear at this point. I seem to be able to somewhat force the issue, when I cycle the Nomad server ASG with updated AMIs.

Expected Result

Client work that was lost, should be rescheduled once the Nomad client comes up and reports readiness.

Actual Result

Lost jobs that can't be allocated on worker with an updated AMI.

nomad status

ID Type Priority Status Submit Date
auth-service service 50 pending 2022-10-09T11:32:57+02:00
ingress service 50 pending 2022-10-17T14:57:26+02:00
plugin-aws-ebs-controller service 50 running 2022-10-09T14:48:11+02:00
plugin-aws-ebs-nodes system 50 running 2022-10-09T14:48:11+02:00
plugin-aws-efs-nodes system 50 running 2022-10-09T11:37:04+02:00
prometheus service 50 pending 2022-10-18T21:19:24+02:00
spot-drainer system 50 running 2022-10-11T18:04:49+02:00

Job file (if appropriate)

variable "stage" {
  type        = string
  description = "The stage for this jobfile."
}

variable "domain_suffix" {
  type        = string
  description = "The HDI stage specific domain suffix."
}

variable "acme_route" {
  type = string
}

variables {
  step_cli_version = "0.22.0"
  traefik_version  = "2.9.1"
}

job "ingress" {

  datacenters = [join("-", ["pd0011", var.stage])]

  type = "service"

  group "ingress" {

    constraint {
      attribute = meta.instance_type
      value     = "ingress"
    }

    count = 1

    service {
      name = "traefik"
      tags = [
        "traefik.enable=true",

        "traefik.http.routers.api.rule=Host(`ingress.dsp.${var.domain_suffix}`)",
        "traefik.http.routers.api.entrypoints=secure",
        "traefik.http.routers.api.service=api@internal",
        "traefik.http.routers.api.tls.certresolver=hdi_acme_resolver",
        "traefik.http.routers.api.tls.options=tls13@file",
        "traefik.http.routers.api.middlewares=dspDefaultPlusAdmin@file",

        "traefik.http.routers.ping.rule=Host(`ingress.dsp.${var.domain_suffix}`) && Path(`/ping`)",
        "traefik.http.routers.ping.entrypoints=secure",
        "traefik.http.routers.ping.service=ping@internal",
        "traefik.http.routers.ping.tls.certresolver=hdi_acme_resolver",
        "traefik.http.routers.ping.tls.options=tls13@file",
        "traefik.http.routers.ping.middlewares=dspDefault@file"
      ]

      port = "https"

      check {
        name     = "Traefik Ping Endpoint"
        type     = "http"
        protocol = "http"
        port     = "http"
        path     = "/ping"
        interval = "10s"
        timeout  = "2s"
      }
    }

    network {

      port "http" {
        static = 80
        to     = 80
      }
      port "https" {
        static = 443
        to     = 443
      }
    }

    ephemeral_disk {
      size    = "300"
      sticky  = true
      migrate = true
    }

    task "generate_consul_cert" {
<snip>
    }

    task "generate_nomad_cert" {
<snip>
    }


    task "traefik" {

      driver = "docker"

      env {
        LEGO_CA_CERTIFICATES = join(":", ["${NOMAD_SECRETS_DIR}/cacert.pem", "${NOMAD_SECRETS_DIR}/root_ca_${var.stage}.crt"])
        # LEGO_CA_SYSTEM_CERT_POOL = true
      }

      config {
        image = "traefik:${var.traefik_version}"
        volumes = [
          # Use absolute paths to mount arbitrary paths on the host
          "local/:/etc/traefik/",
          "/etc/timezone:/etc/timezone:ro",
          "/etc/localtime:/etc/localtime:ro",
        ]
        network_mode = "host"
        ports        = ["http", "https"]
      }

      resources {
        cpu    = 800
        memory = 128
      }
      # Controls the timeout between signalling a task it will be killed
      # and killing the task. If not set a default is used.
      kill_timeout = "60s"

      template {
        data        = <<EOH
<snip>
    }
  }
}

Nomad Server logs (if appropriate)

Oct 20 15:00:30 uat-nomad-95I nomad[485]:     2022-10-20T15:00:30.571+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 15:00:42 uat-nomad-95I nomad[485]:     2022-10-20T15:00:42.948+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=827f0dfe-0584-b44a-92e2-9a92ab649c48 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"

Nomad Client logs (if appropriate)

Oct 20 11:55:00 uat-worker-wZz nomad[464]:              Log Level: INFO
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Region: europe (DC: pd0011-uat)
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Server: false
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                Version: 1.4.1
Oct 20 11:55:00 uat-worker-wZz nomad[464]: ==> Nomad agent started! Log data will stream in below:
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.817+0200 [INFO]  client: using state directory: state_dir=/opt/hsy/nomad/data/client
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using alloc directory: alloc_dir=/opt/hsy/nomad/data/alloc
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.831+0200 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.852+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.856+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.870+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.897+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.906+0200 [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"https://127.0.0.1:8501/v1/catalog/datacenters\": dial tcp 127.0.0.1:8501: connect: connection refused"
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [INFO]  client: started client: node_id=5f21ebef-e0a9-8bd2-775a-61b3e32cac6e
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.438+0200 [WARN]  client.server_mgr: no servers available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.439+0200 [WARN]  client.server_mgr: no servers available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.453+0200 [INFO]  client.consul: discovered following servers: servers=[10.194.73.146:4647, 10.194.74.253:4647, 10.194.75.103:4647]
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.501+0200 [INFO]  client: node registration complete
Oct 20 11:55:06 uat-worker-wZz nomad[464]:     2022-10-20T11:55:06.856+0200 [INFO]  client: node registration complete
Oct 20 11:55:14 uat-worker-wZz nomad[464]:     2022-10-20T11:55:14.893+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is available
Oct 20 11:55:21 uat-worker-wZz nomad[464]:     2022-10-20T11:55:21.417+0200 [INFO]  client: node registration complete
@tgross tgross self-assigned this Oct 20, 2022
@tgross
Copy link
Member

tgross commented Oct 20, 2022

Hi @bfqrst! Thanks for opening this issue. The keyring being referred to here is new in 1.4.x and supports the new Workload Identity feature. When a new server joins the cluster, it streams the raft snapshot from the old cluster. It also starts up keyring replication from the old cluster. The keyring replication loop on the server reads key metadata from raft, sees a key it doesn't have in its local keyring, and then sends a RPC to the leader to get that key (falling back to polling all the other peers for the key if the leader doesn't have it so that we can get the key even if there was a leader election immediately following a new key).

What seems to be happening in your case is that the new servers aren't replicating the keyring, which means the leader can't sign the workload identity for the replacement allocations. Do you have any server logs containing the word keyring around the time the new servers were started up? That might help track this down.

We've had a similar report from an internal user in their staging environment as well. Their workflow seems somewhat similar to what you're reporting here, so I want to double-check that:

  • When you bring up the new AMI, are you bringing them all up at once?
  • When you bring up the new AMI, are you ensuring that all servers are current on their raft index before shutting down the old servers?
  • What is bootstrap_expect set to?
  • Is your cluster federated with another cluster?

Also, it might help if we could get a stack trace from the servers. you can trigger this via SIGQUIT, which will dump it to stderr. If it's really long, you can email it to [email protected] with a subject line pointing to this issue and I'll see it.

@bfqrst
Copy link
Author

bfqrst commented Oct 20, 2022

Thanks for looking into it @tgross!

So bootstrap_expect is set to 3 with no federation whatsoever. No ACLs but full TLS and gossip encryption. I really don't know if the logs I caught are from the cluster formation...

Oct 19 09:58:50 uat-nomad-95I systemd[1]: Started "HashiCorp Nomad - A workload orchestration solution".
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Loaded configuration from /opt/hsy/nomad/config/nomad_server.hcl
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Starting Nomad agent...
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Nomad agent configuration:
Oct 19 09:58:51 uat-nomad-95I nomad[485]:        Advertise Addrs: HTTP: 10.194.73.146:4646; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 19 09:58:51 uat-nomad-95I nomad[485]:             Bind Addrs: HTTP: [127.0.0.1:4646 10.194.73.146:4646]; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Client: false
Oct 19 09:58:51 uat-nomad-95I nomad[485]:              Log Level: INFO
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Region: europe (DC: pd0011-uat)
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Server: true
ops@uat-nomad-95I:~$ head nomad.txt -n 100
Oct 19 09:58:50 uat-nomad-95I systemd[1]: Started "HashiCorp Nomad - A workload orchestration solution".
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Loaded configuration from /opt/hsy/nomad/config/nomad_server.hcl
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Starting Nomad agent...
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Nomad agent configuration:
Oct 19 09:58:51 uat-nomad-95I nomad[485]:        Advertise Addrs: HTTP: 10.194.73.146:4646; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 19 09:58:51 uat-nomad-95I nomad[485]:             Bind Addrs: HTTP: [127.0.0.1:4646 10.194.73.146:4646]; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Client: false
Oct 19 09:58:51 uat-nomad-95I nomad[485]:              Log Level: INFO
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Region: europe (DC: pd0011-uat)
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                 Server: true
Oct 19 09:58:51 uat-nomad-95I nomad[485]:                Version: 1.4.1
Oct 19 09:58:51 uat-nomad-95I nomad[485]: ==> Nomad agent started! Log data will stream in below:
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.878+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.878+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.878+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.878+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.878+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.917+0200 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.920+0200 [INFO]  nomad.raft: initial configuration: index=0 servers=[]
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.920+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 10.194.73.146:4647 [Follower]" leader-address= leader-id=
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.923+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-95I.europe 10.194.73.146
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.923+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.923+0200 [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.923+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.926+0200 [INFO]  nomad: adding server: server="uat-nomad-95I.europe (Addr: 10.194.73.146:4647) (DC: pd0011-uat)"
Oct 19 09:58:51 uat-nomad-95I nomad[485]:     2022-10-19T09:58:51.926+0200 [ERROR] nomad: error looking up Nomad servers in Consul: error="server.nomad: unable to query Consul datacenters: Get \"https://127.0.0.1:8501/v1/catalog/datacenters\": dial tcp 127.0.0.1:8501: connect: connection refused"
Oct 19 09:58:53 uat-nomad-95I nomad[485]:     2022-10-19T09:58:53.180+0200 [WARN]  nomad.raft: no known peers, aborting election
Oct 19 09:59:02 uat-nomad-95I nomad[485]:     2022-10-19T09:59:02.001+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:02 uat-nomad-95I nomad[485]:     2022-10-19T09:59:02.268+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:07 uat-nomad-95I nomad[485]:     2022-10-19T09:59:07.186+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:07 uat-nomad-95I nomad[485]:     2022-10-19T09:59:07.441+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:12 uat-nomad-95I nomad[485]:     2022-10-19T09:59:12.601+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:13 uat-nomad-95I nomad[485]:     2022-10-19T09:59:13.057+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:19 uat-nomad-95I nomad[485]:     2022-10-19T09:59:19.070+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:19 uat-nomad-95I nomad[485]:     2022-10-19T09:59:19.350+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:29 uat-nomad-95I nomad[485]:     2022-10-19T09:59:29.259+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:29 uat-nomad-95I nomad[485]:     2022-10-19T09:59:29.548+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:44 uat-nomad-95I nomad[485]:     2022-10-19T09:59:44.354+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="No cluster leader"
Oct 19 09:59:44 uat-nomad-95I nomad[485]:     2022-10-19T09:59:44.787+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="No cluster leader"
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.942+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-ggK.europe 10.194.74.112
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.942+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-072.europe 10.194.75.246
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.942+0200 [INFO]  nomad: adding server: server="uat-nomad-ggK.europe (Addr: 10.194.74.112:4647) (DC: pd0011-uat)"
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.946+0200 [INFO]  nomad: successfully contacted Nomad servers: num_servers=2
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.950+0200 [WARN]  nomad.raft: failed to get previous log: previous-index=24648 last-index=0 error="log not found"
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.952+0200 [INFO]  nomad: disabling bootstrap mode because existing Raft peers being reported by peer: peer_name=uat-nomad-ggK.europe peer_address=10.194.74.112:4647
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.952+0200 [INFO]  nomad: adding server: server="uat-nomad-072.europe (Addr: 10.194.75.246:4647) (DC: pd0011-uat)"
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.956+0200 [INFO]  snapshot: creating new snapshot: path=/opt/hsy/nomad/data/server/raft/snapshots/4-24584-1666166388956.tmp
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.962+0200 [INFO]  nomad.raft: snapshot network transfer progress: read-bytes=394042 percent-complete="100.00%"
Oct 19 09:59:48 uat-nomad-95I nomad[485]:     2022-10-19T09:59:48.971+0200 [INFO]  nomad.raft: copied to local snapshot: bytes=394042
Oct 19 09:59:49 uat-nomad-95I nomad[485]:     2022-10-19T09:59:49.004+0200 [INFO]  nomad.raft: snapshot restore progress: id=4-24584-1666166388956 last-index=24584 last-term=4 size-in-bytes=394042 read-bytes=394042 percent-complete="100.00%"
Oct 19 09:59:49 uat-nomad-95I nomad[485]:     2022-10-19T09:59:49.004+0200 [INFO]  nomad.raft: Installed remote snapshot
Oct 19 10:03:22 uat-nomad-95I nomad[485]:     2022-10-19T10:03:22.569+0200 [INFO]  nomad: serf: EventMemberLeave: uat-nomad-ggK.europe 10.194.74.112
Oct 19 10:03:22 uat-nomad-95I nomad[485]:     2022-10-19T10:03:22.569+0200 [INFO]  nomad: removing server: server="uat-nomad-ggK.europe (Addr: 10.194.74.112:4647) (DC: pd0011-uat)"
Oct 19 10:04:25 uat-nomad-95I nomad[485]:     2022-10-19T10:04:25.357+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-Kxk.europe 10.194.74.253
Oct 19 10:04:25 uat-nomad-95I nomad[485]:     2022-10-19T10:04:25.357+0200 [INFO]  nomad: adding server: server="uat-nomad-Kxk.europe (Addr: 10.194.74.253:4647) (DC: pd0011-uat)"
Oct 19 10:08:45 uat-nomad-95I nomad[485]:     2022-10-19T10:08:45.357+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="rpc error: eval broker disabled"
Oct 19 10:08:45 uat-nomad-95I nomad[485]:     2022-10-19T10:08:45.357+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="rpc error: eval broker disabled"
Oct 19 10:08:46 uat-nomad-95I nomad[485]:     2022-10-19T10:08:46.369+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=10.194.74.253:4647 leader=10.194.75.246:4647 leader-id=4d46ab0f-aaae-002a-f916-ee8670d327bc
Oct 19 10:08:46 uat-nomad-95I nomad[485]:     2022-10-19T10:08:46.544+0200 [INFO]  nomad: serf: EventMemberLeave: uat-nomad-072.europe 10.194.75.246
Oct 19 10:08:46 uat-nomad-95I nomad[485]:     2022-10-19T10:08:46.544+0200 [INFO]  nomad: removing server: server="uat-nomad-072.europe (Addr: 10.194.75.246:4647) (DC: pd0011-uat)"
Oct 19 10:08:46 uat-nomad-95I nomad[485]:     2022-10-19T10:08:46.918+0200 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr=10.194.75.246:4647 last-leader-id=4d46ab0f-aaae-002a-f916-ee8670d327bc
Oct 19 10:08:46 uat-nomad-95I nomad[485]:     2022-10-19T10:08:46.918+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 10.194.73.146:4647 [Candidate]" term=7
Oct 19 10:08:48 uat-nomad-95I nomad[485]:     2022-10-19T10:08:48.046+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="rpc error: No cluster leader"
Oct 19 10:08:48 uat-nomad-95I nomad[485]:     2022-10-19T10:08:48.046+0200 [ERROR] worker: failed to dequeue evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="rpc error: No cluster leader"
Oct 19 10:08:48 uat-nomad-95I nomad[485]:     2022-10-19T10:08:48.374+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 10.194.73.146:4647 [Follower]" leader-address= leader-id=
Oct 19 10:09:54 uat-nomad-95I nomad[485]:     2022-10-19T10:09:54.850+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-tMd.europe 10.194.75.103
Oct 19 10:09:54 uat-nomad-95I nomad[485]:     2022-10-19T10:09:54.851+0200 [INFO]  nomad: adding server: server="uat-nomad-tMd.europe (Addr: 10.194.75.103:4647) (DC: pd0011-uat)"
Oct 20 09:16:38 uat-nomad-95I nomad[485]:     2022-10-20T09:16:38.333+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d eval_id=03b77c02-1121-9a63-0d64-48868310227c error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:38 uat-nomad-95I nomad[485]:     2022-10-20T09:16:38.335+0200 [ERROR] worker: error invoking scheduler: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:39 uat-nomad-95I nomad[485]:     2022-10-20T09:16:39.336+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d eval_id=9b8451a5-b344-6295-0556-1e863005b076 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:39 uat-nomad-95I nomad[485]:     2022-10-20T09:16:39.336+0200 [ERROR] worker: error invoking scheduler: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:39 uat-nomad-95I nomad[485]:     2022-10-20T09:16:39.339+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=03b77c02-1121-9a63-0d64-48868310227c error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:39 uat-nomad-95I nomad[485]:     2022-10-20T09:16:39.339+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:59 uat-nomad-95I nomad[485]:     2022-10-20T09:16:59.339+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d eval_id=9b8451a5-b344-6295-0556-1e863005b076 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:59 uat-nomad-95I nomad[485]:     2022-10-20T09:16:59.339+0200 [ERROR] worker: error invoking scheduler: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:59 uat-nomad-95I nomad[485]:     2022-10-20T09:16:59.342+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=03b77c02-1121-9a63-0d64-48868310227c error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:16:59 uat-nomad-95I nomad[485]:     2022-10-20T09:16:59.342+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:18:52 uat-nomad-95I nomad[485]:     2022-10-20T09:18:52.789+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=11e43b96-6f71-51d7-5c1f-37118b07ea5e error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:18:52 uat-nomad-95I nomad[485]:     2022-10-20T09:18:52.789+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:18:53 uat-nomad-95I nomad[485]:     2022-10-20T09:18:53.792+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d eval_id=11e43b96-6f71-51d7-5c1f-37118b07ea5e error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:18:53 uat-nomad-95I nomad[485]:     2022-10-20T09:18:53.792+0200 [ERROR] worker: error invoking scheduler: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:21:14 uat-nomad-95I nomad[485]:     2022-10-20T09:21:14.762+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d eval_id=bfe00218-ccb1-cd80-f4da-a3e8f66002c7 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:21:14 uat-nomad-95I nomad[485]:     2022-10-20T09:21:14.762+0200 [ERROR] worker: error invoking scheduler: worker_id=4375ac27-f83a-7f0c-710e-2c5a221e438d error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:01 uat-nomad-95I nomad[485]:     2022-10-20T09:22:01.891+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=f4324e9c-59f7-fed9-ef3a-139720072cb0 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:01 uat-nomad-95I nomad[485]:     2022-10-20T09:22:01.891+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:02 uat-nomad-95I nomad[485]:     2022-10-20T09:22:02.894+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=f4324e9c-59f7-fed9-ef3a-139720072cb0 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:02 uat-nomad-95I nomad[485]:     2022-10-20T09:22:02.894+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:22 uat-nomad-95I nomad[485]:     2022-10-20T09:22:22.899+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=f4324e9c-59f7-fed9-ef3a-139720072cb0 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:22 uat-nomad-95I nomad[485]:     2022-10-20T09:22:22.899+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:37 uat-nomad-95I nomad[485]:     2022-10-20T09:22:37.904+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=023e409c-f644-7e0f-33d8-813623ce622e error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:37 uat-nomad-95I nomad[485]:     2022-10-20T09:22:37.904+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:38 uat-nomad-95I nomad[485]:     2022-10-20T09:22:38.907+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=023e409c-f644-7e0f-33d8-813623ce622e error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 09:22:38 uat-nomad-95I nomad[485]:     2022-10-20T09:22:38.907+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"

After that it's basically failed to submit plan for evaluation until infinity.

When you bring up the new AMI, are you bringing them all up at once?

Nope, that would be one at a time.

When you bring up the new AMI, are you ensuring that all servers are current on their raft index before shutting down the old servers?

Hmm, I don't think it's actively ensured, but between server instance switches I think there's a healthy 5 minutes on a cluster with very few jobs...

@tgross
Copy link
Member

tgross commented Oct 20, 2022

The total lack of keyring.replicator logs here is very suspicious to me. There's a trace log at the start of replication (which we won't see here as it looks like your logs are at info-level), but there should be a whole bunch of logs around failures if there's been a failure to fetch the key. Now I'm wondering if we might be stuck waiting on a mutex in the replication loop in encrypter.go#L419-L512. I've been trying to reproduce this bug but it may be that my rig isn't hitting the right timing to enter any lock -- so that goroutine dump would be extra useful if it's possible to get.

@bfqrst
Copy link
Author

bfqrst commented Oct 20, 2022

Okay @tgross I did SIGQUIT the server process. Logs are out as we speak! Meanwhile, after the restart, we see keyring.replicator business:

Oct 20 17:06:53 uat-nomad-95I systemd[1]: Started "HashiCorp Nomad - A workload orchestration solution".
Oct 20 17:06:53 uat-nomad-95I nomad[5241]: WARNING: keyring exists but -encrypt given, using keyring
Oct 20 17:06:53 uat-nomad-95I nomad[5241]: ==> Loaded configuration from /opt/hsy/nomad/config/nomad_server.hcl
Oct 20 17:06:53 uat-nomad-95I nomad[5241]: ==> Starting Nomad agent...
Oct 20 17:06:54 uat-nomad-95I nomad[5241]: ==> Nomad agent configuration:
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:        Advertise Addrs: HTTP: 10.194.73.146:4646; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:             Bind Addrs: HTTP: [127.0.0.1:4646 10.194.73.146:4646]; RPC: 10.194.73.146:4647; Serf: 10.194.73.146:4648
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:                 Client: false
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:              Log Level: INFO
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:                 Region: europe (DC: pd0011-uat)
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:                 Server: true
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:                Version: 1.4.1
Oct 20 17:06:54 uat-nomad-95I nomad[5241]: ==> Nomad agent started! Log data will stream in below:
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.985+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.985+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.985+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.985+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.985+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.994+0200 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:53.999+0200 [INFO]  nomad.raft: starting restore from snapshot: id=4-24584-1666166388956 last-index=24584 last-term=4 size-in-bytes=394042
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.014+0200 [INFO]  nomad.raft: snapshot restore progress: id=4-24584-1666166388956 last-index=24584 last-term=4 size-in-bytes=394042 read-bytes=394042 percent-complete="100.00%"
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.015+0200 [INFO]  nomad.raft: restored from snapshot: id=4-24584-1666166388956 last-index=24584 last-term=4 size-in-bytes=394042
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.193+0200 [INFO]  nomad.raft: initial configuration: index=24750 servers="[{Suffrage:Voter ID:209b28b1-d4ba-111a-93b4-5433c76ca54d Address:10.194.73.146:4647} {Suffrage:Voter ID:29c95ced-78be-e7cc-2ca9-d24f4045cc2a Address:10.194.74.253:4647} {Suffrage:Voter ID:437dd9d3-c70a-640a-7bc3-b74c04a09bb0 Address:10.194.75.103:4647}]"
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.193+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 10.194.73.146:4647 [Follower]" leader-address= leader-id=
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-95I.europe 10.194.73.146
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [INFO]  nomad: serf: Attempting re-join to previously known node: uat-nomad-Kxk.europe: 10.194.74.253:4648
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.194+0200 [INFO]  nomad: adding server: server="uat-nomad-95I.europe (Addr: 10.194.73.146:4647) (DC: pd0011-uat)"
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.196+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-tMd.europe 10.194.75.103
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.196+0200 [WARN]  nomad: memberlist: Refuting an alive message for 'uat-nomad-95I.europe' (10.194.73.146:4648) meta:([255 138 164 114 111 108 101 165 110 111 109 97 100 166 114 101 103 105 111 110 166 101 117 114 111 112 101 163 118 115 110 161 49 168 114 97 102 116 95 118 115 110 161 51 162 105 100 218 0 36 50 48 57 98 50 56 98 49 45 100 52 98 97 45 49 49 49 97 45 57 51 98 52 45 53 52 51 51 99 55 54 99 97 53 52 100 168 114 112 99 95 97 100 100 114 173 49 48 46 49 57 52 46 55 51 46 49 52 54 162 100 99 170 112 100 48 48 49 49 45 117 97 116 165 98 117 105 108 100 165 49 46 52 46 49 164 112 111 114 116 164 52 54 52 55 166 101 120 112 101 99 116 161 51] VS [255 138 168 114 97 102 116 95 118 115 110 161 51 165 98 117 105 108 100 165 49 46 52 46 49 163 118 115 110 161 49 162 100 99 170 112 100 48 48 49 49 45 117 97 116 162 105 100 218 0 36 50 48 57 98 50 56 98 49 45 100 52 98 97 45 49 49 49 97 45 57 51 98 52 45 53 52 51 51 99 55 54 99 97 53 52 100 168 114 112 99 95 97 100 100 114 173 49 48 46 49 57 52 46 55 51 46 49 52 54 164 112 111 114 116 164 52 54 52 55 166 101 120 112 101 99 116 161 51 164 114 111 108 101 165 110 111 109 97 100 166 114 101 103 105 111 110 166 101 117 114 111 112 101]), vsn:([1 5 2 2 5 4] VS [1 5 2 2 5 4])
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.197+0200 [INFO]  nomad: serf: EventMemberJoin: uat-nomad-Kxk.europe 10.194.74.253
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.197+0200 [INFO]  nomad: adding server: server="uat-nomad-tMd.europe (Addr: 10.194.75.103:4647) (DC: pd0011-uat)"
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.197+0200 [INFO]  nomad: adding server: server="uat-nomad-Kxk.europe (Addr: 10.194.74.253:4647) (DC: pd0011-uat)"
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.197+0200 [INFO]  nomad: serf: Re-joined to previously known node: uat-nomad-Kxk.europe: 10.194.74.253:4648
Oct 20 17:06:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:54.355+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=55231009-66eb-497c-2980-d6f2e02de791 error=<nil>
Oct 20 17:06:55 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:55.364+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:06:56 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:56.368+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:06:57 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:57.373+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:06:58 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:58.377+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:06:59 uat-nomad-95I nomad[5241]:     2022-10-20T17:06:59.382+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:00 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:00.386+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:01.391+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:02.395+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:03 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:03.400+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:04 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:04.405+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:05 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:05.409+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:05 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:05.864+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=b8c5ba8b-a8c4-6339-d898-203aacac6835 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:05 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:05.864+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:06 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:06.414+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:06 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:06.867+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=b8c5ba8b-a8c4-6339-d898-203aacac6835 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:06 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:06.867+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:07 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:07.418+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:08 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:08.422+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:09 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:09.428+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:10 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:10.432+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:11 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:11.436+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:12 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:12.441+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:13 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:13.446+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:14 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:14.450+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:15 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:15.454+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:16 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:16.459+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:17 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:17.464+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:18 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:18.468+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:18 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:18.890+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=38e612e5-c8fd-8ed7-2609-c08a04647154 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:18 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:18.890+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:19 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:19.239+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=5816ec5a-9562-e659-018b-568fd995bcd5 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:19 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:19.239+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:19 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:19.472+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:20 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:20.243+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=5816ec5a-9562-e659-018b-568fd995bcd5 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:20 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:20.243+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:20 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:20.477+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:21 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:21.481+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:22 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:22.486+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:23 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:23.491+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:24 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:24.496+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:25 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:25.501+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:26 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:26.505+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:26 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:26.871+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=b8c5ba8b-a8c4-6339-d898-203aacac6835 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:26 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:26.871+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:27 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:27.510+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:28 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:28.515+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:29 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:29.519+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:30 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:30.524+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:31 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:31.528+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:32 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:32.533+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:33 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:33.538+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:34 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:34.543+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:35 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:35.547+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:36 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:36.552+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:37 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:37.557+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:38 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:38.562+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:39 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:39.567+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:40 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:40.571+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:40 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:40.599+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=db52a78a-49e4-ce40-6b92-da6aad3a0246 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:40 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:40.599+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:41 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:41.575+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:41 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:41.603+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=db52a78a-49e4-ce40-6b92-da6aad3a0246 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:41 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:41.603+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:41 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:41.820+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=0e089a81-ab92-d515-c1ae-cea06122b86a error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:41 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:41.820+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:42 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:42.580+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:42 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:42.823+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=0e089a81-ab92-d515-c1ae-cea06122b86a error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:42 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:42.823+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:43 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:43.584+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:44 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:44.588+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:45 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:45.593+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:46 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:46.597+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:47 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:47.603+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:48 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:48.583+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=5c0dbc14-26d7-2b25-2721-e0ae940768de error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:48 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:48.583+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:48 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:48.607+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:49 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:49.587+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=5c0dbc14-26d7-2b25-2721-e0ae940768de error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:49 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:49.587+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:49 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:49.612+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:50 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:50.616+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:51 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:51.621+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:52 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:52.626+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:53 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:53.631+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:54 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:54.635+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:55 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:55.639+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:56 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:56.644+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:57 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:57.649+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:58 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:58.653+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:07:59 uat-nomad-95I nomad[5241]:     2022-10-20T17:07:59.657+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:00 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:00.662+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:01.606+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=db52a78a-49e4-ce40-6b92-da6aad3a0246 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:01.607+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:01.666+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:01.962+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=d7d0cb60-47dd-0d97-1265-30e184e599ad error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:01 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:01.962+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:02.671+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:02.826+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=0e089a81-ab92-d515-c1ae-cea06122b86a error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:02.826+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:02.966+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=d7d0cb60-47dd-0d97-1265-30e184e599ad error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:02 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:02.966+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:03 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:03.676+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:03 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:03.836+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 eval_id=2e334c81-9285-afdf-9915-accbe180b489 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:03 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:03.836+0200 [ERROR] worker: error invoking scheduler: worker_id=c5a91435-08a4-dbef-4513-99c91519d039 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:04 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:04.680+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:05 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:05.685+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:06 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:06.689+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:07 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:07.697+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:08 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:08.702+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:09 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:09.706+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:10 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:10.711+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:11 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:11.715+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:12 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:12.719+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:13 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:13.723+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:14 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:14.728+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:15 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:15.732+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:16 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:16.737+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:17 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:17.741+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:18 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:18.745+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:19 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:19.750+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:20 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:20.754+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:21 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:21.759+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:22 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:22.764+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:23 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:23.769+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:24 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:24.186+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=5343aaa3-5def-cd73-2a25-cce84f9105f7 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:24 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:24.186+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:24 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:24.774+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:25 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:25.190+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 eval_id=5343aaa3-5def-cd73-2a25-cce84f9105f7 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:25 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:25.190+0200 [ERROR] worker: error invoking scheduler: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:25 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:25.778+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:26 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:26.782+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:27 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:27.787+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:28 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:28.792+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:29 uat-nomad-95I systemd[1]: Stopping "HashiCorp Nomad - A workload orchestration solution"...
Oct 20 17:08:29 uat-nomad-95I nomad[5241]: ==> Caught signal: interrupt
Oct 20 17:08:29 uat-nomad-95I nomad[5241]: ==> Gracefully shutting down agent...
Oct 20 17:08:29 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:29.541+0200 [INFO]  nomad: server starting leave
Oct 20 17:08:29 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:29.796+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:30 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:30.195+0200 [INFO]  nomad: serf: EventMemberLeave: uat-nomad-95I.europe 10.194.73.146
Oct 20 17:08:30 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:30.195+0200 [INFO]  nomad: removing server: server="uat-nomad-95I.europe (Addr: 10.194.73.146:4647) (DC: pd0011-uat)"
Oct 20 17:08:30 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:30.801+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:31 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:31.805+0200 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 17:08:32 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:32.245+0200 [INFO]  agent: requesting shutdown
Oct 20 17:08:32 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:32.245+0200 [INFO]  nomad: shutting down server
Oct 20 17:08:32 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:32.245+0200 [ERROR] worker: error waiting for Raft index: worker_id=a2b9dd2d-8d9f-065e-b040-2d85951a7af0 error="context canceled" index=31516
Oct 20 17:08:32 uat-nomad-95I nomad[5241]:     2022-10-20T17:08:32.294+0200 [INFO]  agent: shutdown complete
Oct 20 17:08:32 uat-nomad-95I systemd[1]: nomad.service: Deactivated successfully.
Oct 20 17:08:32 uat-nomad-95I systemd[1]: Stopped "HashiCorp Nomad - A workload orchestration solution".
Oct 20 17:08:32 uat-nomad-95I systemd[1]: nomad.service: Consumed 2.503s CPU time.

@tgross
Copy link
Member

tgross commented Oct 20, 2022

From that goroutine dump I see a couple of important bits of information:

  • There is only one goroutine anywhere in encrypter.go, which means we're not deadlocked on any of the mutexes that the encrypter.go code owns.
  • That one goroutine is in select at encrypter.go#L438, which is where we're rate-limiting the state store queries to 10/second to make sure we don't hammer the state store too often. I'm noting that the select doesn't say something like [select, 2 minutes] that would indicate it's a long-lived select (in rate.go#L2565):
goroutine 91 [select]:
runtime.gopark(0x40000d0cd0?, 0x2?, 0x98?, 0xb?, 0x40000d0c8c?)
        runtime/proc.go:363 +0xe4 fp=0x40000d0b10 sp=0x40000d0af0 pc=0x43f144
runtime.selectgo(0x40000d0cd0, 0x40000d0c88, 0x449f680?, 0x0, 0x65efc40b3946?, 0x1)
        runtime/select.go:328 +0x6b4 fp=0x40000d0c20 sp=0x40000d0b10 pc=0x44ffb4
golang.org/x/time/rate.(*Limiter).WaitN(0x40008ef9f0, {0x2edd2d8, 0x4000828280}, 0x1)
        golang.org/x/[email protected]/rate/rate.go:256 +0x2f4 fp=0x40000d0d90 sp=0x40000d0c20 pc=0xc37534
golang.org/x/time/rate.(*Limiter).Wait(...)
        golang.org/x/[email protected]/rate/rate.go:216
github.com/hashicorp/nomad/nomad.(*KeyringReplicator).run(0x400082a150, {0x2edd2d8, 0x4000828280})
        github.com/hashicorp/nomad/nomad/encrypter.go:438 +0x27c fp=0x40000d0fa0 sp=0x40000d0d90 pc=0x194cafc
github.com/hashicorp/nomad/nomad.NewKeyringReplicator.func1()
        github.com/hashicorp/nomad/nomad/encrypter.go:410 +0x30 fp=0x40000d0fd0 sp=0x40000d0fa0 pc=0x194c840
runtime.goexit()
        runtime/asm_arm64.s:1172 +0x4 fp=0x40000d0fd0 sp=0x40000d0fd0 pc=0x471174
created by github.com/hashicorp/nomad/nomad.NewKeyringReplicator
        github.com/hashicorp/nomad/nomad/encrypter.go:410 +0x180

Next step for me is to go dig into that rate limiting library and make sure we're not misusing it in some way. I'll report back here later today.

@tgross
Copy link
Member

tgross commented Oct 20, 2022

I've got a draft PR open with one fix I see we need, but I don't think that's at all the cause here: #14987. Investigation continues!

@tgross
Copy link
Member

tgross commented Oct 20, 2022

Just wanted to give a summary of where this is at. #14987 fixes some real bugs I've found while investigating, but I'm not 100% sure that it's the fix we need until I can reproduce and so far I've been unable to. Not having that failed to fetch key from any peer log on servers until after that restart is weird and makes me hesitate to say we're done here.

I'm going to pick this up again tomorrow and consult with some of my colleagues again and see what we can come up with.

@bfqrst
Copy link
Author

bfqrst commented Oct 20, 2022

Appreciate the update and the investigation! For the time being, as I have a non working cluster, what's the best course of action? Since I thankfully don't use any of the new 1.4 features, I'm inclined to roll everything back to 1.3.6...

That should work, right?

@doppelc
Copy link

doppelc commented Oct 21, 2022

nomad operator root keyring rotate solves the issue, although it seems it may re-appear if a leader election occurs.

@tgross
Copy link
Member

tgross commented Oct 21, 2022

Since I thankfully don't use any of the new 1.4 features, I'm inclined to roll everything back to 1.3.6...
That should work, right?

Once any new raft log type has been written, you can't downgrade. That include the keyring metadata in this case. 😦

nomad operator root keyring rotate solves the issue

Yes, that will force a rotation and re-replication of the key.

, although it seems it may re-appear if a leader election occurs.

Have you seen that happen without adding new servers @doppelc? Because if so that's a useful new data point for my investigation.

@bfqrst
Copy link
Author

bfqrst commented Oct 21, 2022

Yeah I had to redo the cluster in question, @doppelc response came after I started cleaning up.

It's kinda funny from a reproducibility standpoint really. It literally hit me in each of my three stages when I went up to 1.4.x. This time around is the fourth occurrence. At first I thought that it was the bug that was in 1.4.0 and brushed it off...

For now, good to know that I'm able to help myself to some extent with the nomad operator root keyring rotate command.

@doppelc
Copy link

doppelc commented Oct 21, 2022

Have you seen that happen without adding new servers? Because if so that's a useful new data point for my investigation.

Yes, it has happened in both scenarios.

@tgross tgross reopened this Oct 21, 2022
@tgross
Copy link
Member

tgross commented Oct 21, 2022

Based on some feedback from the internal user, I've got a clue that the underlying issue here is actually related to a bug in garbage collection, which is why I wasn't able to reproduce it with fresh test clusters. I'm working on verifying that and I'm hoping to have a fix for that later this afternoon.

@tgross
Copy link
Member

tgross commented Oct 21, 2022

GC behavior fixes are #15009. I still haven't been able to reproduce the reported bug exactly, but this at least explains where the extraneous keys were coming from: not from leader election but from misbehaving GC. That gives me a lot more confidence in the fix in #14987 even if I don't have an exact reproduction yet.

@bfqrst
Copy link
Author

bfqrst commented Oct 21, 2022

While I think the GC is triggered by default every 48 hours or so, I have the habit of manually firing nomad system gc fairly regularly to clean up old stuff... While this doesn't mean anything in terms of repro, I think you might be on to something @tgross ...

@madsholden
Copy link

I have the same issue. After upgrading from 1.3.6 to 1.4.1 it happened right after the upgrade on two separate clusters. It's also happening at seemingly random times, like the middle of the night yesterday.

@tgross
Copy link
Member

tgross commented Oct 26, 2022

We've landed our fixes for this and that'll go out in Nomad 1.4.2, which I'm in the process of prepping. Expect this shortly, and thanks for your patience all!

@tgross
Copy link
Member

tgross commented Oct 27, 2022

Several keyring fixes have shipped in Nomad 1.4.2 which should put this issue to rest. I'm going to close this out but please feel free to let us know here if you're encountering the issue again! Thanks!

@Fuco1
Copy link
Contributor

Fuco1 commented Nov 6, 2022

What is the interim solution here? Downgrading to pre-1.4? Today this happened in our cluster, running 1.4.2, and jobs stopped being evaluated. I rotated the keys just as @HINT-SJ but it's obviously not ideal production setup.

Can we turn this feature off somehow with settings?

@Great-Stone Great-Stone unpinned this issue Nov 7, 2022
@tgross tgross pinned this issue Nov 7, 2022
@tgross
Copy link
Member

tgross commented Nov 7, 2022

Here's the current status on this issue:

  • We've landed a patch that will go out in Nomad 1.4.3, which should fix all the known problems.
  • However, our internal user is reporting that keys are never getting written to the on-disk keystore, even after a key rotation on a healthy and unchanging cluster (even on the leader!). The trouble with this is that we don't know why b/c of an issue they're having with our log shipping going on in their staging environment. 😊
  • I'm hoping to get logs from them to assess what the "writing to disk" problem is (it may be specific to their machine image... I can't reproduce this anywhere else).

What is the interim solution here? Downgrading to pre-1.4? Today this happened in our cluster, running 1.4.2, and jobs stopped being evaluated. I rotated the keys just as @HINT-SJ but it's obviously not ideal production setup.

There are a number of interim solutions, in increasing order of complexity/annoyance:

  1. Rotate the keys after a leader election. Wait for Nomad 1.4.3 to come out.
  2. Downgrade to pre-1.4: note that a snapshot from 1.4.x can't be loaded into a 1.3.x cluster b/c of new raft log types, so you'll be rebuilding the cluster (which is probably ok if you've deployed to a staging environment first).
  3. Build yourself a Nomad off of main and run that: this would also help us validate the 1.4.3 patches.
  4. Build yourself a Nomad off of 1.4.2 with the alloc identity signing removed from the plan applier. Note: this will break template rendering of Nomad native service discovery and after 1.4.3 you'll need to reschedule all your workloads if you want them to have identities.

If you've got a staging/test environment where you can do (3) while you're doing (1), that'd be helpful too!

Can we turn this feature off somehow with settings?

The keyring is used for providing template blocks with workload identity signing, so it can't be turned off without breaking other features like template rendering for Nomad native service discovery and we didn't want to expose that knob. (Generally speaking, WIs are going to get used more widely in Nomad for future features so it's better to keep this available and fix the bug rather than have it be something folks can turn off and then be unhappy when they don't get the features they expect.)

@Fuco1
Copy link
Contributor

Fuco1 commented Nov 9, 2022

I did (2) killing the whole prod cluster :D 🤦‍♂️ Took me only about an hour to get it up so not a big deal. (1) unfortunately stopped working after about 4 or 5 rotations. I listed the keys and tried to remove the old ones but I wasn't able to, the remove command did nothing. In the logs I saw a lot of messages about leaders trying to load one of the old (inactive) keys instead of the new active one.

We currently only have one nomad deployment, only occasionally I setup a simple dev cluster when playing with things, but in general no workloads run there so it wouldn't test much.

For now I downgraded to the latest 1.3.x release, but man I'm missing 1.4.x new UI improvements.

@seanamos
Copy link

We've got Nomad 1.4.2 running in Dev and this just happened to us, effectively knocking the cluster out. No new or existing jobs can be scheduled and failed evaluations are piling up.

This is Dev so I've not scrubbed anything:

Nov 11 17:27:42 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:42.857Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:27:47 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:47.865Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:27:52 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:52.886Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:27:52 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:52.886Z [ERROR] worker: failed to submit plan for evaluation: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 eval_id=4cf60a5a-9aec-8102-5f8a-5afb18a5d365 error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:27:52 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:52.886Z [ERROR] worker: error invoking scheduler: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 error="failed to process evaluation: no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:27:52 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:52.886Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"4cf60a5a-9aec-8102-5f8a-5afb18a5d365\" JobID: \"datadog-agent\" Namespace: \"default\">"
Nov 11 17:27:57 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:27:57.893Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:02 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:02.907Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:08 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:08.920Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:13 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:13.927Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:13 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:13.927Z [ERROR] worker: failed to submit plan for evaluation: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 eval_id=db1e0a4f-7901-5793-1c6a-857e2503af3f error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:13 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:13.927Z [ERROR] worker: error invoking scheduler: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 error="failed to process evaluation: no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:13 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:13.927Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"db1e0a4f-7901-5793-1c6a-857e2503af3f\" JobID: \"ingress-edge\" Namespace: \"default\">"
Nov 11 17:28:19 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:19.297Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:24 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:24.320Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:24 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:24.321Z [ERROR] worker: failed to submit plan for evaluation: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 eval_id=ce62c7e2-1628-76ec-4fd4-6a8d6556cc92 error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:24 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:24.321Z [ERROR] worker: error invoking scheduler: worker_id=65065deb-921e-6cc2-8dbd-b32e41374b12 error="failed to process evaluation: no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:24 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:24.321Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"ce62c7e2-1628-76ec-4fd4-6a8d6556cc92\" JobID: \"account-api\" Namespace: \"default\">"
Nov 11 17:28:29 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:29.345Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"
Nov 11 17:28:34 ip-11-0-5-202 nomad[1264]:     2022-11-11T17:28:34.355Z [ERROR] nomad: failed to submit plan: error="no such key \"8eaeb825-cdd4-eff5-2fe2-6a6bd8b6e922\" in keyring"

@tgross
Copy link
Member

tgross commented Nov 11, 2022

@seanamos you didn't provide information on what led to the state (ex. was this new nodes added to the cluster, a leader election, or what?). But it might help me out if you could run nomad operator debug -duration 1m -interval 10s -server-id all -max-nodes=1 -log-level trace and then while that's running doing a nomad operator root keyring rotate. The resulting server logs in that bundle would be hugely valuable. And then once you've run that, verify whether or not the key material is getting written to the keystore on disk at $datadir/server/keystore/*.nks.json

@seanamos
Copy link

@tgross Unfortunately the timing of this couldn't be worse as I was busy reworking the monitoring of the Dev cluster today. I will gather what I can.

I have already done a nomad operator root keyring rotate to get deployments working again, so I don't think nomad operator debug will have much value.

It does look like the keystore is being updated after rotation.
The KeyId from nomad operator root keyring list -verbose matches the keystore:

nomad operator root keyring list -verbose
Key                                   State   Create Time
5a7d853d-35cd-e5da-5d25-b43f8004a2d5  active  2022-11-11T19:31:44+02:00

sudo cat keystore/5a7d853d-35cd-e5da-5d25-b43f8004a2d5.nks.json
{"Meta":{"KeyID":"5a7d853d-35cd-e5da-5d25-b43f8004a2d5","Algorithm":"aes256-gcm","CreateTime":1668187904426600174,"CreateIndex":0,"ModifyIndex":0,"State":"active"},"DEK":"WK0ct5Vtd5+IgqJB3a9bDVr4R0quY2/Zo6gkHlzqCfXPy1cyACqjcd0u1uN+9qfVxcbSbZYKOKdMIe8p","KEK":"b7JKABGfBau9rGW1oDFuMAW9aPLcp3BaWrOoGEAXoH0="}

@seanamos
Copy link

While log collection wasn't working at the time, metrics gathering was:

Screenshot 2022-11-11 at 21 37 28

There was a leadership election a couple hours before the outage (or at least that we noticed). At the time the raft transaction time starts climbing is where the outage is noticed, it drops back down after I ran the keyring rotation.

@tgross
Copy link
Member

tgross commented Nov 11, 2022

Ok, if you're willing to try out a build from main as I've noted above, that could help us figure out whether the patch we have is safe to ship. I haven't been able to reproduce this problem otherwise.

@tgross
Copy link
Member

tgross commented Nov 11, 2022

With the help of our internal user, I was finally able to reproduce this and even write a test that can cause it to happen consistently. PR is up at #15227. We're going to test this out internally (but please feel free to try that PR in your own acceptance environments!) early next week and hopefully be able to finally put this one to bed. Thanks for your patience folks!

@tgross
Copy link
Member

tgross commented Nov 17, 2022

We've got #15227 merged and will be doing some additional testing internally before 1.4.3 ships. We can't give an exact date on that other than to say we're moving as quickly as feasible on it.

@chrismo
Copy link

chrismo commented Nov 17, 2022

@tgross is there any sort of hacky fix before 1.4.3 is released? I'm seeing this in a staging environment that I can start over from scratch if I have to, but would rather not if there's a hammer I can bang it with.

@tgross
Copy link
Member

tgross commented Nov 18, 2022

@chrismo there's two sets of workarounds, depending on where you're at, described in #14981 (comment) and #14981 (comment)

@tgross
Copy link
Member

tgross commented Nov 21, 2022

Just a heads up that we're kicking off the process for releasing 1.4.3 shortly, so expect the patch soon. Thanks again for your patience with this, all!

@bfqrst
Copy link
Author

bfqrst commented Nov 22, 2022

Hey @tgross, here's a quick report on how the update of the first cluster to v1.4.3 went... TL;DR: had to redo it!

Status quo:

Up until today, basically after server leader elections, I would check the clients tab in the GUI and if the #alloc counter on any of the the clients would read 0, I would routinely issue the nomad operator root keyring rotate and voila, couple of minutes later, allocs would trickle in... I would not always nomad operator root keyring delete xxx though...

At this point only one job has nomadVars going... So to recap:

  • servers and clients all on v1.4.2
  • 1 active key
  • probably couple of uncleaned, inactive keys
  • 1 job makes use of the WI feature (Nomad vars)

At this point I'm pack(er)ing new AMIs with v1.4.3 and terraform the ASGs. 1 server goes -> okay, second server goes -> okay, third server goes -> WI job listed as pending. Server logs show something along the lines of couldn't replicate _some_inactive_key... At this point I replicate, gc a couple of times and try to read and put the vars back in, but the message is, some key (at this point long deleted and before that inactive) to no avail... I decided to kill the state a bring it back up cleanly with only v1.4.3 server and clients.

Present hour:

Now I'm a bit reluctant to move on to the second cluster. Couple of thoughts here:

  • I'd issue a replicate to make sure there's a fresh active key
  • I'd make sure that there's no inactive keys dangling around
  • This cluster doesn't have a WI job like the last cluster so maybe this would just work?!

Sooo yeah, not optimal... Are there any pointers as to how we should go about updating this thing securely?

Cheers
Ralph

@tgross
Copy link
Member

tgross commented Nov 22, 2022

Hi @bfqrst! Thanks for the feedback on that. Something that's a little unclear for me here is what happened when you replaced the 3rd server. You said "WI job listed as pending". Was that job's deployment in-flight, or did it get rescheduled for some reason after the 3rd server was replaced, or...?

My suspicion here is that the job had allocations with a WI signed by the missing key (or a variable encrypted by it), which wasn't around in the keystore.

In 1.4.3 we track a SigningKeyID so that we don't garbage collect keys that were used to sign a WI (although allocations created before 1.4.3 won't have that field set), but we also check whether the key was used to encrypt variables, which you probably do have. So if the 2nd cluster doesn't have any variables encrypted by a missing key, you should be ok. Otherwise you might want to rotate with the -full flag to make sure that all the variables are using a key that you're sure exists. And then make sure that all the existing nodes have an identical keystore in the server datadir.

@tgross
Copy link
Member

tgross commented Nov 22, 2022

As noted, 1.4.3 is out. We'll keep this issue pinned for a little while so that folks can see it if they're having upgrade issues. I'll keep an eye on this thread.

@bfqrst
Copy link
Author

bfqrst commented Nov 22, 2022

Was that job's deployment in-flight, or did it get rescheduled for some reason after the 3rd server was replaced, or...?

No, it definitely wasn't in flight, it was already there and stopped the moment the last server wen't away and specifically complained about a key that wasn't there anymore.

My suspicion here is that the job had allocations with a WI signed by the missing key (or a variable encrypted by it), which wasn't around in the keystore.

So this would be my suspicion too, although I would have thought that an inactive key is truly obsolete and can be safely removed without comprimising already signed vars. But I'm just wild guessing and assuming here... I wasn't aware of the -full flag, so I think I'd issue that beforehand just as an insurance policy.

@bfqrst
Copy link
Author

bfqrst commented Nov 23, 2022

...on to the next cluster today then...

  • gc'd the cluster
  • ran a rotate -full
  • ended up with most keys in rekeying which never finishes
  • one key in deactivated or invalid (can't remember the wording...)
  • deleted those keys
  • terraform'd the servers
  • this stage didn't have nomadVars so jobs remained scheduled BUT the server log still complains about keys that are no more
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]: ==> Nomad agent configuration:
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:        Advertise Addrs: HTTP: <snip>:4646; RPC: <snip>:4647; Serf: <snip>:4648
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:             Bind Addrs: HTTP: [127.0.0.1:4646 <snip>:4646]; RPC: <snip>:4647; Serf:<snip>:4648
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:                 Client: false
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:              Log Level: INFO
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:                 Region: europe (DC: <snip>-dev)
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:                 Server: true
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:                Version: 1.4.3
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]: ==> Nomad agent started! Log data will stream in below:
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.529+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.529+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.529+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.529+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.529+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.541+0100 [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.543+0100 [INFO]  nomad.raft: starting restore from snapshot: id=31-71375-1669208654546 last-index=71375 last-term=31 size-in-bytes=377390
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.557+0100 [INFO]  nomad.raft: snapshot restore progress: id=31-71375-1669208654546 last-index=71375 last-term=31 size-in-bytes=377390 read-bytes=377390 percent-complete="100.00%"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.557+0100 [INFO]  nomad.raft: restored from snapshot: id=31-71375-1669208654546 last-index=71375 last-term=31 size-in-bytes=377390
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.613+0100 [INFO]  nomad.raft: initial configuration: index=77308 servers="[{Suffrage:Voter ID:87be3427-4801-68b3-aca1-1caf1d6d935e Address:10.194.78.168:4647} {Suffrage:Voter ID:19caa1a6-48b3-5063-ab8b-022aa1329811 Address:10.194.77.197:4647}]"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.613+0100 [INFO]  nomad.raft: entering follower state: follower="Node at 10.194.79.251:4647 [Follower]" leader-address= leader-id=
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.613+0100 [INFO]  nomad: serf: Ignoring previous leave in snapshot
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.613+0100 [INFO]  nomad: serf: Ignoring previous leave in snapshot
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.613+0100 [INFO]  nomad: serf: Ignoring previous leave in snapshot
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [INFO]  nomad: serf: EventMemberJoin: dev-nomad-tYZ.europe 10.194.79.251
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [INFO]  nomad: serf: Attempting re-join to previously known node: dev-nomad-T3z.europe: 10.194.77.197:4648
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.614+0100 [INFO]  nomad: adding server: server="dev-nomad-tYZ.europe (Addr: 10.194.79.251:4647) (DC: pd0011-dev)"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.616+0100 [INFO]  nomad: serf: EventMemberJoin: dev-nomad-CZ4.europe 10.194.78.168
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.616+0100 [WARN]  nomad: memberlist: Refuting a dead message (from: dev-nomad-tYZ.europe)
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.617+0100 [INFO]  nomad: serf: EventMemberJoin: dev-nomad-T3z.europe 10.194.77.197
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.617+0100 [INFO]  nomad: serf: Re-joined to previously known node: dev-nomad-T3z.europe: 10.194.77.197:4648
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.617+0100 [INFO]  nomad: adding server: server="dev-nomad-CZ4.europe (Addr: 10.194.78.168:4647) (DC: pd0011-dev)"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.617+0100 [INFO]  nomad: adding server: server="dev-nomad-T3z.europe (Addr: 10.194.77.197:4647) (DC: pd0011-dev)"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.623+0100 [INFO]  nomad: successfully contacted Nomad servers: num_servers=2
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.732+0100 [WARN]  nomad.raft: failed to get previous log: previous-index=77309 last-index=77308 error="log not found"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.788+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=19b1867f-1f0f-a4e1-962f-0974918b3a78 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.811+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=19b1867f-1f0f-a4e1-962f-0974918b3a78 error="rpc error: no such key \"19b1867f-1f0f-a4e1-962f-0974918b3a78\" in keyring"
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.811+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: rpc error: no such key "19b1867f-1f0f-a4e1-962f-0974918b3a78" in keyring: key=19b1867f-1f0f-a4e1-962f-0974918b3a78
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.814+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.819+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.819+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.821+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=6057518c-ec0a-af9c-6104-ad5dab3be510 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.828+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=6057518c-ec0a-af9c-6104-ad5dab3be510 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.828+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=6057518c-ec0a-af9c-6104-ad5dab3be510
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.830+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=19b1867f-1f0f-a4e1-962f-0974918b3a78 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.832+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=19b1867f-1f0f-a4e1-962f-0974918b3a78 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.833+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=19b1867f-1f0f-a4e1-962f-0974918b3a78
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.834+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=2eb3eed5-90a6-c7ba-25c0-65669b53043a error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.836+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=2eb3eed5-90a6-c7ba-25c0-65669b53043a error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.836+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=2eb3eed5-90a6-c7ba-25c0-65669b53043a
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.837+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.838+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.839+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=3d60d705-4b08-8353-cacc-a4c520dbe2ad
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.839+0100 [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=6057518c-ec0a-af9c-6104-ad5dab3be510 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.841+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=6057518c-ec0a-af9c-6104-ad5dab3be510 error=<nil>
Nov 23 16:07:01 dev-nomad-tYZ nomad[1353]:     2022-11-23T16:07:01.841+0100 [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: <nil>: key=6057518c-ec0a-af9c-6104-ad5dab3be510

These are the keys:

nomad operator root keyring list -verbose
Key                                   State     Create Time
b2d887f8-399f-e7f1-3d2c-5f35b47adca2  active    2022-11-23T14:10:29+01:00
e6e0b12f-8a0d-0aaf-6653-335d10d27778  rekeying  2022-11-23T13:50:26+01:00

Result:

  • cluster seems functional, I'm able to put and get vars, so okay I guess
  • server log still looks kinda polluted (see replicator entries above) which worries me somewhat...

EDIT: keep in mind this was a native v1.4.2 cluster so maybe the state may suffer from poisoning which was mentioned in this thread. As far as I've seen, the redone cluster from yesterday looks clean. But I wouldn't know what might happen if one was to start to rotate keys and such...

Sorry if I'm overly chatty on this matter, but I am trying to keep others from messing up their clusters... Let me know if I there's something I can assist with.

Cheers

@tgross
Copy link
Member

tgross commented Nov 23, 2022

I'm sorry @bfqrst I'm realizing I gave you bad advice to do a rekey. If you have missing keys, doing a rekey will never work because we need to decrypt each Variable and then reencrypt it. We obviously can't decrypt a Variable if we're missing the key we used to encrypt it!

server log still looks kinda polluted (see replicator entries above) which worries me somewhat...

If you've got key material that doesn't exist anywhere on the cluster but still have key metadata, there's no path to recovering that key or any variables/allocations that need it. So you should be able to do nomad operator root keyring remove e6e0b12f-8a0d-0aaf-6653-335d10d27778 for that old key stuck in rekeying. I'm not sure why there are keys in the logs other than e6e0b12f though. Are these logs from the new machines after the keys have been removed?

Your active key b2d887f8 doesn't show up in the logs though, which is good. Just to belt-and-suspenders ourselves here, I would verify that the key file exists in $datadir/server/keystore/b2d887f8-399f-e7f1-3d2c-5f35b47adca2.nks.json.

Sorry if I'm overly chatty on this matter, but I am trying to keep others from messing up their clusters... Let me know if I there's something I can assist with.

Not at all, this is all helpful! We screwed up the initial implementation and this is a stateful system so simply upgrading won't fix everything, so I'm glad we're chatting through recovery here in a way that other folks can use.

@bfqrst
Copy link
Author

bfqrst commented Nov 23, 2022

I'm not sure why there are keys in the logs other than e6e0b12f though. Are these logs from the new machines after the keys have been removed?

That is correct, yes, the log is from a new v1.4.3 server cluster member...

Just to belt-and-suspenders ourselves here, I would verify that the key file exists in $datadir/server/keystore/b2d887f8-399f-e7f1-3d2c-5f35b47adca2.nks.json

And this file exists, so thumbs up! BUT there is a second file present even after deleting the key that was stuck in rekeying AND running a system gc...

/opt/hsy/nomad/data/server/keystore# ll
total 16
drwx------ 2 nomad nomad 4096 Nov 23 14:10 ./
drwxr-xr-x 5 nomad nomad 4096 Nov 23 14:04 ../
-rw------- 1 nomad nomad  314 Nov 23 16:07 b2d887f8-399f-e7f1-3d2c-5f35b47adca2.nks.json
-rw------- 1 nomad nomad  314 Nov 23 16:07 e6e0b12f-8a0d-0aaf-6653-335d10d27778.nks.json

@shoenig shoenig unpinned this issue Dec 20, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.