Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect jobs failing in 12.0 #8423

Closed
dclfan opened this issue Jul 12, 2020 · 25 comments · Fixed by #9299 or #9356
Closed

Connect jobs failing in 12.0 #8423

dclfan opened this issue Jul 12, 2020 · 25 comments · Fixed by #9299 or #9356
Assignees

Comments

@dclfan
Copy link

dclfan commented Jul 12, 2020

When I upgraded my cluster to 12.0, all of my Connect jobs began to fail. I started messing around with syntax of the job definition, but couldn't get it to work.

Nomad Version

Nomad v0.12.0 (8f7fbc8)

OS

Centos 7 and Ubunto 20.04

Issue

After upgrade to nomad 12.0, allocations fail because of a "missing network" constraint.

Reproduction Steps

To take any environmental variables out of the mix, I followed the steps outlined on (https://www.nomadproject.io/docs/integrations/consul-connect) exactly. The only difference between the two runs was the version of the nomad binary running in -dev-connect

"nomad job plan connect.nomad" - Nomad v0.11.3 (8918fc8)

  • Job: "countdash"

  • Task Group: "api" (1 create)

    • Task: "connect-proxy-count-api" (forces create)
    • Task: "web" (forces create)
  • Task Group: "dashboard" (1 create)

    • Task: "connect-proxy-count-dashboard" (forces create)
    • Task: "dashboard" (forces create)

Scheduler dry-run:

  • All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 connect.nomad -

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
For reporting security vulnerabilities please refer to the website.

"nomad job plan connect.nomad" - Nomad v0.12.0 (8f7fbc8)

  • Job: "countdash"

  • Task Group: "api" (1 create)

    • Task: "connect-proxy-count-api" (forces create)
    • Task: "web" (forces create)
  • Task Group: "dashboard" (1 create)

    • Task: "connect-proxy-count-dashboard" (forces create)
    • Task: "dashboard" (forces create)

Scheduler dry-run:

  • WARNING: Failed to place all allocations.
    Task Group "api" (failed to place 1 allocation):

    • Constraint "missing network": 1 nodes excluded by filter

    Task Group "dashboard" (failed to place 1 allocation):

    • Constraint "missing network": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 connect.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

@dclfan dclfan changed the title Connect Jobs failing in 12.0 Connect jobs failing in 12.0 Jul 12, 2020
@nickethier
Copy link
Member

Hey @t0nyhays I think I fixed this in #8407 sorry for the issue. We should be cutting a big fix release soon.

@spuder
Copy link
Contributor

spuder commented Jul 13, 2020

I am also seeing this issue Constraint missing network filtered 2 nodes. This is a major bug that blocks all jobs
, unfortunately we are going to have to roll back to 0.11.x until this is fixed.

@shoenig shoenig added type/bug theme/consul/connect Consul Connect integration labels Jul 13, 2020
@shoenig
Copy link
Contributor

shoenig commented Jul 13, 2020

I believe it should also be possible to workaround this by manually configuring network_interface on the Nomad Client. Before, Nomad leaned hard on the simple network model for the Connect integration. With multi-interface networking in Nomad v0.12+ we missed an edge case here where we don't detect an interface with multiple default routes as usable by default (hence the scheduling error).

@spuder
Copy link
Contributor

spuder commented Jul 13, 2020

I'm running 0.12 on the nomad servers, but still running 0.11 on the clients/agents. I tried adding the following to the client/agent config with no change in behavior. I suspect that running different versions on the servers as the agents makes the work around not valid.

"client": {
  "network_interface": "ens3"
}

@dclfan
Copy link
Author

dclfan commented Jul 13, 2020

I pushed out 12 to all servers(3) and all clients(10). Added "client": { "network_interface": "bond0" to all clients. Job w/o connect stanza worked fine. Job with connect stanza failed:
'network: no addresses availale for "" network':10. That is a typo in the error message, not a copy paste thing.

@dclfan
Copy link
Author

dclfan commented Jul 13, 2020

2nd workaround test:
On a single node I added:
"client":{"enabled":true,"host_network":[{"public":[{"cidr":"x.x.x.x/23","reserved_ports":"22,80"}]}],"meta":[{"rack":"D1","workload":"generic"}],"network_interface":"bond0"}

Restarted nomad.

Added host_network to job description:

network { mode = "bridge" port "grafana" { to = 3000 host_network = "public" }

Nomad job plan failed again, this time with:
{'NodesEvaluated': 10, 'NodesFiltered': 9, 'NodesAvailable': {'sim': 0, 'dc1': 0, 'stage': 0, 'prod': 10}, 'ClassFiltered': None, 'ConstraintFiltered': {'missing host network "public" for port "grafana"': 9}, 'NodesExhausted': 1, 'ClassExhausted': None, 'DimensionExhausted': {'network: no addresses availale for "" network': 1}

Does this help at all? It makes me think it was picking up the default correctly, since now that the job description has changed, it clearly knows the node has the named network. I will probably set back to 11.3 unless you can think of another potential fix, other than the bug fix roll.

@spuder
Copy link
Contributor

spuder commented Jul 14, 2020

Is there an ETA for when this will be released? 0.12 is completely unusable and the work arounds aren't viable.

@shoenig
Copy link
Contributor

shoenig commented Jul 15, 2020

Yeah sorry we don't have an ETA for rolling out the fix - it might be sooner than later since we're coordinating around the recent Go security fix but nothing exact. Definitely hold off for now if you can.

In the mean time I'm trying to incorporate this condition in some tests to make sure we don't run into something like this again.

@t0nyhays do you mind sharing the output of
curl -s "localhost:4646/v1/node/<nodeID>" | jq '.NodeResources | {Networks, NodeNetworks}' (with Nomad 0.12)? Feel free to redact IP addresses, but label them as public or private - so we can make sure there isn't something else funny going on.

@dclfan
Copy link
Author

dclfan commented Jul 15, 2020

@shoenig Sorry but I set my cluster back to 0.11.3 and have a deadline to stand up some new services. I will potentially have some time next week to get you the output. Thanks for working on this one!

@chrisboulton
Copy link

chrisboulton commented Jul 21, 2020

@shoenig If it's helpful, we just slammed into this and here's the output of the above - right now we don't have any of the workarounds that @t0nyhays attempted in place (should that be specifically what you're looking for).

{
  "Networks": [
    {
      "Mode": "host",
      "Device": "ens4",
      "CIDR": "10.133.0.51/32",
      "IP": "10.133.0.51",
      "MBits": 1000,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    }
  ],
  "NodeNetworks": [
    {
      "Mode": "host",
      "Device": "ens4",
      "MacAddress": "42:01:0a:85:00:33",
      "Speed": 1000,
      "Addresses": [
        {
          "Family": "ipv4",
          "Alias": "default",
          "Address": "10.133.0.51",
          "ReservedPorts": "",
          "Gateway": ""
        }
      ]
    }
  ]
}

Interestingly, on my local test cluster, this seems more correct:

{
  "Networks": [
    {
      "Mode": "bridge",
      "Device": "",
      "CIDR": "",
      "IP": "",
      "MBits": 0,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    },
    {
      "Mode": "host",
      "Device": "ens4",
      "CIDR": "10.128.0.11/32",
      "IP": "10.128.0.11",
      "MBits": 1000,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    }
  ],
  "NodeNetworks": [
    {
      "Mode": "bridge",
      "Device": "",
      "MacAddress": "",
      "Speed": 0,
      "Addresses": null
    },
    {
      "Mode": "host",
      "Device": "ens4",
      "MacAddress": "42:01:0a:80:00:0b",
      "Speed": 1000,
      "Addresses": [
        {
          "Family": "ipv4",
          "Alias": "default",
          "Address": "10.128.0.11",
          "ReservedPorts": "",
          "Gateway": ""
        }
      ]
    }
  ]
}

Immediately, I can spot a difference in the configuration we use in our real environments:

client {
  enabled = true
  network_interface = "ens4" # i don't have this on my test instance (though in our development environment, we seem to have this and it also reports a network block like the above)
...
}

^ Update, tried removing the network interface configuration, it made no difference. I also suspected this could be Open CNI plugin related, but between all these instances the plugins are there and are in similar locations.

We thought this might also have something to do with the fingerprinting which was reworked in #7518 and tweaked in #8208, but everything seems OK:

bridge 188416 1 br_netfilter, Live 0xffffffffc040f000
stp 16384 1 bridge, Live 0xffffffffc040a000
llc 16384 2 bridge,stp, Live 0xffffffffc0401000

Debug mode startup logs:

 ==> Nomad agent started! Log data will stream in below:
     2020-07-21T17:07:46.318Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/data/plugins
     2020-07-21T17:07:46.319Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/lib/nomad/data/plugins
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
     2020-07-21T17:07:46.319Z [INFO]  client: using state directory: state_dir=/var/lib/nomad/data/client
     2020-07-21T17:07:46.319Z [INFO]  client: using alloc directory: alloc_dir=/var/lib/nomad/data/alloc
     2020-07-21T17:07:46.355Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=[arch, bridge, cgroup, cni, consul, cpu, host, memory, network, nomad, signal, storage, vault, env_aws, env_gce]
     2020-07-21T17:07:46.355Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
     2020-07-21T17:07:46.355Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=
     2020-07-21T17:07:46.519Z [DEBUG] client: updated allocations: index=10508121 total=0 pulled=0 filtered=0
     2020-07-21T17:07:46.355Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
     2020-07-21T17:07:46.358Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
     2020-07-21T17:07:46.358Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
     2020-07-21T17:07:46.359Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2300
     2020-07-21T17:07:46.359Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=12
     2020-07-21T17:07:46.396Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens4
     2020-07-21T17:07:46.396Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/ens4/speed
     2020-07-21T17:07:46.396Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: mbits=1000
     2020-07-21T17:07:46.396Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=ens4 IP=10.133.0.41
     2020-07-21T17:07:46.397Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
     2020-07-21T17:07:46.397Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed
     2020-07-21T17:07:46.397Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: mbits=1000
     2020-07-21T17:07:46.399Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens4
     2020-07-21T17:07:46.399Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/ens4/speed
     2020-07-21T17:07:46.399Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: mbits=1000
     2020-07-21T17:07:46.401Z [WARN]  client.fingerprint_mgr.network: error calling ethtool: error="exit status 75" path=/sbin/ethtool device=dummy0
     2020-07-21T17:07:46.401Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/dummy0/speed
     2020-07-21T17:07:46.401Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: mbits=1000
     2020-07-21T17:07:46.404Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=docker0
     2020-07-21T17:07:46.404Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/docker0/speed
     2020-07-21T17:07:46.404Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: mbits=1000
     2020-07-21T17:07:46.451Z [INFO]  client.fingerprint_mgr.vault: Vault is available
     2020-07-21T17:07:46.451Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault period=15s
     2020-07-21T17:07:46.520Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
     2020-07-21T17:07:46.520Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
     2020-07-21T17:07:46.470Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=[arch, bridge, cgroup, consul, cpu, host, network, nomad, signal, storage, vault, env_gce]
     2020-07-21T17:07:46.470Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
     2020-07-21T17:07:46.470Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
     2020-07-21T17:07:46.470Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
     2020-07-21T17:07:46.471Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
     2020-07-21T17:07:46.471Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
     2020-07-21T17:07:46.471Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=undetected description=disabled
     2020-07-21T17:07:46.471Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=
     2020-07-21T17:07:46.471Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
     2020-07-21T17:07:46.471Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
     2020-07-21T17:07:46.475Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[int-us-central1, int-australia-southeast1]
     2020-07-21T17:07:46.492Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=healthy description=Healthy
     2020-07-21T17:07:46.495Z [INFO]  client.consul: discovered following servers: servers=[10.133.1.212:4647, 10.133.0.63:4647, 10.133.0.40:4647]
     2020-07-21T17:07:46.495Z [DEBUG] client.server_mgr: new server list: new_servers=[10.133.0.40:4647, 10.133.0.63:4647, 10.133.1.212:4647] old_servers=[]
     2020-07-21T17:07:46.517Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=healthy description=Healthy
     2020-07-21T17:07:46.517Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[exec docker java] undetected:[raw_exec qemu]]"
     2020-07-21T17:07:46.517Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
     2020-07-21T17:07:46.518Z [INFO]  client: started client: node_id=5517b270-051b-ea62-79c4-7b70657d1a84
     2020-07-21T17:07:46.528Z [INFO]  client: node registration complete

Update:

We think we've tracked this back, at least in our situation. A Nomad restart fixed the problem in our situation (the above output is actually good/working). Rebooting our infrastructure breaks it. We see the following in the logs:

2020-07-21T18:14:02.312Z [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled: error="could not detect kernel module bridge"

In our environment, I believe something is triggering a load of the bridge kernel module AFTER Nomad has started. The detection fails during the Nomad startup. It's possibly Docker that's triggering the load of it - for now we'll adjust /etc/modules and force a load of the bridging module (probably should be doing this anyway)

I suspect this is a behaviour change with the introduction of the fingerprinting - either previously it was just assumed bridge support was available, and everything just worked because Docker (presumably) had loaded bridge support, or the first calls out to the Open CNI plugins loaded the kernel module.

Update for the update:

Now that we've worked around the above, we're running into the error @t0nyhays posted above:

      "DimensionExhausted": {
        "network: no addresses availale for \"\" network": 8
      },

Still looking into what's causing this one.

Hopefully the last update:

I can confirm #8407 fixes the "no addresses available" error we were seeing (tested off the latest master commit)


So in summary, at least for us:

  • The first issue, with bridging being unavailable for our Connect tasks is/was caused by the bridge kernel module being unavailable at Nomad startup time, and changes to the fingerprinting code here.
  • Our continued inability to schedule jobs (now receiving no addresses available) is fixed by nomad: recanonicalize network after connect hook #8407

@jharley
Copy link

jharley commented Sep 5, 2020

I am now getting a warning (not failure) when doing plans, the jobs do get scheduled and run:

$ nomad plan myservice.nomad
[ ... ]
Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "myservice" (failed to place 1 allocation):
    * Resources exhausted on 2 nodes
    * Dimension "network: no addresses available for \"\" network" exhausted on 2 nodes

Nomad 0.12.3, CNI plugins 0.8.6, Ubuntu 20.04, Docker 19.03.12

@spuder
Copy link
Contributor

spuder commented Sep 18, 2020

We just tried to upgrade to nomad 0.12.4 again but had to roll back to 0.11 because of this issue. All nomad connect jobs fail with error:

Get "http://10.46.24.2:22489/actuator/health": dial tcp 10.46.24.2:22489: connect: connection refused

It's not obvious that there is a problem until jobs are resubmitted, or we reboot a nomad worker agent.

Update:

I'm able to reproduce this now on 0.12.4 and can confirm the same behavior that @chrisboulton reported

  • service nomad restart on nomad agents fixes the problem
  • sudo reboot causes nomad to break again.

CNI Plugins 0.8.5


Updated to nomad 0.12.5 and consul 1.8.4, problem persists

@CarelvanHeerden
Copy link

Getting the same as @jharley, using Nomad 0.12.4, on Ubuntu 18.04.

@nickethier
Copy link
Member

nickethier commented Sep 24, 2020

Hey folks, this issue turned a bit into a dumping group for connect issues. From my read of the history it looks like #8407 should have solved the original issue that @t0nyhays posted about the "no addresses available" error.

Thanks @chrisboulton for excellent write up and continuing updates. It looks like you're experiencing an issue now where the bridge kernel module is loaded just in time for use which breaks Nomad's fingerprinting of the bridge network (since it checks the currently loaded modules).

If I'm missing anything please shout. If not I plan to close this issue out soon and create one specifically to track the bridge module issue.

@nickethier nickethier self-assigned this Sep 24, 2020
@ibayer
Copy link

ibayer commented Oct 20, 2020

I'm experiencing exactly the same behavior as described in #8423 (comment) by @spuder .

Updated to nomad 0.12.5 and consul 1.8.4, problem persists

I'm on the same versions.

@apollo13
Copy link
Contributor

apollo13 commented Nov 3, 2020

@chrisboulton & @nickethier Rgearding

The first issue, with bridging being unavailable for our Connect tasks is/was caused by the bridge kernel module being unavailable at Nomad startup time, and changes to the fingerprinting code here.

I think we are running into similar things. Seems to be a race condition. After a crash of our whole cluster, one job had one allocation blocked due to "missing network". Draining that node and stopping nomad, docker and restarting them allowed a reschedule. Is there a ticket for this already?

@apollo13
Copy link
Contributor

apollo13 commented Nov 3, 2020

Also weirdly enough:

Nov 03 08:44:32 nomad03 dockerd[500]: time="2020-11-03T08:44:32.688453474+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Nov 03 08:44:32 nomad03 dockerd[500]: time="2020-11-03T08:44:32.760807915+01:00" level=info msg="Loading containers: done."
Nov 03 08:44:32 nomad03 dockerd[500]: time="2020-11-03T08:44:32.949868942+01:00" level=info msg="Docker daemon" commit=4484c46d9d graphdriver(s)=overlay2 version=19.03.13
Nov 03 08:44:32 nomad03 dockerd[500]: time="2020-11-03T08:44:32.963466533+01:00" level=info msg="Daemon has completed initialization"
Nov 03 08:44:32 nomad03 dockerd[500]: time="2020-11-03T08:44:32.995203542+01:00" level=info msg="API listen on /var/run/docker.sock"
Nov 03 08:44:32 nomad03 systemd[1]: Started Docker Application Container Engine.

....

Nov 03 08:44:33 nomad03 nomad[499]:     2020-11-03T08:44:25.905+0100 [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled: error="could not detect kernel module bridge, could not detect kerne

As you can see, timing wise the bridge should probably already been have loaded (docker already started successfully and logged as much)

Edit:// dmesg shows

[Di Nov  3 08:44:29 2020] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[Di Nov  3 08:44:29 2020] Bridge firewalling registered

If you look at the timestamps I think there might be something wrong with the nomad fingerprinting.

@shoenig shoenig self-assigned this Nov 9, 2020
shoenig added a commit that referenced this issue Nov 9, 2020
In Nomad v0.12.0, the client added additional fingerprinting around the
presense of the bridge kernel module. The fingerprinter only checked in
`/proc/modules` which is a list of loaded modules. In some cases, the
bridge kernel module is builtin rather than dynamically loaded. The fix
for that case is in #8721. However we were still missing the case where
the bridge module is dynamically loaded, but not yet loaded during the
startup of the Nomad agent. In this case the fingerprinter would believe
the bridge module was unavailable when really it gets loaded on demand.

This PR now has the fingerprinter scan the kernel module dependency file,
which will contain an entry for the bridge module even if it is not yet
loaded.

In summary, the client now looks for the bridge kernel module in
 - /proc/modules
 - /lib/modules/<kernel>/modules.builtin
 - /lib/modules/<kernel>/modules.dep

Closes #8423
@jorgemarey
Copy link
Contributor

jorgemarey commented Nov 13, 2020

Hi @shoenig, just found out about this when upgrading our cluster 0.12.X (from 0.11.3).

When the servers are in 0.12.X and the clients remain in 0.11.3 we're getting this error when trying to deploy connect jobs:

    * Constraint "missing network": 3 nodes excluded by filter

Seems that they cant find any node able to run the jobs.
Following up on this I saw that old nodes doesn't have bridge information on the NodeResources but new nodes (0.12.X) does have.

# 0.11 node
➜  nomad ✗ curl -k -X GET -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/node/af81f181-c3dc-9187-0992-128e9ea70a59" | jq '.NodeResources | {Networks, NodeNetworks}'
{
  "Networks": [
    {
      "Mode": "",
      "Device": "eth0",
      "CIDR": "10.10.37.109/32",
      "IP": "10.10.37.109",
      "MBits": 1000,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    }
  ],
  "NodeNetworks": [
    {
      "Mode": "",
      "Device": "eth0",
      "MacAddress": "",
      "Speed": 1000,
      "Addresses": [
        {
          "Family": "",
          "Alias": "default",
          "Address": "10.10.37.109",
          "ReservedPorts": "",
          "Gateway": ""
        }
      ]
    }
  ]
}
# 0.12 node
➜  nomad ✗ curl -k -X GET -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/node/e33d252c-bacd-f8f5-0f3d-fcaa9efe4f17" | jq '.NodeResources | {Networks, NodeNetworks}'
{
  "Networks": [
    {
      "Mode": "bridge",
      "Device": "",
      "CIDR": "",
      "IP": "",
      "MBits": 0,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    },
    {
      "Mode": "host",
      "Device": "eth0",
      "CIDR": "100.88.66.99/32",
      "IP": "100.88.66.99",
      "MBits": 1000,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    },
    {
      "Mode": "host",
      "Device": "eth0",
      "CIDR": "100.88.66.99/32",
      "IP": "100.88.66.99",
      "MBits": 1000,
      "DNS": null,
      "ReservedPorts": null,
      "DynamicPorts": null
    }
  ],
  "NodeNetworks": [
    {
      "Mode": "bridge",
      "Device": "",
      "MacAddress": "",
      "Speed": 0,
      "Addresses": null
    },
    {
      "Mode": "host",
      "Device": "eth0",
      "MacAddress": "",
      "Speed": 1000,
      "Addresses": [
        {
          "Family": "",
          "Alias": "default",
          "Address": "100.88.66.99",
          "ReservedPorts": "",
          "Gateway": ""
        }
      ]
    },
    {
      "Mode": "host",
      "Device": "eth0",
      "MacAddress": "",
      "Speed": 1000,
      "Addresses": [
        {
          "Family": "",
          "Alias": "default",
          "Address": "100.88.66.99",
          "ReservedPorts": "",
          "Gateway": ""
        }
      ]
    }
  ]
}

I saw in the code that between 0.11 and 0.12 this file: https://github.com/hashicorp/nomad/blob/master/client/fingerprint/bridge_linux.go is added adding the brigde information to the NodeResources and in the scheduler a new NetworkChecker was added to verify client networks.

nomad/scheduler/feasible.go

Lines 319 to 327 in 2fce235

type NetworkChecker struct {
ctx Context
networkMode string
ports []structs.Port
}
func NewNetworkChecker(ctx Context) *NetworkChecker {
return &NetworkChecker{ctx: ctx, networkMode: "host"}
}

It seems that, with this changes, while using nodes in 0.11 we can't run new connect jobs, and I guess that allocations running won't be able to reeschedule to other nodes if something fails.

Is there anything we can do to sort this out? Should we upgrade the nodes first (we add new instances with the new version)?
I thought about modifying the code and add the following to the NetworkChecker (to allow the previous behaviour). But I don't know if this has any other implications.

 func (c *NetworkChecker) Feasible(option *structs.Node) bool {
        if !c.hasNetwork(option) {
-               c.ctx.Metrics().FilterNode(option, "missing network")
-               return false
+               cstr, _ := version.NewConstraint("< 0.12")
+               nodeVersion, err := version.NewVersion(option.Attributes["nomad.version"])
+               if !cstr.Check(nodeVersion) || err != nil {
+                       c.ctx.Metrics().FilterNode(option, "missing network")
+                       return false
+               }
        }
 
        if c.ports != nil {

@shoenig
Copy link
Contributor

shoenig commented Nov 13, 2020

The -beta3 that went out yesterday contains #9299 which should fix the fingerprinting issue. Can you give that a try, @jorgemarey ?

@ibayer
Copy link

ibayer commented Nov 13, 2020

@shoenig
I just tested with v0.12.8 and the issue still persists (I can't test beta).

@jorgemarey
Copy link
Contributor

The -beta3 that went out yesterday contains #9299 which should fix the fingerprinting issue. Can you give that a try, @jorgemarey ?

Hi @shoenig. I saw that PR, but that, for what I could see fixes the fingerprint on the client side. My problem is that when using a server in 0.12 with clients on 0.11 job with connect can't be deployed because those nodes don't expose the bridge in the NodeResources and the server is checking that the node has the bridge Network.

This should be fixed on the server side making it compatible during the upgrade process.

I don't know if I'm explaining my problem correctly.

@shoenig
Copy link
Contributor

shoenig commented Nov 13, 2020

Ahh sorry I understand now @jorgemarey , thanks. I'll reopen this and look into it.

@shoenig shoenig reopened this Nov 13, 2020
@jorgemarey
Copy link
Contributor

Ahh sorry I understand now @jorgemarey , thanks. I'll reopen this and look into it.

Thanks @shoenig. My connect jobs works fine on 0.12.8 clients with the bridge module loaded previous to nomad start.

For me the only problem is during the upgrade from 0.11 to 0.12. As we first upgrade the servers and then the clients.

I was able to make it work by applying the diff in my previous comment (#8423 (comment)) and deploy the servers in 0.12 with that patch. With that, even with the servers on 0.12 and clients in 0.11 the connect jobs where scheduled correctly.

@ibayer
Copy link

ibayer commented Nov 13, 2020

It's still not clear to me if v0.12.8 contains the fix.
Is there an easy why to check this?

shoenig added a commit that referenced this issue Nov 13, 2020
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12
and beyond. Nomad 0.12 introduced a network fingerprinter for
bridge networks, which is a contstraint checked for if bridge
network is being used. If users upgrade servers first as is
recommended, suddenly no clients running older versions of Nomad
will satisfy the bridge network resource constraint. Instead,
this change only enforces the constraint if the Nomad client
version is also >= 0.12.

Closes #8423
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
jorgemarey pushed a commit to jorgemarey/nomad that referenced this issue Nov 1, 2023
In Nomad v0.12.0, the client added additional fingerprinting around the
presense of the bridge kernel module. The fingerprinter only checked in
`/proc/modules` which is a list of loaded modules. In some cases, the
bridge kernel module is builtin rather than dynamically loaded. The fix
for that case is in hashicorp#8721. However we were still missing the case where
the bridge module is dynamically loaded, but not yet loaded during the
startup of the Nomad agent. In this case the fingerprinter would believe
the bridge module was unavailable when really it gets loaded on demand.

This PR now has the fingerprinter scan the kernel module dependency file,
which will contain an entry for the bridge module even if it is not yet
loaded.

In summary, the client now looks for the bridge kernel module in
 - /proc/modules
 - /lib/modules/<kernel>/modules.builtin
 - /lib/modules/<kernel>/modules.dep

Closes hashicorp#8423
jorgemarey pushed a commit to jorgemarey/nomad that referenced this issue Nov 1, 2023
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12
and beyond. Nomad 0.12 introduced a network fingerprinter for
bridge networks, which is a contstraint checked for if bridge
network is being used. If users upgrade servers first as is
recommended, suddenly no clients running older versions of Nomad
will satisfy the bridge network resource constraint. Instead,
this change only enforces the constraint if the Nomad client
version is also >= 0.12.

Closes hashicorp#8423
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.