-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect jobs failing in 12.0 #8423
Comments
Hey @t0nyhays I think I fixed this in #8407 sorry for the issue. We should be cutting a big fix release soon. |
I am also seeing this issue |
I believe it should also be possible to workaround this by manually configuring |
I'm running 0.12 on the nomad servers, but still running 0.11 on the clients/agents. I tried adding the following to the client/agent config with no change in behavior. I suspect that running different versions on the servers as the agents makes the work around not valid.
|
I pushed out 12 to all servers(3) and all clients(10). Added |
2nd workaround test: Restarted nomad. Added host_network to job description:
Nomad job plan failed again, this time with: Does this help at all? It makes me think it was picking up the default correctly, since now that the job description has changed, it clearly knows the node has the named network. I will probably set back to 11.3 unless you can think of another potential fix, other than the bug fix roll. |
Is there an ETA for when this will be released? 0.12 is completely unusable and the work arounds aren't viable. |
Yeah sorry we don't have an ETA for rolling out the fix - it might be sooner than later since we're coordinating around the recent Go security fix but nothing exact. Definitely hold off for now if you can. In the mean time I'm trying to incorporate this condition in some tests to make sure we don't run into something like this again. @t0nyhays do you mind sharing the output of |
@shoenig Sorry but I set my cluster back to 0.11.3 and have a deadline to stand up some new services. I will potentially have some time next week to get you the output. Thanks for working on this one! |
@shoenig If it's helpful, we just slammed into this and here's the output of the above - right now we don't have any of the workarounds that @t0nyhays attempted in place (should that be specifically what you're looking for). {
"Networks": [
{
"Mode": "host",
"Device": "ens4",
"CIDR": "10.133.0.51/32",
"IP": "10.133.0.51",
"MBits": 1000,
"DNS": null,
"ReservedPorts": null,
"DynamicPorts": null
}
],
"NodeNetworks": [
{
"Mode": "host",
"Device": "ens4",
"MacAddress": "42:01:0a:85:00:33",
"Speed": 1000,
"Addresses": [
{
"Family": "ipv4",
"Alias": "default",
"Address": "10.133.0.51",
"ReservedPorts": "",
"Gateway": ""
}
]
}
]
} Interestingly, on my local test cluster, this seems more correct: {
"Networks": [
{
"Mode": "bridge",
"Device": "",
"CIDR": "",
"IP": "",
"MBits": 0,
"DNS": null,
"ReservedPorts": null,
"DynamicPorts": null
},
{
"Mode": "host",
"Device": "ens4",
"CIDR": "10.128.0.11/32",
"IP": "10.128.0.11",
"MBits": 1000,
"DNS": null,
"ReservedPorts": null,
"DynamicPorts": null
}
],
"NodeNetworks": [
{
"Mode": "bridge",
"Device": "",
"MacAddress": "",
"Speed": 0,
"Addresses": null
},
{
"Mode": "host",
"Device": "ens4",
"MacAddress": "42:01:0a:80:00:0b",
"Speed": 1000,
"Addresses": [
{
"Family": "ipv4",
"Alias": "default",
"Address": "10.128.0.11",
"ReservedPorts": "",
"Gateway": ""
}
]
}
]
} Immediately, I can spot a difference in the configuration we use in our real environments: client {
enabled = true
network_interface = "ens4" # i don't have this on my test instance (though in our development environment, we seem to have this and it also reports a network block like the above)
...
} ^ Update, tried removing the network interface configuration, it made no difference. I also suspected this could be Open CNI plugin related, but between all these instances the plugins are there and are in similar locations. We thought this might also have something to do with the fingerprinting which was reworked in #7518 and tweaked in #8208, but everything seems OK:
Debug mode startup logs:
Update: We think we've tracked this back, at least in our situation. A Nomad restart fixed the problem in our situation (the above output is actually good/working). Rebooting our infrastructure breaks it. We see the following in the logs:
In our environment, I believe something is triggering a load of the bridge kernel module AFTER Nomad has started. The detection fails during the Nomad startup. It's possibly Docker that's triggering the load of it - for now we'll adjust I suspect this is a behaviour change with the introduction of the fingerprinting - either previously it was just assumed bridge support was available, and everything just worked because Docker (presumably) had loaded bridge support, or the first calls out to the Open CNI plugins loaded the kernel module. Update for the update: Now that we've worked around the above, we're running into the error @t0nyhays posted above: "DimensionExhausted": {
"network: no addresses availale for \"\" network": 8
}, Still looking into what's causing this one. Hopefully the last update: I can confirm #8407 fixes the "no addresses available" error we were seeing (tested off the latest master commit) So in summary, at least for us:
|
I am now getting a warning (not failure) when doing plans, the jobs do get scheduled and run:
Nomad 0.12.3, CNI plugins 0.8.6, Ubuntu 20.04, Docker 19.03.12 |
We just tried to upgrade to nomad 0.12.4 again but had to roll back to 0.11 because of this issue. All nomad connect jobs fail with error:
It's not obvious that there is a problem until jobs are resubmitted, or we reboot a nomad worker agent. Update:I'm able to reproduce this now on 0.12.4 and can confirm the same behavior that @chrisboulton reported
CNI Plugins 0.8.5 Updated to nomad 0.12.5 and consul 1.8.4, problem persists |
Getting the same as @jharley, using Nomad 0.12.4, on Ubuntu 18.04. |
Hey folks, this issue turned a bit into a dumping group for connect issues. From my read of the history it looks like #8407 should have solved the original issue that @t0nyhays posted about the "no addresses available" error. Thanks @chrisboulton for excellent write up and continuing updates. It looks like you're experiencing an issue now where the bridge kernel module is loaded just in time for use which breaks Nomad's fingerprinting of the bridge network (since it checks the currently loaded modules). If I'm missing anything please shout. If not I plan to close this issue out soon and create one specifically to track the bridge module issue. |
I'm experiencing exactly the same behavior as described in #8423 (comment) by @spuder .
I'm on the same versions. |
@chrisboulton & @nickethier Rgearding
I think we are running into similar things. Seems to be a race condition. After a crash of our whole cluster, one job had one allocation blocked due to "missing network". Draining that node and stopping nomad, docker and restarting them allowed a reschedule. Is there a ticket for this already? |
Also weirdly enough:
As you can see, timing wise the bridge should probably already been have loaded (docker already started successfully and logged as much) Edit:// dmesg shows
If you look at the timestamps I think there might be something wrong with the nomad fingerprinting. |
In Nomad v0.12.0, the client added additional fingerprinting around the presense of the bridge kernel module. The fingerprinter only checked in `/proc/modules` which is a list of loaded modules. In some cases, the bridge kernel module is builtin rather than dynamically loaded. The fix for that case is in #8721. However we were still missing the case where the bridge module is dynamically loaded, but not yet loaded during the startup of the Nomad agent. In this case the fingerprinter would believe the bridge module was unavailable when really it gets loaded on demand. This PR now has the fingerprinter scan the kernel module dependency file, which will contain an entry for the bridge module even if it is not yet loaded. In summary, the client now looks for the bridge kernel module in - /proc/modules - /lib/modules/<kernel>/modules.builtin - /lib/modules/<kernel>/modules.dep Closes #8423
Hi @shoenig, just found out about this when upgrading our cluster 0.12.X (from 0.11.3). When the servers are in 0.12.X and the clients remain in 0.11.3 we're getting this error when trying to deploy connect jobs:
Seems that they cant find any node able to run the jobs.
I saw in the code that between 0.11 and 0.12 this file: https://github.com/hashicorp/nomad/blob/master/client/fingerprint/bridge_linux.go is added adding the brigde information to the NodeResources and in the scheduler a new NetworkChecker was added to verify client networks. Lines 319 to 327 in 2fce235
It seems that, with this changes, while using nodes in 0.11 we can't run new connect jobs, and I guess that allocations running won't be able to reeschedule to other nodes if something fails. Is there anything we can do to sort this out? Should we upgrade the nodes first (we add new instances with the new version)?
|
The |
@shoenig |
Hi @shoenig. I saw that PR, but that, for what I could see fixes the fingerprint on the client side. My problem is that when using a server in 0.12 with clients on 0.11 job with connect can't be deployed because those nodes don't expose the bridge in the NodeResources and the server is checking that the node has the bridge Network. This should be fixed on the server side making it compatible during the upgrade process. I don't know if I'm explaining my problem correctly. |
Ahh sorry I understand now @jorgemarey , thanks. I'll reopen this and look into it. |
Thanks @shoenig. My connect jobs works fine on 0.12.8 clients with the bridge module loaded previous to nomad start. For me the only problem is during the upgrade from 0.11 to 0.12. As we first upgrade the servers and then the clients. I was able to make it work by applying the diff in my previous comment (#8423 (comment)) and deploy the servers in 0.12 with that patch. With that, even with the servers on 0.12 and clients in 0.11 the connect jobs where scheduled correctly. |
It's still not clear to me if |
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12 and beyond. Nomad 0.12 introduced a network fingerprinter for bridge networks, which is a contstraint checked for if bridge network is being used. If users upgrade servers first as is recommended, suddenly no clients running older versions of Nomad will satisfy the bridge network resource constraint. Instead, this change only enforces the constraint if the Nomad client version is also >= 0.12. Closes #8423
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
In Nomad v0.12.0, the client added additional fingerprinting around the presense of the bridge kernel module. The fingerprinter only checked in `/proc/modules` which is a list of loaded modules. In some cases, the bridge kernel module is builtin rather than dynamically loaded. The fix for that case is in hashicorp#8721. However we were still missing the case where the bridge module is dynamically loaded, but not yet loaded during the startup of the Nomad agent. In this case the fingerprinter would believe the bridge module was unavailable when really it gets loaded on demand. This PR now has the fingerprinter scan the kernel module dependency file, which will contain an entry for the bridge module even if it is not yet loaded. In summary, the client now looks for the bridge kernel module in - /proc/modules - /lib/modules/<kernel>/modules.builtin - /lib/modules/<kernel>/modules.dep Closes hashicorp#8423
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12 and beyond. Nomad 0.12 introduced a network fingerprinter for bridge networks, which is a contstraint checked for if bridge network is being used. If users upgrade servers first as is recommended, suddenly no clients running older versions of Nomad will satisfy the bridge network resource constraint. Instead, this change only enforces the constraint if the Nomad client version is also >= 0.12. Closes hashicorp#8423
When I upgraded my cluster to 12.0, all of my Connect jobs began to fail. I started messing around with syntax of the job definition, but couldn't get it to work.
Nomad Version
Nomad v0.12.0 (8f7fbc8)
OS
Centos 7 and Ubunto 20.04
Issue
After upgrade to nomad 12.0, allocations fail because of a "missing network" constraint.
Reproduction Steps
To take any environmental variables out of the mix, I followed the steps outlined on (https://www.nomadproject.io/docs/integrations/consul-connect) exactly. The only difference between the two runs was the version of the nomad binary running in -dev-connect
"nomad job plan connect.nomad" - Nomad v0.11.3 (8918fc8)
Job: "countdash"
Task Group: "api" (1 create)
Task Group: "dashboard" (1 create)
Scheduler dry-run:
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 connect.nomad -
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
For reporting security vulnerabilities please refer to the website.
"nomad job plan connect.nomad" - Nomad v0.12.0 (8f7fbc8)
Job: "countdash"
Task Group: "api" (1 create)
Task Group: "dashboard" (1 create)
Scheduler dry-run:
WARNING: Failed to place all allocations.
Task Group "api" (failed to place 1 allocation):
Task Group "dashboard" (failed to place 1 allocation):
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 connect.nomad
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
The text was updated successfully, but these errors were encountered: