-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Every Nomad executor queries all consul services and health checks every 5 seconds. #1995
Comments
Definitely needs improving. Might be related to #1968 - still investigating that one. |
@schmichael Have you made progress on this? Pinging because we have Nomad clients in production which have 167 Mbps over |
@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now). Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway). I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature. (If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.) |
Awesome!!! This sounds great, actually. I'm really looking forward to it. The description you provided matches my observations. Thanks for the update! |
Fixes #2478 #2474 #1995 #2294 The new client only handles agent and task service advertisement. Server discovery is mostly unchanged. The Nomad client agent now handles all Consul operations instead of the executor handling task related operations. When upgrading from an earlier version of Nomad existing executors will be told to deregister from Consul so that the Nomad agent can re-register the task's services and checks. Drivers - other than qemu - now support an Exec method for executing abritrary commands in a task's environment. This is used to implement script checks. Interfaces are used extensively to avoid interacting with Consul in tests that don't assert any Consul related behavior.
Fixes #2478 #2474 #1995 #2294 The new client only handles agent and task service advertisement. Server discovery is mostly unchanged. The Nomad client agent now handles all Consul operations instead of the executor handling task related operations. When upgrading from an earlier version of Nomad existing executors will be told to deregister from Consul so that the Nomad agent can re-register the task's services and checks. Drivers - other than qemu - now support an Exec method for executing abritrary commands in a task's environment. This is used to implement script checks. Interfaces are used extensively to avoid interacting with Consul in tests that don't assert any Consul related behavior.
Fixed in master by #2467 -- will be in 0.6.0 release. Actually demo'd this particular fix internally at HashiCorp! For 250 tasks with checks running a single node, loopback traffic went from 60 Mbps in 0.5.6 to 60 Kbps. CPU usage was significantly lower, but still not as low as we'd like. Some quick profiling revealed task stats collection as the next most likely CPU hog on idle systems. If you have the time/ability/desire to test the Consul fix I've attached a binary here: |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
The cluster is running on AWS EC2-VPC.
Issue
Every 4~5 seconds, the nomad executor process queries Consul's services endpoint and health checks endpoint with no filtering; fetching the full list of all services and health checks known by Consul.
This creates a scaling problem where
n
executors frequently fetch information aboutn
services.Here's an example TCP dump session, watching requests from a unique client port:
The client port maps back to an executor:
I'm running a job with
Count: 800
on 4 nomad client machines, and the services response is pretty big:The health checks endpoint is very small, because I've disabled health checks on this big job while running the test. (Our initial assumption was that health checks were responsible for the CPU load and
lo
interface pressure.)Observations:
The
Accept-Encoding
header is ignored (no compression on the response).800 Count / 4 clients
= 200 jobs on localhost:200 executors / 5 seconds
= 40 requests per second40 requests per second * 100 KB
= 4 MB/s4 MB/s * 8 bits
= 32 Mbit/s over localhostLibrato metrics agrees:
The 800 count jobs themselves are idle (e.g. a bash script:
while true ; do sleep 60 ; done
)Nomad executors and Consul consume a steady 25% CPU (on each core):
Reproduction steps
lo
traffic increases from ~36 Mbit/s to ~54 Mbit/s when health checks are included.Nomad Server logs (if appropriate)
N/A
Nomad Client logs (if appropriate)
N/A
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: