-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259
Comments
After further deeper investigation, it seems like DDos is the issue
|
Also looking deeper into 1.6.2 to 1.6.3 changes may be this commit could explain the new behavior |
According to #7257 seems like it is also reproductible with 3 nodes server cluster |
@obourdon try out the new limits entries. https://www.consul.io/docs/agent/options.html#limits Specifically the *_conns_per_client entries. This may also imply that something is way overusing one of the ways into the server cluster, and not using local agents. For instance, do you have Prometheus running with a very default consul_sd_config? |
@chuckyz thanks for the infos. I have reproduced the issue isolated from my complete (crypted and SSL secured) environment using official consul Docker container and get the same behaviour: OK with 1.6.2 KO with 1.6.3 and later. You can find the code to reproduce it yourself here. In fact, consul-template is running [on an agent node/in a docker container] and from the packet capture I have made it is using local consul agent to try to get KVs and using the same Activating debug on both consul server docker container and consul client did not show much I have also tried to add some more Consul client/server configuration parameters without more success. |
I think the issue/configuration values to solve can be both in consul AND consul-template therefore the post of 2 separate issues: See consul-template issue #1346 |
@obourdon Don't get the exact template you are using, but if you have issues with consul-template DOSing consul, you might try consul-templaterb which has a few protections included to avoid those kinds of behaviors (that's one of the reasons I developed this tool, and it is production-grade ready in use for years at Criteo on very large clusters) |
I asked around and it looks like this has to do with the I suggest first trying to adjust that setting. Here's the docs on it. Hope this helps. |
this is one we were hit with - running a Nomad client cluster with ~40 scheduled allocations on each generated almost 1000 established HTTP connections per client... may look at https://github.com/criteo/consul-templaterb to help diagnose exactly what's going on! |
@brydoncheyney-slc if the limit is 100 (assuming you are using 1.6.3+), it probably won't work either (assuming you are using 100+ endpoints), but if you test, use |
@pierresouchay interesting. I ran a fairly naive Appreciate the heads up. If anything interesting crops up I will report back... |
@eikenb indeed increasing the I am still confused by the naming of this variable as it seems that the client only makes one http connection to make all requests but ... I first tried this with all fixed versions above 1.7.1 (where the limit was put back to 200) but I still was having the issue. Digging deeper showed that even though I have ~150 keys in consul, my consul-templates are requesting ~300 KVs (can be seen after launching
However, adding
does the trick |
Overview of the Issue
Call to
curl -s -D - http://consul-server:8500/v1/agent/members
fails withcurl: (56) Recv failure: Connection reset by peer
(even thoughconsul members
still seems to work) afterconsul-template failure
Reproduction Steps
Note that under exact same scenario, consul 1.6.2 does not exhibit the same behaviour as far as the curl call mentioned is concerned. Looking at code differences between consul 1.6.3 and 1.6.2 did not ring any bell on my side
We have a set of uniform versions of 1 or 3 consul servers and 2 consul machines (all 1.6.2 or all 1.6.3)
We are running 2 sets of nomad jobs using consul-template (0.24.0). One set of jobs is targeted at consul client 1 (using constraints & tags) the second set of nomad jobs is targeted at consul client 2
In jobs set 2, we had an error in the sense that one of the consul key which parametrize one of the jobs is missing from consul KV store therefore making consul-template "hang" waiting for this key to be defined (we fixed this on our side by using keyOrDefault instead of key)
However, this error seems to have a dramatic effect on the consul client node where the job causing the error is executing. We reversed the job target rules to make sure that this behaviour is not linked to the host itself but really to the fact that the failing job was executed on this particular failing job and this proved to be the case (anyways all hosts are running same OS, software versions, and configurations for the same roles ...)
We are using both
consul members
andcurl -s -D - http://consul-server:8500/v1/agent/members
to make sure that everything is working properly on all nodes and are launching these 2 commands on each node successfully when using consul 1.6.2In the case of consul 1.6.2, the nomad job is waiting for consul template to get the missing key and once we use
consul kv pu missing-key some-value
everything comes back to normal and curl + CLI members calls are always working at all timesHowever using consul 1.6.3, once the consul-template is hanging, the consul node where this was executed can not make any successful calls to curl members. CLI
consul members
command seems to execute properly but if we runsystemctl restart consul-agent
then node seems to have left the cluster. If I remember correctly, restarting consul server on the server node(s) also shows that cluster has lost all membersOne thing I would also like to state is that due to the large set of jobs we run in job 2, we get the known following warning message at job startup but, again, this still does work properly on 1.6.2 and not on 1.6.3
Consul info for both Client and Server
Client info
CoreOS 2303.3.0
Server info
CoreOS 2303.3.0
Operating system and Environment details
OS, Architecture, and any other information you can provide about the environment.
See above
Log Fragments
Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use
-log-level=TRACE
on the client and server to capture the maximum log detail.Currently gathering these
The text was updated successfully, but these errors were encountered: