-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to use many services #146
Comments
Hi @chen23 thanks for submitting a bug report. This behavior is not ideal. I was able to reproduce both errors you included: the first one errors while initializing the task (exits), then on subsequent restarts of Consul Terraform Sync it errors receiving EOF when querying for Consul leadership. My setup uses the config file #!/bin/bash
for i in $(seq -f "%03g" 1 100)
do
echo $i
curl -X PUT localhost:8500/v1/agent/service/register \
--data '{ "name": "app'"$i"'", "id": "app'"$i"'" }'
done The errors look related to DDoS tickets on Consul using Consul Template hashicorp/consul#7259. And changing the Consul agent configuration However this doesn't seem to resolve the underlying problem. It seems that there are connections held open preventing new HTTP calls to the Consul agent. This was my finding from the first pass of evaluating this bug. My next efforts will look into why there are open connections lingering. |
@findkim thank you for confirming. I was able to workaround the issue by setting |
Oops my PR summary quoted "doesn't fix 146" and github probably parsed it to mean "fix 146" and automatically closed this out. Reopening! |
I dug a bit more into this and wanted to share some of my findings and conclusions. Agent connection limitsThe underlying monitoring logic uses TCP connections for long polling of service changes using Consul blocking queries. The way Consul-Terraform-Sync uses this mechanism ends up exploiting the design to effectively have 1 service to 1 TCP connection with the agent, hence quickly reaching the agent limits. Since blocking queries are used, the load of these connections aren't high so I don't have as much reservation on increasing the limit amount as I initially did. I added documentation around this that will be included as a part of the next release hashicorp/consul#9371 I haven't looked much into the Consul 1.9 streaming feature and how it could improve this use case but would be a direction to look into regarding fewer open connections. EOF / 429 errors from Consul when restarting CTSDepending on the version of Consul running, you may observe an EOF or 429 error resulting from the agent limiting CTS long-running TCP connections. This was very peculiar because the TCP connections should have properly closed out on CTS shutdown. A few connection cleanups are in place #151 and hashicorp/hcat#31 but I still am observing ghost TCP connections stuck after CTS has shutdown.
And Consul agent still processes those blocking query requests and holds onto TCP connections after the client closes. About 5 minutes later I see Consul logs completing the requests, which follows suit with the blocking query default max wait of 5 minutes.
Seeing that the client side is stuck on So unfortunately, I think restarting CTS may run into agent limits with a high number of services until ^ is resolved. In this case the agent could be restarted to release those connections or wait ~5 mins for the server to finally reap those connections before starting CTS up again. |
I've been wracking my mind around why 100 services requires 200 connections. I just realized my repro steps with the bash script using curl to register services keeps TCP connections alive for a minute. And so each registration through curl took up 100 connections from the default 200 quota on the Consul server from the host. And so for me, running a task with 100 services always hit the limit when the curl connections were still open. I was not able to reproduce 429 errors from Consul when taking into account the curl requests I had just made (on first run, restart errors still occur since that is caused by a different bug). @chen23 do you recall how you had registered 100 services with Consul? Were they registered on from the same host within a short period before executing Consul-Terraform-Sync? |
@findkim I just retried my test and still getting similar result. in this case I only had 3/100 services defined and it managed to throw an error. I'm still running an older version of consul 1.8.6. In my test environment I have 2 servers
here's the errors that I see
|
Hi @chen23 thank you for your patience and appreciate your interest in pushing CTS to run at scale. I wanted to update you that we have a solution in place that will be a part of the next release 0.1.0-beta. We have identified an area where we could improve the efficiency of the TCP connections that Consul Terraform Sync establishes with the local Consul agent. There is now support for CTS to use HTTP/2 to make multiple blocking query requests on the same connection (hashicorp/hcat#37, #207). Running CTS with this new option will no longer require 1:1 TCP connection to # of services it monitors. There are now 2 options for running CTS with a large number of services.
The potential bug that's affecting option 1 hashicorp/consul#8524 is likely still in play but CTS at scale is no longer blocked on a fix for it. And so we would suggest operators to configure CTS to use HTTP/2. I'm going to go ahead and close this out. Please re-open if you find that the changes in the master branch or our upcoming release does not resolve your issue. Enable HTTP/2There are a few steps that are required to enable HTTP/2.
|
Describe the bug
When configuring 100 services the client does not work reliably
Versions
Consul Terraform Sync
Consul Version
Terraform Version
Configuration File(s)
Reminder to redact any sensitive information that may be present in this file
Click to toggle contents of config file
Terraform Configuration Files Generated by Consul-Terraform-Sync
Reminder to redact any sensitive information that may be present in the files
Click to toggle contents of main.tf
Click to toggle contents of sample service
see above
Task Variable Files
If passing in task variable file(s), share relevant parts of your variable file(s) here.
n/a
Expected Behavior
Ability to run with 100 services
Actual Behavior
After restarting the client you are unable to run the client
Steps to Reproduce
Additional Context
Add any other context about the problem here.
running the command
The text was updated successfully, but these errors were encountered: