-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.2.0+ trying to connect to CA when Connect protocol is disabled #4421
Comments
Hi @pztrn I asked in the mailing list thread about this for a couple of additional things that would help understand exactly how you hit this issue and got into the current state. But even without understanding exactly how you got into this state I can guess what the bug is: when connect is disabled and there is no CA configured, we assume nothing will attempt to fetch certificates however if it does it hits this bug during watch because the CA response never blocks. Unverified but I think the fix is either to make the CA and certificate endpoints on the server either still block even if there is no connect enabled, or make them actually error with 500 or similar such that clients will back off or even stop retrying instead of busy loop. If you are interested in understanding what's going on in your cluster (I would be) here is my train of though for you to follow and confirm if it's correct. You have connect disabled in config and you are not registering a managed proxy in your service call. But something is requesting a leaf Connect certificate The possible explanations I can think of are:
If you an confirm any one of those is true then great. If not then it would be useful to see:
Hope this is useful. |
Hey, sorry for being quiet, was out of city (and without internet) :) First, I haven't registered any Connect proxies. The code was: func (s *Service) RegisterAtConsul(port int, versionMajor string, versionFull string, registerHealthCheck bool) error {
s.versionFull = versionFull
s.versionFullDashes = strings.Replace(s.versionFull, ".", "-", -1)
s.versionMajor = versionMajor
// Register with consul for major version.
agent := s.Client.Agent()
serviceMajorVersion := &api.AgentServiceRegistration{
Name: s.Name + "-" + s.versionMajor,
Tags: s.Tags,
Address: s.localIP,
Port: port,
}
serviceMajorVersion.ID = s.ID + "-" + s.versionMajor
if registerHealthCheck {
serviceMajorVersion.Check = &api.AgentServiceCheck{
Name: serviceMajorVersion.ID + " HTTP check",
HTTP: "http://" + s.localIP + ":" + strconv.Itoa(port) + "/api/v1/healthCheck/",
Method: http.MethodPost,
Interval: "5s",
Timeout: "2s",
}
}
err := agent.ServiceRegister(serviceMajorVersion)
if err != nil {
return err
}
// Register at consul with full version.
serviceFullVersion := &api.AgentServiceRegistration{
Name: s.Name + "-" + s.versionFullDashes,
Tags: s.Tags,
Address: s.localIP,
Port: port,
}
serviceFullVersion.ID = s.ID + "-" + s.versionFullDashes
if registerHealthCheck {
serviceFullVersion.Check = &api.AgentServiceCheck{
Name: serviceFullVersion.ID + " HTTP check",
HTTP: "http://" + s.localIP + ":" + strconv.Itoa(port) + "/api/v1/healthCheck/",
Method: http.MethodPost,
Interval: "5s",
Timeout: "2s",
}
}
err1 := agent.ServiceRegister(serviceFullVersion)
if err1 != nil {
return err1
}
return nil
} where As you can see there is only service registration without proxifying. Moreover, every experiment was done with persistent storage removal, so no states was restored. Also, last log line from agent was about successful cluster synchronization after master election, with any log level. About possible explanations - it's number 4. Again - nothing was registered with (or as) Connect proxy, unless Consul's Golang API package does something nasty and, despite on disabling Connect things, still tries to use it. I'll attach logs and outputs in next message, within next couple of minutes. |
ps aux output BEFORE register-unregister thing:
Here we go, no proxy, no connect, nothing :) Logs: And after REGISTER-UNREGISTER exec: ps aux:
Full logs: |
Hmm @pztrn Apologies if I'm confused. The logs you posted there are indeed clean with no sign of proxies, however they also don't have the bug you described here - no failed attempts to fetch a leaf certificate, so it's hard to know how much that helps. Have you completely wiped this agent?
I'm still not convinced mostly because this is the only case we've seen in nearly a month - even if it is a bug it's clearly not an obvious one where we accidentally forgot to check about connect and always run proxies no matter what! We'll fix the known bug here which is CPU burn loop in this case regardless, but I still don't have a good story for how you have some Connect client running despite apparently never intending to start one. I realise you've pretty much said this lots of times but just to be totally sure we are not talking past each other could you answer these questions explicitly by number/quote so we can rule it out completely?
Here is another clue: The last segment of that URL is the Hope this helps you track down what happened. If you aren't seeing it anymore then it's up to you if you want to continue trying to work out how it started - as mentioned we'll fix the actual CPU bug here anyway. I'm 99.9% sure it's not a bug that started a proxy completely on it's own without ever being asked, but even if you clear out state that isn't a full reset since proxy processes are left running by the agent and keep going in an orphaned state. That's my best guess for your case still. |
Yes. Storage was definetely wiped.
Probably I have unique setup :D I'm sorry for not answering your questions, because I've just found that I lied - I've just found So, is CA required for Connect service to work? If so - then why there is no message in log like "trying to register Connect proxy without bootstrapping CA"? Anyway, we have two questions then:
Probably separate issues should be created for them? |
No worries! Glad we figure that out.
We do have https://www.consul.io/intro/getting-started/connect.html as well as documentation on native app integration. We can certainly do better though - can you give some feedback I can share with the team about what you were trying to do and whether or not it's covered in those docs. If it is then maybe we need to make them more discoverable somehow.
Yes. Connect is a TLS feature so it certainly needs to be enabled and certainly needs a CA to actually sign certificates!
Great suggestion, we should try to make that more obvious. Mostly it just didn't occur to us people would try to run proxies/native apps without connect support enabled but totally reasonable to make that not buggy and have better error message. In practice though, the agent doesn't necessarily know that connect is disabled (since we only require that config on the servers) until something (the proxy or a native app) tries to request certificates. In this case your
In this case I'll fix the real bug and try to do what i can to make the logs helpful in this case so no separate issue needed, thanks! |
As expected this is trivial to reproduce:
|
You have Golang API - make a quick start for using it with common caveats like using Connect without bootstrapping CA. I was one step before forking Consul and deleting whole Connect thing for internal usage :D
There is always a chance to give such error. For example by adding additional HTTP endpoint which checks for Connect to be enabled at Consul cluster. As it will use configuration value cached in RAM - it will be blazing fast and will give possibility to additionally print "Connect is disabled but remote client tries to register" in logs on agent and return same error to client that tried to register. This also can prevent launching SSL certificates watching in separate goroutine. |
Thanks for the feedback!
Yeah specifically in this case when you use the SDK The fact this is not doing service registration is a known-issue in terms of something that surprises people and isn't clearly documented so we'll certainly be fixing that. The actual certificate request does get an error both in logs on agent and application that currently says Anyway I'll fix this and take that on board, thanks! |
Is there a method to disable SPIFFE authentication on endpoints and CA certificate generation? I'm wondering what the value is of a SPIFFEE authentication when the underlying SNI isn't validated against the host records? |
That's my setup:
Where s.Client - from api.NewClient().
I haven't created AgentServiceConnect structures, do I think it should be disabled by default.
Everything works, service registers and address can be reached via HTTP or DNS requests. The problem with CA bootstraping which I want to disable, as Consul cluster runs in private environment without internet access. On service registering and de-registering it executes a request which push back this error:
2018/07/17 09:45:30 [ERR] http: Request GET /v1/agent/connect/ca/leaf/test-192.168.137.144, error: cluster has no CA bootstrapped from=192.168.99.1:54062
After first message appeared it starts to eat 100-150% CPU of 4-core host constantly. First message appeared right after first service registration.
consul info output from master (which eats CPU):
Complete log for register-deregister with TRACE level:
Configuration for Kubernetes:
Cluster was created with these commands:
The text was updated successfully, but these errors were encountered: