-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ERR] consul: failed to establish leadership: unknown CA provider "" #4954
Comments
Hey @wysenynja, thanks for the bug report. That config you gave for connect ( I tried a few things to reproduce the state you show where it's failing to establish leadership but couldn't get the same result:
Any steps you have to help reproduce this are appreciated. It may take a fresh cluster to do so if you've got things back to a working state though, as it sounds like the invalid CA config that was preventing the leader election has been fixed (or wasn't an issue on another server). |
@kyhavlov I'm experiencing the same issue (without Vault). After upgrading the cluster to 1.3.0 (and then 1.3.1) and enabling connect I receive the same error:
After adding some debug statements to
I attempted to set the
Dumping the leader's agent config shows that connect is enabled and depending whether I set the provider it's either
I cannot reproduce this on test clusters with the same scenario. The problematic cluster is moderate in size with ~1500 nodes. |
I have rotated all of the consul servers in this cluster with new machines and the issue remains. |
Attempting to update the CA config also does not work:
|
While the server the
Killing the leader when in this state does not trigger an election. The follower nodes do report that there is no leader (as expected). I'm unsure what the expected behaviour should be when in this state. |
Regarding this comment: #4954 (comment) And the error message:
It may be worth investigating if #5016 is showing a symptom of a similar bug. Adding a new server in this case and restoring from a snapshot manually in the case of #5016 could be hitting the same condition. That case includes a repro. |
The Docker image referenced in #5016 doesn't seem to include the snapshot (unless I'm missing something obvious). I can get a test cluster in the same state as my broken one by importing the raft db. But I cannot reproduce any other way. I should also note that at no point did I manually restore a snapshot. |
Re: the snapshot - that's correct; there's no snapshot built into the image currently - we had been voluming in the snapshot from a local directory into the container and then running the restore from there. That issue we are seeing there is also from a |
Thanks @agy and @PurrBiscuit for the information in both cases...we're continuing to look into this. |
@agy Can you clarify what version you're upgrading from to 1.3.0? Our current hunch is that you could be seeing a symptom of #4535 which was fixed in 1.2.3. If you were ever on a previous version utilizing Connect CA configuration that wrote to the state store you'd be seeing this as the state store would have been corrupted. If this is case we're considering adding something like a Alternatively (the better option we think) we could add automatic handling of this corrupted CA configuration which would then allow you to bypass this issue by treating this as a nil configuration. If we added something like that could you potentially jump to |
@pearkes Note: I had tested this upgrade path on a newly provisioned test cluster. And I have run this upgrade for all the testing that I've done earlier. Since I'm able to reproduce this issue on a test cluster by importing the raft store for the current broken cluster, I can test whatever fixes that you propose. I agree that your alternate, "better" solution is preferable. Upgrading to |
Unfortunately, since I rotated all the members of the broken cluster I cannot verify if I had enabled connect when the cluster was |
@agy Our concern and assumption was that it corrupted the state in |
@agy can you also clarify from this comment:
What operation did you do here? |
@pearkes I have done both. The
I have also copied over a tarball of the Consul What I do is:
As previously mentioned, this is far from ideal and is only to allow me to attempt some workarounds/fixes. File listing:
|
@wysenynja you mentioned that you had a similar issue when upgrading |
I don’t believe so. I’m running 1.4 now without issue. I am able to rebuild this cluster easily and it didn’t happen when I started fresh. |
This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.
I opened #5061 which should fix this - any older versions from 1.2.3 forward will be able to cherry pick this fix using these steps: #5016 (comment) |
I have got this issue when enable connect in 1.4.0, too.
|
This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.
The #5061 solved my problem. |
Prevent blank CA config from being committed to the snapshot. hashicorp#4954
I was able to reproduce this error with the following server configuration
This is on consul 1.5.1 My agents report
I can remove this configuration section, but when I do I start getting registration errors that RPC resources are exhausted try again. So I assumed I should increasing the number of certificate requests per second (and in parallel) as I'm kicking off some ~175 sidecars across the whole cluster on deployment. |
Overview of the Issue
I'm trying to use consul connect and now my cluster is partially broken.
Reproduction Steps
I'm not sure how exactly to reproduce. I've been poking at this cluster too much to have clear steps. I'll try more tomorrow.
I put this in
/etc/consul.d/config.hcl
:When I upgraded consul from 1.2.3 to 1.3.0 and added that config, something went wrong and leader election failed.
I manually recovered by poking peers.json and now the cluster has a leader:
However, something is still wrong with the cluster. The UI is showing stale data (#4923, maybe?) and vault is stuck in standby.
I would like to use vault for consul connect, but was just trying to get it working with the simplest setup first. How do I migrate to vault from this broken connect provider?
From my reading of the docs, the ca stuff was all supposed to be automatic so I don't know how to do it manually. I think I need to run
consul connect ca set-config -config-file ca.json
but don't know what to put for ca.json (also, why no hcl support here?).Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
I'm running everything inside docker containers on a single host.
Log Fragments
The text was updated successfully, but these errors were encountered: