Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Connect configuration in multi datacenter setup with Vault provider #6819

Open
peimanja opened this issue Nov 20, 2019 · 10 comments
Open
Assignees
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/docs Documentation needs to be created/updated/clarified

Comments

@peimanja
Copy link

Currently I am working on setting up Consul clusters in multi datacenter setup with Connect enabled while using Vault provider.

Consul 1.6.2

We have a root-pki and we setup two PKI for connect consul-connect-west-pki and consul-connect-east-pki.

Connect configuration on the primary datacenter:

{
    "connect": {
        "ca_config": {
            "address": "https://our-vault-address",
            "intermediate_pki_path": "consul-connect-west-pki",
            "root_pki_path": "root-pki",
            "token": "REDACTED"
        },
        "ca_provider": "vault",
        "enabled": true
    }
}

Connect configuration on the secondary datacenter:

{
    "connect": {
        "ca_config": {
            "address": "https://our-vault-address",
            "intermediate_pki_path": "consul-connect-east-pki",
            "root_pki_path": "root-pki",
            "token": "REDACTED"
        },
        "ca_provider": "vault",
        "enabled": true
    }
}

Is this the right approach? Or should I set root_pki_path for the secondary datacenter to point to the primary PKI? like:

{
    "connect": {
        "ca_config": {
            "address": "https://our-vault-address",
            "intermediate_pki_path": "consul-connect-east-pki",
            "root_pki_path": "consul-connect-west-pki",
            "token": "REDACTED"
        },
        "ca_provider": "vault",
        "enabled": true
    }
}
@crhino
Copy link
Contributor

crhino commented Nov 26, 2019

Hi there. Have you gotten your current configuration working?

We definitely do not recommend setting the secondary DC root_pki_path to the intermediate_pki_path of the primary. Whether or not the two datacenters should have the same root_pki_path is a little less clear. Another possibility is to have completely separate root_pki_path entries for each datacenter.

For future reference, Issues on GitHub for Consul are intended to be related to bugs or feature requests, so we recommend using our other community resources instead of asking here.

@crhino crhino added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Nov 26, 2019
@peimanja
Copy link
Author

peimanja commented Nov 27, 2019

Thanks for the reply @crhino. I guess a blog post on a multi datacenter setup with Connect enabled while using Vault provider can be useful here.

Another possibility is to have completely separate root_pki_path entries for each datacenter.

But should they share the same CA? Connect is something new to us so we were wondering how Mesh gateway and proxy is working in multi datacenter setup
For now we are not enabling connect and will figure this out later.

@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Nov 27, 2019
@crhino
Copy link
Contributor

crhino commented Dec 2, 2019

They do not need to share the same CA.

I did some digging on this to understand myself and came up with the insight that, in a secondary datacenter, the root_pki_path for the vault provider is not used to sign Connect certificates at all. A secondary datacenter will request an intermediate CA from the primary datacenter's Root CA, and then use that intermediate to sign all leaf certificates. See here for more info: https://www.consul.io/docs/connect/connect-internals.html#certificate-authority-federation

I agree that we can improve the documentation and guidance around setting up this type of cluster, and I'll take an action to see what can be done about that.

Let me know if you have any more questions!

@crhino crhino self-assigned this Dec 2, 2019
@crhino crhino added type/docs Documentation needs to be created/updated/clarified theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Dec 2, 2019
@peimanja
Copy link
Author

peimanja commented Dec 3, 2019

So the Vault token on the secondary datacenter should have the following policies right?

path "secondary-dc-root-pki/*" {
  capabilities = ["read", "list"]
}

path "secondary-dc-root-pki/root/sign-intermediate" {
  capabilities = ["sudo", "create", "update"]
}

path "primary-dc-connect-pki/*" {
  capabilities = ["read", "list"]
}

path "primary-dc-connect-pki/root/sign-intermediate" {
  capabilities = ["sudo", "create", "update"]
}

path "secondary-dc-connect-pki/*" {
  capabilities = ["create", "read", "update", "list", "sudo"]
}

If we get the Vault policies documented, that could be very helpful because in most enterprise setups it need to be well defined.

@jsosulska jsosulska added the theme/consul-vault Relating to Consul & Vault interactions label Jun 23, 2020
@quinndiggity
Copy link

Just a note - if multiple Consul datacenters share ANY pki mounts under Vault, they will compete and overwrite the CA key+cert, eventually resulting in error fetching CA certificate: stored CA information not able to be parsed when they inevitably hit the following flow:

  • dc2 hits /pki_consul_intermediate/intermediate/generate/
  • dc3 hits /pki_consul_intermediate/intermediate/generate/
  • dc2 hit2 /pki_consul_intermediate/intermediate/set-signed

hashicorp/vault#2685
this ☝️ issue will eventually happen as a direct result

This makes it extremely painful if you want to scale to N secondary datacenters with a single primary datacenter, as you end up having to have N pki mounts - when you are working with about 30+ datacenters it gets ridiculous.

I believe there may be a race condition of some sort (or it's just broken; after the amount of hours I've invested in troubleshooting Consul problems I don't care to track down those specifics anymore, only to have the issue closed by someone like @jefferai without looking very far into it).

Incidentally, there seems to also be another issue that seems to cause hashicorp/vault#2685 which is that when using the Vault provider for Terraform, you can set /pki_consul_intermediate/config/urls and /pki_consul_intermediate/config/urls prior to Consul hitting /pki_consul_intermediate/intermediate/generate/, which also seems to fuck up the CA keypair under some circumstances.

Lastly, make sure you are including both the Consul pki root AND intermediate certs in the SAME file under either ca_path or ca_file, otherwise you'll be up against remote error: tls: bad certificate until you realize that only a self-signed CA certificate can ever be considered, well, a certificate authority (specifically a root). You CANNOT have /etc/consul.d/ssl.ca.d/root.pem AND /etc/consul.d/ssl.ca.d/intermediate.pem - they MUST be in a single file as a chain.

Fuck, this was frustrating to sort out, and Consul/Envoy do very little to help with their logs.

@quinndiggity
Copy link

My last major complaint, is that there is absolutely no way to have a deterministic cluster id, so it's literally impossible to bootstrap a cluster with a static:

connect = {
  enabled = true
  ca_provider = "consul"
  ca_config   = {
    private_key        = "...CONSUL_CA_KEY_CONTENTS..."
    root_cert          = "...CONSUL_CA_CRT_CONTENTS..."

or the equivalent with Vault and vault write pki/config/ca pem_bundle="@/tmp/vault.bundle.pem", as the first bootstrap MUST happen before the cluster id is ever known in order so sign a CA cert with a SAN of URI: spiffe://...cluster id....consul

Consul Connect (specifically the PKI) is easily one of the most frustrating tools I've ever worked with - it's right up there with Asterisk and complex multiple layer NAT traversal.

@GordonMcKinney
Copy link

Comcast has experienced the same issue as outlined above.

Impact: New service instances cannot join the mesh when the intermediate is corrupt. This causes significant issues for services scaling up to handle more customer demand.

The following screenshot demonstrates the error in production:

Screen Shot 2021-11-29 at 1 58 19 PM

@GordonMcKinney
Copy link

@dnephin NOTE: Your doc PR #11143 should address this issue also. cc @quinndiggity

@dnephin
Copy link
Contributor

dnephin commented Nov 29, 2021

Thanks for the note! Looking at some of the other discussion in this issue, my proposal and findings in #11159 may also be of interest.

@GordonMcKinney
Copy link

Yes, for this issue #11159 has value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

6 participants