Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acme in HA Traefik Scenario #348

Closed
stongo opened this issue May 2, 2016 · 24 comments
Closed

Acme in HA Traefik Scenario #348

stongo opened this issue May 2, 2016 · 24 comments

Comments

@stongo
Copy link
Contributor

stongo commented May 2, 2016

Acme integration works great when Traefik is running as a singular service.
When running Traefik as a HA service and using DNS round robin for instance, the Acme integration model breaks down because Let's Encrypt will only issue a certificate to one Traefik instance in a given cluster, making all other Traefik instances throw and shutdown.
I think the easiest solution would be to write the certificates to whatever shared backend is already configured - ie etcd, consul - as a secondary source for the certificates.
Any feedback would be appreciated on this.

@kbroughton
Copy link

+1
this might also be useful
https://github.com/alex/letsencrypt-aws

@emilevauge emilevauge added the kind/enhancement a new or improved feature. label May 2, 2016
@emilevauge
Copy link
Member

@stongo For sure.
I would also like to be able to store everything from traefik.toml in a KV store, in order to have a centralized configuration.
Maybe that should be done in the same PR.

@ekozan
Copy link

ekozan commented May 6, 2016

I love this idea 👍 we can add kubernetes secrets too on storage possibility

@nlf
Copy link

nlf commented May 10, 2016

to take this a step further, there are also complications responding to acme challenges in a DNS round robin situation. each node generates its own challenge response, which causes the challenge to fail unless by some miracle the same node responds.

to get around this, we should store the challenge response certs in the shared backend as well

@nlf
Copy link

nlf commented May 10, 2016

also libkv supports boltdb as a local (non-shared) store, how do we feel about just making all storage go through libkv? so instead of acme.json we'd create a boltdb with the certificates stored in it, and if you configured a shared backend (currently consul, etcd or zookeeper) they would go there. this means there are no real conditionals, just a configured backend.

@emilevauge
Copy link
Member

emilevauge commented May 12, 2016

@nlf: that's basically what I'm thinking about ;)

I would also like to be able to store everything from traefik.toml in a KV store, in order to have a centralized configuration.

Your point on DNS round robin is a real good catch!

@kbroughton
Copy link

Hashicorp vault would be good for the kv store
On May 12, 2016 9:56 AM, "Emile Vauge" [email protected] wrote:

@nlf https://github.com/nlf: that's basically what I'm thinking about ;)

I would also like to be able to store everything from traefik.toml in a KV
store, in order to have a centralized configuration.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#348 (comment)

@emilevauge
Copy link
Member

@kbroughton as Traefik relies on libkv: docker/libkv#123

@abourget
Copy link

does libkv handle Kubernetes or does Traefik talk directly to etcd in a kubernetes cluster ?

@errm
Copy link
Contributor

errm commented May 26, 2016

Traefik talks to the kubernetes api. It does not have any access to underlying storage.

I think that on kubernetes we could use secrets as a backend store...

@abourget
Copy link

abourget commented May 26, 2016

so all Traefik instances could watch a certain secret ? and configure to respond to challenges, and then also dispatch the newly gotten SSL certs to all Traefik instances, again, watching the certs as they become available ?

that'd be reaaaallly cool :)

@abourget
Copy link

abourget commented Jul 5, 2016

Is there anything I could do to help out ? Having Traefik as a load-balanced front-end, able to negotiate new certificates would be exceptionally delightful. It would replace CloudFlare + ELB altogether for me.

I see two issues:

  • Storing the challenges somewhere so that all Traefik instances can reply to the ACME requests.
  • Storing the certificates so that any cert granted for a domain can be accessed (or reloaded) by all the Traefik instances.

I'm thinking more Kubernetes here. Maybe this dicussion belongs in a new issue..

I'd assume we could store the challenges in a Kubernetes Secret. Traefik could watch the given secret and keep an in-memory copy of them when an ACME request arrives.

For the certificates, it seems a bit more tricky. Could we listen for a bunch of secrets ? or a single secret with prefixed key=value that would translated to Traefik domaines ? Something like:

apiVersion: v1
kind: Secret
metadata:
  name: traefik-secrets
type: Opaque
data:
  acme-ABCDEFGHIJKL123123123213: ABCDEF12345678==
  acme-ABCDEFGHIJKL123123123213: ABCDEF12345678==
  cert-domain.com-ca: ABCDEF12345678...ABCDEF12345678==
  cert-domain.com-cert: ABCDEF12345678...ABCDEF12345678==
  cert-domain.com-key: ABCDEF12345678...ABCDEF12345678==

Being able to atomically update certain keys in data would be great... otherwise we'd risk overwriting things that another Traefik instance attempts to write in there.

Other option: we could have a small redis service somewhere to centralize that information.. on Kubernetes it's not that costly if it provides full HA Traefik with automatic SSL :)

What do you think ?

@abourget
Copy link

abourget commented Jul 6, 2016

Thought of another option: running a single pod RC and forwarding traffic from traefik instances for /.well-known/... to that pod so it can centrally manage challenges and update TLS secrets that traefik instances would pick up.

For TLS certs storage, I thought maybe annotations could be used to declare the name of the secret to use when writing let's encrypt certs.

@emilevauge
Copy link
Member

Hi @abourget, for now, we are working on using kv-stores (etcd, consul, zookeeper...) as a global configuration storage #481.
We will first implement ACME HA using these KV stores.
This would also be great to use k8s secrets in the futur :) Or we could also use the ConfigMap.
We would have to add kubernetes secrets/configmap support in https://github.com/docker/libkv and then use directly in traefik.
Sounds good ?

@dts
Copy link

dts commented Jul 6, 2016

The way I did this (in a hurry), was to have a top-level rule that sent all .well-known traffic to a container running an acme manager. The long-lived process managed all the certs, I stored them in a Kubernetes Secret, and then killed the LB pods to load the fresh secret. I wound up using HAProxy (because I found better docs for complex rulesets), but it would work exactly the same with Traefik.

@abourget
Copy link

abourget commented Jul 6, 2016

@dts HAProxy with some Kubernetes glue in front ? Anything publicly available ?

@cretz
Copy link

cretz commented Jul 6, 2016

Just a heads up, I am in the process of implementing this for Caddy (abstraction is at caddyserver/caddy#913 and Consul impl not pushed to a repo yet). Some things that I hit that y'all might want to consider when developing:

  • Goes without saying that data needs to be encrypted. Could use Vault, but I just chose to AES encrypt the vals in Consul.
  • Things can get racy on auto renewal because you do not know which server is doing the renewal so you'll be tempted to do a shared mutex. The problem is the conditions for renewal often hit multiple servers at the same time, so you need to bail quickly if you can't take the lock (a master/slave leader election process resolves some of these issues, but can damage full HA-ness).
  • Things can get racy on the second-level local cache you undoubtedly have to have for your certs. Essentially you need to send out an event to the cluster when a cert is updated. Consul supports this of course, just food for thought. Just make sure you don't hold up a request waiting on this new cert. So, renewing w/ enough time to spare helps.

@dts
Copy link

dts commented Jul 7, 2016

@abourget: WIP, caveat emptor, etc, but the HAProxy component is here: https://github.com/dts/kubernetes-haproxy-lb (originally stolen from kubernetes/contrib). The bit for connecting ACME and Kubernetes secrets is here: https://github.com/dts/kubernetes-acme. They don't depend on each other, you could easily use the ACME part with traefik.

@stongo
Copy link
Contributor Author

stongo commented Jul 12, 2016

The other option would be to implement a leadership election where the leader is the only one to attempt to generate certificates and respond to acme challenges. Not sure if this is any easier, but definitely another solution to this problem. It would at least solve distributed locking complexities.

@emilevauge
Copy link
Member

@stongo that's what I was thinking about using https://github.com/docker/leadership. But in this case, slave nodes will have to forward ACME requests to master node.
I still dont know which solution will be easier to implement.

@stongo
Copy link
Contributor Author

stongo commented Jul 12, 2016

@emilevauge do you know if slaves would be able to abort acme requests and make the assumption that eventually the master node will receive the acme challenge? it adds more time delay to the initial generation of the cert, but reduces implementation complexity

@munnerz
Copy link

munnerz commented Jul 15, 2016

I've implemented a fully HA letsencrypt microservice for Kubernetes, that relies upon Secrets to acquire locks on certificate resources, and ConfigMaps for storing the actual certificates that can be shared. The locking is implemented through a generic interface with only a Secrets-based implementation, but could be swapped out for anything that'll allow for atomic updating (eg. etcd)

You can see the full implementation here: https://github.com/munnerz/kube-acme

I'd be happy to answer any questions or help out with any dev on this :)

@niieani
Copy link

niieani commented Aug 30, 2016

It would be better to do this in a backend-independent way.
I see the solution could as simple as auto-negotiation between all the instances of Traefik:

  1. Traefik does a DNS query for each domain listed in acme.domains
  2. If there are more than one A/AAAA records in the responses, it:
    1. waits for all the Traefik instances to be online (all servers must respond properly)
    2. contacts all the Traefik instances via the API and synchronizes the certificates in a secure way (e.g. using a hand-generated key pair / secret, which is pre-shared by the user to all the instances)
  3. If no valid certificate is present:
    1. it negotiates which server will start an ACME query to get a renewed certificate
    2. .well-known traffic is "shared" - i.e. the instance which is sent the query asks the instance that initiated the ACME negotiation for the data to reply to ACME
    3. the new certificate is synchronized across the Traefik instances

This only requires the user to generate and share a key-pair (or better yet, a configuration secret!) across all servers. Additionally, the same key-pair could be also shared using various secure Secret backends and vaults.

@strarsis
Copy link

Storages for secrets (and TLS/SSL private keys are secrets) could be a good way, like vault.

@ldez ldez added the area/acme label Jun 11, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests