Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Implement automatic certificate rotation #1201

Closed
wants to merge 1 commit into from

Conversation

invidian
Copy link
Member

@invidian invidian commented Nov 19, 2020

Draft on top of #1193

Closes #1215

TODO:

  • Convert from running on every cluster apply to separate sub-command
  • Perhaps don't run updates when rotating the certificates.
  • Figure out how to tell user what's the expiration time of certificates.
  • Add documentation for certificate rotation process, which should mention that ideally all nodes in the cluster should be drained and rebooted after rotation to ensure that all pods using CA certificate etc got restarted.
  • Figure out automated restart of kube-proxy, Calico-node, CoreDNS and other core components.

This commit is a WIP version of automated certificate rotation
implementation, which right now occurs on every cluster update, but as
we discussed, this functionality needs to be moved into separate
subcommand.

Some highlights:
- Added config checksum to kube-apiserver and kube-controller-manager
  to trigger automatic recreation of pods when secrets are updated.

- Added ca-syncer init container to kubelet which ensures that
  bootstrap kubeconfig and one generated by TLS bootstrapping used by
  kubelet and client-ca certificate are always up to date, as kubelet
  seems to be not capable of updating them themselve.

- Temporarily force kube-controller-manager to sign certificates with
  only 45 minutes duration to make sure certificate rotation works as
  expected and all used certificates are rotated.

- When any of certificates used by etcd is changed, trigger
  copy-controller-secrets null_resource to copy etcd certificates on
  all controller nodes and then restart etcd to ensure it picks up
  new certificates, as it can only automatically pick up new client
  certificates and private key, but not CA certificate.

- Temporarily switch certificate validity hours on AWS to 1 hour to
  enforce running 'lokoctl cluster apply' every hour to keep cluster
  alive to make sure certificate rotation works as expected.

- 'lokoctl cluster apply' now taints all certificate resources on
  cluster update and after cluster is updated, it waits for all service
  account token secrets to be updated with the new CA certificate, then
  triggers a restart on all system deployments to make sure they pick up
  new CA certificate. This is core of certificate rotation logic.

TODO:
- Split into multiple commits.
- Move rotation code to separate subcommand.
- Restart etcd on other platforms.
- Figure out if new k8sutil function can be used in e2e tests.

Signed-off-by: Mateusz Gozdek <[email protected]>
@invidian invidian force-pushed the invidian/implement-certificate-rotation branch from 97e674b to 0b360b3 Compare December 7, 2020 16:07
Copy link
Contributor

@meyskens meyskens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some thoughts mainly for myself when picking this up.

I have one main concern which is that is the CA rotation. Rotating the CA by replacing it means the cluster will be non-responsive quite soon when kubelets pick up the new CA but the API i the old (and vice versa) that will no longer be able to comminucate.
To solve this we should be able to distribute an historic CA chain (just 1 old one will be enough) as trust bundle. Followed by making it trusted by all worker nodes then restarting the API server.

Etcd certs should follow a similar strategy.


const (
// Time to wait between updating DaemonSet/Deployment and start looking if
// workload has converged. This should account for kube-controller-manager election time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be an idea to watch the leader election to ensure this instead of guessing a time

@invidian
Copy link
Member Author

invidian commented Mar 6, 2021

I have one main concern which is that is the CA rotation. Rotating the CA by replacing it means the cluster will be non-responsive quite soon when kubelets pick up the new CA but the API i the old (and vice versa) that will no longer be able to comminucate.
To solve this we should be able to distribute an historic CA chain (just 1 old one will be enough) as trust bundle. Followed by making it trusted by all worker nodes then restarting the API server.

The situation you described only occurs when you replace private key of the CA certificate. We only rotate the certificate itself, so the signatures remains the same, so the communication remains functional. This is a deliberate decision as first step, as changing only the certificates is relatively easy. Changing private keys is indeed harder and remains chaining and further cleanup of old CA certificates etc.

@meyskens
Copy link
Contributor

meyskens commented Mar 7, 2021

I over read that we didn't taint the keys so that indeed helps a lot, in the first iteration of this.
I also looked a bit it my kubeadm built cluster (just by coincidence my admin user cert expired) and i confirmed my suspicion that the CA certs kubadm generates are valid for 10 years and client only for 1/2 years (1.5 i guess, need to look into the code just to be sure) which brings me to another thing i wanted to discuss which is having a longer lived CA duration in Lokomtive, but that of course also depends on how we'll handle the certificate rollovers.

@meyskens
Copy link
Contributor

meyskens commented Mar 7, 2021

Also sorry I am throwing a lot of things here :) Some might just be better as a follow up PR in a new version ;)

@invidian
Copy link
Member Author

invidian commented Mar 7, 2021

I also looked a bit it my kubeadm built cluster (just by coincidence my admin user cert expired) and i confirmed my suspicion that the CA certs kubadm generates are valid for 10 years and client only for 1/2 years (1.5 i guess, need to look into the code just to be sure) which brings me to another thing i wanted to discuss which is having a longer lived CA duration in Lokomtive, but that of course also depends on how we'll handle the certificate rollovers.

We already have CA expiration time configurable, so user may choose to extend it, but I think by default we should offer short-lived certificates and, ideally, fully automated rotation.

@meyskens
Copy link
Contributor

meyskens commented Mar 7, 2021

We already have CA expiration time configurable

I only find one certificate lifetime variable used everywhere, am I missing a variable?

@invidian
Copy link
Member Author

invidian commented Mar 7, 2021

Ah yes, we only have certs_validity_period_hours. Hmm...

@invidian
Copy link
Member Author

Closing in favor of #1435.

@invidian invidian closed this Mar 17, 2021
@invidian invidian deleted the invidian/implement-certificate-rotation branch March 17, 2021 18:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement subcommand for certificate rotation and document it's usage
2 participants