Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No-Downtime CA Certificate Rotation #1430

Open
6 tasks
jeremyharisch opened this issue Mar 26, 2024 · 1 comment
Open
6 tasks

No-Downtime CA Certificate Rotation #1430

jeremyharisch opened this issue Mar 26, 2024 · 1 comment
Assignees
Labels
Epic kind/feature Categorizes issue or PR as related to a new feature.

Comments

@jeremyharisch
Copy link
Contributor

Description

In #1428 it was decided to implement a No-Downtime solution for our CA-Certificate which is used in our PKI for the runtime-watcher.

This approach involves a two-phase migration of clients, allowing for a slow, gradual transition with zero downtime. In the table below, you can see the detailed steps of the procedure when the CA-Certificate gets rotated. 'rootA+rootB' signifies that the CA-certificates have been concatenated and set as the 'ca.crt' value in the certificate secret. When transitioning from 'rootA+rootB' to 'rootB', it entails truncating the CA-Certificate String and removing the first certificate from the concatenation.

Detailed description of how the procedure looks like

Step Step-Name Gateway Server Cert Gateway Accepts Clients (CACert on KCP) Clients Accepts Server (CACert on SKR) Client Cert Note
01 Initial setup rootA rootA rootA rootA ""
02 Generate rootB cert in KCP rootA rootA rootA rootA ""
03 Reconfigure the Gateway in the KCP rootA rootA+rootB rootA rootA All clients with the old Certificates signed by rootA still work
04 Migrate Clients to Certificates signed by rootB rootA rootA+rootB rootA+rootB rootB ""
05 After alle Clients are migrated, switch Gate to accept only certs signed by rootB rootB rootB rootB rootB ""

For this procedure we need to introduce a new process which takes care of this stateful process. We should evaluate the following ideas:

  • Have a Go routine running in the KLM
  • Have a new reconciler, reconciling CA Certificates, in the KLM
  • Cron Job running in the KCP
  • Sidecar attached to KLM (could also include tiny reconciler, reconciling over CA Certificates)
  • TBD

Depending on the implementation details, we also need to find a proper way for monitoring this process, as well as testing it. In addition after this solution has been implemented and tested, we should delete the temporary solution, implemented with #1061, to shorten the reconciliation time again.

Reasons

Please have a look at the following issues:

Acceptance Criteria

  • Implement new process to cover No-Downtime CA Certification rotation
  • Cover new process in unit as well as with E2E Tests
  • Come up with a proper monitoring solution
    • Implement needed metrics
    • Cover new metrics with a corresponding dashboard
    • Think about creating alert rules

Feature Testing

End-to-End tests

Testing approach

No response

Attachments

Implementation Hints

The Gardener project had a very similar problem and them have a written proposal how it was implemented. Since we decided to also go for a CA-Bundle solution, we can have a look at their proposal and implementation: https://github.com/gardener/gardener/blob/master/docs/proposals/18-shoot-CA-rotation.md

@jeremyharisch jeremyharisch added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 26, 2024
@Tomasz-Smelcerz-SAP
Copy link
Member

Tomasz-Smelcerz-SAP commented Apr 22, 2024

The POC is done:

The procedure worked as expected.

The findings

  1. A minor adjustment is required in the runtime-watcher. Please refer to the code in the POC.
  2. The Istio-Gateway secret in KCP, which is directly managed by the Cert-Manager, can no longer be used as before. This is due to the Cert-Manager's inability to compose a ca-cert array as required for the zero-downtime procedure.
  3. As a result of point 2, we need an additional secret for the Istio-Gateway as compared to the current setup.
  4. In the POC, two additional secrets are used. However, it might be possible to use just the Istio-Gateway secret, as it contains the same data as the proposed "ca-bundle" secret.
  5. It appears that we can divide the implementation into two parts (this is not an acceptance criterion, but merely a proposal):
    • Lifecycle-Manager: Responsible solely for syncing the SKR secrets to the respective SKRs and managing the SKR secrets by creating and deleting/invalidating them when a migration is necessary.
    • "Cert Migration Agent": A new entity that can be implemented as a set of goroutines (possibly just one) in the Lifecycle-Manager or as a separate process/pod. Its main responsibility is to manage the Istio Gateway Cert with the appropriate TLC Cert and related CaCert array.
  6. In the POC, there is no solution for determining WHEN it is safe to switch the TLC Cert on the Gateway. Please see below for an explanation.

Determining when to switch the TLC Cert on the Gateway
The most significant challenge for the entire migration is determining when it is safe to switch the TLS certificate on the Gateway (from the "oldCert" to the "newCert"). Ideally, this should be done once all the clients have the new "ca-bundle" injected, which consists of two entries: ["newCert", "oldCert"]. The question is: How can we detect if all the clients have received the necessary update?

If any client has not yet been updated (its ca-cert is just ["oldCert"]), then switching the certificate on the Gateway means this client will no longer be able to connect. This is exactly how it works at the moment, which means downtime for such a client. On the other hand, we cannot wait indefinitely for configuration propagation. Assuming we started the migration a week before "oldCert" expires, we have just one week. If we don't switch to the "newCert" certificate on the Gateway, then ALL clients will lose connections, not just the ones that, for whatever reason, are not migrated on time.

Considering this, I believe that any solution that attempts to "enumerate" migrated vs non-migrated clients is not necessary. Even if we have such a procedure, we still MUST switch the Gateway to the "newCert" at some point, otherwise we will experience downtime due to an expired certificate. It seems that relying on a reasonable time-based policy is a better approach.

For example:

  • We start the migration one week before "oldCert" expires.
  • It is expected that all of the SKRs are migrated within 24 hours (at most, currently it's much faster that that).
  • In the remaining time, we continue to actively sync SKR configuration, "fixing" any clients that have not yet been migrated (this is already implemented in the Lifecycle-Manager).
  • A day before "oldCert" expiration, we switch the certificate on the Istio Gateway to the "newCert".
  • If there is still any SKR that haven't received the ca-cert update, it means we weren't able to reconcile it for six days, which indicates a more serious problem with that particular SKR, unrelated to the cert management.
  • After the switch: For every not-yet-migrated SKR (if any), as soon as the problem that blocked the proper reconciliation is removed, the SKR receives a valid certificate configuration and its Watcher connection to the KCP is restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants