No-Downtime CA Certificate Rotation #1430

jeremyharisch · 2024-03-26T14:19:23Z

Description

In #1428 it was decided to implement a No-Downtime solution for our CA-Certificate which is used in our PKI for the runtime-watcher.

This approach involves a two-phase migration of clients, allowing for a slow, gradual transition with zero downtime. In the table below, you can see the detailed steps of the procedure when the CA-Certificate gets rotated. 'rootA+rootB' signifies that the CA-certificates have been concatenated and set as the 'ca.crt' value in the certificate secret. When transitioning from 'rootA+rootB' to 'rootB', it entails truncating the CA-Certificate String and removing the first certificate from the concatenation.

Detailed description of how the procedure looks like

Step	Step-Name	Gateway Server Cert	Gateway Accepts Clients (CACert on KCP)	Clients Accepts Server (CACert on SKR)	Client Cert	Note
01	Initial setup	rootA	rootA	rootA	rootA	""
02	Generate rootB cert in KCP	rootA	rootA	rootA	rootA	""
03	Reconfigure the Gateway in the KCP	rootA	rootA+rootB	rootA	rootA	All clients with the old Certificates signed by rootA still work
04	Migrate Clients to Certificates signed by rootB	rootA	rootA+rootB	rootA+rootB	rootB	""
05	After alle Clients are migrated, switch Gate to accept only certs signed by rootB	rootB	rootB	rootB	rootB	""

For this procedure we need to introduce a new process which takes care of this stateful process. We should evaluate the following ideas:

Have a Go routine running in the KLM
Have a new reconciler, reconciling CA Certificates, in the KLM
Cron Job running in the KCP
Sidecar attached to KLM (could also include tiny reconciler, reconciling over CA Certificates)
TBD

Depending on the implementation details, we also need to find a proper way for monitoring this process, as well as testing it. In addition after this solution has been implemented and tested, we should delete the temporary solution, implemented with #1061, to shorten the reconciliation time again.

Reasons

Please have a look at the following issues:

Acceptance Criteria

Implement new process to cover No-Downtime CA Certification rotation
Cover new process in unit as well as with E2E Tests
Come up with a proper monitoring solution
- Implement needed metrics
- Cover new metrics with a corresponding dashboard
- Think about creating alert rules

Feature Testing

End-to-End tests

Testing approach

No response

Attachments

Implementation Hints

The Gardener project had a very similar problem and them have a written proposal how it was implemented. Since we decided to also go for a CA-Bundle solution, we can have a look at their proposal and implementation: https://github.com/gardener/gardener/blob/master/docs/proposals/18-shoot-CA-rotation.md

Tomasz-Smelcerz-SAP · 2024-04-22T08:27:08Z

The POC is done:

The procedure worked as expected.

The findings

A minor adjustment is required in the runtime-watcher. Please refer to the code in the POC.
The Istio-Gateway secret in KCP, which is directly managed by the Cert-Manager, can no longer be used as before. This is due to the Cert-Manager's inability to compose a ca-cert array as required for the zero-downtime procedure.
As a result of point 2, we need an additional secret for the Istio-Gateway as compared to the current setup.
In the POC, two additional secrets are used. However, it might be possible to use just the Istio-Gateway secret, as it contains the same data as the proposed "ca-bundle" secret.
It appears that we can divide the implementation into two parts (this is not an acceptance criterion, but merely a proposal):
- Lifecycle-Manager: Responsible solely for syncing the SKR secrets to the respective SKRs and managing the SKR secrets by creating and deleting/invalidating them when a migration is necessary.
- "Cert Migration Agent": A new entity that can be implemented as a set of goroutines (possibly just one) in the Lifecycle-Manager or as a separate process/pod. Its main responsibility is to manage the Istio Gateway Cert with the appropriate TLC Cert and related CaCert array.
In the POC, there is no solution for determining WHEN it is safe to switch the TLC Cert on the Gateway. Please see below for an explanation.

Determining when to switch the TLC Cert on the Gateway
The most significant challenge for the entire migration is determining when it is safe to switch the TLS certificate on the Gateway (from the "oldCert" to the "newCert"). Ideally, this should be done once all the clients have the new "ca-bundle" injected, which consists of two entries: ["newCert", "oldCert"]. The question is: How can we detect if all the clients have received the necessary update?

If any client has not yet been updated (its ca-cert is just ["oldCert"]), then switching the certificate on the Gateway means this client will no longer be able to connect. This is exactly how it works at the moment, which means downtime for such a client. On the other hand, we cannot wait indefinitely for configuration propagation. Assuming we started the migration a week before "oldCert" expires, we have just one week. If we don't switch to the "newCert" certificate on the Gateway, then ALL clients will lose connections, not just the ones that, for whatever reason, are not migrated on time.

Considering this, I believe that any solution that attempts to "enumerate" migrated vs non-migrated clients is not necessary. Even if we have such a procedure, we still MUST switch the Gateway to the "newCert" at some point, otherwise we will experience downtime due to an expired certificate. It seems that relying on a reasonable time-based policy is a better approach.

For example:

We start the migration one week before "oldCert" expires.
It is expected that all of the SKRs are migrated within 24 hours (at most, currently it's much faster that that).
In the remaining time, we continue to actively sync SKR configuration, "fixing" any clients that have not yet been migrated (this is already implemented in the Lifecycle-Manager).
A day before "oldCert" expiration, we switch the certificate on the Istio Gateway to the "newCert".
If there is still any SKR that haven't received the ca-cert update, it means we weren't able to reconcile it for six days, which indicates a more serious problem with that particular SKR, unrelated to the cert management.
After the switch: For every not-yet-migrated SKR (if any), as soon as the problem that blocked the proper reconciliation is removed, the SKR receives a valid certificate configuration and its Watcher connection to the KCP is restored.

jeremyharisch added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 26, 2024

This was referenced Mar 26, 2024

No-Downtime CA Certificate Rotation #1428

Closed

Design a no-downtime solution for the CA certificate rotation #1073

Closed

janmedrek assigned Tomasz-Smelcerz-SAP Apr 24, 2024

This was referenced Apr 29, 2024

feat: Support for multiple certificates in the ca-certificates data kyma-project/runtime-watcher#264

Closed

feat: [Zero-Downtime] Istio-Gateway Secret Management #1506

Open

feat: [Zero-Downtime] Runtime-Watcher TLS configuration management #1507

Open

jeremyharisch added the Epic label May 8, 2024

Tomasz-Smelcerz-SAP mentioned this issue Sep 25, 2024

feat: [Zero-Downtime] - safe migration #1890

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No-Downtime CA Certificate Rotation #1430

No-Downtime CA Certificate Rotation #1430

jeremyharisch commented Mar 26, 2024

Tomasz-Smelcerz-SAP commented Apr 22, 2024 •

edited

Loading

No-Downtime CA Certificate Rotation #1430

No-Downtime CA Certificate Rotation #1430

Comments

jeremyharisch commented Mar 26, 2024

Description

Reasons

Acceptance Criteria

Feature Testing

Testing approach

Attachments

Implementation Hints

Tomasz-Smelcerz-SAP commented Apr 22, 2024 • edited Loading

Tomasz-Smelcerz-SAP commented Apr 22, 2024 •

edited

Loading