-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No-Downtime CA Certificate Rotation #1430
Comments
The POC is done:
The procedure worked as expected. The findings
Determining when to switch the TLC Cert on the Gateway If any client has not yet been updated (its ca-cert is just ["oldCert"]), then switching the certificate on the Gateway means this client will no longer be able to connect. This is exactly how it works at the moment, which means downtime for such a client. On the other hand, we cannot wait indefinitely for configuration propagation. Assuming we started the migration a week before "oldCert" expires, we have just one week. If we don't switch to the "newCert" certificate on the Gateway, then ALL clients will lose connections, not just the ones that, for whatever reason, are not migrated on time. Considering this, I believe that any solution that attempts to "enumerate" migrated vs non-migrated clients is not necessary. Even if we have such a procedure, we still MUST switch the Gateway to the "newCert" at some point, otherwise we will experience downtime due to an expired certificate. It seems that relying on a reasonable time-based policy is a better approach. For example:
|
Description
In #1428 it was decided to implement a No-Downtime solution for our CA-Certificate which is used in our PKI for the runtime-watcher.
This approach involves a two-phase migration of clients, allowing for a slow, gradual transition with zero downtime. In the table below, you can see the detailed steps of the procedure when the CA-Certificate gets rotated. 'rootA+rootB' signifies that the CA-certificates have been concatenated and set as the 'ca.crt' value in the certificate secret. When transitioning from 'rootA+rootB' to 'rootB', it entails truncating the CA-Certificate String and removing the first certificate from the concatenation.
Detailed description of how the procedure looks like
For this procedure we need to introduce a new process which takes care of this stateful process. We should evaluate the following ideas:
Depending on the implementation details, we also need to find a proper way for monitoring this process, as well as testing it. In addition after this solution has been implemented and tested, we should delete the temporary solution, implemented with #1061, to shorten the reconciliation time again.
Reasons
Please have a look at the following issues:
Acceptance Criteria
Feature Testing
End-to-End tests
Testing approach
No response
Attachments
Implementation Hints
The Gardener project had a very similar problem and them have a written proposal how it was implemented. Since we decided to also go for a CA-Bundle solution, we can have a look at their proposal and implementation: https://github.com/gardener/gardener/blob/master/docs/proposals/18-shoot-CA-rotation.md
The text was updated successfully, but these errors were encountered: