-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated cherry pick of #5955: Persist TLS certificate and key of antrea-controller (#5955) #6004: Fix race condition in pkg/apiserver/certificate unit tests #6080
Conversation
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
…#6004) There was some "interference" between TestSelfSignedCertProviderRotate and TestSelfSignedCertProviderRun. The root cause is that the certutil.GenerateSelfSignedCertKey does not support a custom clock implementation and always calls time.Now() to determine the current time. It then adds a year to the current time to set the expiration time of the certificate. This means that when rotateSelfSignedCertificate() is called as part of TestSelfSignedCertProviderRotate, the new certificate is already expired, and rotateSelfSignedCertificate() will be called immediately a second time. By this time however, TestSelfSignedCertProviderRotate has already exited, and we are already running the next test, TestSelfSignedCertProviderRun. This creates a race condition because the next test will overwrite generateSelfSignedCertKey with a mock version, right as it is called by the second call to rotateSelfSignedCertificate() from the previous test's provider. To avoid this race condition, we make generateSelfSignedCertKey a member of selfSignedCertProvider. Fixes antrea-io#5977 Signed-off-by: Antonin Bas <[email protected]> Co-authored-by: Quan Tian <[email protected]>
@antoninbas I backport this as it's requested by release-1.13 consumers. Given that the risky scenario only happens when users have changed deployment strategy (which shouldn't be common) and we will release a patch release for all minor releases >=1.13, I think it should be fine. Please let me know if you have any concern. |
@tnqn I know you mentioned before that "upgrading" from a patched controller to an unpatched one could create an issue (when using the RollingUpdate strategy), but I don't recall what the issue would be exactly. While I see that there could be a race condition between the old and new controllers, it doesn't seem any worse than the situation before your patch (for example, upgrading from 1.12.0 to 1.13.0). Or do we increase the likelihood of the old controller overriding the CA bundle because of the periodic sync? I am surprised that we are getting a request to backport. As you pointed out originally, the odds of the old controller overriding the new CA are quite low:
That being said, I am fine with backporting this patch, for the reason you mentioned. Let's just make sure we publish patch releases in the correct order: 1.15.x -> 1.14.x -> 1.13.x |
Yes, I think it increases the likelihood a bit. When it's upgraded from unpatched to unpatched, the issue should only happen when there is already something wrong causing the old controller to keep retrying overwriting the CA cert. When it's upgraded from patched to unpatched, even if everything works fine, there is a chance that old controller could overwrite it due to the periodical update.
Sure, I was thinking the same. |
/test-all |
Cherry pick of #5955 #6004 on release-1.13.
#5955: Persist TLS certificate and key of antrea-controller (#5955)
#6004: Fix race condition in pkg/apiserver/certificate unit tests
For details on the cherry pick process, see the cherry pick requests page.