-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist TLS certificate and key of antrea-controller #5955
Conversation
d003e37
to
a34f717
Compare
} | ||
remainingDuration := certs[0].NotAfter.Sub(p.clock.Now()) | ||
if remainingDuration < p.caConfig.MinValidationDuration { | ||
klog.InfoS("The remaining duration of the TLS certificate and key is less than max rotate duration", "remaining", remainingDuration, "max", p.caConfig.MinValidationDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the log message still appropriate after replacing MaxRotateDuration
with MinValidationDuration
(same for the "max"
log key)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding, forgot updating this one
pkg/apiserver/certificate/config.go
Outdated
ServiceName string | ||
PairName string | ||
// MinValidationDuration is the minimal remaining validation duration for the self-signed certificate. It must be | ||
// rotated once shorter than the duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// rotated once shorter than the duration. | |
// rotated once the time until the certificate expires becomes shorter than this duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -113,7 +113,7 @@ func getCaConfig(isLeader bool, controllerNs string) *certificate.CAConfig { | |||
MutationWebhookSelector: getWebhookLabel(isLeader, controllerNs), | |||
ValidatingWebhookSelector: getWebhookLabel(isLeader, controllerNs), | |||
CertReadyTimeout: 2 * time.Minute, | |||
MaxRotateDuration: time.Hour * (24 * 365), | |||
MinValidationDuration: time.Hour * 24 * 90, // Rotate the certificate 90 days in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like MinValidDuration
would be a more "correct" name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I was thinking valid but somehow wrote validation.
secureServing.ServerCert.CertKey.CertFile = path.Join(caConfig.SelfSignedCertDir, caConfig.PairName+".crt") | ||
secureServing.ServerCert.CertKey.KeyFile = path.Join(caConfig.SelfSignedCertDir, caConfig.PairName+".key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the path/filepath
package should be used here instead of path
, given that these are actual filepaths AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
caConfig := &CAConfig{ | ||
TLSSecretName: tlsSecretName, | ||
SelfSignedCertDir: t.TempDir(), | ||
MinValidationDuration: maxRotateDuration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're using the maxRotateDuration
term here still, yet you use the parameter value to initialize MinValidationDuration
. But the 2 mean different things right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, forgot to update after renaming it
provider.secretInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
AddFunc: func(obj interface{}) { provider.enqueue() }, | ||
UpdateFunc: func(_, _ interface{}) { provider.enqueue() }, | ||
DeleteFunc: func(obj interface{}) { provider.enqueue() }, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to check my understanding: normally there is no reason for the secret to be modified and for these handlers to be invoked, as the Controller should be the only entity modifying the secret?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the corner case that the certificate needs rotation when upgrading. With the handling, regardless of which instance updates the secret first, the other one will switch to it and stop generating a new one.
In the future when HA is implemented, we should only let the active instance rotate the certificate, and the standby instances should refresh its certificate immediately, so the even handlers are also needed for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment would be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
secretBeforeRestart, err := data.clientset.CoreV1().Secrets(tlsSecretNamespace).Get(context.TODO(), tlsSecretName, metav1.GetOptions{}) | ||
require.NoError(t, err) | ||
|
||
testCert(t, data, string(secretBeforeRestart.Data[certificate.TLSCertFile]), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in the self-signed certificate case, the published CA bundle (published to the ConfigMap) is equal to the certificate itself (stored in the Secret)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, self-signed certificate util always include both CA cert and signed cert in server cert. We can choose to only publish CA cert for server authentication, but never made the change. We discussed this before in #5135 (comment)
Should we backport this then? It seems that there is no automatic recovery in this case, until the Controller is manually restarted. Could this explain some issues we saw over the years, with APIServices / Webhooks not working because of TLS issues? |
@@ -171,6 +169,7 @@ func (c *CACertController) syncConversionWebhooks(caCert []byte) error { | |||
crdDef.Spec.Conversion.Webhook.ClientConfig.CABundle = caCert | |||
} | |||
if updated { | |||
klog.InfoS("Syncing CA certificate with CRD that have conversion webhooks", "name", crdDef.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.InfoS("Syncing CA certificate with CRD that have conversion webhooks", "name", crdDef.Name) | |
klog.InfoS("Syncing CA certificate with CRD that have conversion webhook", "name", crdDef.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a CRD can have multiple conversion webhooks.
} | ||
|
||
func (p *selfSignedCertProvider) enqueue() { | ||
// The key can be anything as we only have single item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The key can be anything as we only have single item. | |
// The key can be anything as we only have a single item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
var err error | ||
var secret *corev1.Secret | ||
// If Secret is specified, we should prioritize it. | ||
if p.caConfig.TLSSecretName != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe define a new variable hasTLSSecretName=p.caConfig.TLSSecretName != ""
which can be used in following codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel they are equivalently readable and there are only two such checks. It seems a bit verbose to me to define a variable.
if p.shouldRotateCertificate(cert) { | ||
klog.InfoS("Generating self signed cert") | ||
if cert, key, err = generateSelfSignedCertKey(p.caConfig.ServiceName, loopbackAddresses, GetAntreaServerNames(p.caConfig.ServiceName)); err != nil { | ||
return fmt.Errorf("unable to generate self signed cert: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "self-signed cert" is a more common one. Probably replace the self signed with self-signed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
if bytes.Equal(cert, p.cert) && bytes.Equal(key, p.key) { | ||
return nil | ||
} | ||
klog.InfoS("Writing certificate and key to cert directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.InfoS("Writing certificate and key to cert directory") | |
klog.InfoS("Writing certificate and key to the cert directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
func TestSelfSignedCertProviderRun(t *testing.T) { | ||
secretNamespace := env.GetAntreaNamespace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can simply set this as "kube-system" considering it's for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the correctness of this test would be subject to the value of defaultAntreaNamespace.
I would prefer not to introduce obsecure dependency unnecessarily. The provider is implemented to not assume the Pod must run in "kube-system", the test code shouldn't assume that too. To make it more clear, I have changed the code to set the env var in the beginning of the test.
a34f717
to
c2bbf66
Compare
There was a time the certificate was overridden by the deployment tool kapp when it's used. I don't know what was the cause for other occurrences. Even for the latest report, we only know there was an old controller overwriting the CA bundle with its own one generated two days ago, but this shouldn't happen as overwriting the ConfigMap is one-time job and an old instance will only do it again after 1/2 of the valid duration (half a year) or it never succeeded to update all resources' CA bundle in the last two days. The change is compatible when upgrading from an old controller to patched controller, as well as patched controller to patched controller. However, it may cause issue if upgrading from patch controller to unpatched controller. For example, if we release 1.14.3 with the patch, and users upgrade 1.14.3 to 1.15.0, there would be an issue. Not sure if we can assume users will always upgrade to latest patch version of a minor release. But it's pretty sure this only happens when the deployment strategy is changed to RollingUpdate. |
c2bbf66
to
af8b178
Compare
Thanks, I missed that earlier. Didn't realize it could only happen for custom installations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
provider.secretInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
AddFunc: func(obj interface{}) { provider.enqueue() }, | ||
UpdateFunc: func(_, _ interface{}) { provider.enqueue() }, | ||
DeleteFunc: func(obj interface{}) { provider.enqueue() }, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment would be a good idea?
} | ||
|
||
func newVerifyOptions(caBundle []byte) *x509.VerifyOptions { | ||
// We don't really use the CA bundle to verify clients, this is just to follow DynamicFileCAContent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to double-check: with this, you mean that the generated cert is never used for client-side auth, only server-side auth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated cert is used for client-side server authentication. The server, antrea-controller, doesn't validate clients' TLS using this cert and it can't use it because none of the client cert is signed by it. It uses token to validate clients.
I added the comment because it sets the key usage to "ExtKeyUsageClientAuth" which might be confusing for people who wonders how can this CA bundle be used for server-side client authentication as we never generate any client cert. It's to follow DynamicFileCAContent's implementation.The verify option will never be used actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only server-side auth
That was poor phrasing on my part. I indeed meant server authentication on the client-side...
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
af8b178
to
94cd0cd
Compare
/test-all |
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again. The commit makes two changes to fix the problem: 1. CACertController will periodically sync the CA cert to improve the robustness. 2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time. Besides, the change is helpful for implementing high-availability of antrea-controller in the future. Signed-off-by: Quan Tian <[email protected]>
In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again.
The commit makes two changes to fix the problem:
CACertController will periodically sync the CA cert to improve the robustness.
Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time.
Besides, the change is helpful for implementing high-availability of antrea-controller in the future.