Skip to content
This repository has been archived by the owner on Jun 25, 2024. It is now read-only.

Initial commit for EDPM certs #442

Closed
wants to merge 0 commits into from

Conversation

vakwetu
Copy link
Contributor

@vakwetu vakwetu commented Sep 28, 2023

This is completely untested(although it does compile!), but it gives an idea of how EDPM certs could be generated and mounted for each of the services that need them.

While I will test and refine, I'm hoping that this PR starts conversations and questions so that we can decide on an implementation.

The basic idea is as follows:

  1. After the DNS data is created in the openstackdataplanenodeset_controller, a call is made to GenerateTLSCerts to create all the certs for all the nodes in the deployment.
  2. GenerateTLSCerts right now only generates a single cert for each node in the nodeset using the DNSNames for that node. As the cert contains the DNSNames for all the node's interfaces, it should be usable on all interfaces. Its likely that most services will be able to use that cert (which has the name "nodeName-cert")
  3. You can easily see though how the code could be extended to create a cert with all the IPS for instance, or to use a different issuer, depending of the needs of the service.
  4. The secret containing the cert will need to have an ownership reference updated so that we can trigger a reconcile when the cert changes or is renewed.
  5. We create a secret per service (for those that need certs) that contains all the certs for that service on all the nodes. This secret has the name <nodeSet.Name>--certs. This secret will be mounted in the ansibleEE pod as an extraMount when the service is deployed. This secret contains keys of the form nodeName-tls.key, nodeName-tls.crt, nodeName-ca.crt.
  6. We create a new secret containing all the node certs because there could be hundreds of nodes. I doubt we can have hundred of secrets. Instead, all the certs will be visible at /var/lib/openstack/service.name/certs
  7. A new boolean is added to the Service spec (HasTLSCerts). The secret containing the certs will only be mounted if HasTLSCerts=True

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/09928d8ccf7c4eb18f114acf04ed98b6

openstack-k8s-operators-content-provider FAILURE in 8m 43s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 1m 52s

pkg/deployment/cert.go Outdated Show resolved Hide resolved
Copy link
Contributor

@fultonj fultonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This general approach looks good to me.

api/v1beta1/openstackdataplaneservice_types.go Outdated Show resolved Hide resolved
api/v1beta1/openstackdataplaneservice_types.go Outdated Show resolved Hide resolved
controllers/openstackdataplanenodeset_controller.go Outdated Show resolved Hide resolved
pkg/deployment/cert.go Outdated Show resolved Hide resolved
pkg/deployment/cert.go Outdated Show resolved Hide resolved
pkg/deployment/cert.go Outdated Show resolved Hide resolved
pkg/deployment/const.go Outdated Show resolved Hide resolved
pkg/deployment/deployment.go Outdated Show resolved Hide resolved
pkg/deployment/deployment.go Outdated Show resolved Hide resolved
pkg/deployment/cert.go Outdated Show resolved Hide resolved
@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 442,fe182e6e76b9c4327e959d4122f2ae73ebed7122

pkg/deployment/const.go Outdated Show resolved Hide resolved
Comment on lines 104 to 119
// create a secret to hold the certs for the service
serviceCertsSecret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: GetServiceCertsSecretName(instance, serviceName),
Namespace: instance.Namespace,
},
Data: certsData,
}
_, _, err := secret.CreateOrPatchSecret(ctx, helper, instance, serviceCertsSecret)
if err != nil {
err = fmt.Errorf("Error creating certs secret for %s - %w", serviceName, err)
return err
}
Copy link
Contributor

@stuggi stuggi Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we have to create a single secret with the certs, or just could mount certs from several cert secrets based on the label using the secretTamplate as in https://cert-manager.io/docs/usage/certificate/ . would just have to update https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/certmanager/certificate.go#L167 for having the secretTemplate

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion when we pursue the second option, then it means that we can mount only the certs that are needed. Which is in general safer approach. The serviceCertsSecret aggregates all the individual node certificates into one secret, is this correct? Having one secret with all certificates for a service across all nodes is a point of potential issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted openstack-k8s-operators/lib-common#358 so that annotations/labels can be passed for the secretTemplate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the way things work, there is a single ansibleEE pod that is created for a service, which installs the service on every node in the nodeset. That means that if there are 1000 nodes in the nodeset, then there are 1000 secrets to be mounted. Is there a limit to the number of mounts that a pod can have? Does it make sense to have 1000 mounts? I created this single secret because I was concerned with not breaching that limit.

But, I also realize that there is a limit in the size of the secret (4M?) So, how to avoid that limit?

  • We can also create multiple secrets (with as many certs as can fit in a secret) - and then just mount all these secrets as volumes.

  • Another alternative is to create a single volume per service and copy all the certs and keys into that volume. The main concern there though is that this volume will contain a bunch of secret stuff - ie. keys. We can specify that this volume is ephemeral and should be deleted - but we don't have any control of when that happens.

So, I'm looking for guidance here. What is the expectation of the size of a nodeset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some back of the envelope calculations, I figure we can end up with about 500 certs in a single secret.
That seems reasonable as an approach.

We do need to do some work where we create multiple secrets if we go over that amount -- well, really keep track of the secret size as we add certs - but I'd like to leave that to a new PR.

I will look into only removing the cacerts from the secret once the PRs with the cacert chains are merged.

Copy link

@bogdando bogdando Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we rotate a cert for a node, we should only touch its own secret (that also should reduce the numbers of reconciliation loops triggered by an unrelated certificate change in the common secret?). To avoid bind mounting housands of certs in large envs, let's use kolla to copy it in instead, or bind-mount a directory (so we could avoid encrypted volumes requirement) that contains all certificates collected by ansibleEE operator directly from secrets (based on secretTemplate/labels provided), instead of bind-mounts? This is how we could mitigate mounts count and secret size limits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdando We're using a secret containing a bunch of other secrets here rather than a bind-mounted directory because we're concerned that the directory could stick around awhile after the ansible EE pod is completed, potentially exposing the secrets for all the compute nodes. Secrets have some built-in protections. They are stored encrypted in etcd, so they are safe at rest. They are only sent to the nodes that need them and they are mounted using ramfs - so there is no place where they are written to disk. The secret size limitation is because the data is stored in etcd.

We could potentially use a mounted volume if that volume were encrypted in some way, but as far as I can tell, there is no openstack-style encrypted cinder volume available in k8s, where a volume is mounted with an encryptor/decryptor mechanism. If you know of such a mechanism, I'd be happy to use it. I did find mention of something like that here -- https://docs.portworx.com/portworx-enterprise/operations/key-management/kubernetes-secrets/pvc-encryption-using-annotations -- but not in openshift.

For openshift, there appear to be a couple of options.
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html/planning_your_deployment/security-considerations_rhocs (configuring encryption at the container storage level)

https://red-hat-storage.github.io/ocs-training/training/ocs4/ocs4-encryption.html#_overview (integration with vault as a kms)

All of this is way beyond the scope of this patch :)

For the reconciliation loop issue, recall that there will be maybe 500 certs in a secret, so the number of extra reconciliations will be somewhat small. Also, we can add a label to the secret of secrets and some code to tell the reconciliation loop to ignore it. We can handle that in a follow on patch when we test reconciations for cert changes.

I propose we go with this mechanism for now - we know it works - and it provides the needed security without too much extra complexity.

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/dataplane-operator for 442,6b141c6a0036ecabaa60bfeaf491bf245382015c

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/fff2fa5866a34c88996d27041521cb73

openstack-k8s-operators-content-provider FAILURE in 9m 19s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 1m 50s

vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 9, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 9, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
@vakwetu vakwetu changed the title WIP - Initial commit for EDPM certs Initial commit for EDPM certs Nov 9, 2023
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/72c02d56d08f47188ddd22a458a3f754

openstack-k8s-operators-content-provider FAILURE in 9m 22s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ dataplane-operator-docs-preview SUCCESS in 1m 47s

vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 15, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 15, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 15, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 22, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/4cc08449e92944098ffadd557633e0a8

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 43m 25s
podified-multinode-edpm-deployment-crc FAILURE in 1h 21m 13s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 17m 43s
dataplane-operator-docs-preview NODE_FAILURE Node request 200-0006663238 failed in 0s

@vakwetu
Copy link
Contributor Author

vakwetu commented Nov 23, 2023

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/590cc0244a8346899438cc3cbef60ca5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 51m 40s
podified-multinode-edpm-deployment-crc FAILURE in 1h 23m 19s
cifmw-crc-podified-edpm-baremetal NODE_FAILURE Node request 199-0006664258 failed in 0s
dataplane-operator-docs-preview NODE_FAILURE Node request 200-0006664229 failed in 0s

vakwetu added a commit to vakwetu/edpm-ansible that referenced this pull request Nov 27, 2023
The dataplane operator will provide certs and keys to the ansibleEE pod
in /var/lib/openstack/certs as well as a cacert file in
/var/lib/openstack/cacerts.  It will also pass an environment variable
tls_certs_enabled to indicate that the certs are present.

This PR copies these files to the appropriate place on the compute node
and then mounts these files in the nova-compute container.

We still need to confirm that these files are in the right location
in the container and add config to use these certs - as well as code
to trust the cacert.  Will need help from compute DFG for that.

Depends-On: openstack-k8s-operators/dataplane-operator#442
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/43129b7ea2a94a5ab60890e7e0dd864a

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 37m 32s
podified-multinode-edpm-deployment-crc FAILURE in 1h 18m 23s
cifmw-crc-podified-edpm-baremetal FAILURE in 14m 21s
✔️ dataplane-operator-docs-preview SUCCESS in 1m 55s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/55fc1925d0c9420d8890adf261550575

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 41m 41s
podified-multinode-edpm-deployment-crc FAILURE in 1h 21m 08s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 17m 38s
✔️ dataplane-operator-docs-preview SUCCESS in 2m 11s

@vakwetu vakwetu force-pushed the main branch 2 times, most recently from 4257550 to 3e4e97f Compare November 29, 2023 19:21
// HasTLSCerts - Whether the nodes have TLS certs
// +kubebuilder:validation:Optional
// +operator-sdk:csv:customresourcedefinitions:type=spec,xDescriptors={"urn:alm:descriptor:com.tectonic.ui:booleanSwitch"}
HasTLSCerts *bool `json:"hasTLSCerts,omitempty" yaml:"hasTLSCerts,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also use the field name TLSEnabled here as well? Or, just as a naming convention, call this one TLSCertsEnabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - will use TLSCertsEnabled


// CACerts - Secret containing the CA certificate chain
// +kubebuilder:validation:Optional
CACerts string `json:"caCerts,omitempty" yaml:"caCerts,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the Issuers and CACerts fields that need to be Service specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - we have been discussing creating a specific CA for ovn related services. This will allow OVN components to only accept communications from other ovn components.

ips = allIPs[nodeName]

switch service.Name {
case "nova", "libvirt":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a way to do this generically, instead of having special handling for individual services. Could we add something to the CRD that indicates what needs to be in the cert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack - we could do something that ends up in the crd or in the service definition. I suggest we refactor this as a follow-on CR so that we have something working while we design this out.


// add TLS ansible variable
if instance.Spec.TLSEnabled != nil && *instance.Spec.TLSEnabled {
nodeSetGroup.Vars["tls_certs_enabled"] = "true"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a variable that multiple roles from edpm-ansible would use right? Could we call this edpm_tls_certs_enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - we can do that

@@ -209,6 +209,24 @@ func (r *OpenStackDataPlaneNodeSetReconciler) Reconcile(ctx context.Context, req
instance.Status.DNSClusterAddresses = dnsClusterAddresses
instance.Status.CtlplaneSearchDomain = ctlplaneSearchDomain

// Issue certs for TLS for services that need them
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be refactored out into a function that can also be called from https://github.com/openstack-k8s-operators/dataplane-operator/blob/main/controllers/openstackdataplanedeployment_controller.go as that can use an entirely different set of services if servicesOverride is set.

In fact...maybe this logic should just live entirely in openstackdataplanedeployment_controller.go. I'm thinking that might be the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - will refactor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually -- I don't understand this comment entirely .. will sync with you,

Copy link
Contributor

openshift-ci bot commented Dec 1, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vakwetu

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants