Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SecurityProfile.EncryptionAtHost parameter to enable host-based VM encryption #1012

Merged
merged 1 commit into from
Nov 2, 2020
Merged

Conversation

dkorzuno
Copy link
Contributor

@dkorzuno dkorzuno commented Oct 27, 2020

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

The PR adds a parameter which enables encryption at host for virtual machines.

Which issue(s) this PR fixes

This PR addresses the second part of #982 started by @mjudeikis

Special notes for your reviewer:

I added what seemed reasonable to me, but I'm not entirely sure that I haven't missed anything. Especially I have some doubts about the conversion part. Will be happy to make whatever changes required.

Also, I added a couple of the unit test, in somewhat "copy-pasty" manner. While they seem to follow the suit and do cover the change, maybe I should clean those up a bit to exclude extra checks?

  • [+] squashed commits
  • [-] includes documentation
  • [+] adds unit tests

Release note:

1. Add SecurityProfile.EncryptionAtHost parameter to machine spec to enable host-based VM encryption.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 27, 2020
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 27, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @dkorzuno!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @dkorzuno. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 27, 2020
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Oct 27, 2020
@dkorzuno
Copy link
Contributor Author

dkorzuno commented Oct 27, 2020

Followed the CLA instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 27, 2020
}

return &compute.SecurityProfile{
EncryptionAtHost: to.BoolPtr(*vmSpec.SecurityProfile.EncryptionAtHost),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't supported on all VM sizes, it requires a check against the capability (see accelerated networking for an example).

we should probably fail fast with error + no requeue in the reconciler if it's not possible to provision that for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the capability check.

}

return &compute.SecurityProfile{
EncryptionAtHost: to.BoolPtr(*scaleSpec.SecurityProfile.EncryptionAtHost),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the capability check to the function.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 27, 2020
Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bummed we can't reject this in the webhook rather than in reconciliation.

Actually, I think returning an error on this is going to cause an endless reconcile loop. We will never get to a terminal state unless the user changes the VM SKU or disables EncryptionAtHost.

@alexeldeib wdyt?

@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2020
@alexeldeib
Copy link
Contributor

alexeldeib commented Oct 28, 2020

I'm a little bummed we can't reject this in the webhook rather than in reconciliation.

Agreed 🙁

Actually, I think returning an error on this is going to cause an endless reconcile loop.

IMO, we need a way to bubble terminal vs non-terminal errors to the top. When you write one big reconcile loop, it's super easy -- log the error and either return if non-terminal or don't return the error if it's terminal. We could use a good way to bubble that up from inside the reconcilers (I smell custom error types).

for this PR, probably fine to let it loop sadly.

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #1014 logged, I'm good with merging this.

lgtm

I'm going to leave it open for a bit to allow others to review. Thank you for the PR!

},
}, nil)
s.GetBootstrapData(context.TODO()).Return("fake-bootstrap-data", nil)
m.CreateOrUpdate(context.TODO(), "my-rg", "my-vmss", gomockinternal.DiffEq(compute.VirtualMachineScaleSet{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking the entire struct here since we only care about testing the encryption at host functionality is enabled it would make sense to do the same thing you did in the virtualmachines test https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1012/files#diff-f2a0391546f379219695118e52694fe6a8c5739619cb672941617a78d0c63336R770

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test... maybe not in the best possible way, though. I had to modify the test case interface a bit to pass through the *WithT pointer. Happy to change it if needed.

@dkorzuno
Copy link
Contributor Author

dkorzuno commented Oct 28, 2020

Actually, I think returning an error on this is going to cause an endless reconcile loop.

IMO, we need a way to bubble terminal vs non-terminal errors to the top. When you write one big reconcile loop, it's super easy -- log the error and either return if non-terminal or don't return the error if it's terminal. We could use a good way to bubble that up from inside the reconcilers (I smell custom error types).

A random thought - at least for virtualmachine the error handling can be improved a bit:

  • instead of sequential processing of each vmSpec start a goroutine which creates a machine independently
  • Reconcile will wait for all goroutines to finish and will log errors if any

This way an error in one VM specification won't affect others. On next Reconcile the successfully created machines will be skipped, but the failed ones will be retried.

Something along the lines

ch := make(chan error)
for _, vmSpec := range s.Scope.VMSpecs() {
   go func(spec azure.VMSpec) error {
           ch <- ReconcileOne(ctx, spec)
   } (vmSpec)
}
for i :=0; i< len(s.Scope.VMSpecs()); i++ {
    if err <- ch; err != nil {
         // handle error
     }
}

@CecileRobertMichon
Copy link
Contributor

This smells fishy to me, that feels like a cluster-level or even machinepool level resource, rather than per-machine resource? I'm struggling to think of a scenario where you'd regularly want multiple VMs per AzureMachine. I was a bit surprised looking back at the code that VMSpecs is an array

Yeah arguably it's not a good reason, at the time it seemed easier to be consistent everywhere. I think VM and vnet (and scale set but that one is still not refactored) are probably the only ones that are exceptions to the rule? I'd be open to switching those to not be arrays. It's not an API change shouldn't be breaking for users.

@mjudeikis
Copy link
Contributor

I'm a little bummed we can't reject this in the webhook rather than in reconciliation.

Im might be missing something, Is there something stopping us to do this in webhook? There were multiple comments about this. Is this a request or just a rant for that this is technologically not possible? #wannalearn

@devigned
Copy link
Contributor

devigned commented Oct 29, 2020

I'm a little bummed we can't reject this in the webhook rather than in reconciliation.

@mjudeikis, it's the start of a rant. Since the webhooks don't have an authenticated context and we don't make any service calls within the webhook (we don't call out to Azure), we are unable to check to see if the resource sku supports EncryptionAtHost for a given subscription and region. This leads to the controller accepting updates to a resource which should really be rejected by the webhook. Since it is not rejected, we then have to provide feedback to a user that their resource is not in a valid state and that they need to remediate. It's been a long standing issue that we have punted due to the amount of work to make that happen.

Great question!

@dkorzuno
Copy link
Contributor Author

Is there anything else I can do in the context of this PR?

@CecileRobertMichon
Copy link
Contributor

@dkorzuno to me the only actionable item for this PR is to persist the value in the controller (ie. if nil, set it to false) so that it doesn't get changed later on existing clusters if we change the default to "true if supported". @alexeldeib @devigned does that sound reasonable to you?

@dkorzuno
Copy link
Contributor Author

@CecileRobertMichon let me clarify it as I got confused a bit.
Do you mean that the getSecurityProfile function should return an allocated SecurityProfile with EncryptionAtHost set to false, as in

func getSecurityProfile(vmssSpec azure.ScaleSetSpec, sku resourceskus.SKU) (*compute.SecurityProfile, error) {
	if vmssSpec.SecurityProfile == nil {
		return &compute.SecurityProfile{
		      EncryptionAtHost: to.BoolPtr(false),
	        }, nil
	}

	if !sku.HasCapability(resourceskus.EncryptionAtHost) {
		return nil, errors.Errorf("encryption at host is not supported for VM type %s", vmssSpec.Size)
	}

	return &compute.SecurityProfile{
		EncryptionAtHost: to.BoolPtr(*vmssSpec.SecurityProfile.EncryptionAtHost),
	}, nil
}

Or (which I'm leaning towards to) if getSecurityProfile notices that SecurityProfile in the spec is nil then it should write back the spec to Etcd with SecurityProfile field pre-populated?

If it's the latter, what would be the best approach there - add UpdateMachineSpec method to ScaleSetScope interface and implement it for MachinePoolScope?

@alexeldeib
Copy link
Contributor

I'm sure Cecile will comment too, but personally I think leave the field as an optional pointer, but default it to false in the webhook?

Or (which I'm leaning towards to) if getSecurityProfile notices that SecurityProfile in the spec is nil then it should write back the spec to Etcd with SecurityProfile field pre-populated

so basically this, but in the webhook for now should be okay I think?

in the future, we would want to be able to dynamically detect whether a size supports it without forcing the user to tell us. to do that would still require what you describe, and yes it would require some kind of setter on the scope to write back into the CRD object.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 30, 2020
@dkorzuno
Copy link
Contributor Author

so basically this, but in the webhook for now should be okay I think?

Done.

@CecileRobertMichon
Copy link
Contributor

E2E is failing with

E1030 04:05:18.264096       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile AzureMachine: failed to create virtual machine: failed to create VM capz-e2e-oo3dfl-control-plane-p6tmv in resource group capz-e2e-oo3dfl: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidParameter\" Message=\"The property 'securityProfile.encryptionAtHost' is not valid because the 'Microsoft.Compute/EncryptionAtHost' feature is not enabled for this subscription.\" Target=\"securityProfile.encryptionAtHost\"" "controller"="azuremachine" "name"="capz-e2e-oo3dfl-control-plane-p6tmv" "namespace"="create-workload-cluster-ao4au3"

🤨

Is EncryptionAtHost a gated feature? Also strange that it's for returning an error message when the value is false...

@CecileRobertMichon
Copy link
Contributor

Confirmed docs mention you have to send an email to get it activated on your sub https://docs.microsoft.com/en-us/azure/virtual-machines/linux/disks-enable-host-based-encryption-cli#prerequisites so we can't do the defaulting otherwise that will break all users that haven't enabled it 😞

@dkorzuno let's revert the defaulting, sorry for making you do that extra work for nothing...

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 1, 2020
@dkorzuno
Copy link
Contributor Author

dkorzuno commented Nov 1, 2020

/retest

@k8s-ci-robot
Copy link
Contributor

@dkorzuno: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-full
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-verify
  • /test pull-cluster-api-provider-azure-conformance-v1alpha3
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-coverage

Use /test all to run the following jobs:

  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-verify
  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-coverage

In response to this:

/retest ? hmm... I believe the changes without the latest modifications have already passed the e2e test at least once.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/retest

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2020
@CecileRobertMichon
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit c613c3b into kubernetes-sigs:master Nov 2, 2020
@k8s-ci-robot k8s-ci-robot added this to the v0.4.10 milestone Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants