Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Confidential VM images #1148

Merged
merged 1 commit into from
Jun 1, 2023

Conversation

mresvanis
Copy link
Contributor

@mresvanis mresvanis commented Apr 27, 2023

What this PR does / why we need it:

This PR adds support for Azure Confidential VM images. These images are necessary in order to enable the Cluster API Provider Azure support nodes hosted on Confidential VMs.

Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #1154

Additional context

Related CAPZ PR.

NOTE: Building a CAPI Ubuntu 22.04 CVM image from a VM instance with vTPM disabled fails due to this bug(?) in nullboot. At the same time, Azure (and thus also packer) does not support publishing trusted launch images to managed images (i.e. when vTpmEnabled: true, we can't publish a managed image, which is currently what we do now for all linux images). Because of this we are currently unable to build an CAPI Ubuntu 22.04 for CVM image.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 27, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 27, 2023
Copy link
Contributor

@jsturtevant jsturtevant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! Couple thoughts:

  • Are there VHD targets for Confidential VMs?
  • We might need to add something like
    IFS=' ' read -r -a SIG_GEN2_CI_TARGETS <<< "${SIG_GEN2_CI_TARGETS}"
    # Append the "gen2" targets to the original SIG list
    for element in "${SIG_GEN2_CI_TARGETS[@]}"
    do
    SIG_CI_TARGETS+=("${element}-gen2")
    done
    to get CI running

@jsturtevant
Copy link
Contributor

/cc @nawazkh @mboersma

@k8s-ci-robot k8s-ci-robot requested a review from nawazkh April 27, 2023 16:09
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 28, 2023
@mresvanis
Copy link
Contributor Author

mresvanis commented Apr 28, 2023

@jsturtevant thank you for your time and review!

  • Are there VHD targets for Confidential VMs?

I tried to create a VHD from the CVM images with the current packer configuration, but I got the following error:

Build 'azure-arm.vhd-{{user `build_name`}}' errored after 13 minutes 2 seconds: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details." D
etails=[{"code":"Conflict","message":"{\r\n  \"error\": {\r\n    \"code\": \"OperationNotAllowed\",\r\n    \"message\": \"The specified platform image is not supported for creating a Virtual Machine with unmanaged disks. Please refer to the disallowed VM disk types at https://docs.mi
crosoft.com/en-us/rest/api/compute/virtualmachineimages/get#vmdisktypes. \"\r\n  }\r\n}"}]

Then I tried looking into it and I'm not quire sure we can get VHD from a managed disk via packer (although this is supported by Azure). Any ideas on what I'm missing or how we should move forward?

My two cents would be that since VHDs are slowly getting deprecated, we could skip adding such targets for CVMs. But I am probably missing CAPI's current policy on VHDs and their usecases.

  • We might need to add something like
    IFS=' ' read -r -a SIG_GEN2_CI_TARGETS <<< "${SIG_GEN2_CI_TARGETS}"
    # Append the "gen2" targets to the original SIG list
    for element in "${SIG_GEN2_CI_TARGETS[@]}"
    do
    SIG_CI_TARGETS+=("${element}-gen2")
    done

    to get CI running

Great suggestion, I missed the e2e CI script altogether. I have tried to suggest a solution for the CI CVM targets, keeping in mind that CVM images are only available in 4 Azure locations currently (i.e. eastus, westus, westeurope, northeurope). Do you think this change makes sense?

@mresvanis mresvanis force-pushed the add-cvm-images branch 2 times, most recently from 06a92e0 to 6be5267 Compare April 28, 2023 14:45
@jsturtevant
Copy link
Contributor

My two cents would be that since VHDs are slowly getting deprecated, we could skip adding such targets for CVMs. But I am probably missing CAPI's current policy on VHDs and their usecases.

Yes I generally think that is fine, though we don't have a good way to add an e2e test in capz with the sig images or maybe that has changed @CecileRobertMichon @mboersma?

looks like your ubuntu CMM image is failing (windows passed): https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_image-builder/1148/pull-azure-sigs/1651961055783424000/artifacts/azure-sigs/ubuntu-2204-cvm.log:

�[0;32m    azure-arm.sig-{{user `build_name`}}: TASK [setup : perform a dist-upgrade] ******************************************�[0m
�[0;32m    azure-arm.sig-{{user `build_name`}}: fatal: [default]: FAILED! => {"attempts": 5, "changed": false, "msg": "'/usr/bin/apt-get dist-upgrade ' failed: E: Sub-process /usr/bin/dpkg returned an error code (1)\n", "rc": 100, "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nCalculating upgrade...\nThe following package was automatically installed and is no longer required:\n  libfreetype6\nUse 'sudo apt autoremove' to remove it.\nThe following packages have been kept back:\n  libglib2.0-0 libglib2.0-bin libglib2.0-data\n0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.\n1 not fully installed or removed.\nAfter this operation, 0 B of additional disk space will be used.\nSetting up nullboot (0.4.0-0ubuntu0.22.04.1) ...\r\n2023/04/28 14:53:28 cannot trust boot assets used for current boot: open /sys/kernel/security/tpm0/binary_bios_measurements: no such file or directory\r\ndpkg: error processing package nullboot (--configure):\r\n installed nullboot package post-installation script subprocess returned error exit status 1\r\nErrors were encountered while processing:\r\n nullboot\r\nneedrestart is being skipped since dpkg has failed\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Calculating upgrade...", "The following package was automatically installed and is no longer required:", "  libfreetype6", "Use 'sudo apt autoremove' to remove it.", "The following packages have been kept back:", "  libglib2.0-0 libglib2.0-bin libglib2.0-data", "0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.", "1 not fully installed or removed.", "After this operation, 0 B of additional disk space will be used.", "Setting up nullboot (0.4.0-0ubuntu0.22.04.1) ...", "2023/04/28 14:53:28 cannot trust boot assets used for current boot: open /sys/kernel/security/tpm0/binary_bios_measurements: no such file or directory", "dpkg: error processing package nullboot (--configure):", " installed nullboot package post-installation script subprocess returned error exit status 1", "Errors were encountered while processing:", " nullboot", "needrestart is being skipped since dpkg has failed"]}�[0m

@jsturtevant
Copy link
Contributor

Great suggestion, I missed the e2e CI script altogether. I have tried to suggest a solution for the CI CVM targets, keeping in mind that CVM images are only available in 4 Azure locations currently (i.e. eastus, westus, westeurope, northeurope). Do you think this change makes sense?

Yes generally looks like a good way to handle it 👍

@mresvanis
Copy link
Contributor Author

/test pull-azure-vhds

@mresvanis
Copy link
Contributor Author

mresvanis commented May 2, 2023

looks like your ubuntu CMM image is failing (windows passed): https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_image-builder/1148/pull-azure-sigs/1651961055783424000/artifacts/azure-sigs/ubuntu-2204-cvm.log:****

@jsturtevant it seems that dist-upgrade on Ubuntu 22.04 for CVM fails due to this bug in nullboot. In the change I proposed, I was trying to create the image via a non-confidential VM and the absence of TPM leads to the dist-upgrade failing. At the same time, I tried to use a VM size that supports setting its SecurityType to ConfidentialVM or TrustedLaunch, but packer currently does not seem to support setting a SecurityType for a VM.

Since this doesn't seem to be an issue with Ubuntu 20.04 for CVMs, I would drop support for Ubuntu 22.04 for CVMs until there is either a packer update that supports setting SecurityProfile.SecurityType on a VM or nullboot 's bug is resolved. What do you think?

@jsturtevant
Copy link
Contributor

Sounds reasonable to me. Could you create an issue in Packer to add that field and maybe an issue to track adding 2204 here?

@mresvanis
Copy link
Contributor Author

mresvanis commented May 4, 2023

Sounds reasonable to me. Could you create an issue in Packer to add that field and maybe an issue to track adding 2204 here?

@jsturtevant I updated this PR's description with a NOTE to reflect exactly what I managed to find and also asked here about a possible workaround, as it's not clear to me what/how we could resolve the issue in the proper way.

I also opened this issue to make sure we track it properly.

Do you maybe see any possible workarounds (or even the proper solution) that I'm probably missing altogether? Should we move forward without Ubuntu 22.04 LTS for CVMs for now, in order to enable this feature in CAPZ sooner?

@jsturtevant
Copy link
Contributor

I think we can move forward with this and revisit when 22.04 becomes more adopted. Hopefully the bug is resolved by then.

lgtm code wise. @mresvanis is it worth updating the book to point out CVM images and any caveats?

Yes I generally think that is fine, though we don't have a good way to add an e2e test in capz with the sig images or maybe that has changed @CecileRobertMichon @mboersma?

If we need e2e tests in capz we might need to figure this out @mboersma @nawazkh

/assign @mboersma

@mresvanis
Copy link
Contributor Author

lgtm code wise. @mresvanis is it worth updating the book to point out CVM images and any caveats?

@jsturtevant thanks for pointing me to the docs, I added a small section to the book regarding CVM images.

@mresvanis
Copy link
Contributor Author

mresvanis commented May 8, 2023

@jsturtevant the nullboot issue (i.e. trying to run nullbootctl without the vTPM enabled fails) hit us in Ubuntu 20.04 LTS for CVM as well. That's why I implemented a workaround here and added the Ubuntu 22.04 LTS for CVMs back to the Azure SIG CVM build targets.

This workaround is not great, but it provides users with CAPZ Ubuntu images for confidential VMs. Do you think this is an acceptable workaround?

@mresvanis
Copy link
Contributor Author

/test pull-azure-vhds

1 similar comment
@mresvanis
Copy link
Contributor Author

/test pull-azure-vhds

@mresvanis
Copy link
Contributor Author

/test pull-ova-all

@jsturtevant
Copy link
Contributor

This workaround is not great, but it provides users with CAPZ Ubuntu images for confidential VMs. Do you think this is an acceptable workaround?

I don't really know enough about that that is doing. Could you give a summary?

@mresvanis
Copy link
Contributor Author

mresvanis commented May 15, 2023

This workaround is not great, but it provides users with CAPZ Ubuntu images for confidential VMs. Do you think this is an acceptable workaround?

I don't really know enough about that that is doing. Could you give a summary?

Absolutely, the issue we face when building CAPZ images based on Ubuntu {20,22}.04 LTS for CVMs is due to nullboot's post-install script expecting vTPM to be enabled. This post-install script is run after apt-get dist-upgrade and fails because we cannot have vTPM enabled in the VMs we're building the images from. This leads to apt-get dist-upgrade failing, which in turn results into the image build failure.

To resolve the issue, i.e. to have a successful apt-get dist-upgrade with nullboot and vTPM disabled, I did the following:

  • hold the nullboot package, i.e. do not update it when apt-get dist-upgrade
  • temporarily adjust the nullboot post-install script to not fail when vTPM is disabled (this is accomplished by adding 2 additional fields in nullbootctl)
  • run apt-get dist-upgrade
  • change the nullboot post-install script back to its original
  • unhold the nullboot package, i.e. when apt-get update | dist-upgrade runs, it will be updated to its latest version

This is configured to only run for debian-based (i.e. Ubuntu) CVM images. What do you think?

@nawazkh
Copy link
Member

nawazkh commented May 17, 2023

@mresvanis Thank you for the explanation, makes sense to me. Though I am not so well versed with image-builder flow, your approach sounds good to me. Few questions:

  • What are the maintainable components in this PR we need to be mindful of as the OS offerings increase or change going ahead?
  • Is there a scope to add a test to validate the changes in being generated in the new image? I once added a goss test in one of the PRs I worked on add /etc/cloud/cloud.cfg.d/15_azure-vnet.cfg to all Azure VMs #1090 ; I am not sure, but can we add something similar ?

@mresvanis
Copy link
Contributor Author

mresvanis commented May 17, 2023

@nawazkh thank you for reviewing this change!

  • What are the maintainable components in this PR we need to be mindful of as the OS offerings increase or change going ahead?

If I understand your question correctly, ideally we would just run the regular image-builder flow on top of specific CVM images without any additional changes to them. For example, exactly what this change adds for the Windows CVM images.

Very good point! I tried to think about how we could test the changes of CVM images. But since we are not doing anything more other than base the CAPI images to the Ubuntu and Windows CVM images, without any CVM specific configuration needed, we should be good to go. If we were to test/check for specific CVM image configuration, already included in those CVM base images, I believe there is no way (at least not at the time of this writing) to consistently test that the CVM images we are basing upon the CAPI CVM images, include everything that is needed, for them to be successfully used in CVM instances. We can only trust that the Ubuntu and Windows base CVM images have everything we need.

IMHO the holding/unholding nullboot package workaround should be covered by the inclusion of the CVM image building in the e2e CI tests. I am not sure I covered your questions, but please feel to let me know for more details or questions, I'll be happy to try and help more.

@nawazkh
Copy link
Member

nawazkh commented May 18, 2023

^ Thank you for the detailed explanation and answer @mresvanis. I am happy with the changes, explanation makes sense to me, and I would love to "lgtm" this PR.

"looks good to me!"

Delegating to @jsturtevant and @mboersma for their thoughts :)

@mresvanis
Copy link
Contributor Author

/test pull-ova-all

@mresvanis
Copy link
Contributor Author

/test pull-azure-vhds

@jsturtevant
Copy link
Contributor

  • change the nullboot post-install script back to its original
  • unhold the nullboot package, i.e. when apt-get update | dist-upgrade runs, it will be updated to its latest version

This should only be needed for the 22.04 LTS images, right?

A few more questions about the workaround:

  • What happens next time a user runs an upgrade? will they run into this issue as well?
  • Do the CVM images need a TPM? If we can't publish the image due to the azure bug are we getting anything from building them?

I'm wondering if we can get away with just skipping 22.04 until the issues have been resolved upstream.

Thanks for your patience and help understanding all the nuances.

@mresvanis
Copy link
Contributor Author

mresvanis commented May 25, 2023

  • change the nullboot post-install script back to its original
  • unhold the nullboot package, i.e. when apt-get update | dist-upgrade runs, it will be updated to its latest version

This should only be needed for the 22.04 LTS images, right?

@jsturtevant unfortunately, we have the same issue with Ubuntu 20.04 LTS as well. We saw the issue appearing in both distributions gradually, as the latest version of the nullboot package was being gradually released.

A few more questions about the workaround:

  • What happens next time a user runs an upgrade? will they run into this issue as well?

Azure CVMs won't boot unless their OS image adheres to specific requirements (which are not yet published, but I'm confident that vTPM PCR measurements are a hard requirement). The Ubuntu CVM images are specific to the Azure CVMs, so users would use them only on those supported VM sizes. These VM sizes require vTPM to be enabled, otherwise Azure just refuses to create them (there is no point in using a CVM if it cannot be attested to). If a user tries to create a CVM with a non-compliant OS images, then the VM just won't boot.

When a user runs an apt-get dist-upgrade on a CVM using one of the Ubuntu CVM images, everything should work as expected, because vTPM will be enabled (as it's a hard requirement to be so).

  • Do the CVM images need a TPM? If we can't publish the image due to the azure bug are we getting anything from building them?

As described above, CVMs require vTPM to be enabled and these OS images are a hard requirement, in order to be able to boot those VMs (and after boot for users to be able to attest them).

I'm wondering if we can get away with just skipping 22.04 until the issues have been resolved upstream.

That was my initial thought as well, but since the nullboot package was updated also in Ubuntu 20.04 LTS, we would have to skip both Ubuntu images (which would leave us with only the Windows ones and I think having both Windows and Linux images is quite useful).

Thanks for your patience and help understanding all the nuances.

Absolutely no worries, we have to be sure we're adding reliable and useful changes. In addition, CVMs are a quite recent field :)

@jsturtevant
Copy link
Contributor

These VM sizes require vTPM to be enabled, otherwise Azure just refuses to create them

So the VM sizes we are using to build this images don't have the vTPM enabled? But we expect users to select VM sizes in Azure that do make sense when creating VMs from the images?

Would we have this issue if the VM size we are building from has a vTPM?

@mresvanis
Copy link
Contributor Author

@jsturtevant thank you for digging in, great questions and those are exactly what I was also trying to figure out. Please let me know if the following makes sense:

So the VM sizes we are using to build this images don't have the vTPM enabled?

Exactly. We can't have SecureBoot or vTPM enabled when creating a managed image.

But we expect users to select VM sizes in Azure that do make sense when creating VMs from the images?

That's correct. The CVM images target Azure Confidential VMs only, but currently we can't use such VMs to build the respective CAPZ managed images.

Would we have this issue if the VM size we are building from has a vTPM?

The VM size we are building from can have a vTPM, but not an enabled one (here is the error from the packer plugin).

It is the combination of the following two issues that lead me to this implementation:

  • nullboot by default requires vTPM to be enabled when being configured (which happens with apt-get dist-upgrade)
  • packer-plugin-azure (and also Azure) does not allow us to create a managed image with vTPM enabled

@jsturtevant
Copy link
Contributor

Thanks for unraveling that for me. I think I finally understand the various bugs :-)

/lgtm
/assign @mboersma

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2023
Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

I'm sorry I didn't review this earlier! I've finally caught up–thank you for the detailed explanation of the complexities here.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma, mresvanis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2023
@k8s-ci-robot k8s-ci-robot merged commit 3779daf into kubernetes-sigs:master Jun 1, 2023
@mresvanis mresvanis deleted the add-cvm-images branch June 2, 2023 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support building Azure Confidential VM image based on Ubuntu 22.04 LTS for CVMs
5 participants