Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libvirt e2e CI for pull requests #1318

Merged

Conversation

wainersm
Copy link
Member

@wainersm wainersm commented Aug 9, 2023

This is my second tentative to have the libvirt e2e tests running for pull requests.

From the user point of view, the workflow is executed by adding the test_e2e_libvirt label. In theory if we add support for other providers, it would be a matter of adding a new label. However it may not work as the common jobs (see below) will be triggered every time a test_e2e_* label is added to the PR. I prefer to defer that discussion for when (and if) we add other e2e jobs.

Still regarding how it is triggered, an alternative would be to use comments on PR (e.g. /test_e2e_libvirt) but then we hit a bunch of limitations on Github actions design. To mention the most basic: on commenting the workflow runs on context of the main branch meaning that it doesn't give the reference to the pull request number (or other information like the source tree/branch). That's why I decided to try with the label approach.

Like in the previous tentative the workflow is split two set of jobs: a) common/shareable among cloud e2e, b) cloud specific e2e

The common jobs:

  • job1: podvm builder job
  • job2: image builder job

The job1 is actually not building podvm as this is a time/resources consuming task. Instead, it is using the ./podvm/hack/download-image.sh to extract the qcow2 file from already built quay.io/confidential-containers/podvm-* images. The qcow2 files are attached to the workflow so that downstream jobs (i.e. the jobs that effectively run e2e tests) can simply download-and-use the disk. Note: I am not sure about this approach of archiving the disk file as they are huge...we may hit storage issues.

The job2 is going to build and push the cloud-api-adaptor to ghcr.io/confidential-containers/cloud-api-adaptor. The kustomization files (install dir) are changed to use that new image on installation. So just like in job1 the entire install directory is attached to the workflow so that downstream jobs can simply use it (here the files are a couple of kb so we won't have storage problem).

On more comment on job2: as it creates temporary images to ghcr.io I implement a script that will clean up unused images. I will watch how this PR evolves to publish its code.

Fixes #599

@wainersm wainersm added the test_e2e_libvirt Run Libvirt e2e tests label Aug 9, 2023
@wainersm
Copy link
Member Author

wainersm commented Aug 9, 2023

Notice that I ran the e2e libvirt workflow by tagging this PR with test_e2e_libvirt.

The e2e tests / image / build and push (dev, linux/amd64) and e2e tests / image / build and push (release, linux/amd64,linux/s390x,linux/ppc64le) jobs failed to push the built CAA image, the reason being that the pull_request event doesn't give a GITHUB_TOKEN with admin privileges to the workflow (for security reasons). In order to make it work, we will need to run the entire workflow as pull_request_target or the publishing job as workflow_run (because those two events will have granted a read/write token for ghcr.io)

Regardless, I would like to receive a feedback on this version first. :)

@wainersm
Copy link
Member Author

Updated with a fix to Install tests dependencies step on the libvirt job. I had added a commit to disable the image build for the sake of testing the libvirt job, before the merge I will remove that commit.

@wainersm
Copy link
Member Author

Updated the libvirt job to install a given Go version and the build-essential package.

@wainersm
Copy link
Member Author

Update the PR with:

  • now pinning the version of k8s installed (v1.26.7)
  • downloading the correct version of Ubuntu (20.04 instead of 22.04)

@wainersm
Copy link
Member Author

Blocked by #1332

@wainersm
Copy link
Member Author

Fix #1332 is merged, rebased and giving it a try again.

@stevenhorsman
Copy link
Member

stevenhorsman commented Aug 22, 2023

Fix #1332 is merged, rebased and giving it a try again.

FYI the CAA image build failed for this: https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/5939073597/job/16104756661 so it probably won't work yet. Re-queued that to see if that helps

@wainersm wainersm removed hold test_e2e_libvirt Run Libvirt e2e tests labels Aug 22, 2023
@wainersm wainersm force-pushed the workflows_for_libvirt_ci branch from 4d70490 to 4de5345 Compare August 24, 2023 14:09
@wainersm wainersm removed the test_e2e_libvirt Run Libvirt e2e tests label Aug 24, 2023
@wainersm wainersm force-pushed the workflows_for_libvirt_ci branch from 4de5345 to 56dce17 Compare August 24, 2023 17:02
@wainersm
Copy link
Member Author

@bpradipt @stevenhorsman I finally made this work, however on code path was not tested: the job that builds the CAA image (although I tested it on a fork) because it should run on pull_request_target (meaning the workflow executed should be in the main branch). I had to disable two tests that are failing only on CI, but I will be investigating the problem once this is merged.

I appreciate if you can make a "loosely" review as I would like to have it merged asap in order to 1) check whether it really works then send fixes; 2) start testing the PRs on this project against the libvirt provider

@stevenhorsman
Copy link
Member

Do we have any issues like the Azure tests where we need to have an approval before the libvirt tests run to avoid exposing secrets/Fabiano using the nodes for Crypto mining?

@wainersm
Copy link
Member Author

Do we have any issues like the Azure tests where we need to have an approval before the libvirt tests run to avoid exposing secrets/Fabiano using the nodes for Crypto mining?

Good question!

The workflow needs a privileged token to publish the caa images on the ghcr.io. As the workflow run on pull_request_target it is not possible for a pull request to run a malicious workflow, that's the first fence of protection. Second being the label, someone needs to label the PR as 'test_e2e_libvirt' otherwise it won't run. Only committers should be allowed to put labels on PRs, so this is basically the approval mechanism implemented. Am I wrong and anyone can label a PR?

Unlike some workflows on kata, here it is not explicitly instantiating any resources on Azure so we don't even need the secrets configured on the project. The runners are auto-magically deployed by garm and handed over as self-hosted runners. This is the third security fence, a.k.a, anti-Fabiano's crypto scam ;)

tree install
git diff
export TEST_PROVISION="yes"
export TEST_TEARDOWN="no"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might come to it in a later commit, but is Teardown no so we can get debug info and the test runner node should be cleaned up anyway at the end of the workflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly Steve!

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given it a non-extensive review and it looks good. I wonder if we can start to re-use pieces of this in the other (e.g. Azure, but maybe internal IBM) tests too, but that's something for when the code is more tested. Thanks for the great work Wainer

@wainersm
Copy link
Member Author

I've given it a non-extensive review and it looks good. I wonder if we can start to re-use pieces of this in the other (e.g. Azure, but maybe internal IBM) tests too, but that's something for when the code is more tested. Thanks for the great work Wainer

Thanks for the review sir!

My aim was to create a reusable workflow indeed, for example, the podvm (that should be improved) and CAA build jobs are meant generic. But I didn't have time to look in deep how the CI for the other providers are implemented today. Definitely I want to have more shareable code among the providers CI.

runs-on: ubuntu-latest
if: ${{ contains(github.event.pull_request.labels.*.name, 'test_e2e_libvirt') }}
steps:
- run: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a dummy, but it should be - run: "true" otherwise it's interpreted as boolean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @mkulke, first of all, thanks for the review. For this comment in particular, yeah, I hadn't tested that code path...good catch. Fixed.


- name: Update kustomization configuration
run: |
providers=(${{ env.PROVIDERS }})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could providers be a matrix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they could but I thought it would be overkill. My idea is to have a single install directory to be shared by all downstream jobs (i.e. provider specific jobs). If I convert to gha matrix it will create N jobs (N VMs instantiated) to produce the same artifact, it seems just resources waste.

Created a callable workflow from the image.yaml to allow us to
reuse the code on other workflows but tweaking parameters. For example,
on an e2e testing workflow on pull request we don't want to overwrite the
`latest` on quay.io, so we may push to a temporary image to ghcr.io.

A calller workflow will be able to overwrite:
 * the registry. Defaults to 'quay.io/confidential-containers'
 * the dev build arches. Defaults to 'linux/amd64'
 * the release build arches. Defaults to 'linux/amd64,linux/s390x,linux/ppc64le'

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Currently the hack/build.sh builds the cloud-api-adaptor image in dev and
release modes. The image is tagged as `dev-$sha1` and `latest` when in
dev mode, and just `$sha1` for releases. However we may want to tag the
image differently, for example, on releases context the release image
should have the git tag (e.g. `v0.7.0`). This changed the script to
optionally read from the `DEV_TAGS` and `RELEASE_TAGS` environment
variables the tags that should be used.

For instance, `make image RELEASE_BUILD=true RELEASE_TAGS=v0.7.0` will push to
quay.io/confidential-containers/cloud-api-adaptor:v0.7.0 a release
image.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Added just a skeleton for the execution of the libvirt e2e tests on pull
request, it will do nothing at this point. The workflow runs when the PR
is labeled 'test_e2e_libvirt'.

Fixes confidential-containers#599
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Added a job to build the podvm for various provider/OS/ARCH which
actually currently just download the qcow2 files and put into an archive
so that downstream jobs can simply fetch the files to their runners.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Changed the caa_build_and_push.yaml workflow to receive new inputs:

* dev_tags: Comma-separated list of tags for the dev built image (e.g. latest,ci-dev). By default uses the values from hack/build.sh
* release_tags: Likewise but for the release built image

This will allow us to publish cloud-api-adaptor images with tags
differently from the defaults. Examples: a) for the sake of CI we don't need to
publish the release builds with `latest` tag, we may also record the pull
request number on the tag b) a release workflow may want to tag the image
with a git tag.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The added job builds cloud-api-adaptor and push to ghcr.io registry.
The kustomization files are modified to reference the new images then
the install directory is archived so that downstream jobs can get to use
those files.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Fully implemented the libvirt e2e workflow to run the tests with the
Ubuntu podvm.

Fixes confidential-containers#599
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Run the libvirt e2e tests for CentOS podvm too.

Fixed confidential-containers#599
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
There was added a self-hosted runner on Azure via Garm (**) which is
labeled 'az-ubuntu-2204'. Run the libvirt e2e workflow on that VM which
support nested virt.

(**) https://github.com/confidential-containers/infra/tree/main/github/azure-self-hosted-runners
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The libvirt/kcli_version.sh will create a k8s cluster version 1.26.7
unless the CLUSTER_VERSION variable is exported.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
On provisioner's Deploy() and Delete() it should print the output of
`kubectl apply -k` before the function return in case of failure.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Print debug information at the end of the Libvirt e2e workflow if
the run tests step failed.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The TestLibvirtCreatePodWithConfigMap and
TestLibvirtCreatePeerPodContainerWithExternalIPAccess tests are failing
on the VM runner on CI. Let's disable them temporarialy to keep the
CI running while we figure out the problem.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The libvirt e2e CI should run on pull_request_target so the jobs get
access to secrets. For example, the image job will publish on ghcr.io
with admin privileges. It is added the 'authorize' job that is the first
triggered when the PR is labeled, the following up jobs depending on it.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
@wainersm wainersm force-pushed the workflows_for_libvirt_ci branch from 56dce17 to 98918bd Compare August 28, 2023 19:25
@wainersm
Copy link
Member Author

Updated to address the comment of @mkulke regaring run: true. Also rebased to main.

# This is the key used by cloud-api-adaptor to connect to libvirt
- name: Generate the SSH key
run: |
ssh-keygen -f ./id_rsa -N ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wainersm Shouldn't this be a required step for Install kcli rather than handling ssh-keygen there as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @ldoktor , I'm glad to see you here :)

This key is used by cloud-api-adaptor to connect to the Libvirt on host. It could be the same key as used by the kcli to install the cluster, but I prefer to keep them two separated keys.

@@ -14,6 +15,7 @@ func TestLibvirtCreateSimplePod(t *testing.T) {
}

func TestLibvirtCreatePodWithConfigMap(t *testing.T) {
t.Skip("Failing on CI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using an ENV value to skip it only when running in CI while allowing it otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea @ldoktor . Mind if I send in another PR? I'd like to get this merged asap to start testing the project's code as well as workflow's :)

@wainersm wainersm merged commit 739dcab into confidential-containers:main Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the e2e CI for libvirt
4 participants