E2E: verify daemonset pods after machines #2950

jackfrancis · 2022-12-15T23:57:34Z

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

This PR moves the "validate expected daemonset pods" after "verify control plane is ready", because at that point in the cluster creation flow machines are not actually online, and so daemonset pods are never going to be scheduled.

Instead, we move daemonset pod validation after ApplyClusterTemplateAndWait, which waits for control plane and worker machines. At this point in the E2E flow we will have all of our expected number of machines verified as Ready, at which point we can check for the expected daemonset pods.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2933

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

jackfrancis · 2022-12-16T00:02:16Z

/assign @CecileRobertMichon @marosset @jsturtevant

test/e2e/azure_test.go

jackfrancis · 2022-12-16T01:09:35Z

@CecileRobertMichon do you prefer this updated use of the PostMachinesProvisioned func?

@marosset this doesn't really address your valid observation. I'm not entirely certain why we wrap this test cases in a func and then invoke the input vars "just in time", rather than just passing them into the func as values. I assume there is some async value mutation that we need to properly track and that these func closures enable. Maybe @nojnhuh has more context?

test/e2e/daemonsets.go

CecileRobertMichon · 2022-12-16T01:55:16Z

@CecileRobertMichon do you prefer this updated use of the PostMachinesProvisioned func?

yes 100%. Although not sure why we need to pass in a func as param, commented on that.

@marosset has a good point, there is a lot of duplication. Perhaps we could refactor our uses of the clusterctl ApplyClusterTemplateAndWait with a wrapper helper func that adds all the common stuff like this by default and just takes the stuff that changes as parameters. That's probably out of scope here but would be a nice code cleanup.

jackfrancis · 2022-12-16T17:05:35Z

Now that we're actually testing for the presence of the calico-node daemonset, it's failing three of the E2E test scenarios 🤯

CecileRobertMichon · 2022-12-21T17:10:49Z

Now that we're actually testing for the presence of the calico-node daemonset, it's failing three of the E2E test scenarios 🤯

@jackfrancis looking at the test output it's not even finding the daemonset, not failing to wait for pods (that part was passing before). I think that's because you are using the bootstrap cluster proxy to look for the daemonsets instead of the workload cluster where calico is actually installed https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2950/files#diff-c95013f3c197cb1edeb15be6787e403afd41244efd8ee819d5cefd24deca32dcR79

jackfrancis · 2023-01-05T18:54:33Z

@jsturtevant @marosset see @CecileRobertMichon's comment above, I'm attempting to address that.

tl;dr 🤦, not 🤯.

jackfrancis · 2023-01-05T23:19:51Z

It definitely seems like our VMSS cluster template is failing to launch Windows csi-node-driver pods. See:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/2950/pull-cluster-api-provider-azure-e2e/1611109298186752000/artifacts/clusters/capz-e2e-jfwxpp-vmss/kube-system/csi-azuredisk-node-win-6cc4v/azuredisk.log

@marosset @jsturtevant any ideas?

CecileRobertMichon · 2023-01-06T01:32:01Z

It definitely seems like our VMSS cluster template is failing to launch Windows csi-node-driver pods. See:

That seems like the same error I was trying to fix in #2947

jackfrancis · 2023-01-06T23:42:53Z

@CecileRobertMichon indeed: #2992

jackfrancis · 2023-01-09T20:17:43Z

/retest

(cluster delete timeout flake)

jackfrancis · 2023-01-09T21:38:27Z

@CecileRobertMichon @marosset this is passing tests and ready for another review round

CecileRobertMichon

This lgtm and is equivalent to previous functionality (but better because we're actually waiting for machines now). One thought: have you considered listing all daemonsets and waiting for all of them to to be available instead hardcoding a select few we can about?

If we want to make sure the daemonsets for calico exist we could potentially leave the existing code waiting for them to be available in the install Calico func and then do a general "wait for all daemonsets" post node provisioning

jackfrancis · 2023-01-10T03:02:42Z

@CecileRobertMichon this generalized approach appears to work fine, you can see the outcomes if you search for "Waiting for all DaemonSet Pods to be Running" in E2E output.

jackfrancis · 2023-01-10T03:02:56Z

/test pull-cluster-api-provider-azure-e2e-optional

mboersma

This looks good. Squash?

jackfrancis · 2023-01-10T16:09:51Z

@mboersma yeah just wanna get one more eyeball on this approach so I can revert if necessary

CecileRobertMichon

lgtm

jackfrancis · 2023-01-12T20:29:49Z

/retest

CecileRobertMichon · 2023-01-18T17:21:26Z

/lgtm
/approve
/retest

k8s-ci-robot · 2023-01-18T17:21:31Z

LGTM label has been added.

Git tree hash: 98c019c3a8441432643ad24f8f5031f4f3fd7669

k8s-ci-robot · 2023-01-18T17:21:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2023-01-18T19:49:39Z

/retest

jackfrancis · 2023-01-18T22:13:31Z

/retest

jackfrancis · 2023-01-19T22:45:28Z

/retest

jackfrancis · 2023-01-20T00:40:23Z

/retest

CecileRobertMichon · 2023-01-20T00:41:48Z

what's up with all the flakes 👀 anything related to this change?

jackfrancis · 2023-01-20T00:49:13Z

make lint timeout
default/web-windows7jo8s4 flake during LB IIS test
HA test delete cluster timeout
AKS E2E flake during autoscale test (addressed here: Always run AKS E2E get+update as transaction #3058)
AKS E2E flake during public IP test (also addressed here: Always run AKS E2E get+update as transaction #3058)

I think that's it!

jackfrancis · 2023-01-20T00:54:02Z

@CecileRobertMichon sorry, not related to this change, I've seen every one of those flakes on other PRs :(

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 15, 2022

k8s-ci-robot requested review from jsturtevant and sonasingh46 December 15, 2022 23:57

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 15, 2022

jackfrancis force-pushed the e2e-check-daemonsets-later branch from cc1c839 to 986d711 Compare December 16, 2022 00:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2022

k8s-ci-robot assigned CecileRobertMichon, jsturtevant and marosset Dec 16, 2022

CecileRobertMichon reviewed Dec 16, 2022

View reviewed changes

test/e2e/azure_test.go Outdated Show resolved Hide resolved

marosset reviewed Dec 16, 2022

View reviewed changes

test/e2e/azure_test.go Outdated Show resolved Hide resolved

jackfrancis force-pushed the e2e-check-daemonsets-later branch from 986d711 to 76e94a6 Compare December 16, 2022 01:06

CecileRobertMichon reviewed Dec 16, 2022

View reviewed changes

test/e2e/daemonsets.go Show resolved Hide resolved

jackfrancis force-pushed the e2e-check-daemonsets-later branch from 76e94a6 to 3848b87 Compare January 5, 2023 18:34

jackfrancis force-pushed the e2e-check-daemonsets-later branch 2 times, most recently from 64398c7 to 6e1f197 Compare January 5, 2023 21:15

kubernetes-sigs deleted a comment from k8s-ci-robot Jan 5, 2023

jackfrancis force-pushed the e2e-check-daemonsets-later branch from 6e1f197 to fc39439 Compare January 5, 2023 23:20

jackfrancis force-pushed the e2e-check-daemonsets-later branch from fc39439 to 474e237 Compare January 6, 2023 17:06

jackfrancis force-pushed the e2e-check-daemonsets-later branch from 474e237 to dd5de35 Compare January 9, 2023 18:33

CecileRobertMichon reviewed Jan 9, 2023

View reviewed changes

mboersma reviewed Jan 10, 2023

View reviewed changes

CecileRobertMichon reviewed Jan 12, 2023

View reviewed changes

E2E: verify daemonset pods after machines

0dcd63c

jackfrancis force-pushed the e2e-check-daemonsets-later branch from 1174fe1 to 0dcd63c Compare January 12, 2023 20:10

kubernetes-sigs deleted a comment from k8s-ci-robot Jan 12, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 18, 2023

kubernetes-sigs deleted a comment from k8s-ci-robot Jan 18, 2023

jackfrancis added this to the v1.8 milestone Jan 19, 2023

kubernetes-sigs deleted a comment from k8s-ci-robot Jan 19, 2023

k8s-ci-robot merged commit 5e40d05 into kubernetes-sigs:main Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E: verify daemonset pods after machines #2950

E2E: verify daemonset pods after machines #2950

jackfrancis commented Dec 15, 2022

jackfrancis commented Dec 16, 2022

jackfrancis commented Dec 16, 2022

CecileRobertMichon commented Dec 16, 2022

jackfrancis commented Dec 16, 2022

CecileRobertMichon commented Dec 21, 2022

jackfrancis commented Jan 5, 2023

jackfrancis commented Jan 5, 2023

CecileRobertMichon commented Jan 6, 2023

jackfrancis commented Jan 6, 2023

jackfrancis commented Jan 9, 2023

jackfrancis commented Jan 9, 2023

CecileRobertMichon left a comment

jackfrancis commented Jan 10, 2023

jackfrancis commented Jan 10, 2023

mboersma left a comment

jackfrancis commented Jan 10, 2023

CecileRobertMichon left a comment

jackfrancis commented Jan 12, 2023

CecileRobertMichon commented Jan 18, 2023

k8s-ci-robot commented Jan 18, 2023

k8s-ci-robot commented Jan 18, 2023

jackfrancis commented Jan 18, 2023

jackfrancis commented Jan 18, 2023

jackfrancis commented Jan 19, 2023

jackfrancis commented Jan 20, 2023

CecileRobertMichon commented Jan 20, 2023

jackfrancis commented Jan 20, 2023

jackfrancis commented Jan 20, 2023

E2E: verify daemonset pods after machines #2950

E2E: verify daemonset pods after machines #2950

Conversation

jackfrancis commented Dec 15, 2022

jackfrancis commented Dec 16, 2022

jackfrancis commented Dec 16, 2022

CecileRobertMichon commented Dec 16, 2022

jackfrancis commented Dec 16, 2022

CecileRobertMichon commented Dec 21, 2022

jackfrancis commented Jan 5, 2023

jackfrancis commented Jan 5, 2023

CecileRobertMichon commented Jan 6, 2023

jackfrancis commented Jan 6, 2023

jackfrancis commented Jan 9, 2023

jackfrancis commented Jan 9, 2023

CecileRobertMichon left a comment

Choose a reason for hiding this comment

jackfrancis commented Jan 10, 2023

jackfrancis commented Jan 10, 2023

mboersma left a comment

Choose a reason for hiding this comment

jackfrancis commented Jan 10, 2023

CecileRobertMichon left a comment

Choose a reason for hiding this comment

jackfrancis commented Jan 12, 2023

CecileRobertMichon commented Jan 18, 2023

k8s-ci-robot commented Jan 18, 2023

k8s-ci-robot commented Jan 18, 2023

jackfrancis commented Jan 18, 2023

jackfrancis commented Jan 18, 2023

jackfrancis commented Jan 19, 2023

jackfrancis commented Jan 20, 2023

CecileRobertMichon commented Jan 20, 2023

jackfrancis commented Jan 20, 2023

jackfrancis commented Jan 20, 2023