Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E: verify daemonset pods after machines #2950

Merged

Conversation

jackfrancis
Copy link
Contributor

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

This PR moves the "validate expected daemonset pods" after "verify control plane is ready", because at that point in the cluster creation flow machines are not actually online, and so daemonset pods are never going to be scheduled.

Instead, we move daemonset pod validation after ApplyClusterTemplateAndWait, which waits for control plane and worker machines. At this point in the E2E flow we will have all of our expected number of machines verified as Ready, at which point we can check for the expected daemonset pods.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2933

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 15, 2022
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 15, 2022
@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from cc1c839 to 986d711 Compare December 16, 2022 00:01
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2022
@jackfrancis
Copy link
Contributor Author

test/e2e/azure_test.go Outdated Show resolved Hide resolved
@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from 986d711 to 76e94a6 Compare December 16, 2022 01:06
@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon do you prefer this updated use of the PostMachinesProvisioned func?

@marosset this doesn't really address your valid observation. I'm not entirely certain why we wrap this test cases in a func and then invoke the input vars "just in time", rather than just passing them into the func as values. I assume there is some async value mutation that we need to properly track and that these func closures enable. Maybe @nojnhuh has more context?

@CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon do you prefer this updated use of the PostMachinesProvisioned func?

yes 100%. Although not sure why we need to pass in a func as param, commented on that.

@marosset has a good point, there is a lot of duplication. Perhaps we could refactor our uses of the clusterctl ApplyClusterTemplateAndWait with a wrapper helper func that adds all the common stuff like this by default and just takes the stuff that changes as parameters. That's probably out of scope here but would be a nice code cleanup.

@jackfrancis
Copy link
Contributor Author

Now that we're actually testing for the presence of the calico-node daemonset, it's failing three of the E2E test scenarios 🤯

@CecileRobertMichon
Copy link
Contributor

Now that we're actually testing for the presence of the calico-node daemonset, it's failing three of the E2E test scenarios 🤯

@jackfrancis looking at the test output it's not even finding the daemonset, not failing to wait for pods (that part was passing before). I think that's because you are using the bootstrap cluster proxy to look for the daemonsets instead of the workload cluster where calico is actually installed https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2950/files#diff-c95013f3c197cb1edeb15be6787e403afd41244efd8ee819d5cefd24deca32dcR79

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from 76e94a6 to 3848b87 Compare January 5, 2023 18:34
@jackfrancis
Copy link
Contributor Author

@jsturtevant @marosset see @CecileRobertMichon's comment above, I'm attempting to address that.

tl;dr 🤦, not 🤯.

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch 2 times, most recently from 64398c7 to 6e1f197 Compare January 5, 2023 21:15
@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 5, 2023
@jackfrancis
Copy link
Contributor Author

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from 6e1f197 to fc39439 Compare January 5, 2023 23:20
@CecileRobertMichon
Copy link
Contributor

It definitely seems like our VMSS cluster template is failing to launch Windows csi-node-driver pods. See:

That seems like the same error I was trying to fix in #2947

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from fc39439 to 474e237 Compare January 6, 2023 17:06
@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon indeed: #2992

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from 474e237 to dd5de35 Compare January 9, 2023 18:33
@jackfrancis
Copy link
Contributor Author

/retest

(cluster delete timeout flake)

@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon @marosset this is passing tests and ready for another review round

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm and is equivalent to previous functionality (but better because we're actually waiting for machines now). One thought: have you considered listing all daemonsets and waiting for all of them to to be available instead hardcoding a select few we can about?

If we want to make sure the daemonsets for calico exist we could potentially leave the existing code waiting for them to be available in the install Calico func and then do a general "wait for all daemonsets" post node provisioning

@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon this generalized approach appears to work fine, you can see the outcomes if you search for "Waiting for all DaemonSet Pods to be Running" in E2E output.

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Squash?

@jackfrancis
Copy link
Contributor Author

@mboersma yeah just wanna get one more eyeball on this approach so I can revert if necessary

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jackfrancis jackfrancis force-pushed the e2e-check-daemonsets-later branch from 1174fe1 to 0dcd63c Compare January 12, 2023 20:10
@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 12, 2023
@jackfrancis
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 98c019c3a8441432643ad24f8f5031f4f3fd7669

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 18, 2023
@jackfrancis
Copy link
Contributor Author

/retest

1 similar comment
@jackfrancis
Copy link
Contributor Author

/retest

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 18, 2023
@jackfrancis jackfrancis added this to the v1.8 milestone Jan 19, 2023
@jackfrancis
Copy link
Contributor Author

/retest

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 19, 2023
@jackfrancis
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor

what's up with all the flakes 👀 anything related to this change?

@jackfrancis
Copy link
Contributor Author

I think that's it!

@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon sorry, not related to this change, I've seen every one of those flakes on other PRs :(

@k8s-ci-robot k8s-ci-robot merged commit 5e40d05 into kubernetes-sigs:main Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E2E tests aren't properly verifying the DaemonSets are running
6 participants