Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize helm install during E2E testing #2264

Merged
merged 1 commit into from
May 24, 2022

Conversation

jackfrancis
Copy link
Contributor

@jackfrancis jackfrancis commented Apr 25, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR follows up from #2209

Here we generalize some of the "install helm chart" foo so that it can be composed in different ways to accommodate different test scenarios.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 25, 2022
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 25, 2022
@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsturtevant @marosset I wonder if we could use this for metrics-server and other addons we're currently installing via ClusterResourceSet for windows e2e tests?

@jsturtevant
Copy link
Contributor

Looks like the metric server has a chart. Some of the window specific things like kube-proxy/calico don't have charts.

@jackfrancis
Copy link
Contributor Author

How much customization are we doing for metrics-server, kube-proxy, and calico? Depending upon the amount of "capz optimizations" that we're performing in these yaml specs it may be appropriate to host our own "capz-tested" charts for such components in this repo. And there's little work to maintain a canonical repo like we do here:

https://github.com/kubernetes-sigs/cloud-provider-azure/tree/master/helm

@jsturtevant
Copy link
Contributor

We can likely use the metric-server chart. For Windows kube-proxy/calico the yaml is from https://github.com/kubernetes-sigs/sig-windows-tools/tree/master/hostprocess. That would be the place where we could add those charts. The long term goal is for sig-windows to not host the calico and kube-proxy images so I am hesitant to put much effort in to create charts. If capz is moving away from CRS completely shorter term then I guess we might need too as it is still a ways out before we can get those images to proper owner repositories do to changes needed in the HostProcess containers implementation.

}, input.WaitForControlPlaneIntervals...).Should(Succeed())
}

func WaitForRunningNodes(ctx context.Context, input clusterctl.ApplyClusterTemplateAndWaitInput) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and for running pods below. As much as possible we should share e2e logic with the other providers in the CAPI framework instead of implementing our own logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to rely upon pre-existing cluster-api conveniences for a part of this, but there is the "ensure n pods are Running" that I had to implement here.

@jackfrancis jackfrancis force-pushed the helm-general branch 2 times, most recently from 97224aa to b5b5528 Compare April 27, 2022 22:52
@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon this should be ready for a review round

LabelSelector: cloudNodeManagerPodLabel,
Namespace: "kube-system",
},
Condition: podListHasNumPods(int(to.Int64(input.ConfigCluster.ControlPlaneMachineCount) + to.Int64(input.ConfigCluster.WorkerMachineCount))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also count also need to add windows worker nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we're saying is that we need to multiply the input.ConfigCluster.WorkerMachineCount by the number of pools in the cluster.

The current OOT template only has a single node pool, which is why this is working.

I don't think we have a way to determine the number of pools from the input to this func, but because we enter into this after all the control plane nodes and machines are online with node references, we should be able to count the number of nodes in the cluster, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windows is its own variable so we should be able to at least add that to the count (even if it's zero for now, that will make it more future proof)

LabelSelector: cloudNodeManagerPodLabel,
Namespace: "kube-system",
},
Condition: podListHasNumPods(len(machineList.Items)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon I think counting the machines in the cluster is probably the best approach for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about machine pools? Should we count nodes instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably count both Machines and MachinePoolMachines, right? (a cluster can exist with both types, I assume)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO doing it by number of nodes is more future-proof. There's no guarantee CAPI won't add a new type of Machine in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using nodes is that they are not the authoritative unit in terms of calculating "expected number" outcomes. For example, a node may be offline, or not yet joined, or other states that are not as definitive in terms of gathering a cluster count as compared to number of machines (plus number of machinepoolmachines).

So, if we use nodes then we'd probably want to change the pod list criteria to >= to account for the fact that a new node may have joined after we calculated num nodes, but before we gathered a list of cloud-node-manager pods.

It would be nice to establish a resilient (hopefully simple to understand and maintain) pattern here, as I do think being able to both validate the helm install and validate expected outcomes of that helm install is valuable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case thoughts about adding helper functions to the CAPI test framework that return the number of expected control plane nodes and the number of expected worker nodes? Control planes can be the total number of control plane replicas of the control plane object (or 0 if the control plane does not implement replicas https://cluster-api.sigs.k8s.io/developer/architecture/controllers/control-plane.html?highlight=spec#required-spec-fields-for-implementations-using-replicas). Worker nodes can be the total of machines + machine pool replicas.

It's a bit more resilient to abstract what a worker machine can be and have one central place to make that assumption than having to update every place in our CAPZ tests where we make that assumption later if a new type of Machine is added.

Also worth checking if something like this doesn't already exist in CAPI, it's very possible it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and capi E2E doesn't have a convenience for this (there are things sort of similar, with lots of buried Expect() funcs that are in the process of being cleaned up, but not for this purpose explicitly). How about this as a way of getting this merged ASAP.

  1. We verify the existence of the cloud-node-manager daemonset
  2. We wait until all of its replicas are Ready

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@netlify
Copy link

netlify bot commented May 20, 2022

👷 Deploy request for kubernetes-sigs-cluster-api-provider-azure pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit 23666ad

}

// WaitForDaemonset retries during E2E until a daemonset's pods are all Running
func WaitForDaemonset(ctx context.Context, input clusterctl.ApplyClusterTemplateAndWaitInput, cl client.Client, name, namespace string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon this is the alternative solution to validate the cloud-node-manager pods. Rather than comparing against number of machines, let's just make sure the daemonset pods are all running. There are several reasons out of our control why the number of nodes running these pods may not be easily predictable. Rather than try to deal with all of those edge cases, we can simply rely upon the daemonset API to pick the appropriate schedulable nodes, and ensure that the pods reach a Running state.

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackfrancis, looks great

/lgtm
/assign @Jont828

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2022
@CecileRobertMichon
Copy link
Contributor

@jackfrancis we should now be able to switch over Calico as well. I believe we just need a bit of kustomization in the values.yaml to make it work for Azure (mainly VXLAN encapsulation)

https://kubernetes.slack.com/archives/CEX9HENG7/p1652473459598119

@jackfrancis
Copy link
Contributor Author

/retest

1 similar comment
@jackfrancis
Copy link
Contributor Author

/retest

@Jont828
Copy link
Contributor

Jont828 commented May 24, 2022

This looks great! Was useful for getting us off the ground with HelmChartProxy as well!

/lgtm

@CecileRobertMichon
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2022
@jackfrancis
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7d5e541 into kubernetes-sigs:main May 24, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.4 milestone May 24, 2022
@jackfrancis jackfrancis deleted the helm-general branch December 9, 2022 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants