-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalize helm install during E2E testing #2264
Conversation
/test pull-cluster-api-provider-azure-e2e-optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsturtevant @marosset I wonder if we could use this for metrics-server and other addons we're currently installing via ClusterResourceSet for windows e2e tests?
Looks like the metric server has a chart. Some of the window specific things like kube-proxy/calico don't have charts. |
How much customization are we doing for metrics-server, kube-proxy, and calico? Depending upon the amount of "capz optimizations" that we're performing in these yaml specs it may be appropriate to host our own "capz-tested" charts for such components in this repo. And there's little work to maintain a canonical repo like we do here: https://github.com/kubernetes-sigs/cloud-provider-azure/tree/master/helm |
We can likely use the metric-server chart. For Windows kube-proxy/calico the yaml is from https://github.com/kubernetes-sigs/sig-windows-tools/tree/master/hostprocess. That would be the place where we could add those charts. The long term goal is for sig-windows to not host the calico and kube-proxy images so I am hesitant to put much effort in to create charts. If capz is moving away from CRS completely shorter term then I guess we might need too as it is still a ways out before we can get those images to proper owner repositories do to changes needed in the HostProcess containers implementation. |
test/e2e/helpers.go
Outdated
}, input.WaitForControlPlaneIntervals...).Should(Succeed()) | ||
} | ||
|
||
func WaitForRunningNodes(ctx context.Context, input clusterctl.ApplyClusterTemplateAndWaitInput) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here and for running pods below. As much as possible we should share e2e logic with the other providers in the CAPI framework instead of implementing our own logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to rely upon pre-existing cluster-api conveniences for a part of this, but there is the "ensure n pods are Running" that I had to implement here.
97224aa
to
b5b5528
Compare
/test pull-cluster-api-provider-azure-e2e-optional |
@CecileRobertMichon this should be ready for a review round |
test/e2e/cloud-provider-azure.go
Outdated
LabelSelector: cloudNodeManagerPodLabel, | ||
Namespace: "kube-system", | ||
}, | ||
Condition: podListHasNumPods(int(to.Int64(input.ConfigCluster.ControlPlaneMachineCount) + to.Int64(input.ConfigCluster.WorkerMachineCount))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this also count also need to add windows worker nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we're saying is that we need to multiply the input.ConfigCluster.WorkerMachineCount
by the number of pools in the cluster.
The current OOT template only has a single node pool, which is why this is working.
I don't think we have a way to determine the number of pools from the input to this func, but because we enter into this after all the control plane nodes and machines are online with node references, we should be able to count the number of nodes in the cluster, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
windows is its own variable so we should be able to at least add that to the count (even if it's zero for now, that will make it more future proof)
test/e2e/cloud-provider-azure.go
Outdated
LabelSelector: cloudNodeManagerPodLabel, | ||
Namespace: "kube-system", | ||
}, | ||
Condition: podListHasNumPods(len(machineList.Items)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CecileRobertMichon I think counting the machines in the cluster is probably the best approach for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about machine pools? Should we count nodes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably count both Machines and MachinePoolMachines, right? (a cluster can exist with both types, I assume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO doing it by number of nodes is more future-proof. There's no guarantee CAPI won't add a new type of Machine in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with using nodes is that they are not the authoritative unit in terms of calculating "expected number" outcomes. For example, a node may be offline, or not yet joined, or other states that are not as definitive in terms of gathering a cluster count as compared to number of machines (plus number of machinepoolmachines).
So, if we use nodes then we'd probably want to change the pod list criteria to >= to account for the fact that a new node may have joined after we calculated num nodes, but before we gathered a list of cloud-node-manager pods.
It would be nice to establish a resilient (hopefully simple to understand and maintain) pattern here, as I do think being able to both validate the helm install and validate expected outcomes of that helm install is valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case thoughts about adding helper functions to the CAPI test framework that return the number of expected control plane nodes and the number of expected worker nodes? Control planes can be the total number of control plane replicas of the control plane object (or 0 if the control plane does not implement replicas https://cluster-api.sigs.k8s.io/developer/architecture/controllers/control-plane.html?highlight=spec#required-spec-fields-for-implementations-using-replicas). Worker nodes can be the total of machines + machine pool replicas.
It's a bit more resilient to abstract what a worker machine can be and have one central place to make that assumption than having to update every place in our CAPZ tests where we make that assumption later if a new type of Machine is added.
Also worth checking if something like this doesn't already exist in CAPI, it's very possible it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and capi E2E doesn't have a convenience for this (there are things sort of similar, with lots of buried Expect()
funcs that are in the process of being cleaned up, but not for this purpose explicitly). How about this as a way of getting this merged ASAP.
- We verify the existence of the cloud-node-manager daemonset
- We wait until all of its replicas are Ready
/test pull-cluster-api-provider-azure-e2e-optional |
/test pull-cluster-api-provider-azure-e2e-optional |
/test pull-cluster-api-provider-azure-e2e-optional |
/test pull-cluster-api-provider-azure-e2e-optional |
👷 Deploy request for kubernetes-sigs-cluster-api-provider-azure pending review.Visit the deploys page to approve it
|
} | ||
|
||
// WaitForDaemonset retries during E2E until a daemonset's pods are all Running | ||
func WaitForDaemonset(ctx context.Context, input clusterctl.ApplyClusterTemplateAndWaitInput, cl client.Client, name, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CecileRobertMichon this is the alternative solution to validate the cloud-node-manager
pods. Rather than comparing against number of machines, let's just make sure the daemonset pods are all running. There are several reasons out of our control why the number of nodes running these pods may not be easily predictable. Rather than try to deal with all of those edge cases, we can simply rely upon the daemonset API to pick the appropriate schedulable nodes, and ensure that the pods reach a Running state.
/test pull-cluster-api-provider-azure-e2e-optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jackfrancis, looks great
/lgtm
/assign @Jont828
@jackfrancis we should now be able to switch over Calico as well. I believe we just need a bit of kustomization in the values.yaml to make it work for Azure (mainly VXLAN encapsulation) https://kubernetes.slack.com/archives/CEX9HENG7/p1652473459598119 |
/retest |
1 similar comment
/retest |
This looks great! Was useful for getting us off the ground with HelmChartProxy as well! /lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR follows up from #2209
Here we generalize some of the "install helm chart" foo so that it can be composed in different ways to accommodate different test scenarios.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: