update autoscaler max nodes test #241

elmiko · 2022-09-27T20:34:38Z

this change makes it so that the max nodes test will observe only the ready nodes in the cluster when determining if the maximum has been reached. the autoscaler will only count ready nodes when performing its calculations, when this is combined with tests that might not clean up the cluster properly, or may leave nodes in a non schedulable state, then the test need to take into account only the ready nodes when calculating the maximum size.

this PR also changes the workload job names that are run in the autoscaler tests to ensure that they are unique from each other.

elmiko · 2022-09-27T20:45:10Z

openshift/kubernetes-autoscaler#241 will want to know when this merges

also, i am amused that these both have the same pull number

elmiko · 2022-09-27T22:53:07Z

/test e2e-aws-operator

elmiko · 2022-09-28T12:40:31Z

/retest-required

elmiko · 2022-09-28T13:31:43Z

updated to add a filter for unschedulable nodes as well

elmiko · 2022-09-28T15:58:12Z

looking at the scale to/from zero tests, it appears that sometimes the autoscaler does not think it should remove the nodes even after the test workload has been deleted. my current hypothesis is that /something/ is being scheduled to those nodes (could be a core operator or similar) that is causing the autoscaler to not delete them. if we look at successful runs of this test we see this in the output:

STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0725 15:27:08.616361   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-228-209.us-east-2.compute.internal"
I0725 15:27:08.629561   13940 autoscaler.go:224] ip-10-0-228-209.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.631358   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-231-145.us-east-2.compute.internal"
I0725 15:27:08.650275   13940 autoscaler.go:224] dns-default-fhp8g: deleting pod for node scale down
I0725 15:27:08.658258   13940 autoscaler.go:224] ip-10-0-231-145.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.669076   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-198-234.us-east-2.compute.internal"
I0725 15:27:08.678413   13940 autoscaler.go:224] dns-default-n2pbv: deleting pod for node scale down
I0725 15:27:08.686565   13940 autoscaler.go:224] ip-10-0-198-234.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.701542   13940 autoscaler.go:224] dns-default-hcwdz: deleting pod for node scale down
I0725 15:27:09.821073   13940 autoscaler.go:224] ip-10-0-228-209.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:09.833105   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-228-209.us-east-2.compute.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
I0725 15:27:10.628454   13940 autoscaler.go:224] ip-10-0-231-145.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:10.628496   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-231-145.us-east-2.compute.internal removed
I0725 15:27:11.227274   13940 autoscaler.go:224] ip-10-0-198-234.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:11.233313   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-198-234.us-east-2.compute.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 0, expected 0.

this output shows that the autoscaler emitted events to remove all the nodes. but, if we look at a failure, we see this:

 STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0928 14:50:10.567005   13834 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-225-90.ec2.internal"
I0928 14:50:10.584090   13834 autoscaler.go:224] ip-10-0-225-90.ec2.internal: marked the node as toBeDeleted/unschedulable
I0928 14:50:10.588194   13834 autoscaler.go:224] dns-default-9wmsn: deleting pod for node scale down
STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0928 14:50:11.769735   13834 autoscaler.go:224] ip-10-0-225-90.ec2.internal: node removed by cluster autoscaler
I0928 14:50:11.776553   13834 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-225-90.ec2.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
I0928 14:50:23.583593   13834 autoscaler.go:224] oauth-openshift-fd8c57bd5-kmmn8: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.589126   13834 autoscaler.go:224] apiserver-5775898b65-djq2v: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.593609   13834 autoscaler.go:224] route-controller-manager-5bfc67fb97-qrrdq: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.598202   13834 autoscaler.go:224] apiserver-5b96744cdb-g6xbs: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.

we can see that the autoscaler only considers removing a single node and also it thinks that these other pods should be scheduled somewhere. i'm not sure if those other pods are causing part of the issue as they might schedule onto the new machineset and prevent a scale down. we have had similar bugs related to OLM pods, but i had thought we solved those issues.

it also seems like this problem is intermittent as the test will pass sometimes.

elmiko · 2022-09-28T16:52:31Z

/retest-required

elmiko · 2022-09-28T21:16:24Z

/retest

elmiko · 2022-09-29T03:14:43Z

/retest

elmiko · 2022-09-29T12:33:58Z

/retest

elmiko · 2022-09-29T15:38:35Z

/retest

elmiko · 2022-09-29T18:43:04Z

/retest

elmiko · 2022-09-29T20:49:46Z

i'm starting to think we need to bump the k8s version here as well. seeing a bunch of failures around PDBs having the wrong version.

elmiko · 2022-09-29T21:00:34Z

updated the deps for k8s 1.25 and MAO/CAO, let's see how this works

elmiko · 2022-09-30T01:40:06Z

/retest

JoelSpeed · 2022-09-30T15:00:37Z

/retest

elmiko · 2022-09-30T18:21:15Z

i would like to catch this in the vendor update, openshift/cluster-autoscaler-operator#252, so i am putting a hold here temporarily.

/hold

elmiko · 2022-09-30T23:47:02Z

updated with rebase of cao

elmiko · 2022-10-01T02:59:37Z

/retest

elmiko · 2022-10-01T14:12:34Z

/retest

lobziik · 2022-10-03T08:43:59Z

/test e2e-aws-operator

elmiko · 2022-10-03T14:41:14Z

/retest

elmiko · 2022-10-04T13:41:23Z

/retest

elmiko · 2022-10-04T21:21:00Z

/retest

JoelSpeed · 2022-10-13T12:09:27Z

/retest-required

elmiko · 2022-10-13T20:25:08Z

/retest-required

JoelSpeed · 2022-10-14T08:31:33Z

/retest-required

elmiko · 2022-10-14T18:27:13Z

/retest-required

elmiko · 2022-10-14T20:47:37Z

/retest-required

elmiko · 2022-10-14T20:50:53Z

that last couple failures on e2e-aws-operator appear to have been because the autoscaler scale to zero test was waiting for a node to finish scaling in, but i never see a corresponding event from the autoscaler to signal that it will remove the node. this makes me think that either something is on that node that is preventing it from scaling down or the workload hasn't finished yet (which would be odd).

openshift-ci-robot · 2022-10-18T12:47:12Z

/retest-required

Remaining retests: 0 against base HEAD 477b16e and 1 for PR HEAD dc312c4 in total

elmiko · 2022-10-18T21:59:15Z

i'm hacking on a small change to enforce a taint on the machinesets we create that the test workload can tolerate. i have it working locally but need to add a few more tests.
i guess i also need a rebase or something here.

elmiko · 2022-10-24T12:07:50Z

/hold
i am still fixing this up

elmiko · 2022-10-26T14:51:07Z

not sure what happened here, i tried to rebase my changes and now it's all messed up

elmiko · 2022-10-26T14:51:42Z

i see what i did, i'll fix it shortly

this change makes it so that the max nodes test will observe only the nodes that are ready and schedulable in the cluster when determining if the maximum has been reached. the autoscaler will only count ready and schedulable nodes when performing its calculations, when this is combined with tests that might not clean up the cluster properly, or may leave nodes in a non schedulable state, then the test need to take into account only the ready nodes when calculating the maximum size.

this change makes the job names reflect the test they are running and also make them unique to each test for the autoscaler. although these workloads should not conflict with each other, they are all created in the same namespace with the same name, this change will make them more unique.

this change adds taints to the machinesets that are used by the autoscaler tests to ensure that no other workloads are landing on the new machines that are created.

JoelSpeed

The taints and tolerations prevent other workloads running on our hosts, but they don't guarantee that our workloads run on our hosts.
We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

JoelSpeed · 2022-10-26T15:18:03Z

pkg/framework/jobs.go

@@ -41,6 +41,10 @@ func NewWorkLoad(njobs int32, memoryRequest resource.Quantity, workloadJobName s
 							Key:      "kubemark",
 							Operator: corev1.TolerationOpExists,
 						},
+						{
+							Key:    ClusterAPIActuatorPkgTaint,
+							Effect: corev1.TaintEffectPreferNoSchedule,


I think we want a NoSchedule rather than prefer, prefer means we might still get random workloads on here right? You said you had some issues with this IIRC? Maybe you can expand on those here

i tried NoSchedule at it did not work as i expected, which is why i backed off the PreferredNoSchedule. i have a feeling something else is not properly tolerating NoSchedule and causing the nodes to get deprovisioned. we should figure it out, but this current patch is an incremental improvement imo.

openshift-ci · 2022-10-26T18:17:38Z

@elmiko: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

elmiko · 2022-10-26T20:14:26Z

We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

we are doing that already, unless i am misunderstanding

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/framework/jobs.go#L52

elmiko · 2022-10-26T20:14:44Z

/hold cancel

JoelSpeed · 2022-10-27T08:34:26Z

We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

we are doing that already, unless i am misunderstanding

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/framework/jobs.go#L52

Positional arguments are hard!

/lgtm

openshift-ci bot requested review from enxebre and lobziik September 27, 2022 20:35

elmiko mentioned this pull request Sep 27, 2022

OCPBUGS-1484: rebase on upstream 1.25.0 openshift/kubernetes-autoscaler#241

Merged

elmiko force-pushed the update-camaxnodestotal branch from 5c4a785 to 3fc6fc2 Compare September 28, 2022 13:31

elmiko force-pushed the update-camaxnodestotal branch from 31a07e5 to 62b270e Compare September 30, 2022 16:57

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2022

elmiko force-pushed the update-camaxnodestotal branch from 62b270e to dc312c4 Compare September 30, 2022 23:46

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2022

elmiko force-pushed the update-camaxnodestotal branch from 2bb6213 to 194d804 Compare October 26, 2022 14:49

elmiko added 4 commits October 26, 2022 10:52

update dependencies for k8s 1.25

636c038

add tolerations and taints for autoscaler workload

802d88a

this change adds taints to the machinesets that are used by the autoscaler tests to ensure that no other workloads are landing on the new machines that are created.

elmiko force-pushed the update-camaxnodestotal branch from 194d804 to 802d88a Compare October 26, 2022 15:04

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022

JoelSpeed reviewed Oct 26, 2022

View reviewed changes

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2022

openshift-merge-robot merged commit 36287e9 into openshift:master Oct 27, 2022

elmiko deleted the update-camaxnodestotal branch October 27, 2022 14:22

update autoscaler max nodes test #241

update autoscaler max nodes test #241

Conversation

elmiko commented Sep 27, 2022 • edited Loading

elmiko commented Sep 27, 2022

elmiko commented Sep 27, 2022

elmiko commented Sep 28, 2022

elmiko commented Sep 28, 2022

elmiko commented Sep 28, 2022

elmiko commented Sep 28, 2022

elmiko commented Sep 28, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 29, 2022

elmiko commented Sep 30, 2022

JoelSpeed commented Sep 30, 2022

elmiko commented Sep 30, 2022

elmiko commented Sep 30, 2022

elmiko commented Oct 1, 2022

elmiko commented Oct 1, 2022

lobziik commented Oct 3, 2022

elmiko commented Oct 3, 2022

elmiko commented Oct 4, 2022

elmiko commented Oct 4, 2022

JoelSpeed commented Oct 13, 2022

elmiko commented Oct 13, 2022

JoelSpeed commented Oct 14, 2022

elmiko commented Oct 14, 2022

elmiko commented Oct 14, 2022

elmiko commented Oct 14, 2022

openshift-ci-robot commented Oct 18, 2022

elmiko commented Oct 18, 2022

elmiko commented Oct 24, 2022

elmiko commented Oct 26, 2022

elmiko commented Oct 26, 2022

JoelSpeed left a comment

Choose a reason for hiding this comment

JoelSpeed Oct 26, 2022

Choose a reason for hiding this comment

elmiko Oct 26, 2022

Choose a reason for hiding this comment

openshift-ci bot commented Oct 26, 2022

elmiko commented Oct 26, 2022

elmiko commented Oct 26, 2022

JoelSpeed commented Oct 27, 2022

elmiko commented Sep 27, 2022 •

edited

Loading