Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update autoscaler max nodes test #241

Merged
merged 4 commits into from
Oct 27, 2022

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Sep 27, 2022

this change makes it so that the max nodes test will observe only the ready nodes in the cluster when determining if the maximum has been reached. the autoscaler will only count ready nodes when performing its calculations, when this is combined with tests that might not clean up the cluster properly, or may leave nodes in a non schedulable state, then the test need to take into account only the ready nodes when calculating the maximum size.

this PR also changes the workload job names that are run in the autoscaler tests to ensure that they are unique from each other.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 27, 2022

openshift/kubernetes-autoscaler#241 will want to know when this merges

also, i am amused that these both have the same pull number

@elmiko
Copy link
Contributor Author

elmiko commented Sep 27, 2022

/test e2e-aws-operator

@elmiko
Copy link
Contributor Author

elmiko commented Sep 28, 2022

/retest-required

@elmiko elmiko force-pushed the update-camaxnodestotal branch from 5c4a785 to 3fc6fc2 Compare September 28, 2022 13:31
@elmiko
Copy link
Contributor Author

elmiko commented Sep 28, 2022

updated to add a filter for unschedulable nodes as well

@elmiko
Copy link
Contributor Author

elmiko commented Sep 28, 2022

looking at the scale to/from zero tests, it appears that sometimes the autoscaler does not think it should remove the nodes even after the test workload has been deleted. my current hypothesis is that /something/ is being scheduled to those nodes (could be a core operator or similar) that is causing the autoscaler to not delete them. if we look at successful runs of this test we see this in the output:

STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0725 15:27:08.616361   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-228-209.us-east-2.compute.internal"
I0725 15:27:08.629561   13940 autoscaler.go:224] ip-10-0-228-209.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.631358   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-231-145.us-east-2.compute.internal"
I0725 15:27:08.650275   13940 autoscaler.go:224] dns-default-fhp8g: deleting pod for node scale down
I0725 15:27:08.658258   13940 autoscaler.go:224] ip-10-0-231-145.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.669076   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-198-234.us-east-2.compute.internal"
I0725 15:27:08.678413   13940 autoscaler.go:224] dns-default-n2pbv: deleting pod for node scale down
I0725 15:27:08.686565   13940 autoscaler.go:224] ip-10-0-198-234.us-east-2.compute.internal: marked the node as toBeDeleted/unschedulable
I0725 15:27:08.701542   13940 autoscaler.go:224] dns-default-hcwdz: deleting pod for node scale down
I0725 15:27:09.821073   13940 autoscaler.go:224] ip-10-0-228-209.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:09.833105   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-228-209.us-east-2.compute.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
I0725 15:27:10.628454   13940 autoscaler.go:224] ip-10-0-231-145.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:10.628496   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-231-145.us-east-2.compute.internal removed
I0725 15:27:11.227274   13940 autoscaler.go:224] ip-10-0-198-234.us-east-2.compute.internal: node removed by cluster autoscaler
I0725 15:27:11.233313   13940 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-198-234.us-east-2.compute.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 0, expected 0.

this output shows that the autoscaler emitted events to remove all the nodes. but, if we look at a failure, we see this:

 STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0928 14:50:10.567005   13834 autoscaler.go:224] cluster-autoscaler-status: Scale-down: removing empty node "ip-10-0-225-90.ec2.internal"
I0928 14:50:10.584090   13834 autoscaler.go:224] ip-10-0-225-90.ec2.internal: marked the node as toBeDeleted/unschedulable
I0928 14:50:10.588194   13834 autoscaler.go:224] dns-default-9wmsn: deleting pod for node scale down
STEP: Waiting for machineSet replicas to scale in. Current replicas are 3, expected 0.
I0928 14:50:11.769735   13834 autoscaler.go:224] ip-10-0-225-90.ec2.internal: node removed by cluster autoscaler
I0928 14:50:11.776553   13834 autoscaler.go:224] cluster-autoscaler-status: Scale-down: empty node ip-10-0-225-90.ec2.internal removed
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0.
I0928 14:50:23.583593   13834 autoscaler.go:224] oauth-openshift-fd8c57bd5-kmmn8: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.589126   13834 autoscaler.go:224] apiserver-5775898b65-djq2v: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.593609   13834 autoscaler.go:224] route-controller-manager-5bfc67fb97-qrrdq: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
I0928 14:50:23.598202   13834 autoscaler.go:224] apiserver-5b96744cdb-g6xbs: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
STEP: Waiting for machineSet replicas to scale in. Current replicas are 2, expected 0. 

we can see that the autoscaler only considers removing a single node and also it thinks that these other pods should be scheduled somewhere. i'm not sure if those other pods are causing part of the issue as they might schedule onto the new machineset and prevent a scale down. we have had similar bugs related to OLM pods, but i had thought we solved those issues.

it also seems like this problem is intermittent as the test will pass sometimes.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 28, 2022

/retest-required

@elmiko
Copy link
Contributor Author

elmiko commented Sep 28, 2022

/retest

4 similar comments
@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

i'm starting to think we need to bump the k8s version here as well. seeing a bunch of failures around PDBs having the wrong version.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2022

updated the deps for k8s 1.25 and MAO/CAO, let's see how this works

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2022

/retest

1 similar comment
@JoelSpeed
Copy link
Contributor

/retest

@elmiko elmiko force-pushed the update-camaxnodestotal branch from 31a07e5 to 62b270e Compare September 30, 2022 16:57
@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2022

i would like to catch this in the vendor update, openshift/cluster-autoscaler-operator#252, so i am putting a hold here temporarily.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2022
@elmiko elmiko force-pushed the update-camaxnodestotal branch from 62b270e to dc312c4 Compare September 30, 2022 23:46
@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2022

updated with rebase of cao

@elmiko
Copy link
Contributor Author

elmiko commented Oct 1, 2022

/retest

1 similar comment
@elmiko
Copy link
Contributor Author

elmiko commented Oct 1, 2022

/retest

@lobziik
Copy link
Contributor

lobziik commented Oct 3, 2022

/test e2e-aws-operator

@elmiko
Copy link
Contributor Author

elmiko commented Oct 3, 2022

/retest

2 similar comments
@elmiko
Copy link
Contributor Author

elmiko commented Oct 4, 2022

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Oct 4, 2022

/retest

@JoelSpeed
Copy link
Contributor

/retest-required

4 similar comments
@elmiko
Copy link
Contributor Author

elmiko commented Oct 13, 2022

/retest-required

@JoelSpeed
Copy link
Contributor

/retest-required

@elmiko
Copy link
Contributor Author

elmiko commented Oct 14, 2022

/retest-required

@elmiko
Copy link
Contributor Author

elmiko commented Oct 14, 2022

/retest-required

@elmiko
Copy link
Contributor Author

elmiko commented Oct 14, 2022

that last couple failures on e2e-aws-operator appear to have been because the autoscaler scale to zero test was waiting for a node to finish scaling in, but i never see a corresponding event from the autoscaler to signal that it will remove the node. this makes me think that either something is on that node that is preventing it from scaling down or the workload hasn't finished yet (which would be odd).

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 477b16e and 1 for PR HEAD dc312c4 in total

@elmiko
Copy link
Contributor Author

elmiko commented Oct 18, 2022

i'm hacking on a small change to enforce a taint on the machinesets we create that the test workload can tolerate. i have it working locally but need to add a few more tests.
i guess i also need a rebase or something here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2022
@elmiko
Copy link
Contributor Author

elmiko commented Oct 24, 2022

/hold
i am still fixing this up

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2022
@elmiko elmiko force-pushed the update-camaxnodestotal branch from 2bb6213 to 194d804 Compare October 26, 2022 14:49
@elmiko
Copy link
Contributor Author

elmiko commented Oct 26, 2022

not sure what happened here, i tried to rebase my changes and now it's all messed up

@elmiko
Copy link
Contributor Author

elmiko commented Oct 26, 2022

i see what i did, i'll fix it shortly

this change makes it so that the max nodes test will observe only the
nodes that are ready and schedulable in the cluster when determining
if the maximum has been reached. the autoscaler will only count ready
and schedulable nodes when performing its calculations, when this is
combined with tests that might not clean up the cluster properly, or
may leave nodes in a non schedulable state, then the test need to take
into account only the ready nodes when calculating the maximum size.
this change makes the job names reflect the test they are running and
also make them unique to each test for the autoscaler. although these
workloads should not conflict with each other, they are all created in
the same namespace with the same name, this change will make them more
unique.
this change adds taints to the machinesets that are used by the
autoscaler tests to ensure that no other workloads are landing on the
new machines that are created.
@elmiko elmiko force-pushed the update-camaxnodestotal branch from 194d804 to 802d88a Compare October 26, 2022 15:04
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The taints and tolerations prevent other workloads running on our hosts, but they don't guarantee that our workloads run on our hosts.
We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

@@ -41,6 +41,10 @@ func NewWorkLoad(njobs int32, memoryRequest resource.Quantity, workloadJobName s
Key: "kubemark",
Operator: corev1.TolerationOpExists,
},
{
Key: ClusterAPIActuatorPkgTaint,
Effect: corev1.TaintEffectPreferNoSchedule,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a NoSchedule rather than prefer, prefer means we might still get random workloads on here right? You said you had some issues with this IIRC? Maybe you can expand on those here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried NoSchedule at it did not work as i expected, which is why i backed off the PreferredNoSchedule. i have a feeling something else is not properly tolerating NoSchedule and causing the nodes to get deprovisioned. we should figure it out, but this current patch is an incremental improvement imo.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2022

@elmiko: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link
Contributor Author

elmiko commented Oct 26, 2022

We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

we are doing that already, unless i am misunderstanding

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/framework/jobs.go#L52

@elmiko
Copy link
Contributor Author

elmiko commented Oct 26, 2022

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022
@JoelSpeed
Copy link
Contributor

We should add a specific label to the Machine template (so it ends up on the node) and then add a nodeSelector to the workloads pods so that they only schedule on nodes with that label

we are doing that already, unless i am misunderstanding

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/framework/jobs.go#L52

Positional arguments are hard!

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2022
@openshift-merge-robot openshift-merge-robot merged commit 36287e9 into openshift:master Oct 27, 2022
@elmiko elmiko deleted the update-camaxnodestotal branch October 27, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants