-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Reenable 2 MHC unit tests #10906
🌱 Reenable 2 MHC unit tests #10906
Conversation
Skipping CI for Draft Pull Request. |
i think the nodes just need some more time to come up? the flakes seem to occur when there's an additional unintentional remediation with when a Machine's Node has gone awayfailure:
unexpected remediation:
expected remediation:
when a Machine has no Node ref for longer than the NodeStartupTimeoutFailure:
unexpected remediation:
expected remediation:
|
ba840ac
to
d86d8be
Compare
running each test individually with ➜ KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine_has_no_Node_ref_for_longer_than_the_NodeStartupTimeout$ -count=50 -failfast
...
ok sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck 309.292s KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine\'s_Node_has_gone_away\$ -count=50 -failfast
...
ok sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck 48.430s
|
d86d8be
to
96d1bee
Compare
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
@cahillsf Thx for looking into this. I'm generally fine with using 5s for these two tests. I would just prefer if we set it locally in those tests. It's really hard for me to assess the impact on all other tests using this func and if this would introduce new flakes elsewhere |
sounds good, thanks for the review. will make this change |
96d1bee
to
fd89f58
Compare
fd89f58
to
5050e3c
Compare
whoops 🙃 fixed when you have a chance to take another pass |
Thank you! |
LGTM label has been added. Git tree hash: 9b31fbda01e0a4f74b6012d80931e6aa153a98f7
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cahillsf Looks like one of them failed again (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1820212332773511168) Do you have some time to look into it? |
Looks like this one as well: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-main/1819898507000025088 |
hey @sbueringer thanks for flagging. on first review it seems like the same issue
this one has 2 machines marked for remediation when it only expects 1, unexpected remediation is due to I0804 00:54:55.194264 26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="9bc5a0b0-d629-4207-924f-52bbcf7e850a" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-s7n5t/" reason="NodeStartupTimeout" message="Node failed to report startup in 5s" here is the expected one: I0804 00:54:55.706437 26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="556769db-d779-4912-9c72-bba965028c85" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-twm26/" reason="NodeNotFound" message="" may need some more time to look into the second. i'm happy to open a revert PR, or I can try increasing the timeout further for just these two tests? lmk what you think is best |
I think we don't have to revert it happens pretty infrequently. Fine to try to just increase the timeout further to see if that resolves it |
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #9903
/area machinehealthcheck