🌱 Reenable 2 MHC unit tests #10906

cahillsf · 2024-07-18T23:45:25Z

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #9903

/area machinehealthcheck

k8s-ci-robot · 2024-07-18T23:45:27Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cahillsf · 2024-07-18T23:46:50Z

i think the nodes just need some more time to come up? the flakes seem to occur when there's an additional unintentional remediation with message="Node failed to report startup in <1ms|1s>"

when a Machine's Node has gone away

failure:

--- FAIL: TestMachineHealthCheck_Reconcile (32.66s)
    --- FAIL: TestMachineHealthCheck_Reconcile/when_a_Machine's_Node_has_gone_away (32.65s)
        machinehealthcheck_controller_test.go:1069: 
            Timed out after 30.001s.
            Expected
                <int>: 2
            to equal
                <int>: 1
FAIL

unexpected remediation:

I0718 19:16:12.978262   82258 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-hh6v8/test-mhc-kjj7x" namespace="test-mhc-hh6v8" name="test-mhc-kjj7x" reconcileID="c787a48f-8a18-4cd2-b355-4f929cb1c6d6" Cluster="test-mhc-hh6v8/test-cluster-fdcf2" target="test-mhc-hh6v8/test-mhc-kjj7x/test-mhc-machine-8cmcv/" reason="NodeStartupTimeout" message="Node failed to report startup in 1ms"

expected remediation:

I0718 19:16:13.417869   82258 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-hh6v8/test-mhc-kjj7x" namespace="test-mhc-hh6v8" name="test-mhc-kjj7x" reconcileID="3b971619-b43f-4565-a9d0-41d435c91a05" Cluster="test-mhc-hh6v8/test-cluster-fdcf2" target="test-mhc-hh6v8/test-mhc-kjj7x/test-mhc-machine-g2m4k/" reason="NodeNotFound" message=""

when a Machine has no Node ref for longer than the NodeStartupTimeout

Failure:

--- FAIL: TestMachineHealthCheck_Reconcile (33.11s)
    --- FAIL: TestMachineHealthCheck_Reconcile/when_a_Machine_has_no_Node_ref_for_longer_than_the_NodeStartupTimeout (33.10s)
        machinehealthcheck_controller_test.go:975: 
            Timed out after 30.000s.
            Expected
                <int>: 2
            to equal
                <int>: 1

unexpected remediation:

I0718 19:08:06.039675   77374 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-8fzfd/test-mhc-swlqt" namespace="test-mhc-8fzfd" name="test-mhc-swlqt" reconcileID="38905778-daec-4868-914f-8cf25d893355" Cluster="test-mhc-8fzfd/test-cluster-twwzs" target="test-mhc-8fzfd/test-mhc-swlqt/test-mhc-machine-jvf4p/" reason="NodeStartupTimeout" message="Node failed to report startup in 1s"

expected remediation:

I0718 19:08:06.046092   77374 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-8fzfd/test-mhc-swlqt" namespace="test-mhc-8fzfd" name="test-mhc-swlqt" reconcileID="d1bd88fe-a14a-4ced-ad3f-c1ed4a617841" Cluster="test-mhc-8fzfd/test-cluster-twwzs" target="test-mhc-8fzfd/test-mhc-swlqt/test-mhc-machine-rch4n/" reason="NodeNotFound" message=""

cahillsf · 2024-07-19T00:19:20Z

running each test individually with -count=50 -failfast

➜ KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine_has_no_Node_ref_for_longer_than_the_NodeStartupTimeout$ -count=50 -failfast
...
ok  	sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck	309.292s

KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine\'s_Node_has_gone_away\$ -count=50 -failfast
...
ok  	sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck	48.430s

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

sbueringer · 2024-07-25T14:37:39Z

@cahillsf Thx for looking into this. I'm generally fine with using 5s for these two tests. I would just prefer if we set it locally in those tests. It's really hard for me to assess the impact on all other tests using this func and if this would introduce new flakes elsewhere

cahillsf · 2024-07-25T15:21:54Z

@cahillsf Thx for looking into this. I'm generally fine with using 5s for these two tests. I would just prefer if we set it locally in those tests. It's really hard for me to assess the impact on all other tests using this func and if this would introduce new flakes elsewhere

sounds good, thanks for the review. will make this change

sbueringer · 2024-07-26T15:01:20Z

@cahillsf #10906 (comment) :)

cahillsf · 2024-07-26T17:56:25Z

@cahillsf #10906 (comment) :)

whoops 🙃 fixed when you have a chance to take another pass

sbueringer · 2024-07-29T09:32:11Z

Thank you!
/lgtm
/approve

k8s-ci-robot · 2024-07-29T09:32:18Z

LGTM label has been added.

Git tree hash: 9b31fbda01e0a4f74b6012d80931e6aa153a98f7

k8s-ci-robot · 2024-07-29T09:32:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2024-08-05T10:52:38Z

@cahillsf Looks like one of them failed again (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1820212332773511168) Do you have some time to look into it?

sbueringer · 2024-08-05T11:17:15Z

Looks like this one as well: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-main/1819898507000025088

cahillsf · 2024-08-05T13:27:34Z

hey @sbueringer thanks for flagging. on first review it seems like the same issue

Looks like one of them failed again (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1820212332773511168)

this one has 2 machines marked for remediation when it only expects 1, unexpected remediation is due to NodeStartupTimeout:

I0804 00:54:55.194264   26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="9bc5a0b0-d629-4207-924f-52bbcf7e850a" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-s7n5t/" reason="NodeStartupTimeout" message="Node failed to report startup in 5s"

here is the expected one:

I0804 00:54:55.706437   26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="556769db-d779-4912-9c72-bba965028c85" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-twm26/" reason="NodeNotFound" message=""

may need some more time to look into the second.

i'm happy to open a revert PR, or I can try increasing the timeout further for just these two tests? lmk what you think is best

sbueringer · 2024-08-05T13:29:58Z

I think we don't have to revert it happens pretty infrequently. Fine to try to just increase the timeout further to see if that resolves it

k8s-ci-robot requested review from chrischdi and killianmuldoon July 18, 2024 23:45

cahillsf force-pushed the reenable-mhc-unit-test branch from ba840ac to d86d8be Compare July 18, 2024 23:57

cahillsf marked this pull request as ready for review July 19, 2024 00:20

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024

k8s-ci-robot requested review from fabriziopandini and vincepri July 19, 2024 00:20

cahillsf changed the title ~~🌱 reeable MHC unit tests~~ 🌱 Reenable MHC unit tests Jul 19, 2024

cahillsf marked this pull request as draft July 19, 2024 13:03

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024

cahillsf force-pushed the reenable-mhc-unit-test branch from d86d8be to 96d1bee Compare July 19, 2024 13:55

cahillsf marked this pull request as ready for review July 19, 2024 16:28

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024

k8s-ci-robot requested a review from richardcase July 19, 2024 16:28

cahillsf changed the title ~~🌱 Reenable MHC unit tests~~ 🌱 Reenable 2 MHC unit tests Jul 19, 2024

sbueringer reviewed Jul 25, 2024

View reviewed changes

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go Outdated Show resolved Hide resolved

cahillsf force-pushed the reenable-mhc-unit-test branch from 96d1bee to fd89f58 Compare July 26, 2024 13:14

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 26, 2024

reeable MHC unit tests

5050e3c

cahillsf force-pushed the reenable-mhc-unit-test branch from fd89f58 to 5050e3c Compare July 26, 2024 16:29

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 26, 2024

k8s-ci-robot assigned sbueringer Jul 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2024

k8s-ci-robot merged commit ff2f73f into kubernetes-sigs:main Jul 29, 2024
21 checks passed

k8s-ci-robot added this to the v1.8 milestone Jul 29, 2024

cahillsf mentioned this pull request Aug 5, 2024

🌱 Increase timeout for 2 MHC unit tests #11019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Reenable 2 MHC unit tests #10906

🌱 Reenable 2 MHC unit tests #10906

cahillsf commented Jul 18, 2024

k8s-ci-robot commented Jul 18, 2024

cahillsf commented Jul 18, 2024 •

edited

Loading

cahillsf commented Jul 19, 2024 •

edited

Loading

sbueringer commented Jul 25, 2024 •

edited

Loading

cahillsf commented Jul 25, 2024

sbueringer commented Jul 26, 2024

cahillsf commented Jul 26, 2024

sbueringer commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

sbueringer commented Aug 5, 2024

sbueringer commented Aug 5, 2024

cahillsf commented Aug 5, 2024

sbueringer commented Aug 5, 2024 •

edited

Loading

🌱 Reenable 2 MHC unit tests #10906

🌱 Reenable 2 MHC unit tests #10906

Conversation

cahillsf commented Jul 18, 2024

k8s-ci-robot commented Jul 18, 2024

cahillsf commented Jul 18, 2024 • edited Loading

when a Machine's Node has gone away

when a Machine has no Node ref for longer than the NodeStartupTimeout

cahillsf commented Jul 19, 2024 • edited Loading

sbueringer commented Jul 25, 2024 • edited Loading

cahillsf commented Jul 25, 2024

sbueringer commented Jul 26, 2024

cahillsf commented Jul 26, 2024

sbueringer commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

sbueringer commented Aug 5, 2024

sbueringer commented Aug 5, 2024

cahillsf commented Aug 5, 2024

sbueringer commented Aug 5, 2024 • edited Loading

cahillsf commented Jul 18, 2024 •

edited

Loading

cahillsf commented Jul 19, 2024 •

edited

Loading

sbueringer commented Jul 25, 2024 •

edited

Loading

sbueringer commented Aug 5, 2024 •

edited

Loading