Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Reenable 2 MHC unit tests #10906

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

cahillsf
Copy link
Member

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #9903

/area machinehealthcheck

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added area/machinehealthcheck Issues or PRs related to machinehealthchecks do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 18, 2024
@cahillsf
Copy link
Member Author

cahillsf commented Jul 18, 2024

i think the nodes just need some more time to come up? the flakes seem to occur when there's an additional unintentional remediation with message="Node failed to report startup in <1ms|1s>"

when a Machine's Node has gone away

failure:

--- FAIL: TestMachineHealthCheck_Reconcile (32.66s)
    --- FAIL: TestMachineHealthCheck_Reconcile/when_a_Machine's_Node_has_gone_away (32.65s)
        machinehealthcheck_controller_test.go:1069: 
            Timed out after 30.001s.
            Expected
                <int>: 2
            to equal
                <int>: 1
FAIL

unexpected remediation:

I0718 19:16:12.978262   82258 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-hh6v8/test-mhc-kjj7x" namespace="test-mhc-hh6v8" name="test-mhc-kjj7x" reconcileID="c787a48f-8a18-4cd2-b355-4f929cb1c6d6" Cluster="test-mhc-hh6v8/test-cluster-fdcf2" target="test-mhc-hh6v8/test-mhc-kjj7x/test-mhc-machine-8cmcv/" reason="NodeStartupTimeout" message="Node failed to report startup in 1ms"

expected remediation:

I0718 19:16:13.417869   82258 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-hh6v8/test-mhc-kjj7x" namespace="test-mhc-hh6v8" name="test-mhc-kjj7x" reconcileID="3b971619-b43f-4565-a9d0-41d435c91a05" Cluster="test-mhc-hh6v8/test-cluster-fdcf2" target="test-mhc-hh6v8/test-mhc-kjj7x/test-mhc-machine-g2m4k/" reason="NodeNotFound" message=""

when a Machine has no Node ref for longer than the NodeStartupTimeout

Failure:

--- FAIL: TestMachineHealthCheck_Reconcile (33.11s)
    --- FAIL: TestMachineHealthCheck_Reconcile/when_a_Machine_has_no_Node_ref_for_longer_than_the_NodeStartupTimeout (33.10s)
        machinehealthcheck_controller_test.go:975: 
            Timed out after 30.000s.
            Expected
                <int>: 2
            to equal
                <int>: 1

unexpected remediation:

I0718 19:08:06.039675   77374 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-8fzfd/test-mhc-swlqt" namespace="test-mhc-8fzfd" name="test-mhc-swlqt" reconcileID="38905778-daec-4868-914f-8cf25d893355" Cluster="test-mhc-8fzfd/test-cluster-twwzs" target="test-mhc-8fzfd/test-mhc-swlqt/test-mhc-machine-jvf4p/" reason="NodeStartupTimeout" message="Node failed to report startup in 1s"

expected remediation:

I0718 19:08:06.046092   77374 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-8fzfd/test-mhc-swlqt" namespace="test-mhc-8fzfd" name="test-mhc-swlqt" reconcileID="d1bd88fe-a14a-4ced-ad3f-c1ed4a617841" Cluster="test-mhc-8fzfd/test-cluster-twwzs" target="test-mhc-8fzfd/test-mhc-swlqt/test-mhc-machine-rch4n/" reason="NodeNotFound" message=""

@cahillsf
Copy link
Member Author

cahillsf commented Jul 19, 2024

running each test individually with -count=50 -failfast

➜ KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine_has_no_Node_ref_for_longer_than_the_NodeStartupTimeout$ -count=50 -failfast
...
ok  	sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck	309.292s
KUBEBUILDER_ASSETS="/Users/stephen.cahill/Library/Application Support/io.kubebuilder.envtest/k8s/1.30.0-darwin-arm64" go test -run ^TestMachineHealthCheck_Reconcile$/^when_a_Machine\'s_Node_has_gone_away\$ -count=50 -failfast
...
ok  	sigs.k8s.io/cluster-api/internal/controllers/machinehealthcheck	48.430s

@cahillsf cahillsf marked this pull request as ready for review July 19, 2024 00:20
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024
@cahillsf cahillsf changed the title 🌱 reeable MHC unit tests 🌱 Reenable MHC unit tests Jul 19, 2024
@cahillsf cahillsf marked this pull request as draft July 19, 2024 13:03
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024
@cahillsf cahillsf marked this pull request as ready for review July 19, 2024 16:28
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2024
@cahillsf cahillsf changed the title 🌱 Reenable MHC unit tests 🌱 Reenable 2 MHC unit tests Jul 19, 2024
@sbueringer
Copy link
Member

sbueringer commented Jul 25, 2024

@cahillsf Thx for looking into this. I'm generally fine with using 5s for these two tests. I would just prefer if we set it locally in those tests. It's really hard for me to assess the impact on all other tests using this func and if this would introduce new flakes elsewhere

@cahillsf
Copy link
Member Author

@cahillsf Thx for looking into this. I'm generally fine with using 5s for these two tests. I would just prefer if we set it locally in those tests. It's really hard for me to assess the impact on all other tests using this func and if this would introduce new flakes elsewhere

sounds good, thanks for the review. will make this change

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 26, 2024
@sbueringer
Copy link
Member

@cahillsf #10906 (comment) :)

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 26, 2024
@cahillsf
Copy link
Member Author

@cahillsf #10906 (comment) :)

whoops 🙃 fixed when you have a chance to take another pass

@sbueringer
Copy link
Member

Thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9b31fbda01e0a4f74b6012d80931e6aa153a98f7

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2024
@k8s-ci-robot k8s-ci-robot merged commit ff2f73f into kubernetes-sigs:main Jul 29, 2024
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Jul 29, 2024
@sbueringer
Copy link
Member

@cahillsf Looks like one of them failed again (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1820212332773511168) Do you have some time to look into it?

@sbueringer
Copy link
Member

@cahillsf
Copy link
Member Author

cahillsf commented Aug 5, 2024

hey @sbueringer thanks for flagging. on first review it seems like the same issue


Looks like one of them failed again (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1820212332773511168)

this one has 2 machines marked for remediation when it only expects 1, unexpected remediation is due to NodeStartupTimeout:

I0804 00:54:55.194264   26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="9bc5a0b0-d629-4207-924f-52bbcf7e850a" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-s7n5t/" reason="NodeStartupTimeout" message="Node failed to report startup in 5s"

here is the expected one:

I0804 00:54:55.706437   26712 machinehealthcheck_controller.go:435] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="test-mhc-vmdz9/test-mhc-8pj59" namespace="test-mhc-vmdz9" name="test-mhc-8pj59" reconcileID="556769db-d779-4912-9c72-bba965028c85" Cluster="test-mhc-vmdz9/test-cluster-h87bk" target="test-mhc-vmdz9/test-mhc-8pj59/test-mhc-machine-twm26/" reason="NodeNotFound" message=""

may need some more time to look into the second.

i'm happy to open a revert PR, or I can try increasing the timeout further for just these two tests? lmk what you think is best

@sbueringer
Copy link
Member

sbueringer commented Aug 5, 2024

I think we don't have to revert it happens pretty infrequently. Fine to try to just increase the timeout further to see if that resolves it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove skip from Machine health check test
3 participants