Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakes in the machine pools unit tests #4068

Closed
fabriziopandini opened this issue Jan 13, 2021 · 9 comments · Fixed by #4086
Closed

Flakes in the machine pools unit tests #4068

fabriziopandini opened this issue Jan 13, 2021 · 9 comments · Fixed by #4086
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@fabriziopandini
Copy link
Member

What steps did you take and what happened:
While investigating some unit test failures I some this error happen sometimes:


--- PASS: TestMachinePoolConditions (0.08s)
    --- PASS: TestMachinePoolConditions/all_conditions_true (0.01s)
    --- PASS: TestMachinePoolConditions/boostrap_not_ready (0.02s)
    --- PASS: TestMachinePoolConditions/bootstrap_not_ready_with_fallback_condition (0.02s)
    --- PASS: TestMachinePoolConditions/infrastructure_not_ready (0.00s)
    --- PASS: TestMachinePoolConditions/infrastructure_not_ready_with_fallback_condition (0.03s)
=== RUN   TestAPIs
Running Suite: Controller Suite
===============================
Random Seed: 1610553618
....
••E0113 16:00:39.205134   12454 certwatcher.go:144] controller-runtime/certwatcher "msg"="error re-watching file" "error"="no such file or directory"  
E0113 16:00:39.205196   12454 certwatcher.go:149] controller-runtime/certwatcher "msg"="error re-reading certificate" "error"="open /tmp/envtest-serving-certs-204656340/tls.crt: no such file or directory"  
E0113 16:00:39.205232   12454 certwatcher.go:144] controller-runtime/certwatcher "msg"="error re-watching file" "error"="no such file or directory"  
E0113 16:00:39.205261   12454 certwatcher.go:149] controller-runtime/certwatcher "msg"="error re-reading certificate" "error"="open /tmp/envtest-serving-certs-204656340/tls.crt: no such file or directory"  
I0113 16:00:39.205404   12454 server.go:231] controller-runtime/webhook "msg"="shutting down webhook server"  
STEP: tearing down the test environment
------------------------------
Failure [0.652 seconds]
[AfterSuite] AfterSuite 
/home/prow/go/src/sigs.k8s.io/cluster-api/exp/controllers/suite_test.go:66
  Expected success, but got an error:
      <*errors.StatusError | 0xc000348aa0>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Timeout: failed waiting for *v1alpha4.MachinePool Informer to sync",
              Reason: "Timeout",
              Details: {Name: "", Group: "", Kind: "", UID: "", Causes: nil, RetryAfterSeconds: 0},
              Code: 504,
          },
      }
      Timeout: failed waiting for *v1alpha4.MachinePool Informer to sync
  /home/prow/go/src/sigs.k8s.io/cluster-api/exp/controllers/suite_test.go:60
------------------------------
Ran 9 of 9 Specs in 20.685 seconds
FAIL! -- 9 Passed | 0 Failed | 0 Pending | 0 Skipped
--- FAIL: TestAPIs (20.69s)
FAIL
FAIL	sigs.k8s.io/cluster-api/exp/controllers	21.138s

The error seems to be in the tear-down sequence, not in the test itself 🤔

Environment:

  • Cluster-api version: Main

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2021
@fabriziopandini
Copy link
Member Author

/area testing
/milestone v0.4.x

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The provided milestone is not valid for this repository. Milestones in this repository: [Next, v0.3.x, v0.4.0]

Use /milestone clear to clear the milestone.

In response to this:

/area testing
/milestone v0.4.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the area/testing Issues or PRs related to testing label Jan 13, 2021
@fabriziopandini fabriziopandini added this to the v0.4.0 milestone Jan 13, 2021
@fabriziopandini
Copy link
Member Author

@CecileRobertMichon @detiber @vincepri I could use some help here.

"Timeout: failed waiting for *v1alpha4.MachinePool Informer to sync" happens only sometimes.

MachinePool CRDs is registered in testenv together with the other CAPI CRDs, so the fact that this error only happens for this kind is driving me to assume the problem is in exp/api/v1alpha4.

However, the only thing a little bit odd that I can find in the MachinePool types is t that they are disabling conversions, but I don't have the full context here

// +k8s:conversion-gen=false

Does anything above triggers some bell?
Do you think I'm going in the right direction while investigating this issue?

@vincepri
Copy link
Member

Is the error coming from the StartManager function? From above it seems like it, but it also seems tests did run 🤔

@fabriziopandini
Copy link
Member Author

@vincepri, yes the error is from StartManager, and the test is run

My current assumption is that the test is too fast, and test env is stopped even before StartManager is completed (which is something that can happen given that most of the initialization stuff happens in goroutines, so it is not blocking). However, I don't know yet how can we tackle this...

@faiq
Copy link
Contributor

faiq commented Apr 1, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@faiq: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@faiq
Copy link
Contributor

faiq commented Apr 1, 2021

Hey folks, just wanted to register that I'm currently experiencing this problem as well. To me it seems like this is occurring because the files are never installed.

E0401 12:45:18.942558  504688 certwatcher.go:143] controller-runtime/certwatcher "msg"="error re-watching file" "error"="no such file or directory"  
E0401 12:45:18.943325  504688 certwatcher.go:148] controller-runtime/certwatcher "msg"="error re-reading certificate" "error"="open /tmp/envtest-serving-certs-767561487/tls.crt: no such file or directory"  
I0401 12:45:18.942641  504688 controller.go:203] controller-runtime/controller "msg"="Stopping workers" "controller"="clusterresourcesetbinding"

I think this can be solved by using this method to write the files to disk
https://github.com/kubernetes-sigs/controller-runtime/blob/10ae090c1d3ac0c560dfa1a29b2517eb8d74442b/pkg/envtest/webhook.go#L163

in this method in the envtest helper.

func initializeWebhookInEnvironment() {

@fabriziopandini
Copy link
Member Author

@faiq do you mind to move your comment to a new issue, so we can track this problem without being confused by the previous discussion thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants