Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

Closed
sbueringer opened this issue Apr 2, 2021 · 4 comments · Fixed by #4453
Closed

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

sbueringer opened this issue Apr 2, 2021 · 4 comments · Fixed by #4453
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sbueringer
Copy link
Member

sbueringer commented Apr 2, 2021

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

According to testgrid about 1 in 10 test runs of capi-e2e.When testing Cluster API working on self-hosted clusters Should pivot the bootstrap cluster to a self-hosted cluster is failing. Bug can be reproduced as far as I know only by running the e2e tests a few times and sooner or later the issue occurs.

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Some details (all data is from this test run):

  • Test usually works like this:
    • create a bootstrap cluster
    • create a workload cluster via the bootstrap cluster
    • deploy providers on the workload clusters via clusterctl init
    • clusterctl move
    • wait until everything is up on the workload cluster

In the failed test runs clusterctl move is unable to unpause the cluster in the workload cluster because the mutating webhook capi-webhook-service.capi-system.svc:443 fails because of a connection refused from the APIServer in the workload cluster to the capi-controller-manager pod (with 10 retries over 40 sec, according to the doc).

There's nothing interesting in the capi controller manager logs. But it's interesting that the last log line is about 1m50s before the test fails. So there was not a single log line during the calls to unpause the cluster. This indicates that the capi controller in the workload cluster was not running anymore.

The problem seems to be that the capd manager of the bootstrap cluster deletes the control plane machine of the workload cluster (full logs) during the pivot.

I0402 01:58:14.205132       1 machine.go:406] controller-runtime/manager/controller/dockermachine "msg"="Deleting machine container" "name"="self-hosted-gltar1-md-0-q42ht" "namespace"="self-hosted-cxckdr" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 

The unpause timed out at around 02:00.

I suspect there is something wrong in CAPD so that machines are deleted even though the corresponding Cluster resource is paused.

Some additional suggestions for easier debugging (as separate help-wanted issues?):

  • add the test/cluster name to the filename of clusterctl-move.log
  • to be able to easier correlate controller and test logs:
    • add timestamps to the ginkgo logs
    • add timestamps to the clusterctl logs
  • also retrieve some basic deployment/ pod/ event ... resources from namespaces where CAPI controllers are deployed

Environment:

  • Cluster-api version: main
  • Minikube/KIND version: standard e2e test setup
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021
@sbueringer
Copy link
Member Author

@fabriziopandini

I'm not entirely sure how the pause logic should be implemented in the CAPD machine controller but I suspect we should check if the cluster is paused somewhere in the Reconcile func. The watches look okay, but I think they don't filter out events on Machines when there Clusters are paused (but I could be completely wrong):

c, err := ctrl.NewControllerManagedBy(mgr).
For(&infrav1.DockerMachine{}).
WithOptions(options).
WithEventFilter(predicates.ResourceNotPaused(ctrl.LoggerFrom(ctx))).
Watches(
&source.Kind{Type: &clusterv1.Machine{}},
handler.EnqueueRequestsFromMapFunc(util.MachineToInfrastructureMapFunc(infrav1.GroupVersion.WithKind("DockerMachine"))),
).
Watches(
&source.Kind{Type: &infrav1.DockerCluster{}},
handler.EnqueueRequestsFromMapFunc(r.DockerClusterToDockerMachines),
).
Build(r)
if err != nil {
return err
}
return c.Watch(
&source.Kind{Type: &clusterv1.Cluster{}},
handler.EnqueueRequestsFromMapFunc(clusterToDockerMachines),
predicates.ClusterUnpausedAndInfrastructureReady(ctrl.LoggerFrom(ctx)),
)

@sbueringer
Copy link
Member Author

@fabriziopandini If you have some time, can you please check if what I wrote makes sense. Input on how to implement it would be very valuable. Thx :)

@fabriziopandini
Copy link
Member

@sbueringer we are definitely missing something similar to

// Return early if the object or Cluster is paused.
if annotations.IsPaused(cluster, m) {
log.Info("Reconciliation is paused for this object")
return ctrl.Result{}, nil
}

In the machine reconcile loop.
Let's get this fixed, but TBH I don't see a direct connection with the web hooks not responding.

@sbueringer
Copy link
Member Author

sbueringer commented Apr 9, 2021

@fabriziopandini

The webhooks are not responding because they are running in the self-hosted cluster. But the nodes on which they are running don't exist anymore at this moment. (because CAPD deleted them, because of the delete during clusterctl move, even though the Cluster is paused)
There are a few variants of this problem:

  • sometimes the apiserver still exists and the webhooks fail
  • sometimes the apiserver is already gone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants