Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

sbueringer · 2021-04-02T14:56:06Z

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

According to testgrid about 1 in 10 test runs of capi-e2e.When testing Cluster API working on self-hosted clusters Should pivot the bootstrap cluster to a self-hosted cluster is failing. Bug can be reproduced as far as I know only by running the e2e tests a few times and sooner or later the issue occurs.

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Some details (all data is from this test run):

Test usually works like this:
- create a bootstrap cluster
- create a workload cluster via the bootstrap cluster
- deploy providers on the workload clusters via clusterctl init
- clusterctl move
- wait until everything is up on the workload cluster

In the failed test runs clusterctl move is unable to unpause the cluster in the workload cluster because the mutating webhook capi-webhook-service.capi-system.svc:443 fails because of a connection refused from the APIServer in the workload cluster to the capi-controller-manager pod (with 10 retries over 40 sec, according to the doc).

There's nothing interesting in the capi controller manager logs. But it's interesting that the last log line is about 1m50s before the test fails. So there was not a single log line during the calls to unpause the cluster. This indicates that the capi controller in the workload cluster was not running anymore.

The problem seems to be that the capd manager of the bootstrap cluster deletes the control plane machine of the workload cluster (full logs) during the pivot.

I0402 01:58:14.205132       1 machine.go:406] controller-runtime/manager/controller/dockermachine "msg"="Deleting machine container" "name"="self-hosted-gltar1-md-0-q42ht" "namespace"="self-hosted-cxckdr" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"

The unpause timed out at around 02:00.

I suspect there is something wrong in CAPD so that machines are deleted even though the corresponding Cluster resource is paused.

Some additional suggestions for easier debugging (as separate help-wanted issues?):

add the test/cluster name to the filename of clusterctl-move.log
to be able to easier correlate controller and test logs:
- add timestamps to the ginkgo logs
- add timestamps to the clusterctl logs
also retrieve some basic deployment/ pod/ event ... resources from namespaces where CAPI controllers are deployed

Environment:

Cluster-api version: main
Minikube/KIND version: standard e2e test setup
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-04-02T15:00:04Z

@fabriziopandini

I'm not entirely sure how the pause logic should be implemented in the CAPD machine controller but I suspect we should check if the cluster is paused somewhere in the Reconcile func. The watches look okay, but I think they don't filter out events on Machines when there Clusters are paused (but I could be completely wrong):

cluster-api/test/infrastructure/docker/controllers/dockermachine_controller.go

Lines 361 to 381 in 7478817

    
           c, err := ctrl.NewControllerManagedBy(mgr). 
        
           	For(&infrav1.DockerMachine{}). 
        
           	WithOptions(options). 
        
           	WithEventFilter(predicates.ResourceNotPaused(ctrl.LoggerFrom(ctx))). 
        
           	Watches( 
        
           		&source.Kind{Type: &clusterv1.Machine{}}, 
        
           		handler.EnqueueRequestsFromMapFunc(util.MachineToInfrastructureMapFunc(infrav1.GroupVersion.WithKind("DockerMachine"))), 
        
           	). 
        
           	Watches( 
        
           		&source.Kind{Type: &infrav1.DockerCluster{}}, 
        
           		handler.EnqueueRequestsFromMapFunc(r.DockerClusterToDockerMachines), 
        
           	). 
        
           	Build(r) 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           return c.Watch( 
        
           	&source.Kind{Type: &clusterv1.Cluster{}}, 
        
           	handler.EnqueueRequestsFromMapFunc(clusterToDockerMachines), 
        
           	predicates.ClusterUnpausedAndInfrastructureReady(ctrl.LoggerFrom(ctx)), 
        
           )

sbueringer · 2021-04-09T05:38:43Z

@fabriziopandini If you have some time, can you please check if what I wrote makes sense. Input on how to implement it would be very valuable. Thx :)

fabriziopandini · 2021-04-09T14:49:41Z

@sbueringer we are definitely missing something similar to

cluster-api/controllers/machine_controller.go

Lines 152 to 156 in 7478817

    
           // Return early if the object or Cluster is paused. 
        
           if annotations.IsPaused(cluster, m) { 
        
           	log.Info("Reconciliation is paused for this object") 
        
           	return ctrl.Result{}, nil 
        
           }

In the machine reconcile loop.
Let's get this fixed, but TBH I don't see a direct connection with the web hooks not responding.

sbueringer · 2021-04-09T15:10:01Z

@fabriziopandini

The webhooks are not responding because they are running in the self-hosted cluster. But the nodes on which they are running don't exist anymore at this moment. (because CAPD deleted them, because of the delete during clusterctl move, even though the Cluster is paused)
There are a few variants of this problem:

sometimes the apiserver still exists and the webhooks fail
sometimes the apiserver is already gone

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021

sbueringer mentioned this issue Apr 9, 2021

🐛 capd: do not reconcile machine if cluster or machine is paused #4453

Merged

k8s-ci-robot closed this as completed in #4453 Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

sbueringer commented Apr 2, 2021 •

edited

Loading

sbueringer commented Apr 2, 2021

sbueringer commented Apr 9, 2021

fabriziopandini commented Apr 9, 2021

sbueringer commented Apr 9, 2021 •

edited

Loading

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426

Comments

sbueringer commented Apr 2, 2021 • edited Loading

sbueringer commented Apr 2, 2021

sbueringer commented Apr 9, 2021

fabriziopandini commented Apr 9, 2021

sbueringer commented Apr 9, 2021 • edited Loading

sbueringer commented Apr 2, 2021 •

edited

Loading

sbueringer commented Apr 9, 2021 •

edited

Loading