-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky e2e test: Pivot the bootstrap cluster to a self-hosted cluster #4426
Comments
I'm not entirely sure how the pause logic should be implemented in the CAPD machine controller but I suspect we should check if the cluster is paused somewhere in the cluster-api/test/infrastructure/docker/controllers/dockermachine_controller.go Lines 361 to 381 in 7478817
|
@fabriziopandini If you have some time, can you please check if what I wrote makes sense. Input on how to implement it would be very valuable. Thx :) |
@sbueringer we are definitely missing something similar to cluster-api/controllers/machine_controller.go Lines 152 to 156 in 7478817
In the machine reconcile loop. |
The webhooks are not responding because they are running in the self-hosted cluster. But the nodes on which they are running don't exist anymore at this moment. (because CAPD deleted them, because of the delete during clusterctl move, even though the Cluster is paused)
|
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
According to testgrid about 1 in 10 test runs of
capi-e2e.When testing Cluster API working on self-hosted clusters Should pivot the bootstrap cluster to a self-hosted cluster
is failing. Bug can be reproduced as far as I know only by running the e2e tests a few times and sooner or later the issue occurs.What did you expect to happen:
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Some details (all data is from this test run):
clusterctl init
In the failed test runs clusterctl move is unable to unpause the cluster in the workload cluster because the mutating webhook
capi-webhook-service.capi-system.svc:443
fails because of aconnection refused
from the APIServer in the workload cluster to the capi-controller-manager pod (with 10 retries over 40 sec, according to the doc).There's nothing interesting in the capi controller manager logs. But it's interesting that the last log line is about 1m50s before the test fails. So there was not a single log line during the calls to unpause the cluster. This indicates that the capi controller in the workload cluster was not running anymore.
The problem seems to be that the capd manager of the bootstrap cluster deletes the control plane machine of the workload cluster (full logs) during the pivot.
The unpause timed out at around 02:00.
I suspect there is something wrong in CAPD so that machines are deleted even though the corresponding Cluster resource is paused.
Some additional suggestions for easier debugging (as separate help-wanted issues?):
Environment:
kubectl version
):/etc/os-release
):/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: