Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed first control plane node creates unrecoverable failure #2960

Closed
scottslowe opened this issue Apr 24, 2020 · 14 comments
Closed

Failed first control plane node creates unrecoverable failure #2960

scottslowe opened this issue Apr 24, 2020 · 14 comments
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@scottslowe
Copy link
Contributor

What steps did you take and what happened:

I created a workload cluster on AWS that would use existing infrastructure. The instance backing the first control plane node failed unexpected (some sort of AWS failure), generating errors like this in the CAPA controller manager log:

I0424 16:51:13.725055       1 awsmachine_controller.go:361] controllers/AWSMachine "msg"="Error state detected, skipping reconciliation" "awsCluster"="capi-etcd" "awsMachine"="capi-etcd-control-plane-4jpj4" "cluster"="capi-etcd" "machine"="capi-etcd-control-plane-hmx8d" "namespace"="default"

In talking with @randomvariable, he suggested deleting the Machine object, which I did, and that removed the failed EC2 instance and created new Machine and AWSMachine objects, but further reconciliation never happened. Naadir indicated that this was "technically correct" behavior and that the KCP would never recover, but suggested raising this issue to possibly improve docs or surface errors better.

What did you expect to happen:

I expected the workload cluster to be created.

Anything else you would like to add:

N/A

Environment:

  • Cluster-api version: 0.3.3
  • Minikube/KIND version: N/A (tested on AWS)
  • Kubernetes version: (use kubectl version): v1.16.4 (management), v1.17.3 (workload)
  • OS (e.g. from /etc/os-release): Ubuntu 18.04

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2020
@detiber
Copy link
Member

detiber commented Apr 24, 2020

There is a unique challenge here. If we can differentiate between the control plane never initialized and the control plane existed but is no longer there, then I think we can safely retry initialization on failed creation.

I'm not sure we'd ever want to "re-initialize" a control plane that was present but is no longer there, since recreating the "control plane" despite the data loss incurred during the process would potentially cause more confusion than problems it would solve.

@scottslowe
Copy link
Contributor Author

I'm completely out of my depth here, but wouldn't CABPK or KCP have knowledge as to whether the control plane ever initialized or not? I completely agree we'd want to avoid "re-initializing" a control plane.

@vincepri
Copy link
Member

/priority awaiting-more-evidence
/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Apr 29, 2020
@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 29, 2020
@randomvariable
Copy link
Member

CABPK is very much in the fire and forget category.
KCP potentially has the ability to figure this out. Related to #2976

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 28, 2020
@ncdc ncdc added the area/control-plane Issues or PRs related to control-plane lifecycle management label Jul 31, 2020
@vincepri
Copy link
Member

/milestone v0.4.0

We should probably tackle this as part of KCP in a very narrow scenario (initialized)

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Jul 31, 2020
@vincepri vincepri removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jul 31, 2020
@ncdc ncdc removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 31, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2020
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2021
@vincepri
Copy link
Member

vincepri commented Feb 3, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 3, 2021
@vincepri
Copy link
Member

/milestone v0.4.x

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.4.0, v0.4.x Feb 19, 2021
@CecileRobertMichon CecileRobertMichon modified the milestones: v0.4.x, v0.4 Mar 22, 2021
@vincepri vincepri modified the milestones: v0.4, v1.1 Oct 22, 2021
@fabriziopandini fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/close

As of today the easiest path is to delete and re-create.

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close

As of today the easiest path is to delete and re-create.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants