Failed first control plane node creates unrecoverable failure #2960

scottslowe · 2020-04-24T18:07:38Z

What steps did you take and what happened:

I created a workload cluster on AWS that would use existing infrastructure. The instance backing the first control plane node failed unexpected (some sort of AWS failure), generating errors like this in the CAPA controller manager log:

I0424 16:51:13.725055       1 awsmachine_controller.go:361] controllers/AWSMachine "msg"="Error state detected, skipping reconciliation" "awsCluster"="capi-etcd" "awsMachine"="capi-etcd-control-plane-4jpj4" "cluster"="capi-etcd" "machine"="capi-etcd-control-plane-hmx8d" "namespace"="default"

In talking with @randomvariable, he suggested deleting the Machine object, which I did, and that removed the failed EC2 instance and created new Machine and AWSMachine objects, but further reconciliation never happened. Naadir indicated that this was "technically correct" behavior and that the KCP would never recover, but suggested raising this issue to possibly improve docs or surface errors better.

What did you expect to happen:

I expected the workload cluster to be created.

Anything else you would like to add:

N/A

Environment:

Cluster-api version: 0.3.3
Minikube/KIND version: N/A (tested on AWS)
Kubernetes version: (use kubectl version): v1.16.4 (management), v1.17.3 (workload)
OS (e.g. from /etc/os-release): Ubuntu 18.04

/kind bug

The text was updated successfully, but these errors were encountered:

detiber · 2020-04-24T18:11:45Z

There is a unique challenge here. If we can differentiate between the control plane never initialized and the control plane existed but is no longer there, then I think we can safely retry initialization on failed creation.

I'm not sure we'd ever want to "re-initialize" a control plane that was present but is no longer there, since recreating the "control plane" despite the data loss incurred during the process would potentially cause more confusion than problems it would solve.

scottslowe · 2020-04-24T19:04:12Z

I'm completely out of my depth here, but wouldn't CABPK or KCP have knowledge as to whether the control plane ever initialized or not? I completely agree we'd want to avoid "re-initializing" a control plane.

vincepri · 2020-04-29T17:56:13Z

/priority awaiting-more-evidence
/milestone v0.3.x

randomvariable · 2020-04-29T18:11:58Z

CABPK is very much in the fire and forget category.
KCP potentially has the ability to figure this out. Related to #2976

fejta-bot · 2020-07-28T19:06:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-07-28T20:16:16Z

/lifecycle frozen

vincepri · 2020-07-31T15:58:05Z

/milestone v0.4.0

We should probably tackle this as part of KCP in a very narrow scenario (initialized)

fejta-bot · 2020-10-29T16:19:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fabriziopandini · 2020-10-30T08:41:32Z

/remove-lifecycle stale

fejta-bot · 2021-01-28T08:45:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2021-02-03T23:23:10Z

/lifecycle frozen

vincepri · 2021-02-19T18:07:59Z

/milestone v0.4.x

fabriziopandini · 2022-09-30T19:21:30Z

/close

As of today the easiest path is to delete and re-create.

k8s-ci-robot · 2022-09-30T19:21:34Z

@fabriziopandini: Closing this issue.

In response to this:

/close

As of today the easiest path is to delete and re-create.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2020

k8s-ci-robot added this to the v0.3.x milestone Apr 29, 2020

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 29, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 28, 2020

ncdc added the area/control-plane Issues or PRs related to control-plane lifecycle management label Jul 31, 2020

k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Jul 31, 2020

vincepri removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jul 31, 2020

ncdc removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 31, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 3, 2021

k8s-ci-robot modified the milestones: v0.4.0, v0.4.x Feb 19, 2021

CecileRobertMichon modified the milestones: v0.4.x, v0.4 Mar 22, 2021

vincepri modified the milestones: v0.4, v1.1 Oct 22, 2021

fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed first control plane node creates unrecoverable failure #2960

Failed first control plane node creates unrecoverable failure #2960

scottslowe commented Apr 24, 2020

detiber commented Apr 24, 2020

scottslowe commented Apr 24, 2020

vincepri commented Apr 29, 2020

randomvariable commented Apr 29, 2020

fejta-bot commented Jul 28, 2020

vincepri commented Jul 28, 2020

vincepri commented Jul 31, 2020

fejta-bot commented Oct 29, 2020

fabriziopandini commented Oct 30, 2020

fejta-bot commented Jan 28, 2021

vincepri commented Feb 3, 2021

vincepri commented Feb 19, 2021

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022

Failed first control plane node creates unrecoverable failure #2960

Failed first control plane node creates unrecoverable failure #2960

Comments

scottslowe commented Apr 24, 2020

detiber commented Apr 24, 2020

scottslowe commented Apr 24, 2020

vincepri commented Apr 29, 2020

randomvariable commented Apr 29, 2020

fejta-bot commented Jul 28, 2020

vincepri commented Jul 28, 2020

vincepri commented Jul 31, 2020

fejta-bot commented Oct 29, 2020

fabriziopandini commented Oct 30, 2020

fejta-bot commented Jan 28, 2021

vincepri commented Feb 3, 2021

vincepri commented Feb 19, 2021

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022