Add health resilience for managed cluster #2169

dtzar · 2022-03-14T23:37:59Z

/kind feature

Describe the challenges
After "successfully" completing clusterctl init --infrastructure azure I'm personally finding there are various scenarios where the actual management cluster is in an unhealthy state. When you then run kubectl -f myCapzWorkloadCluster.yaml the cluster is permanently stuck in the provisioning state.

kubectl describe clusters at the end shows it can't provision because control plane is not ready:

Status:
  Conditions:
    Last Transition Time:  2022-03-14T05:41:11Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  Ready
    Last Transition Time:  2022-03-14T05:41:11Z
    Message:               Waiting for control plane provider to indicate the control plane has been initialized
    Reason:                WaitingForControlPlaneProviderInitialized
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2022-03-14T05:41:11Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-03-14T05:41:20Z
    Status:                True
    Type:                  InfrastructureReady
  Infrastructure Ready:    true
  Observed Generation:     1
  Phase:                   Provisioning
Events:                    <none>

The two specific scenarios I've hit why the managed cluster isn't healthy:

rancher desktop or other local K8s cluster alternative has issues with the volume mount which happens for the certificates to attach to pods. The mount never happens, and the management cluster never works.

Events:
  Type     Reason       Age                    From               Message
  ----     ------       ----                   ----               -------
  Normal   Scheduled    7m58s                  default-scheduler  Successfully assigned capz-system/capz-nmi-6fn2r to dtzardel9
  Warning  FailedMount  3m39s (x2 over 5m56s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[kubelet-config], unattached volumes=[kubelet-config kube-api-access-s8d42 iptableslock]: timed out waiting for the condition
  Warning  FailedMount  107s (x11 over 7m58s)  kubelet            MountVolume.SetUp failed for volume "kubelet-config" : open /etc/default/kubelet: no such file or directory
  Warning  FailedMount  81s                    kubelet            Unable to attach or mount volumes: unmounted volumes=[kubelet-config], unattached volumes=[iptableslock kubelet-config kube-api-access-s8d42]: timed out waiting for the condition

  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    7m50s  default-scheduler  Successfully assigned capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-86c4dcbc4c-zpd7b to docker-desktop
  Warning  FailedMount  7m50s  kubelet            MountVolume.SetUp failed for volume "cert" : secret "capi-kubeadm-control-plane-webhook-service-cert" not found
  Normal   Pulled       7m48s  kubelet            Container image "k8s.gcr.io/cluster-api/kubeadm-control-plane-controller:v1.1.3" already present on machine
  Normal   Created      7m48s  kubelet            Created container manager
  Normal   Started      7m48s  kubelet            Started container manager
  Warning  Unhealthy    7m47s  kubelet            Readiness probe failed: Get "http://10.1.0.166:9440/readyz": dial tcp 10.1.0.166:9440: connect: connection refused

CAPZ controller fails due to the agentPoolProfile.count argument being invalid. This is not necessarily an unhealthy controlplane, but rather invalid template arguments attempting to be deployed. The end-user (without digging into logs) still sees the same result shown at top stuck in provisioning state.

E0117 23:41:26.224640       1 controller.go:317] controller/azuremanagedcontrolplane "msg"="Reconciler error" "error"="error creating AzureManagedControlPlane default/aksflavor: failed to reconcile managed cluster: failed to create managed cluster, failed to begin operation: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidParameter\" Message=\"The value of parameter agentPoolProfile.count is invalid. Please see https://aka.ms/aks-naming-rules for more details.\" Target=\"agentPoolProfile.count\"" "name"="aksflavor" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureManagedControlPlane"

Describe potential solutions you'd like to see
Any number of ways to ensure that the management cluster is healthy before attempting to deploy a workload cluster. This could include things like (in no particular order):

Doing a cluster health check after the initial install
Not letting the pods go into a healthy state until they actually are healthy
Separate clusterctl command line option to specifically check health
Not allowing someone to try to provision a new cluster until ensuring health on the management cluster.
Do YAML template validation before attempting to deploy.
Have unhandled invalid template code messages be passed back to the user in some meaningful way (i.e. get the actual invalid template log message from the controller, versus the generic "waiting for control plane to be ready")

Environment:

cluster-api-provider-azure version: 1.1.3
Kubernetes version: (use kubectl version):1.23.1 client, 1.21.5 server
OS (e.g. from /etc/os-release): WSL2 on Ubuntu Focal from Windows 11

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2022-03-15T00:23:58Z

@dtzar thanks for raising this issue. For 1), this seems like a clusterctl / Cluster API feature request to me, not one specific to Azure. Would you mind creating an issue in the CAPI repo so we can brainstorm solutions with the other providers?

For 2) specifically, this seems like something that should have gotten caught in a webhook instead. I'm actually refactoring the managed clusters surface area (#2168) quite a bit and noticed this too -- there are a lot of validations being done in the managed cluster reconciler that should be done in webhooks instead. That would prevent the user from even creating the resource if it's invalid. I'll open a separate issue for this.

k8s-triage-robot · 2022-06-13T03:57:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-13T04:23:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jackfrancis · 2022-07-13T07:25:05Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-10-11T08:02:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jackfrancis · 2022-10-14T17:52:38Z

I think we can close this. @dtzar @CecileRobertMichon please re-open if I'm wrong

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 14, 2022

dtzar mentioned this issue Mar 16, 2022

Improve health check for managed cluster kubernetes-sigs/cluster-api#6303

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 13, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 13, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2022

jackfrancis closed this as completed Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health resilience for managed cluster #2169

Add health resilience for managed cluster #2169

dtzar commented Mar 14, 2022 •

edited

Loading

CecileRobertMichon commented Mar 15, 2022

k8s-triage-robot commented Jun 13, 2022

k8s-triage-robot commented Jul 13, 2022

jackfrancis commented Jul 13, 2022

k8s-triage-robot commented Oct 11, 2022

jackfrancis commented Oct 14, 2022

Add health resilience for managed cluster #2169

Add health resilience for managed cluster #2169

Comments

dtzar commented Mar 14, 2022 • edited Loading

CecileRobertMichon commented Mar 15, 2022

k8s-triage-robot commented Jun 13, 2022

k8s-triage-robot commented Jul 13, 2022

jackfrancis commented Jul 13, 2022

k8s-triage-robot commented Oct 11, 2022

jackfrancis commented Oct 14, 2022

dtzar commented Mar 14, 2022 •

edited

Loading