Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health resilience for managed cluster #2169

Closed
dtzar opened this issue Mar 14, 2022 · 6 comments
Closed

Add health resilience for managed cluster #2169

dtzar opened this issue Mar 14, 2022 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@dtzar
Copy link
Contributor

dtzar commented Mar 14, 2022

/kind feature

Describe the challenges
After "successfully" completing clusterctl init --infrastructure azure I'm personally finding there are various scenarios where the actual management cluster is in an unhealthy state. When you then run kubectl -f myCapzWorkloadCluster.yaml the cluster is permanently stuck in the provisioning state.

kubectl describe clusters at the end shows it can't provision because control plane is not ready:

Status:
  Conditions:
    Last Transition Time:  2022-03-14T05:41:11Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  Ready
    Last Transition Time:  2022-03-14T05:41:11Z
    Message:               Waiting for control plane provider to indicate the control plane has been initialized
    Reason:                WaitingForControlPlaneProviderInitialized
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2022-03-14T05:41:11Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2022-03-14T05:41:20Z
    Status:                True
    Type:                  InfrastructureReady
  Infrastructure Ready:    true
  Observed Generation:     1
  Phase:                   Provisioning
Events:                    <none>

The two specific scenarios I've hit why the managed cluster isn't healthy:

  1. rancher desktop or other local K8s cluster alternative has issues with the volume mount which happens for the certificates to attach to pods. The mount never happens, and the management cluster never works.
Events:
  Type     Reason       Age                    From               Message
  ----     ------       ----                   ----               -------
  Normal   Scheduled    7m58s                  default-scheduler  Successfully assigned capz-system/capz-nmi-6fn2r to dtzardel9
  Warning  FailedMount  3m39s (x2 over 5m56s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[kubelet-config], unattached volumes=[kubelet-config kube-api-access-s8d42 iptableslock]: timed out waiting for the condition
  Warning  FailedMount  107s (x11 over 7m58s)  kubelet            MountVolume.SetUp failed for volume "kubelet-config" : open /etc/default/kubelet: no such file or directory
  Warning  FailedMount  81s                    kubelet            Unable to attach or mount volumes: unmounted volumes=[kubelet-config], unattached volumes=[iptableslock kubelet-config kube-api-access-s8d42]: timed out waiting for the condition
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    7m50s  default-scheduler  Successfully assigned capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-86c4dcbc4c-zpd7b to docker-desktop
  Warning  FailedMount  7m50s  kubelet            MountVolume.SetUp failed for volume "cert" : secret "capi-kubeadm-control-plane-webhook-service-cert" not found
  Normal   Pulled       7m48s  kubelet            Container image "k8s.gcr.io/cluster-api/kubeadm-control-plane-controller:v1.1.3" already present on machine
  Normal   Created      7m48s  kubelet            Created container manager
  Normal   Started      7m48s  kubelet            Started container manager
  Warning  Unhealthy    7m47s  kubelet            Readiness probe failed: Get "http://10.1.0.166:9440/readyz": dial tcp 10.1.0.166:9440: connect: connection refused
  1. CAPZ controller fails due to the agentPoolProfile.count argument being invalid. This is not necessarily an unhealthy controlplane, but rather invalid template arguments attempting to be deployed. The end-user (without digging into logs) still sees the same result shown at top stuck in provisioning state.
E0117 23:41:26.224640       1 controller.go:317] controller/azuremanagedcontrolplane "msg"="Reconciler error" "error"="error creating AzureManagedControlPlane default/aksflavor: failed to reconcile managed cluster: failed to create managed cluster, failed to begin operation: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidParameter\" Message=\"The value of parameter agentPoolProfile.count is invalid. Please see https://aka.ms/aks-naming-rules for more details.\" Target=\"agentPoolProfile.count\"" "name"="aksflavor" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureManagedControlPlane" 

Describe potential solutions you'd like to see
Any number of ways to ensure that the management cluster is healthy before attempting to deploy a workload cluster. This could include things like (in no particular order):

  1. Doing a cluster health check after the initial install
  2. Not letting the pods go into a healthy state until they actually are healthy
  3. Separate clusterctl command line option to specifically check health
  4. Not allowing someone to try to provision a new cluster until ensuring health on the management cluster.
  5. Do YAML template validation before attempting to deploy.
  6. Have unhandled invalid template code messages be passed back to the user in some meaningful way (i.e. get the actual invalid template log message from the controller, versus the generic "waiting for control plane to be ready")

Environment:

  • cluster-api-provider-azure version: 1.1.3
  • Kubernetes version: (use kubectl version):1.23.1 client, 1.21.5 server
  • OS (e.g. from /etc/os-release): WSL2 on Ubuntu Focal from Windows 11
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 14, 2022
@CecileRobertMichon
Copy link
Contributor

@dtzar thanks for raising this issue. For 1), this seems like a clusterctl / Cluster API feature request to me, not one specific to Azure. Would you mind creating an issue in the CAPI repo so we can brainstorm solutions with the other providers?

For 2) specifically, this seems like something that should have gotten caught in a webhook instead. I'm actually refactoring the managed clusters surface area (#2168) quite a bit and noticed this too -- there are a lot of validations being done in the managed cluster reconciler that should be done in webhooks instead. That would prevent the user from even creating the resource if it's invalid. I'll open a separate issue for this.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 13, 2022
@jackfrancis
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 13, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2022
@jackfrancis
Copy link
Contributor

I think we can close this. @dtzar @CecileRobertMichon please re-open if I'm wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants