Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

Closed
wcurry opened this issue May 7, 2021 · 12 comments
Closed

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

wcurry opened this issue May 7, 2021 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@wcurry
Copy link

wcurry commented May 7, 2021

What steps did you take and what happened:

Deployed a 1 CP, 1 worker cluster. Realized my OIDC config was set for the wrong environment. Redeployed the workload cluster with the updated apiServer extraArgs. The new machine was created, but got stuck provisioning. I did not see any static pods in /etc/kubernetes/manifests.

What did you expect to happen:

CP roll finishes successfully.

Environment:

  • Cluster-api version: v0.3.13
  • CAPV version: v0.7.7
  • Kubernetes version: 1.17.11
  • OS (e.g. from /etc/os-release): Ubuntu 18.04

I no longer have access to this cluster.

I noticed that the non-goals for CAPI include the following:

To duplicate functionality that exists or is coming to other tooling, e.g., updating kubelet configuration (c.f. dynamic kubelet configuration), or updating apiserver, controller-manager, scheduler configuration (c.f. component-config effort) after the cluster is deployed.

Is this not supported behavior?

I noticed the kubeadm-join-config.yaml does not have a section for apiServer. I went to a 3 CP node cluster I have and found that the same was true there, but apiserver was running on the kubeadm join'd nodes. It's not clear to my how those static pods get created.

kubeadm-join-config.yaml from provisioning CP node
root@s0020-d7sts:~# more /tmp/kubeadm-join-config.yaml
apiVersion: kubeadm.k8s.io/v1beta1
controlPlane:
  localAPIEndpoint:
    advertiseAddress: ""
    bindPort: 443
discovery:
  bootstrapToken:
    apiServerEndpoint: k8s.domain.com:443
    caCertHashes:
    - sha256:xxx
    token: xxx
    unsafeSkipCAVerification: false
kind: JoinConfiguration
nodeRegistration:
  criSocket: /run/containerd/containerd.sock
  kubeletExtraArgs:
    anonymous-auth: "false"
    authentication-token-webhook: "true"
    cgroup-driver: systemd
    cgroups-per-qos: "true"
    cloud-provider: external
    cluster_domain: cluster.local
    cni-conf-dir: /etc/kubernetes/cni/net.d
    cpu-manager-policy: static
    enforce-node-allocatable: pods
    eviction-hard: memory.available<5%,nodefs.available<10%,nodefs.inodesFree<10%,imagefs.available<10%,imagefs.inodesFree<10%
    eviction-max-pod-grace-period: "300"
    eviction-minimum-reclaim: memory.available=0Mi,nodefs.available=5Gi,imagefs.available=5Gi
    eviction-soft: memory.available<10%,nodefs.available<20%,nodefs.inodesFree<20%,imagefs.available<20%,imagefs.inodesFree<20%
    eviction-soft-grace-period: memory.available=2m,nodefs.available=30s,nodefs.inodesFree=30s,imagefs.available=30s,imagefs.inodesFree=30s
    exit-on-lock-contention: "true"
    kube-reserved: cpu=1024m,memory=1000Mi
    kube-reserved-cgroup: /kubeletreserved.slice
    kubelet-cgroups: /kubeletreserved.slice
    lock-file: /var/run/lock/kubelet.lock
    network-plugin: cni
    node-labels: '"master=true","beta.nordstrom.net/node-pool=etcd-pool"'
    read-only-port: "0"
    system-reserved: cpu=200m,memory=1000Mi
  name: 's0020-d7sts'

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2021
@sbueringer
Copy link
Member

Those static pod manifests are created by kubeadm, but without a kubeadm log (usually visible in cloud-init) it's basically impossible to debug what went wrong in your case ;)

@vincepri
Copy link
Member

/triage support
/milestone Next
/priority awaiting-more-evidence

@k8s-ci-robot
Copy link
Contributor

@vincepri: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support
/milestone Next
/priority awaiting-more-evidence

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added this to the Next milestone May 10, 2021
@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label May 10, 2021
@vincepri
Copy link
Member

@wcurry One thing that jumps out is that the Cluster API version is a bit behind and we've had multiple bug fixes to KubeadmControlPlane from v0.3.13. Would you be able to update first?

@wcurry
Copy link
Author

wcurry commented May 11, 2021

I failed to get a repro today on a dev cluster.

I did dig into the journald logs on the failed roll cluster before it was torn down and nothing stood out. I saw the util.py logs, the files being created by cloud-init, but I don't remember seeing any obvious. I definitely didn't see anything in the journal logs for kubeadm aside from the creation of a yaml and chmod. There was no join failure.

Working on updating CAPI this week. I'll see if I can trigger the bug in my downtime.

@sbueringer
Copy link
Member

sbueringer commented May 11, 2021

@wcurry Just a hint. In case of an error check the following logs:

less /var/log/cloud-init-output.log
journalctl -u cloud-init --since "10 hours ago"

If there is nothing there just take a look at:

journalctl --since "10 hours ago"

If you see that kubeadm times out waiting for the static Pods to come up my best guess is to take a look at the logs of the kubelet / containerd unit logs if the containers were started at all (also crictl ps -a helps) and if containers exist at the container logs via crictl logs.

P.S. it could also be helpful to configure a higher kubeadm verbosity, there is a flag for that.

@vincepri
Copy link
Member

@sbueringer Kind of unrelated to this issue, but the above might be good troubleshooting steps to put in our book ^

@sbueringer
Copy link
Member

@vincepri Yes. I might have some more. We're running about 200 test installations (cluster create + updates) every night internally. They are only using ClusterAPI (with CAPO) partially right now but we're using kubeadm there since 1-2 years. As we're trying to achieve a very high success rate we have lots of experience debugging those kind of things.

I'll collect the troubleshooting hints which are relevant for CAPI and open a PR so we can discuss it in a bit more detail.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

5 participants