KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

wcurry · 2021-05-07T00:04:11Z

What steps did you take and what happened:

Deployed a 1 CP, 1 worker cluster. Realized my OIDC config was set for the wrong environment. Redeployed the workload cluster with the updated apiServer extraArgs. The new machine was created, but got stuck provisioning. I did not see any static pods in /etc/kubernetes/manifests.

What did you expect to happen:

CP roll finishes successfully.

Environment:

Cluster-api version: v0.3.13
CAPV version: v0.7.7
Kubernetes version: 1.17.11
OS (e.g. from /etc/os-release): Ubuntu 18.04

I no longer have access to this cluster.

I noticed that the non-goals for CAPI include the following:

To duplicate functionality that exists or is coming to other tooling, e.g., updating kubelet configuration (c.f. dynamic kubelet configuration), or updating apiserver, controller-manager, scheduler configuration (c.f. component-config effort) after the cluster is deployed.

Is this not supported behavior?

I noticed the kubeadm-join-config.yaml does not have a section for apiServer. I went to a 3 CP node cluster I have and found that the same was true there, but apiserver was running on the kubeadm join'd nodes. It's not clear to my how those static pods get created.

kubeadm-join-config.yaml from provisioning CP node

root@s0020-d7sts:~# more /tmp/kubeadm-join-config.yaml
apiVersion: kubeadm.k8s.io/v1beta1
controlPlane:
  localAPIEndpoint:
    advertiseAddress: ""
    bindPort: 443
discovery:
  bootstrapToken:
    apiServerEndpoint: k8s.domain.com:443
    caCertHashes:
    - sha256:xxx
    token: xxx
    unsafeSkipCAVerification: false
kind: JoinConfiguration
nodeRegistration:
  criSocket: /run/containerd/containerd.sock
  kubeletExtraArgs:
    anonymous-auth: "false"
    authentication-token-webhook: "true"
    cgroup-driver: systemd
    cgroups-per-qos: "true"
    cloud-provider: external
    cluster_domain: cluster.local
    cni-conf-dir: /etc/kubernetes/cni/net.d
    cpu-manager-policy: static
    enforce-node-allocatable: pods
    eviction-hard: memory.available<5%,nodefs.available<10%,nodefs.inodesFree<10%,imagefs.available<10%,imagefs.inodesFree<10%
    eviction-max-pod-grace-period: "300"
    eviction-minimum-reclaim: memory.available=0Mi,nodefs.available=5Gi,imagefs.available=5Gi
    eviction-soft: memory.available<10%,nodefs.available<20%,nodefs.inodesFree<20%,imagefs.available<20%,imagefs.inodesFree<20%
    eviction-soft-grace-period: memory.available=2m,nodefs.available=30s,nodefs.inodesFree=30s,imagefs.available=30s,imagefs.inodesFree=30s
    exit-on-lock-contention: "true"
    kube-reserved: cpu=1024m,memory=1000Mi
    kube-reserved-cgroup: /kubeletreserved.slice
    kubelet-cgroups: /kubeletreserved.slice
    lock-file: /var/run/lock/kubelet.lock
    network-plugin: cni
    node-labels: '"master=true","beta.nordstrom.net/node-pool=etcd-pool"'
    read-only-port: "0"
    system-reserved: cpu=200m,memory=1000Mi
  name: 's0020-d7sts'

/kind bug

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-05-07T15:50:33Z

Those static pod manifests are created by kubeadm, but without a kubeadm log (usually visible in cloud-init) it's basically impossible to debug what went wrong in your case ;)

vincepri · 2021-05-10T14:41:24Z

/triage support
/milestone Next
/priority awaiting-more-evidence

k8s-ci-robot · 2021-05-10T14:41:26Z

@vincepri: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support
/milestone Next
/priority awaiting-more-evidence

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2021-05-10T14:42:11Z

@wcurry One thing that jumps out is that the Cluster API version is a bit behind and we've had multiple bug fixes to KubeadmControlPlane from v0.3.13. Would you be able to update first?

wcurry · 2021-05-11T02:26:32Z

I failed to get a repro today on a dev cluster.

I did dig into the journald logs on the failed roll cluster before it was torn down and nothing stood out. I saw the util.py logs, the files being created by cloud-init, but I don't remember seeing any obvious. I definitely didn't see anything in the journal logs for kubeadm aside from the creation of a yaml and chmod. There was no join failure.

Working on updating CAPI this week. I'll see if I can trigger the bug in my downtime.

sbueringer · 2021-05-11T03:47:21Z

@wcurry Just a hint. In case of an error check the following logs:

less /var/log/cloud-init-output.log
journalctl -u cloud-init --since "10 hours ago"

If there is nothing there just take a look at:

journalctl --since "10 hours ago"

If you see that kubeadm times out waiting for the static Pods to come up my best guess is to take a look at the logs of the kubelet / containerd unit logs if the containers were started at all (also crictl ps -a helps) and if containers exist at the container logs via crictl logs.

P.S. it could also be helpful to configure a higher kubeadm verbosity, there is a flag for that.

vincepri · 2021-05-11T16:55:30Z

@sbueringer Kind of unrelated to this issue, but the above might be good troubleshooting steps to put in our book ^

sbueringer · 2021-05-11T17:06:20Z

@vincepri Yes. I might have some more. We're running about 200 test installations (cluster create + updates) every night internally. They are only using ClusterAPI (with CAPO) partially right now but we're using kubeadm there since 1-2 years. As we're trying to achieve a very high success rate we have lots of experience debugging those kind of things.

I'll collect the troubleshooting hints which are relevant for CAPI and open a PR so we can discuss it in a bit more detail.

k8s-triage-robot · 2021-08-09T17:43:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-09-08T17:45:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-10-08T18:17:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-10-08T18:17:30Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2021

k8s-ci-robot added this to the Next milestone May 10, 2021

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label May 10, 2021

sbueringer mentioned this issue May 17, 2021

📖 doc: add node bootstrap troubleshooting doc #4627

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2021

k8s-ci-robot closed this as completed Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

wcurry commented May 7, 2021 •

edited

Loading

sbueringer commented May 7, 2021

vincepri commented May 10, 2021

k8s-ci-robot commented May 10, 2021

vincepri commented May 10, 2021

wcurry commented May 11, 2021

sbueringer commented May 11, 2021 •

edited

Loading

vincepri commented May 11, 2021

sbueringer commented May 11, 2021

k8s-triage-robot commented Aug 9, 2021

k8s-triage-robot commented Sep 8, 2021

k8s-triage-robot commented Oct 8, 2021

k8s-ci-robot commented Oct 8, 2021

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

KubeadmControlPlane stuck rolling out changes to apiServer extraArgs #4583

Comments

wcurry commented May 7, 2021 • edited Loading

sbueringer commented May 7, 2021

vincepri commented May 10, 2021

k8s-ci-robot commented May 10, 2021

vincepri commented May 10, 2021

wcurry commented May 11, 2021

sbueringer commented May 11, 2021 • edited Loading

vincepri commented May 11, 2021

sbueringer commented May 11, 2021

k8s-triage-robot commented Aug 9, 2021

k8s-triage-robot commented Sep 8, 2021

k8s-triage-robot commented Oct 8, 2021

k8s-ci-robot commented Oct 8, 2021

wcurry commented May 7, 2021 •

edited

Loading

sbueringer commented May 11, 2021 •

edited

Loading