Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s HA installation timed out on task "Join master to ControlPlane" #1075

Closed
przemyslavic opened this issue Mar 26, 2020 · 7 comments
Closed

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Mar 26, 2020

Describe the bug
K8s HA installation fails randomly on task Join master to ControlPlane on Azure environments.

To Reproduce
Steps to reproduce the behavior:

  1. execute epicli apply -f test.yml

Expected behavior
The HA cluster was successfully deployed.

Config files
Configuration that should be included in the yaml file:

specification:
  components:
    kubernetes_master:
      count: 3
    kubernetes_node:
      count: 3
---
kind: configuration/shared-config
title: Shared configuration that will be visible to all roles
name: default
specification:
  use_ha_control_plane: true
provider: azure

Task where the problem appears:

- when: not kubernetes_common.master_already_joined
  block:
    - include_role:
        name: kubernetes_common
        tasks_from: ensure-token

    - block:
        - name: Ensure /etc/kubeadm/ directory
          file:
            path: /etc/kubeadm/
            state: directory
            owner: root
            group: root
            mode: u=rw,go=r

        - name: Render /etc/kubeadm/kubeadm-join-master.yml template
          template:
            src: kubeadm-join-master.yml.j2
            dest: /etc/kubeadm/kubeadm-join-master.yml
            owner: root
            group: root
            mode: u=rw,go=r

        - name: Join master to ControlPlane
          shell: |
            kubeadm join \
              --config /etc/kubeadm/kubeadm-join-master.yml
          args:
            executable: /bin/bash

        - name: Mark master as joined
          set_fact:
            kubernetes_common: >-
              {{ kubernetes_common | default({}) | combine(set_fact, recursive=true) }}
          vars:
            set_fact:
              master_already_joined: true

- name: Include kubelet configuration tasks
  include_role:
    name: kubernetes_common
    tasks_from: configure-kubelet

OS (please complete the following information):

  • OS: [RHEL, Ubuntu]

Cloud Environment (please complete the following information):

  • Cloud Provider [MS Azure]

Additional context
Log:

2020-07-27T13:02:47.6918829Z 13:02:47 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Join master to ControlPlane] *************************
2020-07-27T13:03:56.4662918Z 13:03:56 INFO cli.engine.ansible.AnsibleCommand - fatal: [ci-06hatodevazrhcanal-kubernetes-master-vm-2]: FAILED! => {"changed": true, "cmd": "kubeadm join  --config /etc/kubeadm/kubeadm-join-master.yml\n", "delta": "0:01:07.997277", "end": "2020-07-27 13:03:56.339262", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 13:02:48.341985", "stderr": "W0727 13:02:48.379669   13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'\nW0727 13:03:20.332544   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.339844   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.340659   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\n{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nerror execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0727 13:02:48.379669   13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'", "W0727 13:03:20.332544   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.339844   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.340659   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "error execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] Generating \"etcd/server\" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/peer\" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/healthcheck-client\" certificate and key\n[certs] Generating \"apiserver-etcd-client\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"\n[certs] Using the existing \"sa\" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[check-etcd] Checking that the etcd cluster is healthy\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Starting the kubelet\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[etcd] Announced new etcd member joining to the existing etcd cluster\n[etcd] Creating static Pod manifest for \"etcd\"\n[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s\n[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[preflight] Running pre-flight checks before initializing the new control plane instance", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[certs] Using certificateDir folder \"/etc/kubernetes/pki\"", "[certs] Generating \"front-proxy-client\" certificate and key", "[certs] Generating \"etcd/server\" certificate and key", "[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/peer\" certificate and key", "[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/healthcheck-client\" certificate and key", "[certs] Generating \"apiserver-etcd-client\" certificate and key", "[certs] Generating \"apiserver\" certificate and key", "[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]", "[certs] Generating \"apiserver-kubelet-client\" certificate and key", "[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"", "[certs] Using the existing \"sa\" key", "[kubeconfig] Generating kubeconfig files", "[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"", "[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address", "[kubeconfig] Writing \"admin.conf\" kubeconfig file", "[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file", "[kubeconfig] Writing \"scheduler.conf\" kubeconfig file", "[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"", "[control-plane] Creating static Pod manifest for \"kube-apiserver\"", "[control-plane] Creating static Pod manifest for \"kube-controller-manager\"", "[control-plane] Creating static Pod manifest for \"kube-scheduler\"", "[check-etcd] Checking that the etcd cluster is healthy", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Starting the kubelet", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[etcd] Announced new etcd member joining to the existing etcd cluster", "[etcd] Creating static Pod manifest for \"etcd\"", "[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s", "[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace"]}

It happens randomly.
On average, one of the two HA deployments on Azure fails because of this issue.

@rpudlowski93
Copy link
Contributor

rpudlowski93 commented Jul 31, 2020

I would recommend here to check once again if the problem still appears since the ticket was created already a few month ago. Moreover maybe somewhere there should be used time pause (60sec) or conditional in ansible because maybe some tasks run too fast or in wrong order.

@przemyslavic
Copy link
Collaborator Author

przemyslavic commented Jul 31, 2020

Now this has been tested with the latest develop code with Kubernetes 1.18.6 and I haven't been able to reproduce the problem yet.
More testing still underway.
The problem has been reproduced many times in version 0.6.

@sk4zuzu sk4zuzu self-assigned this Aug 3, 2020
@sk4zuzu
Copy link
Contributor

sk4zuzu commented Aug 3, 2020

It seems after upgrading to newer Kubernetes 1.18 we don't really see that problem anymore. But it's still there in 0.6, which is impacting upgrade testing. I'll take a look if it's possible to fix it in 0.6. 👍

@mkyc mkyc modified the milestones: S20200813, S20200827 Aug 13, 2020
@sk4zuzu sk4zuzu removed their assignment Aug 19, 2020
@sk4zuzu
Copy link
Contributor

sk4zuzu commented Aug 19, 2020

Problem still exists, no real progress here. It's reporducible in AWS / Azure while using Azure DevOps, it takes couple of retries to reproduce. Since deployment of a cluster takes 1+ hour it's really annoying to work on.

@mkyc mkyc modified the milestones: S20200827, S20200910 Aug 27, 2020
@mkyc mkyc modified the milestones: S20200910, S20200924, S20201008 Sep 10, 2020
@mkyc mkyc modified the milestones: S20201008, S20201022 Sep 24, 2020
@mkyc mkyc removed this from the S20201022 milestone Oct 6, 2020
@mkyc
Copy link
Contributor

mkyc commented Mar 29, 2021

Is there a workaround? For example if I run apply again would it work?

@mkyc mkyc added this to the S20210422 milestone Apr 8, 2021
@mkyc
Copy link
Contributor

mkyc commented Apr 8, 2021

Add information to changelog known issues section.

@mkyc mkyc self-assigned this Apr 8, 2021
@mkyc
Copy link
Contributor

mkyc commented Apr 8, 2021

Handled in this PR

@mkyc mkyc closed this as completed Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants