Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create Control Plane/ Failed to install CNI #979

Closed
jansmets opened this issue Oct 21, 2019 · 11 comments
Closed

Failed to create Control Plane/ Failed to install CNI #979

jansmets opened this issue Oct 21, 2019 · 11 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@jansmets
Copy link

Hi

I'd like to run KinD in a Gitlab CI pipeline. I have a (bare metal on prem) kubernetes cluster where gitlab launches containers with a docker-in-docker service. (there is no (easy) ability to mount additional host volumes, like /lib/modules).

kind-control-plane has access to /sys/fs/cgroup. and runs in privileged mode. The 'overlay' kernel module has been loaded. It does not have a /lib/modules mount.

cluster create gives :

I1021 09:37:26.534328      89 round_trippers.go:438] POST https://172.17.0.2:6443/apis/rbac.authorization.k8s.io/v1/namespaces/kube-system/roles 201 Created in 14 milliseconds
I1021 09:37:26.560629      89 round_trippers.go:438] POST https://172.17.0.2:6443/apis/rbac.authorization.k8s.io/v1/namespaces/kube-system/rolebindings 201 Created in 26 milliseconds
[addons] Applied essential addon: kube-proxy
I1021 09:37:26.561935      89 loader.go:359] Config loaded from file:  /etc/kubernetes/admin.conf
I1021 09:37:26.563065      89 loader.go:359] Config loaded from file:  /etc/kubernetes/admin.conf

Your Kubernetes control-plane has initialized successfully!
....
kubeadm join 172.17.0.2:6443 --token <value withheld> \
    --discovery-token-ca-cert-hash sha256:8d3046f95a14115991f1c126ced379fe17e9d7a6fda8143e829cf36b6eaa0ff1  
DEBU[09:37:26] Running: /usr/local/bin/docker [docker inspect -f {{(index (index .NetworkSettings.Ports "6443/tcp") 0).HostPort}} kind-control-plane] 
DEBU[09:37:26] Running: /usr/local/bin/docker [docker exec --privileged kind-control-plane cat /etc/kubernetes/admin.conf] 
DEBU[09:37:26] Running: /usr/local/bin/docker [docker exec --privileged kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes --all node-role.kubernetes.io/master-] 
 ✗ Starting control-plane 🟟️ 
DEBU[09:37:29] Running: /usr/local/bin/docker [docker ps -q -a --no-trunc --filter label=io.k8s.sigs.kind.cluster --format {{.Names}}\t{{.Label "io.k8s.sigs.kind.cluster"}} --filter label=io.k8s.sigs.kind.cluster=kind] 
DEBU[09:37:29] Running: /usr/local/bin/docker [docker rm -f -v kind-control-plane] 
Error: failed to create cluster: failed to remove master taint: exit status 1

And sometimes slightly different where "Starting control-plane" reports success.

DEBU[09:41:25] Running: /usr/local/bin/docker [docker inspect -f {{(index (index .NetworkSettings.Ports "6443/tcp") 0).HostPort}} kind-control-plane]
DEBU[09:41:25] Running: /usr/local/bin/docker [docker exec --privileged kind-control-plane cat /etc/kubernetes/admin.conf]
DEBU[09:41:25] Running: /usr/local/bin/docker [docker exec --privileged kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes --all node-role.kubernetes.io/master-]
 ✓ Starting control-plane 🟟️
DEBU[09:41:27] Running: /usr/local/bin/docker [docker exec --privileged kind-control-plane cat /kind/manifests/default-cni.yaml]
DEBU[09:41:27] Running: /usr/local/bin/docker [docker exec --privileged -i kind-control-plane kubectl create --kubeconfig=/etc/kubernetes/admin.conf -f -]
 ✗ Installing CNI 🟟
Error: failed to create cluster: failed to apply overlay network: exit status 1
/ # kind export logs
Exported logs to: /tmp/519333875

containerd's pre-init fails to modprobe the overlay module, but it's fine as that one is already loaded by the host kernel. Containerd continues to start and then gives an Error about the CNI plugin not being loaded.

Oct 21 09:40:48 kind-control-plane systemd[1]: Starting containerd container runtime...
Oct 21 09:40:48 kind-control-plane modprobe[43]: modprobe: ERROR: ../libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/5.3.0-1.el7.elrepo.x86_
64/modules.dep.bin'
Oct 21 09:40:48 kind-control-plane modprobe[43]: modprobe: FATAL: Module overlay not found in directory /lib/modules/5.3.0-1.el7.elrepo.x86_64
...
Oct 21 09:41:25 kind-control-plane containerd[822]: time="2019-10-21T09:41:25.591282313Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
...
Oct 21 09:43:15 kind-control-plane containerd[5651]: time="2019-10-21T09:43:15.970599949Z" level=info msg="Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapsho
tter:overlayfs DefaultRuntime:{Type:io.containerd.runtime.v1.linux Engine: Root: Options:<nil>} UntrustedWorkloadRuntime:{Type: Engine: Root: Options:<nil>} Runtimes:map[] NoPi
vot:false} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginConfTemplate:} Registry:{Mirrors:map[docker.io:{Endpoints:[https://regis
try-1.docker.io]}] Auths:map[]} StreamServerAddress:127.0.0.1 StreamServerPort:0 EnableSelinux:false SandboxImage:k8s.gcr.io/pause:3.1 StatsCollectPeriod:10 SystemdCgroup:false
 EnableTLSStreaming:false X509KeyPairStreaming:{TLSCertFile: TLSKeyFile:} MaxContainerLogLineSize:16384} ContainerdRootDir:/var/lib/containerd ContainerdEndpoint:/run/container
d/containerd.sock RootDir:/var/lib/containerd/io.containerd.grpc.v1.cri StateDir:/run/containerd/io.containerd.grpc.v1.cri}"
Oct 21 09:43:15 kind-control-plane containerd[5651]: time="2019-10-21T09:43:15.970623167Z" level=info msg="Connect containerd service"
Oct 21 09:43:15 kind-control-plane containerd[5651]: time="2019-10-21T09:43:15.970732501Z" level=info msg="Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs""
Oct 21 09:43:15 kind-control-plane containerd[5651]: time="2019-10-21T09:43:15.970837471Z" level=error msg="Failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Oct 21 09:43:15 kind-control-plane containerd[5651]: time="2019-10-21T09:43:15.971104261Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1

but kubelet seems to be in a limbo state, continuously restarting because of the missing CNI.
The CNI is set up by the kindnetd DaemonSet (?) - (but it doesn't get a proper chance to do it. ?)

root@kind-control-plane:/# kubectl  --kubeconfig=/etc/kubernetes/admin.conf -n kube-system get pods --all-namespaces
The connection to the server 172.17.0.2:6443 was refused - did you specify the right host or port?
root@kind-control-plane:/# kubectl  --kubeconfig=/etc/kubernetes/admin.conf -n kube-system get pods --all-namespaces
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   etcd-kind-control-plane                      1/1     Running   55         17m
kube-system   kube-apiserver-kind-control-plane            1/1     Running   57         17m
kube-system   kube-controller-manager-kind-control-plane   1/1     Running   66         17m
kube-system   kube-scheduler-kind-control-plane            1/1     Running   66         17m
Name:               kind-control-plane
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kind-control-plane
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 21 Oct 2019 09:41:19 +0000
**Taints:             node.kubernetes.io/not-ready:NoSchedule**
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 21 Oct 2019 10:51:04 +0000   Mon, 21 Oct 2019 09:41:17 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 21 Oct 2019 10:51:04 +0000   Mon, 21 Oct 2019 09:41:17 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 21 Oct 2019 10:51:04 +0000   Mon, 21 Oct 2019 09:41:17 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 21 Oct 2019 10:51:04 +0000   Mon, 21 Oct 2019 09:41:17 +0000   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:  172.17.0.2
  Hostname:    kind-control-plane

when I manually apply the default-cni.yaml (with templated Podsubnet set to 10.244.0.0/16) I'm not getting any further as kindnet DS isn't scheduled on NotReady nodes.

I wonder if this is just a side effect.

For example, the kube-scheduler is reporting:

E1021 11:18:41.483587       1 reflector.go:125] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:226: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E1021 11:18:41.487398       1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope

Any pointers are welcome.
Thank you

@jansmets jansmets added the kind/support Categorizes issue or PR as a support question. label Oct 21, 2019
@aojea
Copy link
Contributor

aojea commented Oct 21, 2019

There are several users that are using kind in Gitlab, some working config is documented here
https://github.com/kind-ci/examples/blob/master/.gitlab-ci.yml

You can also use github search to look for "gitlab" issues that are closed like this one
i.e. #620

@BenTheElder
Copy link
Member

can you export the logs?

it looks to me like we can't talk to the apiServer healthily, but it's not possible to tell why.

@BenTheElder
Copy link
Member

kubelet / containerd should normally complain some about CNI until the kindnetd daemonset runs.

having issues applying that generally means the api server or the connection to it is not healthy, which can be caused by various problems with the host.

@jansmets
Copy link
Author

Logs attached.
Thank you for looking.
logs.tar.gz
I've also had one occurrence where it successfully created/started, but the job terminated and I didn't get a chance to look around to see if it really all was up.

@aojea
Copy link
Contributor

aojea commented Oct 21, 2019

@BenTheElder is it possible to run dind without mounting /lib/modules ?

kind-control-plane has access to /sys/fs/cgroup. and runs in privileged mode. The 'overlay' kernel module has been loaded. It does not have a /lib/modules mount.

@BenTheElder
Copy link
Member

@aojea it depends, nothing we do explicitly has a hard dependency on /lib/modules, however if you need to modprobe you may want that... (eg perhaps docker setting up iptables rules and iptables modules are not loaded yet)

@jansmets
Copy link
Author

It turns out the /sys/fs/cgroup must be a HOST mounted volume in the container where the docker daemon runs. In CI environments there's no control over the podSpec of the launched pod and therefor /sys/fs/cgroup can not be mounted.

This podSpec mimicks/reproduces the (gitlab CI with kubernetes executor) job behavior. It starts working when you use the volumeMount in the 'dind' container.

apiVersion: v1
kind: Pod
metadata:
  name: dind-k8s
  namespace: sr-build
spec:
  containers:
    - name: dind
      image: docker:18-dind
      securityContext:
        privileged: true
#      volumeMounts:
#        - mountPath: /sys/fs/cgroup
#          name: cgroup
    - name: build
      image: docker:18-dind
      securityContext:
        privileged: true
      command:
      - sh
      - -c
      - "apk add bash curl openssl ; wget https://github.com/kubernetes-sigs/kind/releases/download/v0.5.1/kind-linux-amd64 ; chmod +x kind-linux-amd64 ; mv kind-linux-amd64 /usr/local/bin/kind ; DOCKER_HOST=tcp://localhost:2375  kind create cluster --loglevel trace --retain ; while true; do DOCKER_HOST=tcp://localhost:2375  docker exec --privileged -i kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes ; sleep 2; done "
  volumes:
  - name: cgroup
    hostPath:
      path: /sys/fs/cgroup
      type: Directory

@aojea
Copy link
Contributor

aojea commented Oct 22, 2019

@jansmets seems you've found the problem, do you mind to retitle this issue so other users can find it easily?

@jansmets
Copy link
Author

This works though :

  • without the 'kubernetes' layer.
  • also -without- the host volume mount of `/sys/fs/cgroup'
docker run --privileged --name dind1 -d --expose 2375  docker:18-dind
docker exec -it dind1 ip addr
 #     inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
# note: no host volume mounts!!!
docker run --rm -it --privileged docker:stable
  DOCKER_HOST=tcp://172.17.0.3:2375 docker info
  apk add bash curl openssl
  wget https://github.com/kubernetes-sigs/kind/releases/download/v0.5.1/kind-linux-amd64
  chmod +x kind-linux-amd64
  mv kind-linux-amd64 /usr/local/bin/kind
  DOCKER_HOST=tcp://172.17.0.3:2375 kind create cluster --loglevel trace --retain
  while true; do DOCKER_HOST=tcp://172.17.0.3:2375  docker exec --privileged -i kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes ; sleep 2; done

More CI/CD systems run jobs on kubernetes clusers and these system don't allow modifications to the podSpec of the job. It's nearly impossible to mount host volumes like /sys/fs/cgroup.
(and I'm not sure what implications there are towards host mounting /sys/fs/cgroup in all CI jobs - I don't think any operator wants that.)

Thank you again for your insights.

@BenTheElder
Copy link
Member

If you're running on Kubernetes there's an existing issue discussing these requirements, and indeed I would recommend not running kind in Kubernetes if you can avoid it. There are many pitfalls versus more traditional CI hosts.

And yet, the Kubernetes project is Kubernetes CI based so #303

@BenTheElder
Copy link
Member

Sorry I didn't understand you were in Kubernetes previously. This is a duplicate of #303.

As you can see we are in fact hesitant to encourage people to replicate this versus running on a VM platform like circle CI machine executor or GCB or ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants