kube-apiserver bootstrap is unstable #49

schu · 2018-06-14T14:30:20Z

landscape-setup-template users frequently hit an error during cluster setup or end up with an unhealthy cluster where only 2 out of 3 kube-apiserver pods are running. Currently, we know of the following symptoms:

Cluster setup fails early due to etcd operator errors (deploy_kubify.sh fails to deploy etcd #48):

E0612 12:16:54.377420       1 leaderelection.go:224] error retrieving resource lock kube-system/etcd-operator: Get https://10.241.0.1:443/api/v1/namespaces/kube-system/endpoints/etcd-operator: dial tcp 10.241.0.1:443: getsockopt: connection refused

Cluster is unhealty due to kube-controller-manager continuously throwing errors (pod stays running though):

E0608 08:53:38.676846       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.241.0.1:443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: dial tcp 10.241.0.1:443: getsockopt: connection refused

The fact that the error (dial tcp 10.241.0.1:443: getsockopt: connection refused) is encountered for all requests looks like a routing error at first: 2 out of 3 apiserver instances are running and reachable after all and we expect the requests to the service IP to be distributed among the set of available pods (i.e. shouldn't 2 out of 3 requests succeed?).

This is most likely due to the (default) sessionAffinity setting for the default/kubernetes service:

  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

When a request from a source IP was routed to a KUBE-SEP once, it will be routed there for the next 3 hours (10800 seconds). E.g. if the leading controller-manager pod happens to be routed to the faulty node (w/o kube-apiserver running), all requests will end up there until the timeout is reached. The iptables rules for that look like:

-A KUBE-SEP-2KS7SNAGO5W6YUEJ -s 10.251.142.41/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-2KS7SNAGO5W6YUEJ -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-2KS7SNAGO5W6YUEJ --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.142.41:443
-A KUBE-SEP-Y44MF4JTQLZ5QDG7 -s 10.251.185.148/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-Y44MF4JTQLZ5QDG7 -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-Y44MF4JTQLZ5QDG7 --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.185.148:443
-A KUBE-SEP-YPQKGMLT4E2MSUWB -s 10.251.188.234/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-YPQKGMLT4E2MSUWB -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-YPQKGMLT4E2MSUWB --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.251.188.234:443
-A KUBE-SERVICES ! -s 10.241.0.0/17 -d 10.241.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.241.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-2KS7SNAGO5W6YUEJ --mask 255.255.255.255 --rsource -j KUBE-SEP-2KS7SNAGO5W6YUEJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-Y44MF4JTQLZ5QDG7 --mask 255.255.255.255 --rsource -j KUBE-SEP-Y44MF4JTQLZ5QDG7
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-YPQKGMLT4E2MSUWB --mask 255.255.255.255 --rsource -j KUBE-SEP-YPQKGMLT4E2MSUWB
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-2KS7SNAGO5W6YUEJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-Y44MF4JTQLZ5QDG7
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-YPQKGMLT4E2MSUWB

By removing the sessionAffinity setting from the default/kubernetes service (e.g. with kubectl edit svc kubernetes), the problem can be fixed for symptom 2 (as described above): controller-manager will eventually hit a healthy apiserver instance and be able go on with its tasks. The missing kube-apiserver pod will be rescheduled shortly after.

Noteworthy is that on the faulty master node where kube-apiserver is not running, the checkpoint is also missing (otherwise the pod should be running again shortly after it stopped), find /etc/kubernetes/ -iname '*api*' is empty. The checkpointer logs shows the following:

[...]
I0614 07:59:20.397358       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:23.410633       1 main.go:394] Checkpoint manifest for "kube-system/kube-apiserver-tdmfk" already exists. Skipping                                                                                                         
I0614 07:59:23.421755       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:26.433228       1 main.go:394] Checkpoint manifest for "kube-system/kube-apiserver-tdmfk" already exists. Skipping                                                                                                         
I0614 07:59:31.270644       1 main.go:225] API GC: skipping inactive checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:31.270759       1 main.go:283] Should start checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:31.270949       1 main.go:397] Writing manifest for "kube-system/kube-apiserver-tdmfk" to "/etc/kubernetes/manifests/kube-system-kube-apiserver-tdmfk.json"                                                                
I0614 07:59:34.364611       1 main.go:234] API GC: should remove inactive checkpoint kube-system/kube-apiserver-tdmfk                                                                                                                  
I0614 07:59:34.364673       1 main.go:250] API GC: should remove active checkpoint kube-system/kube-apiserver-tdmfk
I0614 07:59:34.364778       1 main.go:735] Removing checkpoint of: kube-system/kube-apiserver-tdmfk

Current status:

I don't know yet why this happens, but the root cause seems to be a problem during kube-apiserver bootstrapping. I'll add more info as I find it.

Any ideas? :)

The text was updated successfully, but these errors were encountered:

afritzler · 2018-06-15T05:48:05Z

@schu on which infrastructure are you setting up your cluster? I haven't seen this behaviour on OpenStack.

schu · 2018-06-15T06:31:11Z

@afritzler on AWS.

schu · 2018-06-15T14:32:53Z

I have also applied this patch to kubify to get more logs and keep finished containers for debugging:

diff --git a/modules/nodes/templates/cloud-init b/modules/nodes/templates/cloud-init
index 90f4414..d482e8d 100644
--- a/modules/nodes/templates/cloud-init
+++ b/modules/nodes/templates/cloud-init
@@ -57,6 +57,8 @@ ${units}  - name: bootstrap.service
         --mount=volume=etc-resolv-conf,target=/etc/resolv.conf \
         --insecure-options=image"
       ExecStart=/usr/lib/coreos/kubelet-wrapper \
+        --minimum-container-ttl-duration=60m \
+        -v 3 \
         --kubeconfig=/etc/kubernetes/kubelet.conf \
         --require-kubeconfig=true \
         --pod-manifest-path=/etc/kubernetes/manifests \

afritzler · 2018-06-15T15:31:08Z

Thanks @schu

So at the end the problem is that the kubernetes service in the default namespace has the following session affinity timeout

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: 2018-02-26T15:35:29Z
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
  resourceVersion: "62"
  selfLink: /api/v1/namespaces/default/services/kubernetes
  uid: acb349eb-1b0a-11e8-8d96-fa163e521bd0
spec:
  clusterIP: 10.241.0.1
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: ClusterIP
status:
  loadBalancer: {}

afritzler · 2018-06-15T15:31:38Z

Here is the corresponding default in the code: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L2970

afritzler · 2018-06-18T13:02:16Z

This issue is a known one and has been fixed with kubernetes/kubernetes#56690

I created 2 backport PRs for the 1.9 and 1.10 release branch in kubernetes.
kubernetes/kubernetes#65178
kubernetes/kubernetes#65177

schu · 2018-06-18T13:24:15Z

@afritzler thanks 👍

afritzler · 2018-06-18T13:31:18Z

Well, thank you guys for the findings!

afritzler · 2018-06-22T11:45:09Z

Ok, the cherry pick for 1.10 has been merged now. For 1.9 and 1.8 are coming up as well. I will close this issue.

afritzler self-assigned this Jun 18, 2018

afritzler added area/networking Networking related component/control-plane labels Jun 18, 2018

afritzler closed this as completed Jun 22, 2018

schu mentioned this issue Jun 22, 2018

deploy_kubify.sh fails to deploy etcd #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver bootstrap is unstable #49

kube-apiserver bootstrap is unstable #49

schu commented Jun 14, 2018 •

edited

Loading

afritzler commented Jun 15, 2018

schu commented Jun 15, 2018

schu commented Jun 15, 2018

afritzler commented Jun 15, 2018

afritzler commented Jun 15, 2018

afritzler commented Jun 18, 2018

schu commented Jun 18, 2018

afritzler commented Jun 18, 2018

afritzler commented Jun 22, 2018

kube-apiserver bootstrap is unstable #49

kube-apiserver bootstrap is unstable #49

Comments

schu commented Jun 14, 2018 • edited Loading

Current status:

afritzler commented Jun 15, 2018

schu commented Jun 15, 2018

schu commented Jun 15, 2018

afritzler commented Jun 15, 2018

afritzler commented Jun 15, 2018

afritzler commented Jun 18, 2018

schu commented Jun 18, 2018

afritzler commented Jun 18, 2018

afritzler commented Jun 22, 2018

schu commented Jun 14, 2018 •

edited

Loading