Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate impact of unregister_netdevice kernel race #20096

Closed
dchen1107 opened this issue Jan 25, 2016 · 87 comments
Closed

Mitigate impact of unregister_netdevice kernel race #20096

dchen1107 opened this issue Jan 25, 2016 · 87 comments
Assignees
Labels
area/test-infra kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@dchen1107
Copy link
Member

kubernetes-soak-continuous-e2e-gce-1.1 is timeout since Jan 18:

11:49:43 Kubelet regular resource usage tracking 
11:49:43   over 30m0s with 0 pods per node.
11:49:43   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:129
11:49:43 [BeforeEach] Kubelet
11:49:43   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:51
11:49:43 STEP: Creating a kubernetes client
11:49:43 >>> testContext.KubeConfig: /var/lib/jenkins/jobs/kubernetes-soak-weekly-deploy-gce-1.1/workspace/.kube/config
11:49:43 STEP: Building a namespace api object
11:49:43 Jan 18 11:49:43.357: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-kubelet-perf-v1ajy
11:49:43 Jan 18 11:49:43.360: INFO: Get service account default in ns e2e-tests-kubelet-perf-v1ajy failed, ignoring for 2s: serviceaccounts "default" not found
11:49:45 Jan 18 11:49:45.368: INFO: Service account default in ns e2e-tests-kubelet-perf-v1ajy with secrets found. (2.010351595s)
11:49:45 STEP: Waiting for a default service account to be provisioned in namespace
11:49:45 Jan 18 11:49:45.368: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-kubelet-perf-v1ajy
11:49:45 Jan 18 11:49:45.370: INFO: Service account default in ns e2e-tests-kubelet-perf-v1ajy with secrets found. (2.358283ms)
11:49:45 [BeforeEach] Kubelet
11:49:45   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:115
11:49:45 [It] over 30m0s with 0 pods per node.
11:49:45   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:129
11:49:45 STEP: Creating a RC of 0 pods and wait until all pods of this RC are running
11:49:45 STEP: creating replication controller resource0-9fd1ca51-be1c-11e5-b569-42010af01555 in namespace e2e-tests-kubelet-perf-v1ajy
11:49:45 Jan 18 11:49:45.513: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:45 Jan 18 11:49:45.522: INFO: Created replication controller with name: resource0-9fd1ca51-be1c-11e5-b569-42010af01555, namespace: e2e-tests-kubelet-perf-v1ajy, replica count: 0
11:49:45 Jan 18 11:49:45.522: INFO: Resource usage on node "gce-soak-weekly-1-1-minion-foyg" is not ready yet
11:49:45 Jan 18 11:49:45.522: INFO: Resource usage on node "gce-soak-weekly-1-1-minion-h26u" is not ready yet
11:49:45 Jan 18 11:49:45.522: INFO: Resource usage on node "gce-soak-weekly-1-1-minion-v8hi" is not ready yet
11:49:45 STEP: Start monitoring resource usage
11:49:45 Jan 18 11:49:45.522: INFO: Still running...29m59.999999254s left
11:49:46 Jan 18 11:49:46.517: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:47 Jan 18 11:49:47.523: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:48 Jan 18 11:49:48.528: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:49 Jan 18 11:49:49.532: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:50 Jan 18 11:49:50.537: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:51 Jan 18 11:49:51.543: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:52 Jan 18 11:49:52.548: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:53 Jan 18 11:49:53.552: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:54 Jan 18 11:49:54.558: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:55 Jan 18 11:49:55.563: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:56 Jan 18 11:49:56.568: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:57 Jan 18 11:49:57.573: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:58 Jan 18 11:49:58.579: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:49:59 Jan 18 11:49:59.584: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:50:00 Jan 18 11:50:00.590: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:50:01 Jan 18 11:50:01.595: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:50:02 Jan 18 11:50:02.599: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
11:50:03 Jan 18 11:50:03.604: INFO: Error getting container info on "gce-soak-weekly-1-1-minion-v8hi", err: unexpected end of JSON input
@dchen1107 dchen1107 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Jan 25, 2016
@yujuhong
Copy link
Contributor

Similar to #19118, the node is in a bad state, potentially due to moby/moby#5618

Message from syslogd@gce-soak-weekly-1-1-minion-mg6o at Jan 25 18:05:38 ...
 kernel:[534024.040037] unregister_netdevice: waiting for lo to become free. Usage count = 2

@yujuhong
Copy link
Contributor

Most tests failed at checking if all nodes are ready, during test teardown.

@dchen1107
Copy link
Member Author

@yujuhong Thanks for quick triage on this one. We need to detect such failures, and ignore them. I don't mean kubelet itself. Re-assign it back me. cc/ @kubernetes/goog-gke @cjcullen

@dchen1107
Copy link
Member Author

After a while Kubelet node with docker daemon ran into a known kernel issue a lot:

...
Message from syslogd@gce-soak-weekly-1-1-minion-ehmp at Jan 25 22:15:51 ...
 kernel:[549046.528155] unregister_netdevice: waiting for veth1942488 to become free. Usage count = 1
Message from syslogd@gce-soak-weekly-1-1-minion-ehmp at Jan 25 22:16:01 ...
 kernel:[549055.800042] unregister_netdevice: waiting for lo to become free. Usage count = 2
...

Once the node ran into such state, docker will be hung, which is documented at moby/moby#5618. The Kubelet detects the NodeNotReady and container runtime is down. The only way to recover from such state is reboot the node. Once we run into such situation with some nodes lost, all tests are expected to be failed.

Instead of continue our soaking test in such bad situation, we should

  1. detect such kernel issue. Again, I don't want to make Kubelet itself to be a MachineDoctor or SoftwareDoctor since there is no way it could do a good job given different kernel version, os distro, fileystem etc. But for our test environment, I think we can run some remote script to check if node in such sate due to known kernel issue once the node is not ready and runtime is down.

  2. Once such issue is detected, we can reboot affected node(s), and continue our soaking test.

Any other suggestion?

# kubectl get nodes
NAME                              LABELS                                                   STATUS     AGE
gce-soak-weekly-1-1-minion-5jo2   kubernetes.io/hostname=gce-soak-weekly-1-1-minion-5jo2   Ready      6d
gce-soak-weekly-1-1-minion-ehmp   kubernetes.io/hostname=gce-soak-weekly-1-1-minion-ehmp   NotReady   6d
gce-soak-weekly-1-1-minion-mg6o   kubernetes.io/hostname=gce-soak-weekly-1-1-minion-mg6o   NotReady   6d

# kubectl describe nodes gce-soak-weekly-1-1-minion-mg6o
Name:                   gce-soak-weekly-1-1-minion-mg6o
Labels:                 kubernetes.io/hostname=gce-soak-weekly-1-1-minion-mg6o
CreationTimestamp:      Tue, 19 Jan 2016 13:46:01 +0000
Phase:
Conditions:
  Type          Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ────          ──────  ─────────────────                       ──────────────────                      ──────                          ───────
  OutOfDisk     False   Mon, 25 Jan 2016 22:33:39 +0000         Tue, 19 Jan 2016 13:46:01 +0000         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  Ready         False   Mon, 25 Jan 2016 22:33:39 +0000         Tue, 19 Jan 2016 16:24:24 +0000         KubeletNotReady                 container runtime is down
Addresses:      10.240.0.4,104.154.52.178
Capacity:
 cpu:           2
 memory:        7679608Ki
 pods:          100
System Info:
 Machine ID:
 System UUID:                   16E5EFCB-4E3D-A281-33DF-05305569F2E4
 Boot ID:                       a2e79bd1-69dc-4dd3-b53c-373519a9850a
 Kernel Version:                3.16.0-0.bpo.4-amd64
 OS Image:                      Debian GNU/Linux 7 (wheezy)
 Container Runtime Version:     docker://1.8.3
 Kubelet Version:               v1.1.5-beta.0.14+dc406b14911769
 Kube-Proxy Version:            v1.1.5-beta.0.14+dc406b14911769
PodCIDR:                        10.245.2.0/24
ExternalID:                     990883542697417546
Non-terminated Pods:            (0 in total)
  Namespace                     Name            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ─────────                     ────            ────────────    ──────────      ─────────────── ─────────────
Allocated resources:
  (Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ────────────  ──────────      ─────────────── ─────────────
  0 (0%)        0 (0%)          0 (0%)          0 (0%)
No events.

cc/ @ihmccreery @ixdy

@ixdy I have done with the debugging, you can kick out the soaking tests now. Thanks!

@dchen1107 dchen1107 changed the title test flaky: kubernetes-soak-continuous-e2e-gce-1.1 timeout since Jan 18 Improve soak test infra (test flaky: kubernetes-soak-continuous-e2e-gce-1.1 timeout since Jan 18) Jan 25, 2016
@dchen1107
Copy link
Member Author

@ixdy I rebooted all affected nodes, killed kubernetes-soak-continuous-e2e-gce-1.1/929/, and started build 930. Thanks!

@ghost
Copy link

ghost commented Jan 26, 2016

@dchen1107 I think that it's a good idea to remotely monitor nodes and have a set of well-defined node states in which we remotely reboot the node to remedy the situation. But I don't think it's very useful to implement these for the soak tests only. On the contrary, the soak tests now give us a reasonable measure of how reliable our customers experience their nodes and clusters to be.

For GCE and AWS-hosted nodes, it should be fairly straightforward to implement these checks using the cloud providers' health check tools (#19446 provides an example for AWS using cloudwatch alarms). Further details at

For non-cloud/bare metal installs people will need to implement some external monitoring and rebooting agent along the lines of MachineDoctor and SoftwareDoctor that you mentioned (although I would suggest that these not be part of Kubernetes Core, but rather a separate system.

I wrote a proposal/design doc on control plane resiliency in PR #19313 (still to be merged), which I could extend to cover node resiliency.

@ikehz
Copy link
Contributor

ikehz commented Jan 26, 2016

Once we run into such situation with some nodes lost, all tests are expected to be failed.

This seems dubious to me; Kubernetes is supposed to be a self-healing system, and tests should (kind of) mirror real-world production usage, especially in the "soak" side of things.

Instead of continue our soaking test in such bad situation, we should

  1. detect such kernel issue. Again, I don't want to make Kubelet itself to be a MachineDoctor or SoftwareDoctor since there is no way it could do a good job given different kernel version, os distro, fileystem etc. But for our test environment, I think we can run some remote script to check if node in such sate due to known kernel issue once the node is not ready and runtime is down.

  2. Once such issue is detected, we can reboot affected node(s), and continue our soaking test.

I agree that Kubelet shouldn't be a MachineDoctor or SoftwareDoctor, but it's good for us to know that Kubernetes, (our product) is relying on a system that isn't stable enough for us to have soak tests passing. Regardless of whether or not we're building the OS we rely on, we're shipping a product that relies on it, and its instability should be something we should care about.

@ghost
Copy link

ghost commented Jan 26, 2016

@ihmccreery I think that my proposal above addresses your (very valid) concerns?

@ikehz
Copy link
Contributor

ikehz commented Jan 26, 2016

Yup, SGTM. I had written that before I saw your post; looks like we were thinking along similar lines.

@ikehz
Copy link
Contributor

ikehz commented Jan 26, 2016

I'm removing this from the flake list and adding it as P0 for v1.2 per our discussion this morning.

@ikehz ikehz added this to the v1.2 milestone Jan 26, 2016
@ikehz ikehz added team/control-plane and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Jan 26, 2016
@ghost
Copy link

ghost commented Jan 26, 2016

@ihmccreery I wasn't in that discussion, but just to be clear, is the plan to improve soak test infra (as per the title of this issue), or improve remote monitoring and rebooting of nodes (as per my comment above ? If the latter, I'd suggest changing the title accordingly.

@dchen1107
Copy link
Member Author

@quinton-hoole I mentioned both 1) improve the soak test infra in short time and 2) improve remote monitoring and remedy in relatively longer time this morning. This issue is denote to test infra. We should talk about more for 2), especially for GKE / GCE nodes.

@ikehz
Copy link
Contributor

ikehz commented Jan 26, 2016

@dchen1107 What test infra changes do you want to see? I'm skeptical that putting engineering hours toward soak test infra is time well-worth spent; I'd rather put that time into making the the production system acts as we expect.

@dchen1107
Copy link
Member Author

cc/ @timothysc

@jdef
Copy link
Contributor

jdef commented Jan 27, 2016

@timothysc
Copy link
Member

@dchen1107 looks like we may have fixed it a while back https://bugzilla.redhat.com/show_bug.cgi?id=880394#c7

/cc @jeremyeder

@ihmccreery - Also, how can we insert our stack into this picture for cross coverage? I know the ball was dropped a while back, but I'll weekend warrior it to get it done.

@j3ffml
Copy link
Contributor

j3ffml commented Feb 1, 2016

Potentially related: GKE soak tests started timing out last friday with node readiness problems

12:52:44 Feb  1 12:52:44.704: INFO: Condition Ready of node gke-jenkins-gke-soak-weekly-12f27965-node-54ba is false instead of true. Reason: KubeletNotReady, message: ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now

See kubernetes-soak-continuous-e2e-gke

@davidopp
Copy link
Member

davidopp commented Feb 4, 2016

I'm going to temporarily stick a "flake" label on this even though it's not a flake, so that we will remember to discuss it more widely.

@aeneaswiener
Copy link

It looks like PR #21326 (Put the container bridge in promiscuous mode.) is causing issue #25793 (Duplication of packets seen in bridge-promiscuous mode when accessing a pod on the same node through service ip.)

We are affected by this issue of duplicated UDP packets (running on GCE.) Is there a workaround on / will there be one in future?

@bprashanth
Copy link
Contributor

The kernel fix, as stated in the issue referenced. I can't comment on an ETA but we're working on it.
Or you can flip to hairpin mode and never try to access yourself through your own service.
Or you can use the userspace kube proxy, which is a lot slower than the default iptables kube-proxy.

We're bound by the limitations of what we can realistically achieve without heading down a rabbit hole, because the bug is several layers below the orchestration framework.

@fxposter
Copy link

fxposter commented Jun 1, 2016

@bprashanth are you talking about the original race condition bug (described in moby/moby#5618 and https://bugzilla.kernel.org/show_bug.cgi?id=81211) or about "duplication of packets" bug? IE: which one are you actually trying to fix?

@sethp-jive
Copy link

sethp-jive commented Jun 1, 2016

@dcbw I don't know if you're still looking for a reproduction case, but I've had success (on the order of a coffee break between setup and failure) with the Kubernetes Job mentioned at the top of moby/moby#23175 .

It's our own kubernetes cluster, but AFAICT the node is a relatively stock CoreOS 1053.2.0 running kernel version 4.6.0-coreos . I'm not sure what precisely about the Kubernetes Job use-pattern makes the issue easier to reproduce (maybe it's just good at creating new containers in response to the exit 2?), but I'm happy to share any information about that node you might find helpful.

Edit: I'm sorry, I forgot to mention – we're on Kubernetes 1.2.3. I'm not sure if this is the right place for it, but the kubelet's response is still pretty bad, getting hung waiting for the never-to-return /containers/json. I've got a bunch of gorountine stacks from the stuck kubelet if that's helpful, but the short version is that it looks like the PLEG, ImageManager, Garbage Collector, and main sync loop are all hung, preventing the kubelet from picking up any of the work being assigned to it.

@bprashanth
Copy link
Contributor

the kernel bug

@obeattie
Copy link

obeattie commented Jun 2, 2016

Or you can flip to hairpin mode

@bprashanth Could you explain a bit more about what you mean here? Is there a hairpin mode that avoids this? It was my understanding that this bug occurs with both hairpin-veth and promiscuous-bridge. Am I wrong, or do you mean disabling the hairpin altogether fixes the issue?

@bprashanth
Copy link
Contributor

This bug won't happen with a promiscuous-bridge instead of hairpin, or at least, we ran into it consistently and then never again on the stock GCE debians.
The duplicate ack problem won't happen with hairpin-veth.
Neither will happen with neither (i.e userspace kube-proxy).

@guoshimin
Copy link
Contributor

guoshimin commented Jun 6, 2016

If we are using flannel, my understanding is that setting hairpin-mode to promiscuous-bridge will do nothing since the docker bridge isn't managed by kubelet. What's our best bet then to avoid the hung docker problem?

@bprashanth
Copy link
Contributor

hairpin mode is required for proper functioning of services, kubelet would set hairpin mode on bridges regardless of --configure-cbr0. It's a setting per veth. You can check if it's set like: #20475 (comment)

titanous added a commit to flynn/flynn that referenced this issue May 4, 2017
This is a workaround for the unregister_netdevice kernel hang that
can occur when starting containers.

See these issues for more details:

moby/moby#5618
kubernetes/kubernetes#20096

Signed-off-by: Jonathan Rudenberg <[email protected]>
titanous added a commit to flynn/flynn that referenced this issue May 5, 2017
This is a workaround for the unregister_netdevice kernel hang that
can occur when starting containers.

See these issues for more details:

moby/moby#5618
kubernetes/kubernetes#20096

Signed-off-by: Jonathan Rudenberg <[email protected]>
@databus23
Copy link
Contributor

We are still seeing unregister_netdevice: waiting for lo to become free. Usage count = 1 kernel messages after switching our kubelets to --network-plugin=kubenet and --hairpin-mode=promiscous-bridge.

I also confirmed that the veth devices have hairpin_mode set to 0 using

cat /sys/devices/virtual/net/*/brport/hairpin_mode

We are running on baremetal using CoreOS/Container Linux 1353.7.0 (kernel 4.9.24)

@nailgun
Copy link

nailgun commented Nov 16, 2017

I've this issue too on GKE.

Node version 1.8.1-gke.1
Node image Container-Optimized OS (cos)

kubectl describe node shows this event:
Warning UnregisterNetDevice 3s (x14245 over 1d) kernel-monitor, gke-c1-cpu8-mem52-e7a9e3b3-3w22 unregister_netdevice: waiting for eth0 to become free. Usage count = 21

And I can't terminate any pods on this node. Just migrated to GKE and this instability terrifies me.

Please reopen the issue.

@Random-Liu
Copy link
Member

Random-Liu commented Nov 16, 2017

/cc @yujuhong @dchen1107 /cc @kubernetes/sig-node-bugs

@nailgun The events are generated by NPD by parsing kernel log.

unregister_netdevice: waiting for eth0 to become free. Usage count = 21 was a symptom of a kernel deadlock, but I heard that it doesn't necessarily mean a kernel deadlock now. However x14245/day is too frequent for me...

There is a KernelDeadlock node condition. Is that True or False for you?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 16, 2017
@nailgun
Copy link

nailgun commented Nov 16, 2017

@Random-Liu There is no kernel deadlock AFAIK.

This lines are from node's systemd journal:

Nov 14 00:02:08 gke-c1-cpu8-mem52-e7a9e3b3-th0t kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 3
Nov 14 00:02:08 gke-c1-cpu8-mem52-e7a9e3b3-th0t node-problem-detector[1339]: I1114 00:02:08.563574    1339 log_monitor.go:123] New status generated: &{Source:kernel-monitor Events:[{Severity:warn Timestamp:2017-11-14 00:02:08.389022 +0000 UTC Reason:UnregisterNetDevice Message:unregister_netdevice: waiting for eth0 to become free. Usage count = 3}] Conditions:[{Type:KernelDeadlock Status:false Transition:2017-11-07 13:48:32.555190709 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]}

But when this event is first seen, kubelet becomes completely unresponsive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests