-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate impact of unregister_netdevice kernel race #20096
Comments
Similar to #19118, the node is in a bad state, potentially due to moby/moby#5618
|
Most tests failed at checking if all nodes are ready, during test teardown. |
After a while Kubelet node with docker daemon ran into a known kernel issue a lot:
Once the node ran into such state, docker will be hung, which is documented at moby/moby#5618. The Kubelet detects the NodeNotReady and container runtime is down. The only way to recover from such state is reboot the node. Once we run into such situation with some nodes lost, all tests are expected to be failed. Instead of continue our soaking test in such bad situation, we should
Any other suggestion?
cc/ @ihmccreery @ixdy @ixdy I have done with the debugging, you can kick out the soaking tests now. Thanks! |
@ixdy I rebooted all affected nodes, killed kubernetes-soak-continuous-e2e-gce-1.1/929/, and started build 930. Thanks! |
@dchen1107 I think that it's a good idea to remotely monitor nodes and have a set of well-defined node states in which we remotely reboot the node to remedy the situation. But I don't think it's very useful to implement these for the soak tests only. On the contrary, the soak tests now give us a reasonable measure of how reliable our customers experience their nodes and clusters to be. For GCE and AWS-hosted nodes, it should be fairly straightforward to implement these checks using the cloud providers' health check tools (#19446 provides an example for AWS using cloudwatch alarms). Further details at
For non-cloud/bare metal installs people will need to implement some external monitoring and rebooting agent along the lines of MachineDoctor and SoftwareDoctor that you mentioned (although I would suggest that these not be part of Kubernetes Core, but rather a separate system. I wrote a proposal/design doc on control plane resiliency in PR #19313 (still to be merged), which I could extend to cover node resiliency. |
This seems dubious to me; Kubernetes is supposed to be a self-healing system, and tests should (kind of) mirror real-world production usage, especially in the "soak" side of things.
I agree that Kubelet shouldn't be a MachineDoctor or SoftwareDoctor, but it's good for us to know that Kubernetes, (our product) is relying on a system that isn't stable enough for us to have soak tests passing. Regardless of whether or not we're building the OS we rely on, we're shipping a product that relies on it, and its instability should be something we should care about. |
@ihmccreery I think that my proposal above addresses your (very valid) concerns? |
Yup, SGTM. I had written that before I saw your post; looks like we were thinking along similar lines. |
I'm removing this from the flake list and adding it as P0 for v1.2 per our discussion this morning. |
@ihmccreery I wasn't in that discussion, but just to be clear, is the plan to improve soak test infra (as per the title of this issue), or improve remote monitoring and rebooting of nodes (as per my comment above ? If the latter, I'd suggest changing the title accordingly. |
@quinton-hoole I mentioned both 1) improve the soak test infra in short time and 2) improve remote monitoring and remedy in relatively longer time this morning. This issue is denote to test infra. We should talk about more for 2), especially for GKE / GCE nodes. |
@dchen1107 What test infra changes do you want to see? I'm skeptical that putting engineering hours toward soak test infra is time well-worth spent; I'd rather put that time into making the the production system acts as we expect. |
cc/ @timothysc |
@dchen1107 looks like we may have fixed it a while back https://bugzilla.redhat.com/show_bug.cgi?id=880394#c7 /cc @jeremyeder @ihmccreery - Also, how can we insert our stack into this picture for cross coverage? I know the ball was dropped a while back, but I'll weekend warrior it to get it done. |
Potentially related: GKE soak tests started timing out last friday with node readiness problems
|
I'm going to temporarily stick a "flake" label on this even though it's not a flake, so that we will remember to discuss it more widely. |
It looks like PR #21326 (Put the container bridge in promiscuous mode.) is causing issue #25793 (Duplication of packets seen in bridge-promiscuous mode when accessing a pod on the same node through service ip.) We are affected by this issue of duplicated UDP packets (running on GCE.) Is there a workaround on / will there be one in future? |
The kernel fix, as stated in the issue referenced. I can't comment on an ETA but we're working on it. We're bound by the limitations of what we can realistically achieve without heading down a rabbit hole, because the bug is several layers below the orchestration framework. |
@bprashanth are you talking about the original race condition bug (described in moby/moby#5618 and https://bugzilla.kernel.org/show_bug.cgi?id=81211) or about "duplication of packets" bug? IE: which one are you actually trying to fix? |
@dcbw I don't know if you're still looking for a reproduction case, but I've had success (on the order of a coffee break between setup and failure) with the Kubernetes Job mentioned at the top of moby/moby#23175 . It's our own kubernetes cluster, but AFAICT the node is a relatively stock CoreOS 1053.2.0 running kernel version 4.6.0-coreos . I'm not sure what precisely about the Kubernetes Job use-pattern makes the issue easier to reproduce (maybe it's just good at creating new containers in response to the Edit: I'm sorry, I forgot to mention – we're on Kubernetes 1.2.3. I'm not sure if this is the right place for it, but the kubelet's response is still pretty bad, getting hung waiting for the never-to-return |
the kernel bug |
@bprashanth Could you explain a bit more about what you mean here? Is there a hairpin mode that avoids this? It was my understanding that this bug occurs with both |
This bug won't happen with a promiscuous-bridge instead of hairpin, or at least, we ran into it consistently and then never again on the stock GCE debians. |
If we are using flannel, my understanding is that setting hairpin-mode to promiscuous-bridge will do nothing since the docker bridge isn't managed by kubelet. What's our best bet then to avoid the hung docker problem? |
hairpin mode is required for proper functioning of services, kubelet would set hairpin mode on bridges regardless of --configure-cbr0. It's a setting per veth. You can check if it's set like: #20475 (comment) |
This is a workaround for the unregister_netdevice kernel hang that can occur when starting containers. See these issues for more details: moby/moby#5618 kubernetes/kubernetes#20096 Signed-off-by: Jonathan Rudenberg <[email protected]>
This is a workaround for the unregister_netdevice kernel hang that can occur when starting containers. See these issues for more details: moby/moby#5618 kubernetes/kubernetes#20096 Signed-off-by: Jonathan Rudenberg <[email protected]>
We are still seeing I also confirmed that the
We are running on baremetal using CoreOS/Container Linux |
I've this issue too on GKE. Node version 1.8.1-gke.1
And I can't terminate any pods on this node. Just migrated to GKE and this instability terrifies me. Please reopen the issue. |
/cc @yujuhong @dchen1107 /cc @kubernetes/sig-node-bugs @nailgun The events are generated by NPD by parsing kernel log.
There is a |
@Random-Liu There is no kernel deadlock AFAIK. This lines are from node's systemd journal:
But when this event is first seen, kubelet becomes completely unresponsive. |
kubernetes-soak-continuous-e2e-gce-1.1 is timeout since Jan 18:
The text was updated successfully, but these errors were encountered: