-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arp_cache: neighbor table overflow! #4533
Comments
Using a hook sure! |
See https://github.com/kubernetes/kops/blob/master/docs/cluster_spec.md For docs on hooks |
thanks @chrislovecnm, don't you think default max arp table size should be greater though? My cluster isn't even so bigger, it has ~60 nodes and ~2000 ~3000 pods I guess... what would be the downsize on permiting a larger arp table? |
I would check with sig node. Really a kernel question. |
So I think we want to keep gc_thresh1 at 0: kubernetes/kubernetes#23395 I don't see a lot of problems in raising gc_thresh1 and gc_thresh2, but I'm not sure whether we should do this across the board automatically. I think it probably depends on your networking mode, in that I think modes that tunnel traffic will only use a single ARP entry per node, whereas modes that don't tunnel will use an ARP entry per pod. But I'm not really sure. Which network mode are you using @felipejfc ? |
@justinsb you must be right, I use calico |
for future reference if someout needs it, I've used the following hook:
|
this was causing real damage to my cluster, I've saw a big performance improvements on several services in the cluster after increasing gc_thresh3 and gc_thresh2 |
@caseydavenport any comments? |
Seems sensible to me - using Calico each node will have an arp entry for each pod running on that node, so if you've got high pod density / pod churn adjusting makes sense.
I think it's probably bridged vs not bridged that makes the difference here. If all the pods are on a bridge the host will only need a single ARP entry, but for routed pods the host will need an ARP entry for each. |
@caseydavenport I guess every node will also have ARP entry for each of the pods running on other nodes as well, right? at least the ones the communicate with? |
No, it shouldn't have one for every pod because the nodes themselves are the next hops for traffic, not individual pod IPs. Instead, you'll get an ARP entry for each node in the cluster. So, a given node's ARP cache should roughly be |
@caseydavenport is it possible that calico is never cleaning nodes that were deleted from the cluster from the ARP table? I'm using AWS and this seems to be the case, take a look:
I use cluster autoscaler and there are nodes being started and deleted all time, this is the output from a machine thats only 4 hours old and it has 1174 entries in arp table despite my cluster only having like 60 nodes... and there are this ips that seems to be ips of brokers that are not alive anymore and stays with incomplete state. |
@felipejfc while Calico isn't responsible for modifying the ARP table directly, I suspect this is a result of the same root cause as this issue: #3224 Basically Calico node configuration isn't getting cleaned up when nodes go away, so Calico will continue to try to access those nodes and thus will create a bunch of ARP entries which it can't complete (since the nodes are no longer there). Adding in the node controller to kops should fix this as well. |
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
The kernel should be expiring stale arp entries. Seems like the bug which was referenced in https://forums.aws.amazon.com/thread.jspa?messageID=572171 wasn't actually forwarded onto the kernel devs? I think the ideal kernel behaviour would be to GC entries down to the minimum, but GC beyond that for stale entries. |
Worth noting: If you're getting Perf issues abound when this happens because the neighbour table is locked while the synchronous GC is performed. As such, you'll definitely want to ensure that |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Can this be closed now? |
sure |
What
kops
version are you running? The commandkops version
, will displaythis information.
1.8
What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.1.8.6
What cloud provider are you using?
aws
I'm having a log of arp_cache table overflow in my production cluster, reading this blog post about large clusters: https://blog.openai.com/scaling-kubernetes-to-2500-nodes/ they say that the solution is increasing the maximum size of the arp cache table, can I configure sysctl options:
net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3
using kops?
thanks!
The text was updated successfully, but these errors were encountered: