-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing routes with many nodes on vxlan #958
Comments
Looks like this might be related to #779 |
We already tried the proposed solution, but it didn't work. IIUC correctly we have to change
|
related to "get/set receive buffer size" in netlink: vishvananda/netlink@ef84ebb |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This Bug it's still happening. I left this open. There is a workaround to avoid it. We'll update the docs with it until it's fixed. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We detected a kind of the same issue. Some routes were missing after a network outage of a few hours. I would expect flannel to reconcile these routes. I'm I right expecting this? |
Which version of flannel are you using? Maybe your issue is not directly related to this. This issue was related to missing rules when flannel starts with multiple nodes. On you case seems that the rules were somehow removed and aren't recreated again. |
I'm using v0.17.0 shipped with RKE1. I understand this is a kind of old version and my issue may have been fixed in the meantime. It looks like rancher is still shipping this version with the latest releases of RKE1. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Expected Behavior
When adding a new instance, it should always add all routes to all existing instances.
Current Behavior
When there are many nodes (
> 30
), there are occasionally missing routes between some instances.For each missing route we get this error:
Missing route means, that one entry like this is missing on
ip route
:Possible Solution
Besides fixing the underlying problem on the OS or network settings it might be a good idea to retry such things or even to let flannel fail completely (see Context).
Steps to Reproduce
journalctl -u flanneld | grep AddFDB
on each instance and see some errors. There are around 4 missing routes on that scale.systemd unit
logs
We tried to adjust some sysctl settings, but none of them worked:
Context
We are scaling our Kubernetes Cluster inside of an AWS ASG. When adding new nodes, we rely on a working network. Even a not working network would be better than some missing nodes, because the cluster might behave flaky in rare cases and it is not evident where this comes from. We had for example DNS problems. A very small subset of our applications had a high error rate on resolving domain names and we didn't know where this came from for a long time. Now we know, that this was caused by a missing route between the instance where the faulty application ran on and the instance where the DNS server ran on.
Currently we need to manually grep the journal logs and replace broken instances, because it is hard to automatically figuring out, whether a route is missing.
Your Environment
v0.9.0
andv0.10.0
vxlan
(with and withoutDirectRouting
)3.2.11
v1.8.5+coreos.0
Container Linux by CoreOS 1576.5.0 (Ladybug)
The text was updated successfully, but these errors were encountered: