-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
felix connection to etcdv3 api stucks #1632
Comments
@r7vme thanks for raising. I think the root cause of this is in the etcdv3 client. I've got a WIP PR which should hopefully fix this here: projectcalico/libcalico-go#668 However, we need to wait until the etcd v3.3 client code is available, since that is the first release that includes gRPC keepalive support. |
ack. thanks for the info. |
@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/ For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s. |
@fasaxc that's a good point, we very well might need an upgraded server as well. We'll need to investigate. If so, we'll need to decide:
|
As an addition to the 90s timeout workaround, would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them - perhaps having a particular filtered-out instance of each resource type (so that we don't actually generate unnecessary churn in Felix). Probably an overkill - but throwing it out there. |
@robbrockbank Do we know if they're all multiplexed onto one TCP session? If so, refreshing the clusterinfo or something every 10s would do it. |
JFYI: we are noticing this issue very frequently in test environment that has lots of node (etcd) reboots/crashes. I would love to see workaround that will not require etcd 3.3. |
@fasaxc : My understanding is that they are multiplexed onto a single connection with etcdv3. So doing this keep alive on a single resource type should handle the connection dropped issue. That said, there are a couple of reasons though why doing a keep alive for each resource type might be preferable:
If we do want to do a keep alive, perhaps refreshing a WEP resource would be better since that is probably the one that would benefit most from not requiring a full resync. |
I think we should avoid this - IIUC it either means writing a new component or making Felix write data to etcd, which changes its "read-only" behavior. |
Ok, so I've managed to reproduce the issue where etcd connections remain in ESTABLISHED state after etcd has died (by doing a GCE instance reset). When the etcd instance comes back, the TCP connections remain in ESTABLISHED state. By the way, this looks to be the same as this Kubernetes issue: kubernetes/kubernetes#46964 The fix for that is the same as what I had originally guessed the fix here would need to be, and was merged in this PR: kubernetes/kubernetes#58008 I've got the Calico equivalent of that here: projectcalico/libcalico-go#777 Running a build with that patch, I see the connections become properly re-established after I reset my etcd instance, so I think that fixes it. I've also tried this with an etcd server version of v3.1.10 and it behaves the same, so it looks like this will work without requiring an etcd upgrade, which is great. What I don't see is any indication of the failure in Calico's logging - seems that the client handles the keepalive timeouts and reconnection under the covers without bubbling it up to the calling code. This is unfortunate, but pretty minor in comparison so I think let's move ahead with getting the keepalive patch into a bugfix release. |
The fix for this is out in Calico v3.0.2 - please try it out and let me know if you hit any issues! |
Big thanks, we'll let you know. |
We faced the same situation today in 3.5.1 due to a misconfiguration in firewall that block traffic from calico to etcd
Had to restart all calico pod. @caseydavenport Do you have any suggestion about the correct version we should use to get rid of this completely ? |
We are running Calico 3.0.1 and experience the same issue from time to time. At some point one of the nodes becomes "broken" - all new pods don't have networking.
What happens:
Expected Behavior
Felix is healthy and does not delete pods routes.
Current Behavior
Felix deletes new pods routes as unknown.
Possible Solution
As workaround i just did
kill <felix pid>
inside broken calico-node. Beforehand i spent hour to double check bgp, bird, ipip tunnels. Finally i can say with 100% confidence that issue localized in felix <> etcd connection.As solution i think there can be some heathbeat check (above tcp heartbeat) that makes sure that watch connection is still alive. May be this bug even for etcd client driver.
Steps to Reproduce (for bugs)
See above, but there is no 100% way to reproduce this bug, we hit one at least 3 times and workaround was to restart broken node.
Context
We are using following etcd endpoints in calico configuration
https://10.0.5.101:2379,https://10.0.5.102:2379,https://10.0.5.103:2379
.calico-node logs when new pod created and then route removed
netstat -nptu | grep 2379
on broken nodeBGP peers and IPIP tunnels were healthy.
Also before node became "broken" i saw this from bird (10.0.5.103 is master/etcd that was reinstalled):
Your Environment
The text was updated successfully, but these errors were encountered: