Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

felix connection to etcdv3 api stucks #1632

Closed
r7vme opened this issue Jan 23, 2018 · 13 comments · Fixed by projectcalico/libcalico-go#777
Closed

felix connection to etcdv3 api stucks #1632

r7vme opened this issue Jan 23, 2018 · 13 comments · Fixed by projectcalico/libcalico-go#777
Milestone

Comments

@r7vme
Copy link
Contributor

r7vme commented Jan 23, 2018

We are running Calico 3.0.1 and experience the same issue from time to time. At some point one of the nodes becomes "broken" - all new pods don't have networking.

What happens:

  1. felix watch connection(s) to etcd got stuck (we had etcd node hard reboot/reinstall this time)
  2. user creates new pod
  3. route and link added by CNI
  4. felix sees new unknown route on interface
  5. felix deletes new route after grace period as unknown

Expected Behavior

Felix is healthy and does not delete pods routes.

Current Behavior

Felix deletes new pods routes as unknown.

Possible Solution

As workaround i just did kill <felix pid> inside broken calico-node. Beforehand i spent hour to double check bgp, bird, ipip tunnels. Finally i can say with 100% confidence that issue localized in felix <> etcd connection.

As solution i think there can be some heathbeat check (above tcp heartbeat) that makes sure that watch connection is still alive. May be this bug even for etcd client driver.

Steps to Reproduce (for bugs)

See above, but there is no 100% way to reproduce this bug, we hit one at least 3 times and workaround was to restart broken node.

Context

We are using following etcd endpoints in calico configuration https://10.0.5.101:2379,https://10.0.5.102:2379,https://10.0.5.103:2379.

calico-node logs when new pod created and then route removed

2018-01-23 13:11:48.552 [INFO][102] ipsets.go 253: Resyncing ipsets with dataplane. family="inet"
2018-01-23 13:11:48.552 [INFO][102] route_table.go 442: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:48.552 [INFO][102] route_table.go 298: Interface in cleanup grace period, will retry after. ifaceName="cali5129b6bd77a" ipVersion=0x4
...
2018-01-23 13:11:59.503 [INFO][102] route_table.go 446: Syncing routes: removing old route. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:59.503 [INFO][102] conntrack.go 78: Removing conntrack flows ip=192.168.189.183

netstat -nptu | grep 2379 on broken node

tcp        0      0 127.0.0.1:54408         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53158         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53496         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:38372        10.0.5.103:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42572        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:39964        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 127.0.0.1:57264         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:37464        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42420        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 10.0.5.105:40714        10.0.5.101:2379         ESTABLISHED 99/confd
tcp        0      0 127.0.0.1:53498         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:51858        10.0.5.102:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42426        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:57264  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:54408  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53496  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53158  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53498  ESTABLISHED -

BGP peers and IPIP tunnels were healthy.

Also before node became "broken" i saw this from bird (10.0.5.103 is master/etcd that was reinstalled):

bird: Mesh_10_0_5_103: Error: Hold timer expired
bird: Mesh_10_0_5_103: State changed to stop
bird: Mesh_10_0_5_103: State changed to down
bird: Mesh_10_0_5_103: Starting
bird: Mesh_10_0_5_103: State changed to start

Your Environment

  • Calico version 3.0.1
  • Kubernetes 1.7.11
  • Operating System and version: CoreOS 1576.4.0
@caseydavenport
Copy link
Member

@r7vme thanks for raising. I think the root cause of this is in the etcdv3 client.

I've got a WIP PR which should hopefully fix this here: projectcalico/libcalico-go#668

However, we need to wait until the etcd v3.3 client code is available, since that is the first release that includes gRPC keepalive support.

@r7vme
Copy link
Contributor Author

r7vme commented Jan 24, 2018

ack. thanks for the info.

@fasaxc
Copy link
Member

fasaxc commented Jan 25, 2018

@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/

For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.

@caseydavenport
Copy link
Member

@fasaxc that's a good point, we very well might need an upgraded server as well. We'll need to investigate.

If so, we'll need to decide:

  • Write some workaround ourselves
  • Notify users that v3.3+ of etcd is required.

@robbrockbank
Copy link
Contributor

As an addition to the 90s timeout workaround, would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them - perhaps having a particular filtered-out instance of each resource type (so that we don't actually generate unnecessary churn in Felix). Probably an overkill - but throwing it out there.

@fasaxc
Copy link
Member

fasaxc commented Jan 30, 2018

@robbrockbank Do we know if they're all multiplexed onto one TCP session? If so, refreshing the clusterinfo or something every 10s would do it.

@r7vme
Copy link
Contributor Author

r7vme commented Jan 31, 2018

JFYI: we are noticing this issue very frequently in test environment that has lots of node (etcd) reboots/crashes.

I would love to see workaround that will not require etcd 3.3.

@robbrockbank
Copy link
Contributor

@fasaxc : My understanding is that they are multiplexed onto a single connection with etcdv3. So doing this keep alive on a single resource type should handle the connection dropped issue.

That said, there are a couple of reasons though why doing a keep alive for each resource type might be preferable:

  • We'll have recent revisions for each resource type that will allow us to start the watcher using the revision of the last received update, thus reducing the chance that we'll need to do a full resync on a particular resource type.
  • When using Kubernetes API as the datastore, K8s automatically closes a CRD watcher after 1min of inactivity, presumably to work around this problem. This would help keep all CRD watchers active.

If we do want to do a keep alive, perhaps refreshing a WEP resource would be better since that is probably the one that would benefit most from not requiring a full resync.

@caseydavenport
Copy link
Member

would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them

I think we should avoid this - IIUC it either means writing a new component or making Felix write data to etcd, which changes its "read-only" behavior.

@caseydavenport
Copy link
Member

Ok, so I've managed to reproduce the issue where etcd connections remain in ESTABLISHED state after etcd has died (by doing a GCE instance reset). When the etcd instance comes back, the TCP connections remain in ESTABLISHED state.

By the way, this looks to be the same as this Kubernetes issue: kubernetes/kubernetes#46964

The fix for that is the same as what I had originally guessed the fix here would need to be, and was merged in this PR: kubernetes/kubernetes#58008

I've got the Calico equivalent of that here: projectcalico/libcalico-go#777

Running a build with that patch, I see the connections become properly re-established after I reset my etcd instance, so I think that fixes it. I've also tried this with an etcd server version of v3.1.10 and it behaves the same, so it looks like this will work without requiring an etcd upgrade, which is great.

What I don't see is any indication of the failure in Calico's logging - seems that the client handles the keepalive timeouts and reconnection under the covers without bubbling it up to the calling code. This is unfortunate, but pretty minor in comparison so I think let's move ahead with getting the keepalive patch into a bugfix release.

@caseydavenport
Copy link
Member

The fix for this is out in Calico v3.0.2 - please try it out and let me know if you hit any issues!

@r7vme
Copy link
Contributor Author

r7vme commented Feb 13, 2018

Big thanks, we'll let you know.

@tungdam
Copy link

tungdam commented Dec 23, 2021

We faced the same situation today in 3.5.1 due to a misconfiguration in firewall that block traffic from calico to etcd

2021-12-22 11:48:09.358 [INFO][67] route_table.go 446: Syncing routes: removing old route. dest=10.198.192.69/32 ifaceName="calic0df348bad5" ipVersion=0x4
2021-12-22 11:48:09.358 [INFO][67] conntrack.go 78: Removing conntrack flows ip=10.198.192.69
......
2021-12-22 11:50:18.669 [INFO][67] route_table.go 446: Syncing routes: removing old route. dest=10.198.192.87/32 ifaceName="calid1bd716e6e4" ipVersion=0x4
2021-12-22 11:50:18.669 [INFO][67] conntrack.go 78: Removing conntrack flows ip=10.198.192.87
....

Had to restart all calico pod. @caseydavenport Do you have any suggestion about the correct version we should use to get rid of this completely ?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants