felix connection to etcdv3 api stucks #1632

r7vme · 2018-01-23T14:43:30Z

We are running Calico 3.0.1 and experience the same issue from time to time. At some point one of the nodes becomes "broken" - all new pods don't have networking.

What happens:

felix watch connection(s) to etcd got stuck (we had etcd node hard reboot/reinstall this time)
user creates new pod
route and link added by CNI
felix sees new unknown route on interface
felix deletes new route after grace period as unknown

Expected Behavior

Felix is healthy and does not delete pods routes.

Current Behavior

Felix deletes new pods routes as unknown.

Possible Solution

As workaround i just did kill <felix pid> inside broken calico-node. Beforehand i spent hour to double check bgp, bird, ipip tunnels. Finally i can say with 100% confidence that issue localized in felix <> etcd connection.

As solution i think there can be some heathbeat check (above tcp heartbeat) that makes sure that watch connection is still alive. May be this bug even for etcd client driver.

Steps to Reproduce (for bugs)

See above, but there is no 100% way to reproduce this bug, we hit one at least 3 times and workaround was to restart broken node.

Context

We are using following etcd endpoints in calico configuration https://10.0.5.101:2379,https://10.0.5.102:2379,https://10.0.5.103:2379.

calico-node logs when new pod created and then route removed

2018-01-23 13:11:48.552 [INFO][102] ipsets.go 253: Resyncing ipsets with dataplane. family="inet"
2018-01-23 13:11:48.552 [INFO][102] route_table.go 442: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:48.552 [INFO][102] route_table.go 298: Interface in cleanup grace period, will retry after. ifaceName="cali5129b6bd77a" ipVersion=0x4
...
2018-01-23 13:11:59.503 [INFO][102] route_table.go 446: Syncing routes: removing old route. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:59.503 [INFO][102] conntrack.go 78: Removing conntrack flows ip=192.168.189.183

netstat -nptu | grep 2379 on broken node

tcp        0      0 127.0.0.1:54408         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53158         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53496         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:38372        10.0.5.103:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42572        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:39964        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 127.0.0.1:57264         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:37464        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42420        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 10.0.5.105:40714        10.0.5.101:2379         ESTABLISHED 99/confd
tcp        0      0 127.0.0.1:53498         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:51858        10.0.5.102:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42426        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:57264  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:54408  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53496  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53158  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53498  ESTABLISHED -

BGP peers and IPIP tunnels were healthy.

Also before node became "broken" i saw this from bird (10.0.5.103 is master/etcd that was reinstalled):

bird: Mesh_10_0_5_103: Error: Hold timer expired
bird: Mesh_10_0_5_103: State changed to stop
bird: Mesh_10_0_5_103: State changed to down
bird: Mesh_10_0_5_103: Starting
bird: Mesh_10_0_5_103: State changed to start

Your Environment

Calico version 3.0.1
Kubernetes 1.7.11
Operating System and version: CoreOS 1576.4.0

The text was updated successfully, but these errors were encountered:

caseydavenport · 2018-01-24T19:37:56Z

@r7vme thanks for raising. I think the root cause of this is in the etcdv3 client.

I've got a WIP PR which should hopefully fix this here: projectcalico/libcalico-go#668

However, we need to wait until the etcd v3.3 client code is available, since that is the first release that includes gRPC keepalive support.

r7vme · 2018-01-24T21:19:22Z

ack. thanks for the info.

fasaxc · 2018-01-25T14:16:08Z

@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/

For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.

caseydavenport · 2018-01-25T21:12:13Z

@fasaxc that's a good point, we very well might need an upgraded server as well. We'll need to investigate.

If so, we'll need to decide:

Write some workaround ourselves
Notify users that v3.3+ of etcd is required.

robbrockbank · 2018-01-29T19:49:46Z

As an addition to the 90s timeout workaround, would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them - perhaps having a particular filtered-out instance of each resource type (so that we don't actually generate unnecessary churn in Felix). Probably an overkill - but throwing it out there.

fasaxc · 2018-01-30T11:37:43Z

@robbrockbank Do we know if they're all multiplexed onto one TCP session? If so, refreshing the clusterinfo or something every 10s would do it.

r7vme · 2018-01-31T13:40:15Z

JFYI: we are noticing this issue very frequently in test environment that has lots of node (etcd) reboots/crashes.

I would love to see workaround that will not require etcd 3.3.

robbrockbank · 2018-01-31T22:10:32Z

@fasaxc : My understanding is that they are multiplexed onto a single connection with etcdv3. So doing this keep alive on a single resource type should handle the connection dropped issue.

That said, there are a couple of reasons though why doing a keep alive for each resource type might be preferable:

We'll have recent revisions for each resource type that will allow us to start the watcher using the revision of the last received update, thus reducing the chance that we'll need to do a full resync on a particular resource type.
When using Kubernetes API as the datastore, K8s automatically closes a CRD watcher after 1min of inactivity, presumably to work around this problem. This would help keep all CRD watchers active.

If we do want to do a keep alive, perhaps refreshing a WEP resource would be better since that is probably the one that would benefit most from not requiring a full resync.

caseydavenport · 2018-02-01T18:50:54Z

would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them

I think we should avoid this - IIUC it either means writing a new component or making Felix write data to etcd, which changes its "read-only" behavior.

caseydavenport · 2018-02-05T23:02:07Z

Ok, so I've managed to reproduce the issue where etcd connections remain in ESTABLISHED state after etcd has died (by doing a GCE instance reset). When the etcd instance comes back, the TCP connections remain in ESTABLISHED state.

By the way, this looks to be the same as this Kubernetes issue: kubernetes/kubernetes#46964

The fix for that is the same as what I had originally guessed the fix here would need to be, and was merged in this PR: kubernetes/kubernetes#58008

I've got the Calico equivalent of that here: projectcalico/libcalico-go#777

Running a build with that patch, I see the connections become properly re-established after I reset my etcd instance, so I think that fixes it. I've also tried this with an etcd server version of v3.1.10 and it behaves the same, so it looks like this will work without requiring an etcd upgrade, which is great.

What I don't see is any indication of the failure in Calico's logging - seems that the client handles the keepalive timeouts and reconnection under the covers without bubbling it up to the calling code. This is unfortunate, but pretty minor in comparison so I think let's move ahead with getting the keepalive patch into a bugfix release.

caseydavenport · 2018-02-12T21:31:20Z

The fix for this is out in Calico v3.0.2 - please try it out and let me know if you hit any issues!

r7vme · 2018-02-13T14:02:27Z

Big thanks, we'll let you know.

tungdam · 2021-12-23T01:07:37Z

We faced the same situation today in 3.5.1 due to a misconfiguration in firewall that block traffic from calico to etcd

2021-12-22 11:48:09.358 [INFO][67] route_table.go 446: Syncing routes: removing old route. dest=10.198.192.69/32 ifaceName="calic0df348bad5" ipVersion=0x4
2021-12-22 11:48:09.358 [INFO][67] conntrack.go 78: Removing conntrack flows ip=10.198.192.69
......
2021-12-22 11:50:18.669 [INFO][67] route_table.go 446: Syncing routes: removing old route. dest=10.198.192.87/32 ifaceName="calid1bd716e6e4" ipVersion=0x4
2021-12-22 11:50:18.669 [INFO][67] conntrack.go 78: Removing conntrack flows ip=10.198.192.87
....

Had to restart all calico pod. @caseydavenport Do you have any suggestion about the correct version we should use to get rid of this completely ?
Thanks

caseydavenport added the priority/P0 label Jan 24, 2018

fasaxc mentioned this issue Feb 5, 2018

Calico-node couldn't set routes when network interface is restarted #1658

Closed

caseydavenport mentioned this issue Feb 5, 2018

Use etcd v3.3 with keepalives projectcalico/libcalico-go#777

Merged

3 tasks

caseydavenport added this to the Calico v3.0.2 milestone Feb 5, 2018

caseydavenport closed this as completed in projectcalico/libcalico-go#777 Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

felix connection to etcdv3 api stucks #1632

felix connection to etcdv3 api stucks #1632

r7vme commented Jan 23, 2018 •

edited

Loading

caseydavenport commented Jan 24, 2018

r7vme commented Jan 24, 2018

fasaxc commented Jan 25, 2018 •

edited

Loading

caseydavenport commented Jan 25, 2018

robbrockbank commented Jan 29, 2018

fasaxc commented Jan 30, 2018

r7vme commented Jan 31, 2018

robbrockbank commented Jan 31, 2018

caseydavenport commented Feb 1, 2018

caseydavenport commented Feb 5, 2018

caseydavenport commented Feb 12, 2018

r7vme commented Feb 13, 2018

tungdam commented Dec 23, 2021 •

edited

Loading

felix connection to etcdv3 api stucks #1632

felix connection to etcdv3 api stucks #1632

Comments

r7vme commented Jan 23, 2018 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented Jan 24, 2018

r7vme commented Jan 24, 2018

fasaxc commented Jan 25, 2018 • edited Loading

caseydavenport commented Jan 25, 2018

robbrockbank commented Jan 29, 2018

fasaxc commented Jan 30, 2018

r7vme commented Jan 31, 2018

robbrockbank commented Jan 31, 2018

caseydavenport commented Feb 1, 2018

caseydavenport commented Feb 5, 2018

caseydavenport commented Feb 12, 2018

r7vme commented Feb 13, 2018

tungdam commented Dec 23, 2021 • edited Loading

r7vme commented Jan 23, 2018 •

edited

Loading

fasaxc commented Jan 25, 2018 •

edited

Loading

tungdam commented Dec 23, 2021 •

edited

Loading