Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico: Route Programming Delay with new nodes (symptoms: DNS resolution issues) #3224

Closed
ottoyiu opened this issue Aug 17, 2017 · 30 comments
Closed
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ottoyiu
Copy link
Contributor

ottoyiu commented Aug 17, 2017

Reference: #2661

New nodes that are brought up into the cluster experience a delay in route programming, while already being marked as 'Ready' by Kubernetes (kubelet on the 'Node' sets this). New pods. that are scheduled on these nodes prior to routes being programmed, will not have any connectivity to the rest of the cluster until the routes are programmed several minutes later.

Symptoms of this are:

  • DNS resolution fails - the Pod can not connect to kube-dns through the Service IP
  • Service or Pod IPs do not function as intended

This is the result of Calico/BIRD trying to peer with a Node that doesn't exist.

The BIRD documentation states (http://bird.network.cz/?get_doc&f=bird-6.html#ss6.3):

**graceful restart time number**
The restart time is announced in the BGP graceful restart capability and specifies how long the neighbor would wait for the BGP session to re-establish after a restart before deleting stale routes. Default: 120 seconds.

meaning that it will take 120s (since default is not overriden) before the routes will be programmed due to a graceful-restart timer.

This is caused by nodes in the Calico etcd nodestore no longer existing. Due to the ephemeral nature of AWS EC2 instances, new nodes are brought up with different hostnames, and nodes that are taken offline remain in the Calico nodestore. This is unlike most datacentre deployments where the hostnames are mostly static in a cluster.

To solve this, we must keep the nodes in sync between Kubernetes and Calico. To do, we can write a node controller that watches for nodes that are removed and reflect that in Calico. We also need a periodic sync to make sure that missed events are accounted for.

The controller needs to be deployed with kops when Calico is set as the network provider.

There's a proof of concept controller, but not production ready:
https://github.com/caseydavenport/calico-node-controller

Caution: At the time of writing this issue, running this controller has lead to all nodes being removed in the Calico's datastore. It could be something that I've done, and nothing with the node controller itself. However, I'd say to run with extreme caution in production.

EDIT: Though I have not tested this, I expect the move towards using Kubernetes API as the datastore for Calico to solve this issue, as there will not be a need to sync a list of nodes.

@mikesplain

@ottoyiu ottoyiu changed the title Calico: Route Programming Delay with new nodes (DNS resolution issues) Calico: Route Programming Delay with new nodes (symptoms: DNS resolution issues) Aug 17, 2017
@mikesplain
Copy link
Contributor

Thanks @ottoyiu, I'm looking into this as well. I've been able to duplicate the overall issue and the issue with the node controller.

@chrislovecnm
Copy link
Contributor

cc: @blakebarnett

@ottoyiu
Copy link
Contributor Author

ottoyiu commented Aug 19, 2017

@mikesplain The short hostnames stored in Calico datastore as node names are being checked against the Kubernetes API, as a synchronization step in the controller. Because the shortened node names does not match the full node name in Kubernetes, it deletes the node from the datastore :(

You can trace that behaviour here in the code:
https://github.com/caseydavenport/calico-node-controller/blob/master/controller.go#L162

The problem really lies with how we're deploying Calico with kops without specifying the NODENAME. Calico then defaults to using the HOSTNAME environmental variable, which is in a form of a short hostname:
https://github.com/projectcalico/calico/blob/master/calico_node/startup/startup.go#L157

The code there recommends an explicit value to be set, for good reason.

However, if we change it in kops and set the NODENAME using the downward api, we'll need some form of migration as it'll lead to duplicate IPs being defined in the datastore and calico will fail to startup. :( maybe it's just best to wait for the Kubernetes datastore to reach feature parity with etcd, so we don't to worry about this whole issue.

For now, in my controller that I based off of caseydavenport's, I have removed that part of the syncing logic that will check the Kubernetes API against nodes that exist in the Calico datastore.
https://github.com/ottoyiu/kube-calico-controller
This means that you'll need to properly clean your datastore prior to having the kube-calico-controller take over doing the janitorial work.

Not sure if it's a good idea to just allow a user to specify a domain, and it'll append the shortname with the domain to make it a long name again. In that case, the controller can do it's syncing duties.

(EDIT: I added that as a flag, -domain, eg. -domain us-west-2.compute.internal, and re-incorporated the syncing logic; I have tested it on two clusters with success so far, but please tread with caution especially when running in production. Use -dryrun first to validate.)

@mikesplain
Copy link
Contributor

Great thanks @ottoyiu I'll give this a shot.

@chrislovecnm
Copy link
Contributor

cc: @caseydavenport

Casey let me know if you do not want me to keep you in the loop on these. Calico community is very active in kops as you probably know :)

@caseydavenport
Copy link
Member

Yeah, that node controller was a quick experiment of mine and needs a lot of work and testing before I'd even consider putting it into kops.

Like you say, we should push for the Calico kubernetes datastore driver getting into kops - that's going to be the most robust solution to this issue.

We can head down the node controller route in parallel if folks think it would be worthwhile, but even then I think that's probably a longer way off.

@chrislovecnm I'm quite happy to be CC'd :)

@caseydavenport
Copy link
Member

quick update - we've implemented a productized node controller that's been merged to Calico master.

It's expected in an upcoming Calico release, at which point we should enable and test that in kops in order to fix this. stay tuned!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2018
@max-rocket-internet
Copy link

Any update on this?

@caseydavenport
Copy link
Member

@max-rocket-internet as I mentioned above this should be fixed by running the Calico node controller, but as far as I can see that isn't yet enabled in kops.

You should be able to do it by adding node to ENABLED_CONTROLLERS in the kube-controllers deployment.

@chrislovecnm
Copy link
Contributor

@caseydavenport can someone PR that?

@cdenneen
Copy link

cdenneen commented Mar 1, 2018

When I updated from kops 1.7 to 1.8 this stopped being an issue with DNS. I thought it was fixed. Is this not the case?

@felipejfc
Copy link
Contributor

@cdenneen that's not the case
@chrislovecnm it's not a simple PR, if I'm not mistaken, nodes controller was addes in v2 of calico-kube-controllers, kops is currently using v1, for updating to v2 we should also upgrade calico to v3 and for that we also would need to upgrade etcd to v3
https://docs.projectcalico.org/v3.0/getting-started/kubernetes/upgrade/

@caseydavenport
Copy link
Member

FWIW Calico v2.6 does support the node controller as well, so we shouldn't need to to the v3.0 upgrade. I'm happy to review but don't have the cycles to submit / test a PR in the near future.

Still, we do want to get to v3.0 for other reasons anyway, and we have some ideas on how to make that upgrade much easier, but still a WIP. CC @tmjd.

@felipejfc
Copy link
Contributor

felipejfc commented Mar 3, 2018

@caseydavenport do you mean I can use calico-kube-controllers v2.0.1 alongside calico v2.6 (and kubernetes 1.8)?

thanks

@caseydavenport
Copy link
Member

Nope, I mean that calico/kube-controllers:v1.0.3 which is compatible with Calico v2.6.8 includes the same functionality to clean up old nodes, so upgrading to v2.6.8 and enabling the node controller should fix this

@felipejfc
Copy link
Contributor

felipejfc commented Mar 3, 2018

@caseydavenport I see, thanks! I'll try to make a PR soon

do turning on nodes controller clean old nodes that were deleted before turning it on or just the ones deleted after it's already running?

@felipejfc
Copy link
Contributor

felipejfc commented Mar 3, 2018

@caseydavenport, I've turned on node controller using kube-controllers 1.0.3 and calico 2.6.6 and now new nodes are never being able to resolve dns, new nodes is are not having iptables syncing with existing ones

EDIT

they eventually synced the routes (but it took 40 minutes and I turned off node controller) I guess node controller is not working as expected

@caseydavenport
Copy link
Member

do turning on nodes controller clean old nodes that were deleted before turning it on or just the ones deleted after it's already running?

It won't be able to clean up old data, but it should for all new nodes going forward.

I've turned on node controller using kube-controllers 1.0.3 and calico 2.6.6 and now new nodes are never being able to resolve dns, new nodes is are not having iptables syncing with existing ones

Hm, doesn't sound like it's working as expected. If you can share your changes I'd be happy to review.

@felipejfc
Copy link
Contributor

I have changed this:

    - name: ENABLED_CONTROLLERS
      value: policy,profile,workloadendpoint

to

    - name: ENABLED_CONTROLLERS
      value: policy,profile,workloadendpoint,node

I can see in logs that node controller started running:

2018-03-03 22:34:19.960 [INFO][1] main.go 66: Loaded configuration from environment config=&config.Config{LogLevel:"info", ReconcilerPeriod:"5m", EnabledControllers:"policy,profile,workloadendpoint,node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:""}
2018-03-03 22:34:19.960 [INFO][1] client.go 202: Loading config from environment
2018-03-03 22:34:19.963 [INFO][1] node_controller.go 145: Starting Node controller
2018-03-03 22:34:19.963 [INFO][1] namespace_controller.go 147: Starting Namespace/Profile controller
2018-03-03 22:34:19.964 [INFO][1] pod_controller.go 193: Starting Pod/WorkloadEndpoint controller
2018-03-03 22:34:19.964 [INFO][1] policy_controller.go 189: Starting NetworkPolicy controller
2018-03-03 22:34:20.001 [INFO][1] policy_controller.go 205: NetworkPolicy controller is now running
2018-03-03 22:34:20.005 [INFO][1] namespace_controller.go 163: Namespace/Profile controller is now running
2018-03-03 22:34:21.317 [INFO][1] pod_controller.go 216: Pod/WorkloadEndpoint controller is now running
2018-03-03 22:35:58.486 [INFO][1] node_controller.go 168: Node controller is now running
2018-03-03 22:39:28.367 [WARNING][1] node_controller.go 227: No corresponding Node in cache, re-loading cache from datastore node=""

but regardless of that, new nodes are still taking several minutes to have routes to other nodes added...
do I need to clean up old data for it to start working properlly? if so, how do I?

@caseydavenport
Copy link
Member

caseydavenport commented Mar 4, 2018

In order to make it work for new nodes, you'll also need to set this env variable in the calico-node container of the daemonset:

            # Set noderef for node controller.
            - name: CALICO_K8S_NODE_REF
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName

This will provide the correlation with the k8s node required for cleanup. From this reference manifest

do I need to clean up old data for it to start working properlly? if so, how do I?

Yeah, you'll unfortunately still need to clean up old nodes by hand since they won't have the correlation information which allows the node controller to do its job. ^

There are some steps on how to do that here in the decommissioning a node documentation.

@mikesplain
Copy link
Contributor

Thanks for your help @caseydavenport, great work @felipejfc! Let me know when you open that PR, I'd love to test and review as well since we've been running an automated cleanup script since we started running into this awhile ago.

@felipejfc
Copy link
Contributor

hi @caseydavenport, I've added the env you've told and decomissioned the non existing nodes (I've checked twice), but still upon bringing up a new node it still took like 5~7 minutes to start being able to resolve DNS, arp table was slowly constructed and this is the final table:

admin@ip-172-20-155-231:~$ sudo arp -an
? (172.20.147.218) at 0e:f4:f5:7b:29:4a [ether] on eth0
? (172.20.149.29) at 0e:dc:52:6b:f4:02 [ether] on eth0
? (172.20.129.92) at 0e:df:fb:92:79:62 [ether] on eth0
? (172.20.154.244) at 0e:f6:53:4e:ff:9c [ether] on eth0
? (172.20.130.81) at 0e:c5:aa:d8:87:5e [ether] on eth0
? (172.20.143.199) at 0e:30:48:aa:1a:a6 [ether] on eth0
? (172.20.138.18) at 0e:1a:7e:d9:59:66 [ether] on eth0
? (100.100.93.14) at 96:ca:1c:32:4a:00 [ether] PERM on cali760622bdb5f
? (172.20.136.56) at 0e:03:da:60:36:ee [ether] on eth0
? (172.20.133.89) at 0e:18:9d:c3:37:44 [ether] on eth0
? (172.20.133.196) at 0e:4c:7a:06:de:6e [ether] on eth0
? (172.20.141.230) at 0e:93:63:db:ca:6a [ether] on eth0
? (100.100.93.2) at d6:6d:9f:bd:0f:c6 [ether] PERM on cali26efd1b5194
? (172.20.128.1) at 0e:4f:ac:3e:25:6b [ether] on eth0
? (172.20.144.165) at 0e:e5:30:7c:a2:a0 [ether] on eth0
? (100.100.93.12) at 3e:50:35:73:ac:22 [ether] PERM on calid9e6a52eb7b
? (172.20.147.188) at 0e:f8:41:c6:12:e8 [ether] on eth0
? (172.20.157.57) at 0e:9f:c8:b5:bc:10 [ether] on eth0
? (172.20.136.190) at 0e:e3:04:bf:ac:4e [ether] on eth0
? (172.20.134.101) at 0e:12:1d:c6:83:00 [ether] on eth0
? (172.20.152.253) at 0e:ac:ff:43:f1:68 [ether] on eth0
? (172.20.149.238) at 0e:f4:a6:3c:51:96 [ether] on eth0
? (172.20.130.138) at 0e:72:cb:09:bc:3a [ether] on eth0
? (172.20.138.81) at 0e:c3:e9:4d:dd:52 [ether] on eth0
? (172.20.145.15) at 0e:36:2d:89:1e:08 [ether] on eth0
? (172.20.129.138) at 0e:32:13:c5:5b:de [ether] on eth0
? (172.20.133.104) at 0e:9e:64:6f:4f:48 [ether] on eth0
? (172.20.152.124) at 0e:fb:69:52:fa:d2 [ether] on eth0
? (172.20.158.215) at 0e:9c:f8:79:28:aa [ether] on eth0
? (100.100.93.15) at 6a:15:77:60:dd:d1 [ether] PERM on calia93f2a7bc0c
? (172.20.145.222) at 0e:7e:1b:ed:bd:3c [ether] on eth0
? (172.20.135.235) at 0e:50:61:42:dc:1a [ether] on eth0
? (100.100.93.11) at 72:2c:0e:91:f9:83 [ether] PERM on calid85fc28a2c0
? (172.20.159.166) at 0e:1e:60:19:78:d6 [ether] on eth0
? (172.20.148.230) at 0e:b7:25:e1:b3:f4 [ether] on eth0
? (172.20.137.208) at 0e:f4:c8:c0:3b:d0 [ether] on eth0
? (172.20.140.127) at 0e:2d:ad:b0:9b:ca [ether] on eth0
? (100.100.93.13) at 16:2f:63:d1:8a:7c [ether] PERM on cali07a8bdce8ed
? (172.20.131.150) at 0e:43:b0:5b:fc:90 [ether] on eth0
? (100.100.93.1) at 8e:a4:bf:53:01:49 [ether] PERM on cali06b578b6f44
? (172.20.157.200) at 0e:27:b1:f6:aa:9a [ether] on eth0
? (172.20.148.132) at 0e:2f:e7:d2:f3:82 [ether] on eth0
? (172.20.157.183) at 0e:77:f7:2a:55:22 [ether] on eth0
? (172.20.130.15) at 0e:cf:2d:bb:67:68 [ether] on eth0
? (172.20.143.174) at 0e:7c:bb:a9:e2:2c [ether] on eth0
? (172.20.134.88) at 0e:ed:f7:f2:31:c8 [ether] on eth0
? (172.20.129.46) at 0e:2a:aa:22:f1:c6 [ether] on eth0
? (172.20.140.84) at 0e:8d:4e:6b:73:4e [ether] on eth0
? (172.20.140.254) at 0e:c2:79:5a:97:d2 [ether] on eth0
? (172.20.128.115) at 0e:16:27:77:b9:90 [ether] on eth0
? (172.20.133.77) at 0e:c4:88:17:6e:66 [ether] on eth0
? (172.20.159.229) at 0e:bd:dc:8b:23:2a [ether] on eth0
? (172.20.155.224) at 0e:cf:e3:b3:77:78 [ether] on eth0
? (172.20.142.96) at 0e:55:c6:52:49:76 [ether] on eth0
? (172.20.154.72) at 0e:b8:90:94:27:ba [ether] on eth0
? (172.20.149.127) at 0e:fe:85:c5:5f:4a [ether] on eth0
? (172.20.138.157) at 0e:59:1f:b3:25:56 [ether] on eth0
? (100.100.93.16) at 06:4d:63:39:da:b0 [ether] PERM on cali59c9ed8c700
? (172.20.155.37) at 0e:07:e2:6c:e1:0e [ether] on eth0
? (172.20.130.196) at 0e:65:30:7f:e5:a8 [ether] on eth0
? (172.20.137.5) at 0e:31:46:a5:ae:20 [ether] on eth0
? (172.20.149.47) at 0e:43:f3:5d:79:cc [ether] on eth0

@caseydavenport
Copy link
Member

Hm, I'm not sure why it's taking so long to get DNS working then - it might be something else that's the cause. 5-7 minutes is longer than Calico's BGP graceful restart timer, so it doesn't feel like it's the same route distribution error the issue was originally about - if it was, you'd expect to see the routes appear and DNS start working after the BGP GR timeout of 120s.

I think the next thing we need to do is identify exactly why DNS resolution is taking so long now. To rule out Calico, I'd check the calico-node logs and monitor the routing table on the newly started node. You should expect to see one /26 route to each other node in your cluster immediately (within a second or two) of Calico starting on that node.

Feel free to send the calico-node logs from a newly started node (ideally from startup until the point DNS starts working) and I can have a look to see if there's anything weird in there.

The ARP table looks much better though since there are no longer any <incomplete> entries, which means Calico isn't trying to peer with old dead nodes any more.

@felipejfc
Copy link
Contributor

@caseydavenport Actually I guess I was wrong, or maybe I hadn't cleaned the nodes right, maybe cause I hadn't rebooted all calico-node pods after editing the daemonset with the env var you told, after rebooting all calico-node and deleting 1 remaining invalid node I've just brought two new nodes up successfully and calico-node took like 10 seconds to properly setup the routes, I've also tested deleting some nodes and bringing others up, and it's working ok, I'll also send a PR to kops with these changes that I needed, add the env var to calico-node and change the onde in kube-calico-controllers.

regards

felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224
felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
@thomasjungblut
Copy link

how can we replicate this? is it just patching this env variable? any special calico version required? (2.6.2 is default right now I believe)

@felipejfc
Copy link
Contributor

@thomasjungblut see PR #4588

vendrov pushed a commit to vendrov/kops that referenced this issue Mar 21, 2018
rdrgmnzs pushed a commit to rdrgmnzs/kops that referenced this issue Apr 6, 2018
@john-delivuk
Copy link

After making the changes. I noticed new nodes coming online get route updates quickly which is great, because we scare frequently. While testing this issue though I've noticed that if you after your cluster running at capacity where your load matches your number of nodes, you then terminate one of those working nodes, you'll wont see route changes until reconciliation. I'm using the included v1.0.3 of kube-controller w/ calico-node v2.6.6. Not sure if there is something with delete events not processing properly.

@ottoyiu
Copy link
Contributor Author

ottoyiu commented Aug 31, 2018

from what we've seen in prod, the v1.0.3 kube-controller has race condition issues on its node-controller and a problem with handling deletion events - it would panic trying to type convert a node after it was deleted.

@caseydavenport
Copy link
Member

I believe that race issue has been fixed in Calico v3. We should try to get the PR to update kops to v3 merged asap for various reasons including this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests