-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calico: Route Programming Delay with new nodes (symptoms: DNS resolution issues) #3224
Comments
Thanks @ottoyiu, I'm looking into this as well. I've been able to duplicate the overall issue and the issue with the node controller. |
cc: @blakebarnett |
@mikesplain The short hostnames stored in Calico datastore as node names are being checked against the Kubernetes API, as a synchronization step in the controller. Because the shortened node names does not match the full node name in Kubernetes, it deletes the node from the datastore :( You can trace that behaviour here in the code: The problem really lies with how we're deploying Calico with kops without specifying the The code there recommends an explicit value to be set, for good reason. However, if we change it in kops and set the For now, in my controller that I based off of caseydavenport's, I have removed that part of the syncing logic that will check the Kubernetes API against nodes that exist in the Calico datastore. Not sure if it's a good idea to just allow a user to specify a domain, and it'll append the shortname with the domain to make it a long name again. In that case, the controller can do it's syncing duties. (EDIT: I added that as a flag, |
Great thanks @ottoyiu I'll give this a shot. |
cc: @caseydavenport Casey let me know if you do not want me to keep you in the loop on these. Calico community is very active in kops as you probably know :) |
Yeah, that node controller was a quick experiment of mine and needs a lot of work and testing before I'd even consider putting it into kops. Like you say, we should push for the Calico kubernetes datastore driver getting into kops - that's going to be the most robust solution to this issue. We can head down the node controller route in parallel if folks think it would be worthwhile, but even then I think that's probably a longer way off. @chrislovecnm I'm quite happy to be CC'd :) |
quick update - we've implemented a productized node controller that's been merged to Calico master. It's expected in an upcoming Calico release, at which point we should enable and test that in kops in order to fix this. stay tuned! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Any update on this? |
@max-rocket-internet as I mentioned above this should be fixed by running the Calico node controller, but as far as I can see that isn't yet enabled in kops. You should be able to do it by adding |
@caseydavenport can someone PR that? |
When I updated from kops 1.7 to 1.8 this stopped being an issue with DNS. I thought it was fixed. Is this not the case? |
@cdenneen that's not the case |
FWIW Calico v2.6 does support the node controller as well, so we shouldn't need to to the v3.0 upgrade. I'm happy to review but don't have the cycles to submit / test a PR in the near future. Still, we do want to get to v3.0 for other reasons anyway, and we have some ideas on how to make that upgrade much easier, but still a WIP. CC @tmjd. |
@caseydavenport do you mean I can use calico-kube-controllers v2.0.1 alongside calico v2.6 (and kubernetes 1.8)? thanks |
Nope, I mean that |
@caseydavenport I see, thanks! I'll try to make a PR soon do turning on nodes controller clean old nodes that were deleted before turning it on or just the ones deleted after it's already running? |
@caseydavenport, I've turned on node controller using kube-controllers 1.0.3 and calico 2.6.6 and now new nodes are never being able to resolve dns, new nodes is are not having iptables syncing with existing ones EDIT they eventually synced the routes (but it took 40 minutes and I turned off node controller) I guess node controller is not working as expected |
It won't be able to clean up old data, but it should for all new nodes going forward.
Hm, doesn't sound like it's working as expected. If you can share your changes I'd be happy to review. |
I have changed this:
to
I can see in logs that node controller started running:
but regardless of that, new nodes are still taking several minutes to have routes to other nodes added... |
In order to make it work for new nodes, you'll also need to set this env variable in the calico-node container of the daemonset:
This will provide the correlation with the k8s node required for cleanup. From this reference manifest
Yeah, you'll unfortunately still need to clean up old nodes by hand since they won't have the correlation information which allows the node controller to do its job. ^ There are some steps on how to do that here in the decommissioning a node documentation. |
Thanks for your help @caseydavenport, great work @felipejfc! Let me know when you open that PR, I'd love to test and review as well since we've been running an automated cleanup script since we started running into this awhile ago. |
hi @caseydavenport, I've added the env you've told and decomissioned the non existing nodes (I've checked twice), but still upon bringing up a new node it still took like 5~7 minutes to start being able to resolve DNS, arp table was slowly constructed and this is the final table:
|
Hm, I'm not sure why it's taking so long to get DNS working then - it might be something else that's the cause. 5-7 minutes is longer than Calico's BGP graceful restart timer, so it doesn't feel like it's the same route distribution error the issue was originally about - if it was, you'd expect to see the routes appear and DNS start working after the BGP GR timeout of 120s. I think the next thing we need to do is identify exactly why DNS resolution is taking so long now. To rule out Calico, I'd check the calico-node logs and monitor the routing table on the newly started node. You should expect to see one Feel free to send the calico-node logs from a newly started node (ideally from startup until the point DNS starts working) and I can have a look to see if there's anything weird in there. The ARP table looks much better though since there are no longer any |
@caseydavenport Actually I guess I was wrong, or maybe I hadn't cleaned the nodes right, maybe cause I hadn't rebooted all calico-node pods after editing the daemonset with the env var you told, after rebooting all calico-node and deleting 1 remaining invalid node I've just brought two new nodes up successfully and calico-node took like 10 seconds to properly setup the routes, I've also tested deleting some nodes and bringing others up, and it's working ok, I'll also send a PR to kops with these changes that I needed, add the env var to calico-node and change the onde in kube-calico-controllers. regards |
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
how can we replicate this? is it just patching this env variable? any special calico version required? (2.6.2 is default right now I believe) |
@thomasjungblut see PR #4588 |
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533
After making the changes. I noticed new nodes coming online get route updates quickly which is great, because we scare frequently. While testing this issue though I've noticed that if you after your cluster running at capacity where your load matches your number of nodes, you then terminate one of those working nodes, you'll wont see route changes until reconciliation. I'm using the included v1.0.3 of kube-controller w/ calico-node v2.6.6. Not sure if there is something with delete events not processing properly. |
from what we've seen in prod, the v1.0.3 kube-controller has race condition issues on its node-controller and a problem with handling deletion events - it would panic trying to type convert a node after it was deleted. |
I believe that race issue has been fixed in Calico v3. We should try to get the PR to update kops to v3 merged asap for various reasons including this one. |
Reference: #2661
New nodes that are brought up into the cluster experience a delay in route programming, while already being marked as 'Ready' by Kubernetes (kubelet on the 'Node' sets this). New pods. that are scheduled on these nodes prior to routes being programmed, will not have any connectivity to the rest of the cluster until the routes are programmed several minutes later.
Symptoms of this are:
This is the result of Calico/BIRD trying to peer with a Node that doesn't exist.
The BIRD documentation states (http://bird.network.cz/?get_doc&f=bird-6.html#ss6.3):
meaning that it will take 120s (since default is not overriden) before the routes will be programmed due to a graceful-restart timer.
This is caused by nodes in the Calico etcd nodestore no longer existing. Due to the ephemeral nature of AWS EC2 instances, new nodes are brought up with different hostnames, and nodes that are taken offline remain in the Calico nodestore. This is unlike most datacentre deployments where the hostnames are mostly static in a cluster.
To solve this, we must keep the nodes in sync between Kubernetes and Calico. To do, we can write a node controller that watches for nodes that are removed and reflect that in Calico. We also need a periodic sync to make sure that missed events are accounted for.
The controller needs to be deployed with kops when Calico is set as the network provider.
There's a proof of concept controller, but not production ready:
https://github.com/caseydavenport/calico-node-controller
Caution: At the time of writing this issue, running this controller has lead to all nodes being removed in the Calico's datastore. It could be something that I've done, and nothing with the node controller itself. However, I'd say to run with extreme caution in production.
EDIT: Though I have not tested this, I expect the move towards using Kubernetes API as the datastore for Calico to solve this issue, as there will not be a need to sync a list of nodes.
@mikesplain
The text was updated successfully, but these errors were encountered: