-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flannel stops working on EC2 #87
Comments
@dennybritz |
@eyakubovich Yes, I'm running in a VPC.
I see. However, the same log statements appear nowhere else in the logs and only start appearing around the same time flannel stops working. Pinging eth0 on the remote machine works:
Pinging flannel0 (this is docker0, right?) on the local machine works for both machines:
Pinging flannel.1 on the local machine works for both machines:
Pinging either docker0 or flannel.1 on the remote machine does not work. This happens in tcpdump: Sending machine:
Receiving machine (packets from other machine than the sending machine are received as well):
So it seems like packets are arriving on the overlay network... |
Based on the above it seems like this is a one-way issue. If I switch receiving/sending machines I get
And no ICMP packets are being captured on port 8472. |
@dennybritz and the tcpdump confirms that as well -- we see Echo Request make it from one host to another but no sightings of Echo Reply. Let's do a few tests on the box that had nothing captured. As next step, let's see if the routes are still there: Now, keep running Can you also paste the output if I'm also going to setup an cluster on EC2 with VPC and run it over the next few days. |
Please also do a quick |
The ARP table while running ping, seems like there is no mapping?
This is the tail of dmesg:
|
@dennybritz exactly. I'm assuming you were pinging 10.10.1.0 or 10.10.1.1 and we see ARP resolution in progress but not resolved:
That would be causing the problem. Now, can you look into flannel logs. At that time there should be something like:
ARP resolutions are forwarded up to flanneld and it inserts entries into the ARP table based on its knowledge. |
@eyakubovich Yep, that's right, though I don't see any
This is what my etcd looks like, if that helps? Is PublicIP supposed to be |
@dennybritz no, PublicIP needs to be the IP of your eth0. And it's 0.0.0.0 for all entries. But since flannel worked for a while, it had to be valid for some period of time. This is really odd. |
flannel gets a lease for 24 hours (via etcd) and renews it with 1 hour to go. I wonder if something broke in the renewing logic and it renews after 12 hours but sets PublicIP to 0.0.0.0. But then your TTL still shows over 10 hours remaining. I'm going to look into this. |
Thanks. I'm not sure if this is related, but my log is full of these messages (on all machines):
However, these were appearing when flannel was working as well. |
@dennybritz If it does renew, it should print: |
This happens if etcd advances too fast but flannel has logic to recover (although I wonder if there could be a bug there). Is your cluster busy? In general, it shouldn't constantly fall behind. But this gives me some clues to investigate tomorrow. |
@eyakubovich Looking at the logs:
I don't think my cluster is busy, I've mainly been testing things. However, these are t2.small instances that can't handle much. But looking at CPU/memory utilization everything should be fine. |
@dennybritz The renewals seem to be happening no less than 23 hours apart. Not sure what could be happening after 12 hours but something definitely messes up etcd. Will investigate and let you know. |
@dennybritz I found and fixed the bug (introduced during some refactor) that was zeroing out PublicIP during the lease renewal. Can you check if it stays up now? |
That's great news, thanks! I am trying it. |
@eyakubovich Has been up for more than 24 hours now so I think this is resolved! |
Great. Will consider this resolved. |
First of all, sorry for the high-level error report, but I'm not sure what is going wrong or how to best debug it. I am running a ~8 node CoreOS cluster on EC2, stable channel. My etcd is running on a separate node. Using flannel from the current master branch built within a container.
Everything works fine for about 12 hours but then the routing stops working and packets are being dropped. This is reproducible and happens every time I re-provision my cluster (restart flannel and docker). I was initially running the UDP backend and I didn't find any error messages in the journal logs. I then switched to the vxlan backend and I am seeing the following:
Any ideas what could be causing this? Is it perhaps something specific to EC2?
The text was updated successfully, but these errors were encountered: