-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writeup of router kill issue #3320
Comments
My theory is, it does exhaust / overload NAT table, that on some routers does cause lockups. Possible solution: Have a switch to limit number of peers/connections. |
That sounds highly likely. A lot of those are dead connections too: if I open the webui which tries to ping them it quickly drops to 90 or so. |
Running 5 daemons on local network with a well-known hash (they were pinning dist) kills my Fritzbox. AFAIK everyone has high esteem for Fritzboxes as very good routers, and not some shitty hardware. Internet reports a NAT table size of around 7000. I find the problem is exacerbated when my nodes are pinning popular content (I suspect this not only consumes all the bandwidth but also increases the number of connections when other peers try to download these blocks?). |
So my idea of what happens is that conntracker table fills up (it is small in cheapo routers, bigger is good ones) and it starts trowing out other connections. @hsanjuan can you repeat the test, kill ipfs daemons and check if it comes back up online? |
@Kubuxu yeah yeah things are back up immediately when I kill them. Only once I had the router reboot itself, which worried me more. |
So other possibility is that cheapo routers have bigger conntracker limit than their RAM can handle and they kernel panics or lockups. Not sure how to check it. |
Does UDP eat up conntracker entries? We're moving quickly towards having support for QUIC. |
AFAIK, yes. At least from the time my services were DDoSed with UDP packets and they were much more destructive because of low conntracker limits. |
Is it possible that this problem got much worse in the last releases (ie >=0.4.5). I used to be able to run 4 nodes without problems and now it seems I'm not, even after cleaning their contents. |
I'm having issues, too. Maybe ipfs should take two connection pools and migrate peer connections from a bad quality pool to a good quality pool by applying some heuristics to the peers. Peers with higher delays, lower bandwidth and short lives would live in the "bad pool" and easily replaced by new peers if connection limits are hit. Better peers would migrate to the "good pool" and only be replaced by better peers if limits are hit. Having both pools gives slow peers a chance to be part of the network without being starved by higher quality peers, which is important for a p2p distributed network. BTW, udp also needs connection tracking, this wouldn't help here, and usually udp tracking tables are much smaller and much more short-lived which adds a lot of new problems. But udp could probably lower the need for bandwidth as there's no implicit retransmission and no ack. Of course, the protocol has to be designed in a way to handle packet loss, and it must take into account that NAT gateways usually drop udp connection table entries much faster. It doesn't make sense to deploy udp and then reimplement retransfers and keep-alive, as this would replicate tcp with no benefit (probably it would even lower performance). Also, ipfs should limit the amount of outstanding packets, not the amount of connections itself. If there are too many packets in-flight, it should throttle further communication with peers, maybe prioritizing some over others. This way, it could also auto-tune to the available bandwidth but I'm not sure. Looking at what BBR does for network queues, it may be better to throw away some requests instead of queuing up a huge backlog. This can improve overall network performance, bloating buffers is a performance killer. I'd like to run ipfs 24/7 but if it increases my network latency, I simply cannot, which hurts widespread deployment. Maybe ipfs needs to measure latency and throw away slowly responding peers. For this to properly work, it needs to auto-adjust to the bandwidth, because once network queues fill, latency will exponentially spike up and the former mentioned latency measurement is useless. These big queues are also a problem with many routers as they tend to use huge queues to increase total bandwidth for benchmarks but it totally kills latency, and thus kills important services like DNS to properly work. I'm running a 400/25mbps assymmetric link here, and as soon as "ipfs stats bw" get beyond a certain point, everything else chokes, browsers become unusable waiting for websites tens of seconds, or resulting in DNS errors. Once a web request comes through in such a situation, the website almost immediately completely appears (minus assets hosted on different hosts) so this is clearly an upstream issue with queues and buffers filled up and improper prioritizing (as ACKs still seem to pass early through the queues, otherwise download would be reduced, too). I don't know if QUIC would really help here... It just reduces initial round-trip times (which HTTP/2 also does) which is not really an issue here as I consider ipfs a bulk-transfer tool, not a latency-sensitive one like web browsing. Does ipfs properly use TOS/QoS flags in IP packets? PS: ipfs should not try to avoid tcp/ips auto-tuning capabilities by moving to UDP. Instead it should be nice to competing traffic by keeping latency below a sane limit and let TCP do the bandwidth tuning. And it should be nice to edge-router equipment (which is most of the time cheap and cannot be avoided) by limiting outstanding requests and amount of total connections. I remembered when Windows XP tried to fix this in the TCP/IP stack by limiting outstanding TCP handshakes to ten, blocking everything else then globally. This was a silly idea but it was thinking in the right direction, I guess. |
I think you might as well not do anything at all, since routers are getting consistently better at supporting higher numbers of connections. My 5 years old struggled with supporting 2 ipfs nodes (about 600 connections each) + torrent (500 connections). I've just got cheap chinese one, and it works like a charm. Most of even cheap routers nowadays have hardware NAT. They don't much care how many connections you throw at them. |
@dsvi: I'd rather not have to pay hard cash just to use IPFS on the pretence that it's fine to be badly behaved because some other software can be misconfigured to crash routers. A lot of people don't even have the luxury of being allowed to connect to their ISP using their own hardware. And what a strawman you've picked — a Bittorrent client! A system that evolved its defaults based on fifteen years real world experience for precisely this reason! No thanks, just fix the code. |
@dsvi I wonder if they use their own routers because the page times out upon request... ;-) But please do not suggest that: Many people are stuck with what is delivered by their providers with no chance to swap that equipment for better stuff. Ipfs has not only to be nice to such equipment but to overall network traffic on that router, too: If it makes the rest of my traffic demands unusable, there's no chance for ipfs to evolve because nobody or only very few could run it 24/7. Ipfs won't reach its goal if it is started by people only on demand. |
Sorrry guys, should have expressed it better. I'll try this time from another direction ;)
And what about people who stuck with relic hardware for whatever reason? Well i feel sorry for some of them, but the progress will go on with them, or without. |
"Internet world is becoming decentralized in general. " Nope! It's becoming centralized. Almost the whole internet is served by a handful datacenter companies. At the beginning we used to have Usenet and IRC servers running on our computers at home. I don't see signs of any decentralization. But I see signs of further centralization. "And creating tons of connections is a natural part of such systems." Having too many simultaneous connections makes the system inefficient. Currently my IPFS daemon opens 2048 connections within several hours to peers then runs out of file descriptors and becomes useless. This should be fixed. |
I'm using a crappy TalkTalk router provided by the ISP and I've been unable to find a configuration where IPFS doesn't drag my internet connection to it's knees. Using ifstat I see usually between 200kb/s and 1MB up and down whilst ipfs is connected to a couple of hundred peers. I'd like to try connecting to fewer peers, but even with:
ipfs still connects to hundreds. |
Perhaps this is a dumb question, but why don't you make it so that IPFS stops connecting to more peers once the high water mark is reached? |
We should implement a max connections but high/low water are really designed to be target bounds. The libp2p team is currently refactoring the "dialer" system in a way that'll make it easy for us to configure a maximum number of outbound connections. Unfortunately, there's really nothing we can do about inbound connections except kill them as soon as we can. On the other hand, having too many connections usually comes from dialing. |
Note: there's actually another issue here. I'm not sure if limiting the max number of open connections will really fix this problem. I haven't tested this but I'm guessing that many routers have problems with connection velocity (the rate at which we (try to) establish connections) not simply having a bunch of connections. That's because routers often need to remember connections even after they've closed (for a period of time). @vyzo's work on NAT detection and autorelay should help quite a bit, unless I'm mistaken. |
A work-around could be to limit the number of opening connections (in contrast to opened connections) - thus reducing the number of connection attempts running at the same time. I think this could be much more important than limiting the number of total connections. If such a change propagated through the network, it should also reduce the amount of overwhelming incoming connection attempts - especially those with slow handshaking because the sending side is not that busy with opening many connections at the same time. |
We actually do that (mostly to avoid running out of file descriptors). We limit ourselves to opening at most 160 TCP connections at the same time. |
@Stebalien Curious, since when? Because I noticed a while ago that running IPFS no longer chokes DNS resolution of my router... |
Hello in 2022, same as above with CH7465LG-ZG. In my whole life I didn't occur such issues with software like this one. Anyways, dropped, waste of my time. |
The real problem is the large number of concurrent connections, typically from dht and bitswap; thats what needs to be fixed. They both have a tendency to create connection avalanches, which apparently overflows router queues and makes them crash. Blanket disabling reuseport will throw the baby out with the bathwater as it is really necessary for hole punching. |
Here's a thing we need to experiment with to give us more of a direction to know where the actual fault is. Could someone who has this very issue and where disabling portreuse appears to fix it try this one simple thing. Thus far only @urbenlegend mentions windows as not having issues there. I'd like for someone else to confirm that. Nothing against @urbenlegend but i just need to know if this is a pattern or an exception. I'm asking because i just remembered that all my testing (and 99% of the people in this thread) was on Linux. I don't haven have a windows installation to test this with. So if someone can confirm that this very same issue exists on windows too? That would be very helpful! Note that this request might seem weid at first glance because the router crashes. But it's your computer that is asking the router to do things it doesn't like! If it exists on windows too then the bug can still be anywhere. |
👍 @markg85 I remember trying this with you, and we set the connection number to 60 and yet it still crashed (it took more time). It would be nice to see some evidence about connection number being high being a problem. |
In all fairness there, that was just the low/hi water setting adjustment. It's not a limit on the number of connections it makes. If i recall correctly it still had a gazillion connections over time. What might help, an option we didn't have back then, is using the Swarm.ResourceMgr to really limit things. Hypothetically, even if using that fixes it. You still won't know if you fixed the cause or just reduced the symptoms to be so rare that it doesn't appear occur anymore. More research is needed! |
IPFS used to kill my router within 15-30 minutes when using IPv4, and twice the amount when using IPv6. I found 2 solutions to this (tested for a couple of weeks):
What did not work was just limiting the number of connections. It only would take longer time to kill my router. Other than that, I had to limit maximum number of open connections, because when it reached above 500, the router was clogged (google.com opening in 5-10 seconds). |
You can try blackholing private subnet routes. I think the biggest issue is ipfs trying to connect to non-routet private subnets via tcp. Taking away that burden from the router should fix a lot of stability problems already: ip route add blackhole 10.0.0.0/8
ip route add blackhole 172.16.0.0/12
ip route add blackhole 192.168.0.0/16 If you have actually reachable private subnets behind your router, you should add more specific routes (longer prefix) so it still gets routed - or add the blackhole routes to the router. But for a single private subnet, these routes should just work. |
Is it possible to have ipfs perform this sort of behavior automatically? For example, are there (userspace) network mapping techniques that we can use to understand which private networks are actually routable? Even without automatic mapping, users might prefer to apply address filtering within ipfs itself to avoid making doomed connection attempts altogether. |
Doesn't seem to work for me, my router seems to chokes even when there's only ~60-80 connections. LIBP2P_TCP_REUSEPORT=false ipfs daemon Output: Initializing daemon...
Kubo version: 0.17.0
Repo version: 12
System version: amd64/linux
Golang version: go1.19.1
2022/12/18 16:40:26 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
Swarm listening on /ip4/127.0.0.1/udp/4001/quic
Swarm listening on /ip4/172.17.0.1/udp/4001/quic
Swarm listening on /ip4/172.18.0.1/udp/4001/quic
Swarm listening on /ip4/172.19.0.1/udp/4001/quic
Swarm listening on /ip4/172.20.0.1/udp/4001/quic
Swarm listening on /ip4/192.168.10.101/udp/4001/quic
Swarm listening on /ip6/::1/udp/4001/quic
Swarm listening on /p2p-circuit
Swarm announcing /ip4/113.43.201.170/udp/4001/quic
Swarm announcing /ip4/127.0.0.1/udp/4001/quic
Swarm announcing /ip4/192.168.10.101/udp/4001/quic
Swarm announcing /ip6/::1/udp/4001/quic
API server listening on /ip4/127.0.0.1/tcp/5001
WebUI: http://127.0.0.1:5001/webui
Gateway (readonly) server listening on /ip4/127.0.0.1/tcp/8080
Daemon is ready And settings:
|
I've disabled TCP: "Transports": {
"Network": {
"TCP": false
},
"Security": {},
"Multiplexers": {}
} Quic as swarm: "Swarm": [
"/ip4/0.0.0.0/udp/4001/quic",
"/ip6/::/udp/4001/quic"
] And still get the error with connection and swarms capped: "ConnMgr": {
"Type": "basic",
"LowWater": 10,
"HighWater": 15,
"GracePeriod": "30s"
},
"ResourceMgr": {
"Limits": {
"System": {
"Conns": 50,
"Streams": 50
}
}
} Operating System: % uname -a
Linux cryptsus 6.1.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 21 Dec 2022 22:27:55 +0000 x86_64 GNU/Linux Router/Modem (it's from telus and they use a homemade variation of openwrt):
|
Still not fixed? |
It took me a while to find out that ticket but it seems like Inam hitting the exact same bug. A combination of QUIC only + disabling TCP + configuring ConnMgr seems to make it stable so far. |
Reporting back after a year of running an IPFS node. Issue still occurs occasionally even with very conservative configurations. What makes it special is how my qbtorrent connections runs smoothly without no trouble but IPFS starts to fry routers. |
@TechTheAwesome can you please try one of the solutions proposed above and report on if it work (disabling reuseport or Bdisabling TCP) ? |
@Jorropo Unfortunately, I did try both solutions and it did not seem to work. The daemon runs for 1-2 minutes, got up to 300 peers, and then my internet connections starts to get cut off. Is there anyway i can export a IPFS log of sort? System:
IPFS:
|
I've found that after some update, the connection limits of kubo were totally out of control. Putting the default settings into the "ConnMgr": {
"GracePeriod": "20s",
"HighWater": 96,
"LowWater": 32,
"Type": "basic"
}, It now hovers around 20 to 30 connections. |
@kakra Below is my "ConnMgr": {
"GracePeriod": "1m0s",
"HighWater": 40,
"LowWater": 20,
"Type": "basic"
}, And my entire config, including setting {
"API": {
"HTTPHeaders": {
"Access-Control-Allow-Origin": [
"https://webui.ipfs.io",
"http://webui.ipfs.io.ipns.localhost:8080"
]
}
},
"Addresses": {
"API": "/ip4/127.0.0.1/tcp/5001",
"Announce": [],
"AppendAnnounce": [],
"Gateway": "/ip4/127.0.0.1/tcp/8080",
"NoAnnounce": [],
"Swarm": [
"/ip4/0.0.0.0/tcp/4001",
"/ip6/::/tcp/4001",
"/ip4/0.0.0.0/udp/4001/quic",
"/ip4/0.0.0.0/udp/4001/quic-v1",
"/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
"/ip6/::/udp/4001/quic",
"/ip6/::/udp/4001/quic-v1",
"/ip6/::/udp/4001/quic-v1/webtransport"
]
},
"AutoNAT": {},
"Bootstrap": [
"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
"/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
"/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
"/ip4/45.76.100.74/udp/4001/quic/p2p/12D3KooWK2DoikedHm2jQgbknGMhR2SSrGKuFoWN2Xj1EUpi1nYW",
"/ip4/45.76.244.78/udp/4001/quic/p2p/12D3KooWPgSN1V3PKhroXQ1LBTN9LCHmk5jqvxuYbr12BKuYxFYG"
],
"DNS": {
"Resolvers": {}
},
"Datastore": {
"BloomFilterSize": 0,
"GCPeriod": "1h",
"HashOnRead": false,
"Spec": {
"mounts": [
{
"child": {
"path": "blocks",
"shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
"sync": true,
"type": "flatfs"
},
"mountpoint": "/blocks",
"prefix": "flatfs.datastore",
"type": "measure"
},
{
"child": {
"compression": "none",
"path": "datastore",
"type": "levelds"
},
"mountpoint": "/",
"prefix": "leveldb.datastore",
"type": "measure"
}
],
"type": "mount"
},
"StorageGCWatermark": 90,
"StorageMax": "10GB"
},
"Discovery": {
"MDNS": {
"Enabled": true
}
},
"Experimental": {
"AcceleratedDHTClient": false,
"FilestoreEnabled": false,
"GraphsyncEnabled": false,
"Libp2pStreamMounting": false,
"P2pHttpProxy": false,
"StrategicProviding": false,
"UrlstoreEnabled": false
},
"Gateway": {
"APICommands": [],
"HTTPHeaders": {
"Access-Control-Allow-Headers": [
"X-Requested-With",
"Range",
"User-Agent"
],
"Access-Control-Allow-Methods": [
"GET"
],
"Access-Control-Allow-Origin": [
"*"
]
},
"NoDNSLink": false,
"NoFetch": false,
"PathPrefixes": [],
"PublicGateways": null,
"RootRedirect": "",
"Writable": false
},
"Identity": {
"PeerID": "12D3KooWKM6QU7jdvcf6M96RWGUNmCAGJ7aCRKU9odEbbs5ddJuX"
},
"Internal": {},
"Ipns": {
"RecordLifetime": "",
"RepublishPeriod": "",
"ResolveCacheSize": 128
},
"Migration": {
"DownloadSources": [],
"Keep": ""
},
"Mounts": {
"FuseAllowOther": false,
"IPFS": "/ipfs",
"IPNS": "/ipns"
},
"Peering": {
"Peers": [
{
"Addrs": [
"/ip4/35.78.51.148/udp/4001/quic"
],
"ID": "12D3KooWBsyKEDH1x4GhSjXUNwXGfb9HXbvTzeHBert2AevcyFnx"
}
]
},
"Pinning": {
"RemoteServices": {}
},
"Plugins": {
"Plugins": null
},
"Provider": {
"Strategy": ""
},
"Pubsub": {
"DisableSigning": false,
"Router": ""
},
"Reprovider": {},
"Routing": {
"Methods": null,
"Routers": null
},
"Swarm": {
"AddrFilters": null,
"ConnMgr": {
"GracePeriod": "1m0s",
"HighWater": 40,
"LowWater": 20,
"Type": "basic"
},
"DisableBandwidthMetrics": false,
"DisableNatPortMap": false,
"RelayClient": {},
"RelayService": {},
"ResourceMgr": {},
"Transports": {
"Multiplexers": {},
"Network": {
"TCP": false
},
"Security": {}
}
}
} |
@TechTheAwesome Maybe reduce your grace time: As far as I understand, it sets how long a connection is at least kept regardless of the high water mark. Also try setting address filters: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmaddrfilters "Swarm" {
"AddrFilters": [
"/ipv4/10.0.0.0/ipcidr/8",
"/ipv4/172.16.0.0/ipcidr/12",
"/ipv4/192.168.0.0/ipcidr/16"
],
"... your remainder swarm config here": "..."
} This will prevent your node from connecting to local machines but it will also prevent that your node tries to connect to seemingly local nodes which will then in turn be routed via your router - which will probably just route it to its default gateway and create a useless NAT mapping: These local networks are not routable on the WAN network of your router. IPv6 works better here because it knows the routing scopes of your addresses: It won't try to route site scope addresses via the WAN interface. So no need to bother with filters for IPv6. After saving the changes, restart your node and maybe also your router (so you don't carry any artifacts over). |
Contrary to, apparently, popular belief. Low/high water mark mean nothing in this specific case. How it "roughly" works is that your node asks other peers for their peer list. Do take this with a little bit of salt. I don't know the exact internals and might be off on the specifics here. But in general it does work this way. You can see this yourself. On linux, install a package called "ttyplot" and run the following command: Or if you have no ttyplot, just do: And copy it's output to some plot/chart tool of your choice. What you'll see is the actual connections your node has open. At the grace period you'll see a sharp decline (it killed connections above the highWater mark). Moral of the story, highWater/lowWater/grace have nothing to do with fixing this issue. They can, at best, make it occur less. |
So I discovered my ipfs node thrashing over 5000 connections, which impacted the network so badly i couldn't even do a simple git pull. I noticed it because other devices on the network were also struggling to hit just 100kb/s. More or less default network settings for IPFS, i have a static IP and this is all dockerised, i haven't had any issues in the last few months with connectivity broadly. I've disabled TCP as per above suggestions to see if that helps, and have bumped from v0.19 to v0.20 but
Is it possible that there is some kind of regression/bug in a recent version of IPFS that is causing connections to thrash, or have I just been lucky for the last few months until now? |
This issue is very old and some workarounds exists but are buried in the middle of the thread, it's also a collection of various un-actionable opinions. I've created a new issue to collect a table of remaining problems #9998 |
Note that the dial prioritization logic we introduced in the v0.28 go-libp2p release (disabled by default) will dramatically reduce the number of spurious dial attempts (especially on TCP, which is probably what creates the most problems with routers). go-libp2p v0.29 will enable dial prioritization by default, and will be included in the next Kubo release. |
The current BT internet routers for VDSL in the UK are definitely susceptible to this, they freeze up, with the lights still indicating no problem but no response over wifi or LAN. Mine has done this probably 10 times in the last day. I suspect the problem is too many open connections - I'm experimenting with the Swarm settings now to see if I can narrow this down, but I did apply a bandwidth cap with wondershaper and it didn't seem to fix the problem, so now I'm experimenting with the High Water and Low Water settings in ipfs to see if this can fix the problem. To be specific, I have the BT Business Smart Hub 2. The consumer level hub is basically the same hardware, so that's most likely susceptible too. This is a brand-new router and very common in the UK - after a nice conversation with a helpful guy in their second line technical support, he basically said; "yeah, the routers are not good, I recommend you replace it". I've ordered a Draytek, I imagine that will fix the problem. |
I think one way to prevent such routers to lock up is preventing them from routing private destinations to the WAN interface in the first place. Unless you can add blackhole routing entries in the router itself, you should instruct your PC running kubo to not router private destinations to your WAN router. You'd need to add 10/8, 172.16/12 and 192.168/16 to be blackholed in the routing table. Those destinations won't be reachable in the internet anyways but most routers don't care and just fill NAT tables with junk until they eventually lock up or kill valid running connections by invalidating connection tracking states early. |
So we know that ipfs can kill people routers. We should do a quick write up of what the causes are, which routers are normally affected, and maybe propose a couple ideas for solutions.
@Kubuxu do you think you could handle doing this at some point?
The text was updated successfully, but these errors were encountered: