-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Headscale stops accepting connections after ~500 nodes (likely 512) (0.23-alpha2) #1656
Comments
Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be. |
@jwischka Have you made sure you are not hitting your file descriptor limits. |
@TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to root@headscale:/home/user# headscale nodes list |
@jwischka would you be able to test this with current version of main? |
@TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice. |
@jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back. |
After updating and fixing the database config section, I'm getting the following error: FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver" (postgres 12) |
Yep, on it, it was a regression in a upstream dependency, see #1755 |
Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ? The postgres issues should now be resolved. |
@kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back. |
Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers |
@kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap? Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so. Like mentioned in another thread, I'm getting a ton of errors like the following in the logs: Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus Also a lot of: Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true I may have to revert if this continues, since this is a semi-production machine. |
@kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you. |
@kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things. |
I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with? |
Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back? |
I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for? I see new log lines talking about partial updates. |
Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients. |
@kradalby Is alpha 7 working with postgres? I'm getting errors. Also, it's showing that Alpha 8 is out but not posted? Thanks |
@kradalby I got alpha 8 to work. At least so far things look like they are stable, but it looks like the issue (~500 nodes) still persists. When I try joining a client machine that is requesting the entire tailnet (i.e., a machine that should be able to connect to all 500 nodes), I get the following errors after things have settled down:
If I kill the process and the node is one of the first to try to join the tailnet, it joins quite quickly. You may understand these errors better than I do, but it looks to me like the client is timing out and requesting things again. These errors appear every few seconds. I'm also getting a lot of these (thousands) messages in the log files. So it looks like memory usage, processor usage, etc are better, but the connection issues may still remain. It looks like I've leveled out at 538 connections. I can try to add a few different clients here in a bit and see if it goes up, or how long it takes to connect clients. Is there anything I can supply you that would be helpful? |
There are two new undocumented tuning options (please note the might change and/or removed) which could helps us figure out maybe the holdup: https://github.com/juanfont/headscale/blob/main/hscontrol/types/config.go#L235-L236 The first is the max amount of time it waits for updates before bundling them and sending them to the node. The seconds is how many updates a node can receive before it will block. Can you experiment with these numbers? I would not set the wait for higher than 1s, but lower will flush to client more often. I would expect lower wait to be lower memory, higher cpu. For the batcher,you might see a super tiny increase in memory, but I doubt it as it's mostly just more ints allowed on a channel, but you could try that for quite some different large numbers and report back: 100, 300, 500, 1000 even. It would help if you could try different combinations for these and try to make a table with your observation, it's hard for me to simulate environments like yours. Thanks. |
Some updates - As soon as I start to increase node_mapsession_buffered_chan_size, I immediately get a lot of crashes. Even a value of 35 will result in reliable crashes. The error message is longer than my console log, but it ends with:
Setting the value back down to 30 seems to calm things down. Interestingly, if I increase the number of postgres max_open_conns I can get similar errors, and I can also generate http related errors:
With the 0.23 alphas I've been leaving the node update check interval at 10, but I bumped that to 30 and things seem to behave better (I had it at 35s on 0.22). I'm not really having any cpu/memory issues at the moment, but the real problem is that when I try to connect with one of my "main" users that ought to be able to see all of the other nodes (via ACL policy) it's never able to join. It just hangs for minutes without joining. Also worth noting because it's probably relevant, but ACLs seem to be broken on this instance (unless there was a change to the ACL behavior that I missed), and all of my 500+ nodes can see all of my other 500+ nodes, which is... well, kind of bad. Probably bad enough I need to revert. I'm not exactly sure what the issue is because I have another instance that I also updated to alpha8 and the ACLs work fine there. Both instances use a similar acl structure, so it's somewhat odd that it works in one place and not the other. Thoughts? I saved the config file this time, so it's pretty easy to try new stuff at this point. |
I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it. |
Bug description
I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join.
CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected.
Environment
Headscale: 0.23-alpha2
Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows.
OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs
Kernel: 6.2.16
Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small)
DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms)
Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices.
To Reproduce
I suspect difficult to do, but have large numbers of clients connect in.
Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult.
Because I'm in container, I can easily snap/test/revert possible fixes.
The text was updated successfully, but these errors were encountered: