-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectd pegging 100% CPU #5301
Comments
First step is not to do dumb things, second step is to optimize dumb things :) We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files. Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that. |
v0.11.2 might be ever so slightly better on this issue, but Is it expected to have a constant parade of
and
in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me. Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix. |
These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO). With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half). The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period. I haven't seen the failure to accept incoming connections! That's weird... |
I think it was due to |
fd leak? That should show up in |
@rustyrussell: My |
We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage. |
I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release. |
With the pile of recent v23.02 optimizations, this should be better. Please reopen if I'm wrong! |
If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry. Maybe something similar is at play in |
That's fascinating: can you give me a
(I usually use perf report in tui mode, but the man page suggests it will do something useful if redirected!) |
Well, kind of? Premature optimization is the root of all evil, but complexity isn't free either. |
@rustyrussell: Assuming you meant I let I don't know what to make of the report, but here it is: It looks like a lot of overhead in the
Why does |
Issue and Steps to Reproduce
Running v0.11.1,
lightning_connectd
is utilizing 100% of a CPU core. Attachingstrace
to the process reveals that it is hammeringpread64
calls on thegossip_store
file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.The text was updated successfully, but these errors were encountered: