Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd pegging 100% CPU #5301

Open
whitslack opened this issue Jun 1, 2022 · 13 comments
Open

connectd pegging 100% CPU #5301

whitslack opened this issue Jun 1, 2022 · 13 comments
Milestone

Comments

@whitslack
Copy link
Collaborator

Issue and Steps to Reproduce

Running v0.11.1, lightning_connectd is utilizing 100% of a CPU core. Attaching strace to the process reveals that it is hammering pread64 calls on the gossip_store file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.

@rustyrussell
Copy link
Contributor

First step is not to do dumb things, second step is to optimize dumb things :)

We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files.

Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that.

@whitslack
Copy link
Collaborator Author

whitslack commented Jun 27, 2022

v0.11.2 might be ever so slightly better on this issue, but lightning_connectd is still hovering around 90% CPU utilization.

Is it expected to have a constant parade of

INFO    ...-channeld-chan#233342: Peer connection lost
INFO    ...-chan#233342: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

and

INFO    ...-chan#153417: Peer transient failure in CHANNELD_NORMAL: Reconnected

in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me.

Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix.

@rustyrussell
Copy link
Contributor

These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO).

With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half).

The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period.

I haven't seen the failure to accept incoming connections! That's weird...

@whitslack
Copy link
Collaborator Author

I haven't seen the failure to accept incoming connections!

I think it was due to connectd exceeding the file descriptor limit, even though my limit is set at 4096 and I have fewer than 1000 open channels. (I've lost so many channels since upgrading to 0.11.x.)

@rustyrussell
Copy link
Contributor

fd leak? That should show up in ls -l proc/$(pidof lightning_connectd)/fd. Mine shows 59 after 5 days... I'll check with the debugger tomorrow to see if any are untracked.

@whitslack
Copy link
Collaborator Author

@rustyrussell: My lightning_connectd currently has 2341 open FDs.

@whitslack
Copy link
Collaborator Author

We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage.

@rustyrussell
Copy link
Contributor

I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release.

@rustyrussell rustyrussell added this to the v22.10 milestone Jul 9, 2022
@cdecker cdecker modified the milestones: v22.11, v23.02 Dec 1, 2022
@rustyrussell
Copy link
Contributor

With the pile of recent v23.02 optimizations, this should be better. Please reopen if I'm wrong!

@whitslack
Copy link
Collaborator Author

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry. Maybe something similar is at play in lightning_connectd, which presently on my system has consumed 564 minutes of CPU time, far more than the 165 minutes for lightningd itself and 51 minutes for lightning_hsmd. My current daemon has only been running for about 23 hours of wall-clock time, so connectd is still averaging about 41% CPU utilization.

@rustyrussell
Copy link
Contributor

That's fascinating: can you give me a perf report of what's happening?

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

(I usually use perf report in tui mode, but the man page suggests it will do something useful if redirected!)

@rustyrussell rustyrussell reopened this Mar 20, 2023
@rustyrussell
Copy link
Contributor

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry.

Well, kind of? Premature optimization is the root of all evil, but complexity isn't free either.

@whitslack
Copy link
Collaborator Author

whitslack commented Mar 21, 2023

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

@rustyrussell: Assuming you meant pidof lightning_connectd.

I let perf collect data until its perf.data was 4 GB in size.

I don't know what to make of the report, but here it is: perf.report

It looks like a lot of overhead in the poll syscall. Are you really not using epoll(7)?


# command ls -Fl "/proc/$(pidof lightning_connectd)/fd" | grep -Fc socket:
1515

Why does connectd own so many sockets? That's more than the number of open channels I have, let alone actually connected channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants