connectd pegging 100% CPU #5301

whitslack · 2022-06-01T18:21:00Z

Issue and Steps to Reproduce

Running v0.11.1, lightning_connectd is utilizing 100% of a CPU core. Attaching strace to the process reveals that it is hammering pread64 calls on the gossip_store file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.

The text was updated successfully, but these errors were encountered:

rustyrussell · 2022-06-26T04:26:19Z

First step is not to do dumb things, second step is to optimize dumb things :)

We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files.

Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that.

whitslack · 2022-06-27T06:12:28Z

v0.11.2 might be ever so slightly better on this issue, but lightning_connectd is still hovering around 90% CPU utilization.

Is it expected to have a constant parade of

INFO    ...-channeld-chan#233342: Peer connection lost
INFO    ...-chan#233342: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

and

INFO    ...-chan#153417: Peer transient failure in CHANNELD_NORMAL: Reconnected

in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me.

Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix.

rustyrussell · 2022-06-29T02:56:32Z

These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO).

With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half).

The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period.

I haven't seen the failure to accept incoming connections! That's weird...

whitslack · 2022-06-29T06:51:23Z

I haven't seen the failure to accept incoming connections!

I think it was due to connectd exceeding the file descriptor limit, even though my limit is set at 4096 and I have fewer than 1000 open channels. (I've lost so many channels since upgrading to 0.11.x.)

rustyrussell · 2022-06-29T06:59:56Z

fd leak? That should show up in ls -l proc/$(pidof lightning_connectd)/fd. Mine shows 59 after 5 days... I'll check with the debugger tomorrow to see if any are untracked.

whitslack · 2022-06-29T20:51:57Z

@rustyrussell: My lightning_connectd currently has 2341 open FDs.

whitslack · 2022-06-29T21:20:15Z

We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage.

rustyrussell · 2022-07-09T02:53:35Z

I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release.

rustyrussell · 2023-02-06T02:33:54Z

With the pile of recent v23.02 optimizations, this should be better. Please reopen if I'm wrong!

whitslack · 2023-03-20T22:29:25Z

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry. Maybe something similar is at play in lightning_connectd, which presently on my system has consumed 564 minutes of CPU time, far more than the 165 minutes for lightningd itself and 51 minutes for lightning_hsmd. My current daemon has only been running for about 23 hours of wall-clock time, so connectd is still averaging about 41% CPU utilization.

rustyrussell · 2023-03-20T23:46:14Z

That's fascinating: can you give me a perf report of what's happening?

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

(I usually use perf report in tui mode, but the man page suggests it will do something useful if redirected!)

rustyrussell · 2023-03-20T23:49:29Z

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry.

Well, kind of? Premature optimization is the root of all evil, but complexity isn't free either.

whitslack · 2023-03-21T01:56:16Z

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

@rustyrussell: Assuming you meant pidof lightning_connectd.

I let perf collect data until its perf.data was 4 GB in size.

I don't know what to make of the report, but here it is: perf.report

It looks like a lot of overhead in the poll syscall. Are you really not using epoll(7)?

# command ls -Fl "/proc/$(pidof lightning_connectd)/fd" | grep -Fc socket:
1515

Why does connectd own so many sockets? That's more than the number of open channels I have, let alone actually connected channels.

rustyrussell mentioned this issue Jun 24, 2022

common/gossip_store: optimize case where entries are filtered out. #5328

Merged

rustyrussell added this to the v22.10 milestone Jul 9, 2022

cdecker modified the milestones: v22.11, v23.02 Dec 1, 2022

rustyrussell closed this as completed Feb 6, 2023

rustyrussell reopened this Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectd pegging 100% CPU #5301

connectd pegging 100% CPU #5301

whitslack commented Jun 1, 2022

rustyrussell commented Jun 26, 2022

whitslack commented Jun 27, 2022 •

edited

Loading

rustyrussell commented Jun 29, 2022

whitslack commented Jun 29, 2022

rustyrussell commented Jun 29, 2022

whitslack commented Jun 29, 2022

whitslack commented Jun 29, 2022

rustyrussell commented Jul 9, 2022

rustyrussell commented Feb 6, 2023

whitslack commented Mar 20, 2023

rustyrussell commented Mar 20, 2023

rustyrussell commented Mar 20, 2023

whitslack commented Mar 21, 2023 •

edited

Loading

connectd pegging 100% CPU #5301

connectd pegging 100% CPU #5301

Comments

whitslack commented Jun 1, 2022

Issue and Steps to Reproduce

rustyrussell commented Jun 26, 2022

whitslack commented Jun 27, 2022 • edited Loading

rustyrussell commented Jun 29, 2022

whitslack commented Jun 29, 2022

rustyrussell commented Jun 29, 2022

whitslack commented Jun 29, 2022

whitslack commented Jun 29, 2022

rustyrussell commented Jul 9, 2022

rustyrussell commented Feb 6, 2023

whitslack commented Mar 20, 2023

rustyrussell commented Mar 20, 2023

rustyrussell commented Mar 20, 2023

whitslack commented Mar 21, 2023 • edited Loading

whitslack commented Jun 27, 2022 •

edited

Loading

whitslack commented Mar 21, 2023 •

edited

Loading