Writing to the gossip file descriptor in connectd #4119

evd0kim · 2020-10-09T18:28:03Z

Issue and Steps to Reproduce

My node on VPS just stopped working one day. All channels were offline and I found this error in journalctl

Oct 07 07:54:41 satoshis-box lightningd[23254]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 07 07:54:41 satoshis-box lightningd[23254]: keysend: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: autoclean: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c4487 send_backtrace
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon.c:38
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c7934 status_failed
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/status.c:206
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bca2b add_gossip_addrs
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1421
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd41b try_connect_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1470
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd6aa connect_to_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1534
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28be551 recv_req
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1603
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c45ce handle_read
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon_conn.c:31
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e51f7 next_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:59
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e56e6 do_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:407
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e5713 io_ready
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:417
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e6fdf io_loop
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/poll.c:445
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bed4f main
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1679
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x7fa6c3671b96 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28ba589 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0xffffffffffffffff ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.

Then found one channel force-closed but this is for an unknown reason. This channel https://ln.bigsun.xyz/channel/618125x2153x1

The text was updated successfully, but these errors were encountered:

cdecker · 2020-10-11T10:32:24Z

Seems we fail to write to the gossip file descriptor here:

lightning/connectd/connectd.c

Lines 1418 to 1423 in ad049c3

    
           /* For simplicity, we do this synchronous. */ 
        
           msg = towire_gossipd_get_addrs(NULL, id); 
        
           if (!wire_sync_write(GOSSIPCTL_FD, take(msg))) 
        
           	status_failed(STATUS_FAIL_INTERNAL_ERROR, 
        
           		      "Failed writing to gossipctl: %s", 
        
           		      strerror(errno));

Seems that the socketpair that is the other side of GOSSIPCTL_FD was closed? I don't think it can be gossipd since then lightningd itself would complain about it disappearing.

evd0kim · 2020-10-12T07:50:58Z

@cdecker thank you for the response but I'm afraid I can't help here since I don't understand what actually happens. I'm more like a lightningd user here. NIftynei just suggested to post an issue.

evd0kim · 2020-10-25T10:54:49Z

Hey, @cdecker, I have the same error again on my node.

Full log

Oct 24 19:47:52 satoshis-box lightningd[19246]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a487 send_backtrace
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon.c:38
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0d934 status_failed
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/status.c:206
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a02a2b add_gossip_addrs
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1421
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0341b try_connect_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1470
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a036aa connect_to_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1534
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04551 recv_req
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1603
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a5ce handle_read
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon_conn.c:31
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b1f7 next_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:59
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b6e6 do_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:407
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b713 io_ready
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:417
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2cfdf io_loop
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/poll.c:445
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04d4f main
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1679
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x7fb4a5ffdb96 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a00589 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0xffffffffffffffff ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: autoclean: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box lightningd[19246]: keysend: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Service hold-off time over, scheduling restart.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Scheduled restart job, restart counter is at 2.

Node restarted but 1 of 3 opened channels couldn't return back online.

jsarenik · 2020-10-25T10:58:54Z

@engenegr What version? Did you try to update to some later version and re-run?

cdecker · 2020-10-25T11:15:46Z

I was wondering if this could be related to something on the disk: is the disk that the lightning-dir is on low on capacity? Is it getting remounted in read-only mode? (dmesg can tell you a bit about this kind of error in the kernel). Notice that after the restart the gossip_store file which backs the gossip daemon is being compressed, so restarting may free some disk space.

Are there any other logs, from lightning_gossipd particularly, around the time the node exits that may shed some more light? (lightning_gossipd doesn't log sensitive stuff, except the occasional payment destination, so feel free to upload journalctl -u lightningd -n 1000 |& grep gossipd somewhere and link it here).

As for the channel not being online: is it being closed, or just "connected": false. In the latter case the peer may just be unreachable, or we don't have an address for it and they'll need to reconnect from their end.

evd0kim · 2020-10-26T09:18:01Z

@jsarenik v0.9.1

@cdecker

journalctl -u lightningd -n 1000 |& grep gossipd is empty
I see there is enough space left on VPS
The channels aren't closed. The peer is one of reliables. It is LightningTo.Me service.

I have some success with dmesg

9954162.495456] Out of memory: Kill process 19290 (lightningd) score 202 or sacrifice child
[9954162.497159] Killed process 19734 (lightning_gossi) total-vm:84512kB, anon-rss:66888kB, file-rss:0kB, shmem-rss:0kB
[9954162.516137] oom_reaper: reaped process 19734 (lightning_gossi), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Does it mean I need more RAM for the node?

cdecker · 2020-10-26T09:34:31Z

It does indeed, it looks like the 66MB that lightning_gossip makes it the highest scored target to be killed when you are out of memory (OOM). That's a bit surprising, since the score is higher the more memory a process uses (that and a couple of other things like marking some processes as non-OOM-killable). How much memory do you have on the machine and what else is running on the machine?

We are working on further reducing the memory-footprint, only loading the gossip map when it is needed, so it'll have short bursts of memory use, but the base state is far lower. See #4093 for details.

An alternative to adding more RAM would be to add a swapfile that can be used to externalize some of the memory that is seldomly used.

evd0kim · 2020-10-26T09:37:42Z

I have 1GB instance and usually, it has 760 out of 985MB busy.
I maybe move my node to another instance and try to provide more memory for lightningd. Or at least enable swap, indeed.

cdecker · 2020-10-26T12:34:01Z

If it's just occasionally swamped then swap can help, however it's easy to end up in swap hell if the memory pressure persists or if the other processes on the system have memory-access patterns that are not compatible with swap (random access to memory that was just swapped out, large strides across many VM pages, etc).

evd0kim · 2020-10-26T12:41:06Z

I will check and report if it occurs agian. Thank you very much for your support.

evd0kim · 2020-10-30T17:46:27Z

I will check and report if it occurs again. Thank you very much for your support.

cdecker self-assigned this Oct 26, 2020

cdecker added the state::fixed These issues should have been addressed. Pending confirmation by OP, will close soon otherwise label Oct 26, 2020

cdecker added this to the v0.9.2 milestone Oct 26, 2020

evd0kim closed this as completed Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to the gossip file descriptor in connectd #4119

Writing to the gossip file descriptor in connectd #4119

evd0kim commented Oct 9, 2020

cdecker commented Oct 11, 2020

evd0kim commented Oct 12, 2020

evd0kim commented Oct 25, 2020

jsarenik commented Oct 25, 2020

cdecker commented Oct 25, 2020 •

edited

Loading

evd0kim commented Oct 26, 2020

cdecker commented Oct 26, 2020

evd0kim commented Oct 26, 2020

cdecker commented Oct 26, 2020

evd0kim commented Oct 26, 2020

evd0kim commented Oct 30, 2020

Writing to the gossip file descriptor in connectd #4119

Writing to the gossip file descriptor in connectd #4119

Comments

evd0kim commented Oct 9, 2020

Issue and Steps to Reproduce

cdecker commented Oct 11, 2020

evd0kim commented Oct 12, 2020

evd0kim commented Oct 25, 2020

jsarenik commented Oct 25, 2020

cdecker commented Oct 25, 2020 • edited Loading

evd0kim commented Oct 26, 2020

cdecker commented Oct 26, 2020

evd0kim commented Oct 26, 2020

cdecker commented Oct 26, 2020

evd0kim commented Oct 26, 2020

evd0kim commented Oct 30, 2020

cdecker commented Oct 25, 2020 •

edited

Loading