Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing to the gossip file descriptor in connectd #4119

Closed
evd0kim opened this issue Oct 9, 2020 · 11 comments
Closed

Writing to the gossip file descriptor in connectd #4119

evd0kim opened this issue Oct 9, 2020 · 11 comments
Assignees
Labels
state::fixed These issues should have been addressed. Pending confirmation by OP, will close soon otherwise
Milestone

Comments

@evd0kim
Copy link
Contributor

evd0kim commented Oct 9, 2020

Issue and Steps to Reproduce

My node on VPS just stopped working one day. All channels were offline and I found this error in journalctl

Oct 07 07:54:41 satoshis-box lightningd[23254]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 07 07:54:41 satoshis-box lightningd[23254]: keysend: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: autoclean: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c4487 send_backtrace
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon.c:38
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c7934 status_failed
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/status.c:206
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bca2b add_gossip_addrs
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1421
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd41b try_connect_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1470
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd6aa connect_to_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1534
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28be551 recv_req
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1603
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c45ce handle_read
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon_conn.c:31
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e51f7 next_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:59
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e56e6 do_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:407
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e5713 io_ready
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:417
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e6fdf io_loop
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/poll.c:445
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bed4f main
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1679
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x7fa6c3671b96 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28ba589 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0xffffffffffffffff ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.

Then found one channel force-closed but this is for an unknown reason. This channel https://ln.bigsun.xyz/channel/618125x2153x1

@cdecker
Copy link
Member

cdecker commented Oct 11, 2020

Seems we fail to write to the gossip file descriptor here:

lightning/connectd/connectd.c

Lines 1418 to 1423 in ad049c3

/* For simplicity, we do this synchronous. */
msg = towire_gossipd_get_addrs(NULL, id);
if (!wire_sync_write(GOSSIPCTL_FD, take(msg)))
status_failed(STATUS_FAIL_INTERNAL_ERROR,
"Failed writing to gossipctl: %s",
strerror(errno));

Seems that the socketpair that is the other side of GOSSIPCTL_FD was closed? I don't think it can be gossipd since then lightningd itself would complain about it disappearing.

@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 12, 2020

@cdecker thank you for the response but I'm afraid I can't help here since I don't understand what actually happens. I'm more like a lightningd user here. NIftynei just suggested to post an issue.

@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 25, 2020

Hey, @cdecker, I have the same error again on my node.

  1. Full log
Oct 24 19:47:52 satoshis-box lightningd[19246]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a487 send_backtrace
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon.c:38
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0d934 status_failed
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/status.c:206
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a02a2b add_gossip_addrs
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1421
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0341b try_connect_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1470
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a036aa connect_to_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1534
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04551 recv_req
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1603
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a5ce handle_read
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon_conn.c:31
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b1f7 next_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:59
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b6e6 do_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:407
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b713 io_ready
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:417
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2cfdf io_loop
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/poll.c:445
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04d4f main
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1679
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x7fb4a5ffdb96 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a00589 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0xffffffffffffffff ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: autoclean: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box lightningd[19246]: keysend: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Service hold-off time over, scheduling restart.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Scheduled restart job, restart counter is at 2.
  1. Node restarted but 1 of 3 opened channels couldn't return back online.

@jsarenik
Copy link
Collaborator

@engenegr What version? Did you try to update to some later version and re-run?

@cdecker
Copy link
Member

cdecker commented Oct 25, 2020

I was wondering if this could be related to something on the disk: is the disk that the lightning-dir is on low on capacity? Is it getting remounted in read-only mode? (dmesg can tell you a bit about this kind of error in the kernel). Notice that after the restart the gossip_store file which backs the gossip daemon is being compressed, so restarting may free some disk space.

Are there any other logs, from lightning_gossipd particularly, around the time the node exits that may shed some more light? (lightning_gossipd doesn't log sensitive stuff, except the occasional payment destination, so feel free to upload journalctl -u lightningd -n 1000 |& grep gossipd somewhere and link it here).

As for the channel not being online: is it being closed, or just "connected": false. In the latter case the peer may just be unreachable, or we don't have an address for it and they'll need to reconnect from their end.

@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 26, 2020

@jsarenik v0.9.1

@cdecker

  1. journalctl -u lightningd -n 1000 |& grep gossipd is empty
  2. I see there is enough space left on VPS
  3. The channels aren't closed. The peer is one of reliables. It is LightningTo.Me service.

I have some success with dmesg

9954162.495456] Out of memory: Kill process 19290 (lightningd) score 202 or sacrifice child
[9954162.497159] Killed process 19734 (lightning_gossi) total-vm:84512kB, anon-rss:66888kB, file-rss:0kB, shmem-rss:0kB
[9954162.516137] oom_reaper: reaped process 19734 (lightning_gossi), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Does it mean I need more RAM for the node?

@cdecker
Copy link
Member

cdecker commented Oct 26, 2020

It does indeed, it looks like the 66MB that lightning_gossip makes it the highest scored target to be killed when you are out of memory (OOM). That's a bit surprising, since the score is higher the more memory a process uses (that and a couple of other things like marking some processes as non-OOM-killable). How much memory do you have on the machine and what else is running on the machine?

We are working on further reducing the memory-footprint, only loading the gossip map when it is needed, so it'll have short bursts of memory use, but the base state is far lower. See #4093 for details.

An alternative to adding more RAM would be to add a swapfile that can be used to externalize some of the memory that is seldomly used.

@cdecker cdecker self-assigned this Oct 26, 2020
@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 26, 2020

I have 1GB instance and usually, it has 760 out of 985MB busy.
I maybe move my node to another instance and try to provide more memory for lightningd. Or at least enable swap, indeed.

@cdecker
Copy link
Member

cdecker commented Oct 26, 2020

If it's just occasionally swamped then swap can help, however it's easy to end up in swap hell if the memory pressure persists or if the other processes on the system have memory-access patterns that are not compatible with swap (random access to memory that was just swapped out, large strides across many VM pages, etc).

@cdecker cdecker added the state::fixed These issues should have been addressed. Pending confirmation by OP, will close soon otherwise label Oct 26, 2020
@cdecker cdecker added this to the v0.9.2 milestone Oct 26, 2020
@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 26, 2020

I will check and report if it occurs agian. Thank you very much for your support.

@evd0kim
Copy link
Contributor Author

evd0kim commented Oct 30, 2020

I will check and report if it occurs again. Thank you very much for your support.

@evd0kim evd0kim closed this as completed Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state::fixed These issues should have been addressed. Pending confirmation by OP, will close soon otherwise
Projects
None yet
Development

No branches or pull requests

3 participants