Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: fail fast if the Bitcoin backend fails #3571

Closed
darosior opened this issue Mar 5, 2020 · 23 comments
Closed

Improvement: fail fast if the Bitcoin backend fails #3571

darosior opened this issue Mar 5, 2020 · 23 comments

Comments

@darosior
Copy link
Contributor

darosior commented Mar 5, 2020

#3565 introduces a timeout to not wait the Bitcoin backend's response to the init message for ever.

As an improvement, it would be nice to fail directly if the plugin dies.

It could be a good first issue ?
In this case some hints:
The Bitcoin backend plugin will signal its readiness by completing the handshake with lightningd. This handshake is completed by responding to the init message.

If the plugin dies for whatever reason (for example bcli cannot connect to bitcoind), the connection will be dropped and plugin_conn_finished will be called:

/**
* Finalizer for both stdin and stdout connections.
*
* Takes care of final cleanup, once the plugin is definitely dead.
*/
static void plugin_conn_finish(struct io_conn *conn, struct plugin *plugin)
{
plugin->stdout_conn = NULL;
tal_free(plugin);
}

It frees the plugin struct so maybe an (untested) idea would be to attach a custom destructor to Bitcoin plugins ?

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Mar 24, 2020

Why not fold this into #3213? Make bcli (and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.

Then to replace any builtin plugin we make it non-crucial and suppress loading it and then make a new crucial plugin, e.g. custom-bcli. E.g. --non-crucial-plugin=bcli --plugin-suppress=bcli --crucial-plugin=custom-bcli. This would work for all plugins, e.g. pay, autoclean....

@whitslack
Copy link
Collaborator

A couple of weeks ago, I discovered that my Lightning node had not been interacting with the blockchain for over a day because bcli had failed. The only way I noticed was that I had a channel that had been in CHANNELD_AWAITING_LOCKIN state for hours even though the peer was connected and the funding transaction was many blocks deep in the chain. When I examined the process tree, I noticed that bcli was in the defunct/zombie state. If this condition had continued for longer than my their_to_self_delay in any of my channels, my peers could have successfully cheated me. Obviously no good! So I appreciate the move toward making bcli a "crucial" plugin, without which lightningd will terminate.

@darosior
Copy link
Contributor Author

Some logs would be very welcome as we would die on the next request (every 30s at most) if the connection to bcli is dropped. So bcli was in zombie state but still kept its connection opened ?

@whitslack
Copy link
Collaborator

bcli had been in zombie state (process exited but exit status not reaped by parent) for over 24 hours. I don't know why lightningd didn't notice, but the net effect was that nothing in my Lightning node that needs blockchain interaction (e.g., opening and closing channels) was making any progress. I assume if a funding transaction were to have been spent during this time (i.e., peer initiated unilateral close), my node would not have noticed it.

If it ever happens again, I'll have logs to submit, as I'm now running lightningd with debug logging enabled.

@whitslack
Copy link
Collaborator

It happened again. Here's an abbreviated process tree.

  PID USER     GROUP    START     TIME   RSS  NI NLWP COMMAND
    1 root     root     Apr07 00:00:23    20   0    1 init [3]
 2948 lightni+ lightni+ Apr07 11:23:10 613580  0    1   /usr/bin/lightningd --conf=/etc/lightning/lightningd.conf --lightning-dir=/var/lib/lightning --log-file=/var/log/lightningd.log --pid-file=/var/lib/lightning/lightningd.pid --rpc-file=/var/lib/lightning/lightningd.rpc --daemon
 2949 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/pay
 2950 lightni+ lightni+ Apr07 00:17:22     0   0    1     [bcli] <defunct>
 2951 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/autoclean
 2952 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/fundchannel
 2966 lightni+ lightni+ Apr07 01:08:12 83132   0    1     /usr/libexec/c-lightning/lightning_hsmd
 2971 lightni+ lightni+ Apr07 00:20:30  1856   0    1     /usr/libexec/c-lightning/lightning_connectd
 2972 lightni+ lightni+ Apr07 12:14:52 143584  0    1     /usr/libexec/c-lightning/lightning_gossipd

(I have hundreds of lightning_channeld and lightning_openingd processes, which I filtered out of the process tree shown above.)

Of note is the bcli process, which has gone defunct. The last block that lightningd knows about is 627799, which was mined 12 hours ago. My log shows when lightningd learned of this block:

2020-04-27T05:01:46.444Z DEBUG lightningd: Adding block 627799: 0000000000000000000bf07b961efe8779ccdd87454f2eeb20ef1baad92a2072

That's the last Adding block message my log contains. I do have many subsequent messages like this one:

2020-04-27T06:13:30.925Z DEBUG 032c4b954f0f171b694b5e8e8323589e54196b48cf2efc27692513a360cb11d76f-gossipd: Ignoring future channel_announcment for 627803x1839x1 (current block 627799)

There are no messages containing bcli after Adding block 627799.

My bitcoind process has been running uninterrupted the entire time and is still responding to RPC requests as I write this report. My kernel log shows that none of C-Lightning's processes were killed by the OOM killer. The OOM killer did kill some Chrome processes starting at 01:26, but that was well after block 627800 was mined at 01:17, and bcli crashed before block 627800, so I think that's unrelated.

I do see exactly two groups of memory-related messages like this:

2020-04-27T06:32:38.176Z UNUSUAL lightningd: subd lightning_channeld failed: Cannot allocate memory
2020-04-27T06:32:38.177Z UNUSUAL 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Could not subdaemon channel: Cannot allocate memory
2020-04-27T06:32:38.177Z INFO 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Peer transient failure in CHANNELD_NORMAL: Failed to subdaemon channel

So clearly my system was running low on available memory this morning. (I was compiling Chromium overnight, which is what triggered the OOM killer on my chrome processes. The build completed successfully.)

Maybe it's possible that the bcli plugin process exited due to a failure to allocate memory but didn't log any message about it? In any case, lightningd seems completely unaware that bcli has exited.

I am going to restart my C-Lightning now, but I can grep my log for anything you want.

@darosior
Copy link
Contributor Author

Could you grep for bcli please ?

@darosior
Copy link
Contributor Author

Anyhow this happened, I think we should register SIGSHLD and die if bcli died will submit a patch.

@whitslack
Copy link
Collaborator

Could you grep for bcli please ?

@darosior: Here's a grep of the log from midnight (well before the failure) until I restarted C-Lightning.

# grep -a '^2020-04-27T.*bcli' /var/log/lightningd.log | sed -n '0,/started/p'
2020-04-27T00:00:40.297Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627774 (23468 ms)
2020-04-27T00:17:00.390Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (25481 ms)
2020-04-27T00:18:41.818Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (11355 ms)
2020-04-27T00:23:13.474Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627776 (40261 ms)
2020-04-27T00:39:19.749Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627777 (15651 ms)
2020-04-27T00:41:41.165Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (11148 ms)
2020-04-27T00:43:28.371Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (17144 ms)
2020-04-27T00:44:21.677Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (23288 ms)
2020-04-27T00:48:59.666Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (14634 ms)
2020-04-27T00:57:55.708Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (13870 ms)
2020-04-27T01:03:26.656Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (22420 ms)
2020-04-27T01:10:37.584Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 0000000000000000000ca523cfa0b8690e06e78e9a02287c7910cdfdc4bda443 0 (13861 ms)
2020-04-27T01:44:23.676Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627783 (29478 ms)
2020-04-27T01:47:51.789Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627784 (14816 ms)
2020-04-27T01:47:55.669Z DEBUG plugin-bcli: Log pruned 241020 entries (mem 104877937 -> 44367821)
2020-04-27T03:23:18.143Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 000000000000000000020c4438cfdd411f6a68815047be31610ee063516fe755 0 (18891 ms)
2020-04-27T03:43:07.144Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627792 (29436 ms)
2020-04-27T04:31:45.853Z DEBUG plugin-bcli: Log pruned 250656 entries (mem 104910406 -> 49111736)
2020-04-27T04:35:21.728Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:21.847Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.016Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.029Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.040Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:46:58.423Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T18:08:58.657Z DEBUG plugin-manager: started(10206) /usr/bin/../libexec/c-lightning/plugins/bcli

@whitslack
Copy link
Collaborator

Anyhow this happened, I think we should register SIGSHLD and die if bcli died will submit a patch.

@darosior: How do you not get a SIGPIPE when you try to write to the bcli process which has already terminated?

@darosior
Copy link
Contributor Author

Thanks for the logs. That's what I wonder we should have tried to send a request to bcli and died long before (we poll the feerate every 30 seconds!).
Could you send me your whole logs ? You can drop them to me by mail if you don't want to reveal them to the world. If you prefer not to, could you grep -i for feerate ?

@whitslack
Copy link
Collaborator

whitslack commented Apr 27, 2020

Could you send me your whole logs ?

@darosior: Logs, compressed with xz, encrypted using GPG with a symmetric cipher (passphrase emailed to you), then encoded using base64 because GitHub doesn't allow attachments of arbitrary content type: lightningd-redacted.log.xz.gpg.base64.txt

Edit: Just realized I could probably have obtained your GPG public key and used that instead of a shared secret. Oops.

@darosior
Copy link
Contributor Author

darosior commented Apr 28, 2020

Ok so we effectively seem to have stopped polling for feerates just before bcli died.
My guess is we never restarted the polling timers because we never got a response from bcli.
I still fail to understand how all the commands could have been written to bcli succesfully, and why bcli got killed (logs pruned).

Can you try with the latest release and log-level=io ? (Sorry to have no better recommandation..)
I think #3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io would still help though if we need to re debug why bcli is failing).

@darosior
Copy link
Contributor Author

Why not fold this into #3213? Make bcli (and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.

This is already the case for bcli (and I thought the other way around was true too, that we would die if bcli died but apparently not!).

@whitslack
Copy link
Collaborator

Ok so we effectively seem to have stopped polling for feerates just before bcli died.
My guess is we never restarted the polling timers because we never got a response from bcli.
I still fail to understand how all the commands could have been written to bcli succesfully, and why bcli got killed (logs pruned).

What would have happened if this sequence of events had occurred?:

  1. lightningd writes feerates command to bcli.
  2. bcli reads command from lightningd.
  3. bcli attempts to fork a child process to exec bitcoin-cli, but fork fails due to resource exhaustion.

Does bcli die in this case? Probably it should return an error to lightningd instead.

If lightningd never got a response to its feerates command, would the polling timer fire again, or is the timer only reset when a response is read?

Can you try with the latest release and log-level=io ? (Sorry to have no better recommandation..)
I think #3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io would still help though if we need to re debug why bcli is failing).

I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum. I also prefer not to run release candidates for the same reason. I am leery of increasing the log-level even further, as I am already generating gigabytes of logs in a short time at the debug level. Also, the failure mode is rare and hard to reproduce, as it seems to occur under resource exhaustion, so I might have to run for a very long time before it happens again. It might be more fruitful to examine the possible scenario I outlined above (failing fork).

@darosior
Copy link
Contributor Author

What would have happened if this sequence of events had occurred?:

It dies, and lightningd would too at the next request it tries to send to bcli. Thing is the timer to send the next polling request (namely getblock or estimatefees) are set in the "success" callback of the previous call, so I presume you ran in a deadlock. #3675 fixes this by making lightningd crash as soon as the connection to bcli (or any other backend) is lost.

I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum.

I understand. I hope you won't have another resource exhaustion before next week :-)

@whitslack
Copy link
Collaborator

What would have happened if this sequence of events had occurred?:

It dies, and lightningd would too at the next request it tries to send to bcli.

When would the next request occur if the polling timer has not been re-armed? Would it occur if lightningd needs to broadcast a transaction? How does lightningd learn of new blocks? Is that also performed by the same polling timer?

@whitslack
Copy link
Collaborator

I'm not keen to restart my C-Lightning more often than necessary

By the way, it'd be really cool if the log level could be changed on the fly. 😉

@darosior
Copy link
Contributor Author

When would the next request occur if the polling timer has not been re-armed?

Never..

Would it occur if lightningd needs to broadcast a transaction?

Ah, yes. I meant "never for the fees and block chain updates"

How does lightningd learn of new blocks?

Polling with getblockcount

Is that also performed by the same polling timer?

One for the fees, one for the topology (chain)

By the way, it'd be really cool if the log level could be changed on the fly. wink

Yeah ! I also personally ran into this willing regarding my nodes :-)

@whitslack
Copy link
Collaborator

One for the fees, one for the topology (chain)

So I think in my case, both polling timers must have been affected, presumably because bcli's calls to both estimatesmartfee and getblockcount failed to fork a bitcoin-cli child process.

Thanks for the insight.

@darosior
Copy link
Contributor Author

Yeah, that's this strange perfect synchronicity that made me bug. In addition this happened two times in a row !
Maybe we should differ the timers starting time, but still : I didn't expect them to be synchronous after some time running..

@whitslack
Copy link
Collaborator

@darosior: Maybe you should log (at the DEBUG level) when the timers fire.

@darosior
Copy link
Contributor Author

darosior commented Apr 28, 2020 via email

@darosior
Copy link
Contributor Author

darosior commented Oct 17, 2020

I think we can close this, it was solved by #3675 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants