lightningd: make shutdown smoother and safer #4897

SimonVrouwe · 2021-11-01T13:17:31Z

Another proof of concept (should I make this a draft PR?), to address issues raised in #4785 #4790 and #4883

The point of shutdown is to break-down or reduce activity. Starting a new io_loop in order to write shutdown notifications to plugins and wait for them to terminate, should not trigger new activity other then logging and (maybe?) notifications.

So before restarting this io_loop we shutdown subdaemons and disable handling of all JSON RPC requests and responses with an "id" field in it. Hook callback's will abandon already and without subdaemons or JSON RPC no new hooks can be called. All of that should prevent any db_write's and we can safely start the io_loop and shutdown all plugins.

Also cleans up some dead code related to unused destroy_utxos destructor.

TODO:

documentation

SimonVrouwe · 2021-11-03T10:42:02Z

Rebased.

A little back and forth on what to do with RPC command in shutdown: failing them results in some ** BROKEN ** logs by plugins that subscribed to shutdown (all build-in plugins in DEVELOPER=1) which happen to receive the result of a (previous) RPC command when waiting in the 30s shutdown loop. This could be solved by lightningd not returning any result (command_still_pending forever) when in shutdown. But that would make RPC calls made (maybe just by chance) in the 30s shutdown loop hang forever which is not nice in my opinion.

So I choose to return error code -32601 with message lightningd is shutting down and will see if its possible to not make our tests fail on that particular error.

@cdecker BTW: I see #4883 was closed, but that test (same as 9aeeae5 in this PR) about consistent hook-semantics-in-shutdown still fails on v0.10.2rc2

cdecker · 2021-11-03T11:20:49Z

@cdecker BTW: I see #4883 was closed, but that test (same as 9aeeae5 in this PR) about consistent hook-semantics-in-shutdown still fails on v0.10.2rc2

Sorry about that, i must've tagged the wrong issue in the PR.

SimonVrouwe · 2021-11-03T20:29:24Z

Considered all these troubles and the fact that shutdown notification is currently only used for build-in plugin's to do a memleak_check:

lightning/plugins/libplugin.c

Lines 1265 to 1267 in 1268967

    
           bool is_shutdown = streq(cmd->methodname, "shutdown"); 
        
           if (is_shutdown) 
        
           	memleak_check(plugin, cmd);

No intention to hold up the release but if there is no time, maybe the shutdown notification can be made a developer-only option (for now)? Before plugin developers get the wrong impression what they can or should do with it.

ZmnSCPxj · 2021-11-08T06:01:56Z

No intention to hold up the release but if there is no time, maybe the shutdown notification can be made a developer-only option (for now)? Before plugin developers get the wrong impression what they can or should do with it.

I object! It should be a developer-only option forever. I think plugins should really just handle EOF on their input, as that handles the case where lightningd crashes unexpectedly.

rustyrussell

Probably drop final commit?

lightningd/lightningd.c

ZmnSCPxj · 2021-11-17T03:30:46Z

How about deprecating the shutdown notification and instead create a new clean-shutdown chainable hook?

Then the shutdown command first triggers the hook and waits for all registered plugins to handle the hook before it actually sets the internal flag.

Using a hook provides cleaner semantics: we know exactly when plugins have finished whatever clean-shutdown-related work they have, because then they would respond with {"result": "continue"}. No more phasing and no more heuristic-based waiting --- what if the node scales up and clean shutdown requires more than 30 seconds for a plugin to handle?

It also strongly implies that it only works for a clean shutdown, and not for all shutdown conditions. Plugins are still strongly encouraged to not break if lightningd dies from under them.

SimonVrouwe · 2021-11-18T22:18:34Z

Still one failing test test_hsmtool_secret_decryption.
I managed to reproduce the hang, but it only happens occasionally, but by setting TIMEOUT in pyln.testing.utils to large value and then:
while pytest tests/ -s -k 'test_hsmtool_secret_decryption' --maxfail=1; do echo "ok"; done
after a while one of the runs seems to hang, only printing lines gossipd: seeker: no peers, waiting
(NOTE: I was also running a full pytest /tests in another terminal, not sure if relevant.)

But it seems to hang on
tools/hsmtool decrypt /tmp/ltests-x5nqrv48/test_hsmtool_secret_decryption_1/lightning-1/regtest/hsm_secret

connect gdb to that, show backtrace:

0x00007f109fc1a461 in __GI___libc_read (fd=0, buf=0x559ab854d1e0, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  0x00007f109fc1a461 in __GI___libc_read (fd=0, buf=0x559ab854d1e0, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007f109fbac670 in _IO_new_file_underflow (fp=0x7f109fceba00 <_IO_2_1_stdin_>) at libioP.h:839
#2  0x00007f109fba0e38 in _IO_getdelim (lineptr=0x7fff42ec6d10, n=0x7fff42ec6d18, delimiter=10, fp=0x7f109fceba00 <_IO_2_1_stdin_>) at iogetdelim.c:73
#3  0x0000559ab66f317b in getline_stdin_pass (passwd=0x7fff42ec6d10, passwd_size=0x7fff42ec6d18) at common/hsm_encryption.c:83
#4  0x0000559ab66f3316 in read_stdin_pass (reason=0x7fff42ec6de0) at common/hsm_encryption.c:111
#5  0x0000559ab66c0340 in decrypt_hsm (hsm_secret_path=0x7fff42ec8192 "/tmp/ltests-x5nqrv48/test_hsmtool_secret_decryption_1/lightning-1/regtest/hsm_secret") at tools/hsmtool.c:172
#6  0x0000559ab66c191c in main (argc=3, argv=0x7fff42ec7008) at tools/hsmtool.c:599

is somewhere here in our test

lightning/tests/test_wallet.py

Lines 1083 to 1090 in efeb1bc

    
           # We can't use a wrong password ! 
        
           master_fd, slave_fd = os.openpty() 
        
           hsmtool = HsmTool("decrypt", hsm_path) 
        
           hsmtool.start(stdin=slave_fd, 
        
                         stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
        
           hsmtool.wait_for_log(r"Enter hsm_secret password:") 
        
           write_all(master_fd, "A wrong pass\n\n".encode("utf-8")) 
        
           hsmtool.proc.wait(WAIT_TIMEOUT)

Well at least we know where to look now.
Maybe the test framework can be tweaked, so that in case the TimeoutError is raised, it looks around for such hangs and show back traces in log report?

SimonVrouwe · 2021-11-24T14:57:09Z

Rebased, modified commit messages but no code changes except 2 suggestion made by rustyrussel
and added a commit that fixes the hsmtool hang by using TSCANOW flag (for the interested, have a look at this demo )

Added documentation.

SimonVrouwe · 2021-11-24T15:38:57Z

I think this can be merged.

How about deprecating the shutdown notification and instead create a new clean-shutdown chainable hook?

@ZmnSCPxj Not sure if it needs to be depricated, this PR makes it safe I think but it needs documentation.

I agree with a clean-shutdown hook, but how many use cases are there? Some of such functionality can be also accomplished with a method shutdown_clean such as in this example.

plugins expect their hooks to work also in shutdown, see issue ElementsProject#4883

…ess to db because: - shutdown_subdaemons can trigger db write, comments in that function say so at least - resurrecting the main event loop with subdaemons still running is counter productive in shutting down activity (such as htlc's, hook_calls etc.) - custom behavior injected by plugins via hooks should be consistent, see test in previous commmit IDEA: in shutdown_plugins, when starting new io_loop: - A plugin that is still running can return a jsonrpc_request response, this triggers response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification) as these do not trigger subdaemon calls or db_write's Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field PLAN (hypothesis): - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with an "id" field, this should block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?) Q. Can internal (so not via plugin) jsonrpc_requests called in the main io_loop return/revive in the shutdown io_loop? A. No. All code under lightningd/ returning command_still_pending depends on either a subdaemon, timer or plugin. In shutdown loop the subdaemons are dead, timer struct cleared and plugins will be taken care of (in next commits). fixup: we can only io_break the main io_loop once

…ite's anymore since PR ElementsProject#3867 utxos are unreserved by height, destroy_utxos and related functions are not used anymore so clean them up also However free(ld->jsonrpc) still needs to happen before free(ld) because its destructors need list_head pointers from ld

Not needed anymore, see previous commit

…equests

…GD_SHUTDOWN

…code -5 and the two conditions in which plugins can receive shutdown notification

shutdown_subdaemons frees the channel and calls destroy_close_command_on_channel_destroy, see gdb: 0 destroy_close_command_on_channel_destroy (_=0x55db6ca38e18, cc=0x55db6ca43338) at lightningd/closing_control.c:94 1 0x000055db6a8181b5 in notify (ctx=0x55db6ca38df0, type=TAL_NOTIFY_FREE, info=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:237 2 0x000055db6a8186bb in del_tree (t=0x55db6ca38df0, orig=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:402 3 0x000055db6a818a47 in tal_free (ctx=0x55db6ca38e18) at ccan/ccan/tal/tal.c:486 4 0x000055db6a73fffa in shutdown_subdaemons (ld=0x55db6c8b4ca8) at lightningd/lightningd.c:543 5 0x000055db6a741098 in main (argc=21, argv=0x7ffffa3e8048) at lightningd/lightningd.c:1192 Before this PR, there was no io_loop after shutdown_subdaemons and client side raised a general `Connection to RPC server lost.` Now we test the more specific `Channel forgotten before proper close.`, which is good! BTW, this test was added recently in PR ElementsProject#4599.

Seems a timing issue that should be figured out, his makes the test pass.

for the case rpc "listpeers" returns an error, such as in shutdown

…s pipe Setting SIGCHLD back to default (i.e. ignored) makes waitpid hang on an old SIGCHLD that was still in the pipe? This happens running test_important_plugin with developer=1: (or with dev=0 and build-in plugins subscribed to "shutdown") 0 0x00007ff8336b6437 in __GI___waitpid (pid=-1, stat_loc=0x0, options=1) at ../sysdeps/unix/sysv/linux/waitpid.c:30 1 0x000055fb468f733a in sigchld_rfd_in (conn=0x55fb47c7cfc8, ld=0x55fb47bdce58) at lightningd/lightningd.c:785 2 0x000055fb469bcc6b in next_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8) at ccan/ccan/io/io.c:59 3 0x000055fb469bd80b in do_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8, idle_on_epipe=false) at ccan/ccan/io/io.c:407 4 0x000055fb469bd849 in io_ready (conn=0x55fb47c7cfc8, pollflags=1) at ccan/ccan/io/io.c:417 5 0x000055fb469bfa26 in io_loop (timers=0x55fb47c41198, expired=0x7ffdf4be9028) at ccan/ccan/io/poll.c:453 6 0x000055fb468f1be9 in io_loop_with_timers (ld=0x55fb47bdce58) at lightningd/io_loop_with_timers.c:21 7 0x000055fb46924817 in shutdown_plugins (ld=0x55fb47bdce58) at lightningd/plugin.c:2114 8 0x000055fb468f7c92 in main (argc=22, argv=0x7ffdf4be9228) at lightningd/lightningd.c:1195

… test_hsm* No idea why TCSAFLUSH was used, could not find anything in PR comments. Also cannot explain exactly what causes the problem, but the hang can be reproduced *with* TCSAFLUSH and not with TCSANOW. According to termios doc: TCSANOW the change occurs immediately. TCSAFLUSH the change occurs after all output written to the object referred by fd has been transmitted, and all input that has been received but not read will be discarded before the change is made.

Fixes: ElementsProject#4785 Fixes: ElementsProject#4883 Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.

SimonVrouwe · 2021-11-27T12:08:37Z

Bugs keep coming.

Rebased it again, mostly a change to fix the (previous) failure, by adding checks around io_break calls.

lightning/lightningd/jsonrpc.c

Lines 980 to 982 in ff0d591

    
           if (jcon->ld->stop_conn == conn && jcon->ld->state == LD_STATE_RUNNING) { 
        
           	/* Return us to toplevel lightningd.c */ 
        
           	io_break(jcon->ld);

lightning/lightningd/plugin.c

Line 194 in ff0d591

io_break(destroy_plugin);

lightning/lightningd/plugin.c

Line 2089 in ff0d591

io_break(plugin_shutdown_timeout);

lightning/lightningd/plugin.c

Lines 2119 to 2120 in ff0d591

    
           void *ret = io_loop_with_timers(ld); 
        
           assert(ret == plugin_shutdown_timeout || ret == destroy_plugin);

Maybe something similar is useful for #4936?

rustyrussell

Ack ff0d591

rustyrussell · 2021-11-29T23:55:39Z

lightningd/plugin.h

@@ -113,7 +113,7 @@ struct plugins {
 	/* Blacklist of plugins from --disable-plugin */
 	const char **blacklist;

-	/* Whether we are shutting down (`plugins_free` is called) */
+	/* Whether we are shutting down, blocks db write's */
 	bool shutdown;


This would be far neater as a enum lightningd_state, but we can patch that afterwards.

rustyrussell · 2021-11-29T23:57:37Z

lightningd/plugin.c

-				 time_from_sec(30),
-				 plugin_shutdown_timeout, ld);
+		/* 30 seconds should do it, use a clean timers struct */
+		orig_timers = ld->timers;


I theory this can break our debug tests is list_del_from() if a timer were to delete itself, but luckily we don't use that.

tests/test_connection.py

rustyrussell · 2021-11-30T00:20:57Z

OK, I have a fix for the leak valgrind discovered, in my other PR (#4921) so I'll apply this after that one...

rustyrussell · 2021-11-30T03:04:39Z

Actually, applying this now since it fixes the hsm encryption flake which is hitting our other PRs!

SimonVrouwe requested a review from cdecker as a code owner November 1, 2021 13:17

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 7b22790 to 1768fed Compare November 3, 2021 08:51

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 470388d to d467937 Compare November 5, 2021 12:19

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch 2 times, most recently from e634855 to 2254ff5 Compare November 15, 2021 22:58

rustyrussell requested changes Nov 17, 2021

View reviewed changes

lightningd/lightningd.c Outdated Show resolved Hide resolved

lightningd/lightningd.c Outdated Show resolved Hide resolved

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch 2 times, most recently from 9e446de to 9dd7e22 Compare November 18, 2021 08:28

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 94ec11d to c1d766b Compare November 24, 2021 13:29

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from c1d766b to 191bb0b Compare November 25, 2021 10:23

SimonVrouwe added 12 commits November 26, 2021 19:26

testing: test hook semantics is preserved in shutdown

aa92241

plugins expect their hooks to work also in shutdown, see issue ElementsProject#4883

testing: remove test_stop_pending_fundchannel

3e13833

Not needed anymore, see previous commit

JSON RPC: In the shutdown loop, ignore plugin responses to JSON RPC r…

e8cceea

…equests

JSON RPC: Calls made in shutdown loop receive error code -5: LIGHTNIN…

92fa91b

…GD_SHUTDOWN

testing: test that RPC calls made in shutdown fail and receive error …

9f1a014

…code -5 and the two conditions in which plugins can receive shutdown notification

testing: fix test_upgrade_statickey_onchaind

0a40623

Seems a timing issue that should be figured out, his makes the test pass.

libplugin-pay: fix valgrind error

f613308

for the case rpc "listpeers" returns an error, such as in shutdown

doc: update shutdown notification, changelog

ff0d591

Fixes: ElementsProject#4785 Fixes: ElementsProject#4883 Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.

SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 191bb0b to ff0d591 Compare November 26, 2021 21:08

rustyrussell approved these changes Nov 30, 2021

View reviewed changes

rustyrussell merged commit a2a480e into ElementsProject:master Nov 30, 2021

This was referenced Dec 2, 2021

lightningd: make shutdown smoother and safer, PART II #4947

Closed

lightningd: make shutdown smoother and safer, PART II (again) #4959

Merged

SimonVrouwe mentioned this pull request Dec 28, 2022

plugins: db_write hook and safe shutdown, back to square one? #5859

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightningd: make shutdown smoother and safer #4897

lightningd: make shutdown smoother and safer #4897

SimonVrouwe commented Nov 1, 2021 •

edited

Loading

SimonVrouwe commented Nov 3, 2021

cdecker commented Nov 3, 2021

SimonVrouwe commented Nov 3, 2021

ZmnSCPxj commented Nov 8, 2021

rustyrussell left a comment

ZmnSCPxj commented Nov 17, 2021

SimonVrouwe commented Nov 18, 2021

SimonVrouwe commented Nov 24, 2021

SimonVrouwe commented Nov 24, 2021

SimonVrouwe commented Nov 27, 2021

rustyrussell left a comment

rustyrussell Nov 29, 2021

rustyrussell Nov 29, 2021

rustyrussell commented Nov 30, 2021

rustyrussell commented Nov 30, 2021

lightningd: make shutdown smoother and safer #4897

lightningd: make shutdown smoother and safer #4897

Conversation

SimonVrouwe commented Nov 1, 2021 • edited Loading

SimonVrouwe commented Nov 3, 2021

cdecker commented Nov 3, 2021

SimonVrouwe commented Nov 3, 2021

ZmnSCPxj commented Nov 8, 2021

rustyrussell left a comment

Choose a reason for hiding this comment

ZmnSCPxj commented Nov 17, 2021

SimonVrouwe commented Nov 18, 2021

SimonVrouwe commented Nov 24, 2021

SimonVrouwe commented Nov 24, 2021

SimonVrouwe commented Nov 27, 2021

rustyrussell left a comment

Choose a reason for hiding this comment

rustyrussell Nov 29, 2021

Choose a reason for hiding this comment

rustyrussell Nov 29, 2021

Choose a reason for hiding this comment

rustyrussell commented Nov 30, 2021

rustyrussell commented Nov 30, 2021

SimonVrouwe commented Nov 1, 2021 •

edited

Loading