-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightningd: make shutdown smoother and safer #4897
lightningd: make shutdown smoother and safer #4897
Conversation
7b22790
to
1768fed
Compare
Rebased. A little back and forth on what to do with RPC command in shutdown: failing them results in some ** BROKEN ** logs by plugins that subscribed to So I choose to return error code @cdecker BTW: I see #4883 was closed, but that test (same as 9aeeae5 in this PR) about consistent hook-semantics-in-shutdown still fails on v0.10.2rc2 |
Considered all these troubles and the fact that Lines 1265 to 1267 in 1268967
No intention to hold up the release but if there is no time, maybe the |
470388d
to
d467937
Compare
I object! It should be a developer-only option forever. I think plugins should really just handle EOF on their input, as that handles the case where |
e634855
to
2254ff5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably drop final commit?
How about deprecating the Then the Using a hook provides cleaner semantics: we know exactly when plugins have finished whatever clean-shutdown-related work they have, because then they would respond with It also strongly implies that it only works for a clean shutdown, and not for all shutdown conditions. Plugins are still strongly encouraged to not break if |
9e446de
to
9dd7e22
Compare
Still one failing test But it seems to hang on connect gdb to that, show backtrace:
is somewhere here in our test lightning/tests/test_wallet.py Lines 1083 to 1090 in efeb1bc
Well at least we know where to look now. |
94ec11d
to
c1d766b
Compare
Rebased, modified commit messages but no code changes except 2 suggestion made by rustyrussel Added documentation. |
I think this can be merged.
@ZmnSCPxj Not sure if it needs to be depricated, this PR makes it safe I think but it needs documentation. I agree with a |
c1d766b
to
191bb0b
Compare
plugins expect their hooks to work also in shutdown, see issue ElementsProject#4883
…ess to db because: - shutdown_subdaemons can trigger db write, comments in that function say so at least - resurrecting the main event loop with subdaemons still running is counter productive in shutting down activity (such as htlc's, hook_calls etc.) - custom behavior injected by plugins via hooks should be consistent, see test in previous commmit IDEA: in shutdown_plugins, when starting new io_loop: - A plugin that is still running can return a jsonrpc_request response, this triggers response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification) as these do not trigger subdaemon calls or db_write's Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field PLAN (hypothesis): - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with an "id" field, this should block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?) Q. Can internal (so not via plugin) jsonrpc_requests called in the main io_loop return/revive in the shutdown io_loop? A. No. All code under lightningd/ returning command_still_pending depends on either a subdaemon, timer or plugin. In shutdown loop the subdaemons are dead, timer struct cleared and plugins will be taken care of (in next commits). fixup: we can only io_break the main io_loop once
…ite's anymore since PR ElementsProject#3867 utxos are unreserved by height, destroy_utxos and related functions are not used anymore so clean them up also However free(ld->jsonrpc) still needs to happen before free(ld) because its destructors need list_head pointers from ld
Not needed anymore, see previous commit
…code -5 and the two conditions in which plugins can receive shutdown notification
shutdown_subdaemons frees the channel and calls destroy_close_command_on_channel_destroy, see gdb: 0 destroy_close_command_on_channel_destroy (_=0x55db6ca38e18, cc=0x55db6ca43338) at lightningd/closing_control.c:94 1 0x000055db6a8181b5 in notify (ctx=0x55db6ca38df0, type=TAL_NOTIFY_FREE, info=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:237 2 0x000055db6a8186bb in del_tree (t=0x55db6ca38df0, orig=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:402 3 0x000055db6a818a47 in tal_free (ctx=0x55db6ca38e18) at ccan/ccan/tal/tal.c:486 4 0x000055db6a73fffa in shutdown_subdaemons (ld=0x55db6c8b4ca8) at lightningd/lightningd.c:543 5 0x000055db6a741098 in main (argc=21, argv=0x7ffffa3e8048) at lightningd/lightningd.c:1192 Before this PR, there was no io_loop after shutdown_subdaemons and client side raised a general `Connection to RPC server lost.` Now we test the more specific `Channel forgotten before proper close.`, which is good! BTW, this test was added recently in PR ElementsProject#4599.
Seems a timing issue that should be figured out, his makes the test pass.
for the case rpc "listpeers" returns an error, such as in shutdown
…s pipe Setting SIGCHLD back to default (i.e. ignored) makes waitpid hang on an old SIGCHLD that was still in the pipe? This happens running test_important_plugin with developer=1: (or with dev=0 and build-in plugins subscribed to "shutdown") 0 0x00007ff8336b6437 in __GI___waitpid (pid=-1, stat_loc=0x0, options=1) at ../sysdeps/unix/sysv/linux/waitpid.c:30 1 0x000055fb468f733a in sigchld_rfd_in (conn=0x55fb47c7cfc8, ld=0x55fb47bdce58) at lightningd/lightningd.c:785 2 0x000055fb469bcc6b in next_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8) at ccan/ccan/io/io.c:59 3 0x000055fb469bd80b in do_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8, idle_on_epipe=false) at ccan/ccan/io/io.c:407 4 0x000055fb469bd849 in io_ready (conn=0x55fb47c7cfc8, pollflags=1) at ccan/ccan/io/io.c:417 5 0x000055fb469bfa26 in io_loop (timers=0x55fb47c41198, expired=0x7ffdf4be9028) at ccan/ccan/io/poll.c:453 6 0x000055fb468f1be9 in io_loop_with_timers (ld=0x55fb47bdce58) at lightningd/io_loop_with_timers.c:21 7 0x000055fb46924817 in shutdown_plugins (ld=0x55fb47bdce58) at lightningd/plugin.c:2114 8 0x000055fb468f7c92 in main (argc=22, argv=0x7ffdf4be9228) at lightningd/lightningd.c:1195
… test_hsm* No idea why TCSAFLUSH was used, could not find anything in PR comments. Also cannot explain exactly what causes the problem, but the hang can be reproduced *with* TCSAFLUSH and not with TCSANOW. According to termios doc: TCSANOW the change occurs immediately. TCSAFLUSH the change occurs after all output written to the object referred by fd has been transmitted, and all input that has been received but not read will be discarded before the change is made.
Fixes: ElementsProject#4785 Fixes: ElementsProject#4883 Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.
191bb0b
to
ff0d591
Compare
Bugs keep coming. Rebased it again, mostly a change to fix the (previous) failure, by adding checks around lightning/lightningd/jsonrpc.c Lines 980 to 982 in ff0d591
Line 194 in ff0d591
Line 2089 in ff0d591
Lines 2119 to 2120 in ff0d591
Maybe something similar is useful for #4936? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack ff0d591
@@ -113,7 +113,7 @@ struct plugins { | |||
/* Blacklist of plugins from --disable-plugin */ | |||
const char **blacklist; | |||
|
|||
/* Whether we are shutting down (`plugins_free` is called) */ | |||
/* Whether we are shutting down, blocks db write's */ | |||
bool shutdown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be far neater as a enum lightningd_state, but we can patch that afterwards.
time_from_sec(30), | ||
plugin_shutdown_timeout, ld); | ||
/* 30 seconds should do it, use a clean timers struct */ | ||
orig_timers = ld->timers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I theory this can break our debug tests is list_del_from() if a timer were to delete itself, but luckily we don't use that.
OK, I have a fix for the leak valgrind discovered, in my other PR (#4921) so I'll apply this after that one... |
Actually, applying this now since it fixes the hsm encryption flake which is hitting our other PRs! |
Another proof of concept (should I make this a draft PR?), to address issues raised in #4785 #4790 and #4883
The point of shutdown is to break-down or reduce activity. Starting a new
io_loop
in order to writeshutdown
notifications to plugins and wait for them to terminate, should not trigger new activity other then logging and (maybe?) notifications.So before restarting this
io_loop
we shutdown subdaemons and disable handling of all JSON RPC requests and responses with an "id" field in it. Hook callback's will abandon already and without subdaemons or JSON RPC no new hooks can be called. All of that should prevent anydb_write
's and we can safely start theio_loop
and shutdown all plugins.Also cleans up some dead code related to unused
destroy_utxos
destructor.TODO: