Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: make shutdown smoother and safer #4897

Conversation

SimonVrouwe
Copy link
Collaborator

@SimonVrouwe SimonVrouwe commented Nov 1, 2021

Another proof of concept (should I make this a draft PR?), to address issues raised in #4785 #4790 and #4883

The point of shutdown is to break-down or reduce activity. Starting a new io_loop in order to write shutdown notifications to plugins and wait for them to terminate, should not trigger new activity other then logging and (maybe?) notifications.

So before restarting this io_loop we shutdown subdaemons and disable handling of all JSON RPC requests and responses with an "id" field in it. Hook callback's will abandon already and without subdaemons or JSON RPC no new hooks can be called. All of that should prevent any db_write's and we can safely start the io_loop and shutdown all plugins.

Also cleans up some dead code related to unused destroy_utxos destructor.

TODO:

  • documentation

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 7b22790 to 1768fed Compare November 3, 2021 08:51
@SimonVrouwe
Copy link
Collaborator Author

Rebased.

A little back and forth on what to do with RPC command in shutdown: failing them results in some ** BROKEN ** logs by plugins that subscribed to shutdown (all build-in plugins in DEVELOPER=1) which happen to receive the result of a (previous) RPC command when waiting in the 30s shutdown loop. This could be solved by lightningd not returning any result (command_still_pending forever) when in shutdown. But that would make RPC calls made (maybe just by chance) in the 30s shutdown loop hang forever which is not nice in my opinion.

So I choose to return error code -32601 with message lightningd is shutting down and will see if its possible to not make our tests fail on that particular error.

@cdecker BTW: I see #4883 was closed, but that test (same as 9aeeae5 in this PR) about consistent hook-semantics-in-shutdown still fails on v0.10.2rc2

@cdecker
Copy link
Member

cdecker commented Nov 3, 2021

@cdecker BTW: I see #4883 was closed, but that test (same as 9aeeae5 in this PR) about consistent hook-semantics-in-shutdown still fails on v0.10.2rc2

Sorry about that, i must've tagged the wrong issue in the PR.

@SimonVrouwe
Copy link
Collaborator Author

Considered all these troubles and the fact that shutdown notification is currently only used for build-in plugin's to do a memleak_check:

lightning/plugins/libplugin.c

Lines 1265 to 1267 in 1268967

bool is_shutdown = streq(cmd->methodname, "shutdown");
if (is_shutdown)
memleak_check(plugin, cmd);

No intention to hold up the release but if there is no time, maybe the shutdown notification can be made a developer-only option (for now)? Before plugin developers get the wrong impression what they can or should do with it.

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 470388d to d467937 Compare November 5, 2021 12:19
@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Nov 8, 2021

No intention to hold up the release but if there is no time, maybe the shutdown notification can be made a developer-only option (for now)? Before plugin developers get the wrong impression what they can or should do with it.

I object! It should be a developer-only option forever. I think plugins should really just handle EOF on their input, as that handles the case where lightningd crashes unexpectedly.

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch 2 times, most recently from e634855 to 2254ff5 Compare November 15, 2021 22:58
Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably drop final commit?

lightningd/lightningd.c Outdated Show resolved Hide resolved
lightningd/lightningd.c Outdated Show resolved Hide resolved
@ZmnSCPxj
Copy link
Collaborator

How about deprecating the shutdown notification and instead create a new clean-shutdown chainable hook?

Then the shutdown command first triggers the hook and waits for all registered plugins to handle the hook before it actually sets the internal flag.

Using a hook provides cleaner semantics: we know exactly when plugins have finished whatever clean-shutdown-related work they have, because then they would respond with {"result": "continue"}. No more phasing and no more heuristic-based waiting --- what if the node scales up and clean shutdown requires more than 30 seconds for a plugin to handle?

It also strongly implies that it only works for a clean shutdown, and not for all shutdown conditions. Plugins are still strongly encouraged to not break if lightningd dies from under them.

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch 2 times, most recently from 9e446de to 9dd7e22 Compare November 18, 2021 08:28
@SimonVrouwe
Copy link
Collaborator Author

Still one failing test test_hsmtool_secret_decryption.
I managed to reproduce the hang, but it only happens occasionally, but by setting TIMEOUT in pyln.testing.utils to large value and then:
while pytest tests/ -s -k 'test_hsmtool_secret_decryption' --maxfail=1; do echo "ok"; done
after a while one of the runs seems to hang, only printing lines gossipd: seeker: no peers, waiting
(NOTE: I was also running a full pytest /tests in another terminal, not sure if relevant.)

But it seems to hang on
tools/hsmtool decrypt /tmp/ltests-x5nqrv48/test_hsmtool_secret_decryption_1/lightning-1/regtest/hsm_secret

connect gdb to that, show backtrace:

0x00007f109fc1a461 in __GI___libc_read (fd=0, buf=0x559ab854d1e0, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  0x00007f109fc1a461 in __GI___libc_read (fd=0, buf=0x559ab854d1e0, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007f109fbac670 in _IO_new_file_underflow (fp=0x7f109fceba00 <_IO_2_1_stdin_>) at libioP.h:839
#2  0x00007f109fba0e38 in _IO_getdelim (lineptr=0x7fff42ec6d10, n=0x7fff42ec6d18, delimiter=10, fp=0x7f109fceba00 <_IO_2_1_stdin_>) at iogetdelim.c:73
#3  0x0000559ab66f317b in getline_stdin_pass (passwd=0x7fff42ec6d10, passwd_size=0x7fff42ec6d18) at common/hsm_encryption.c:83
#4  0x0000559ab66f3316 in read_stdin_pass (reason=0x7fff42ec6de0) at common/hsm_encryption.c:111
#5  0x0000559ab66c0340 in decrypt_hsm (hsm_secret_path=0x7fff42ec8192 "/tmp/ltests-x5nqrv48/test_hsmtool_secret_decryption_1/lightning-1/regtest/hsm_secret") at tools/hsmtool.c:172
#6  0x0000559ab66c191c in main (argc=3, argv=0x7fff42ec7008) at tools/hsmtool.c:599

is somewhere here in our test

lightning/tests/test_wallet.py

Lines 1083 to 1090 in efeb1bc

# We can't use a wrong password !
master_fd, slave_fd = os.openpty()
hsmtool = HsmTool("decrypt", hsm_path)
hsmtool.start(stdin=slave_fd,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
hsmtool.wait_for_log(r"Enter hsm_secret password:")
write_all(master_fd, "A wrong pass\n\n".encode("utf-8"))
hsmtool.proc.wait(WAIT_TIMEOUT)

Well at least we know where to look now.
Maybe the test framework can be tweaked, so that in case the TimeoutError is raised, it looks around for such hangs and show back traces in log report?

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 94ec11d to c1d766b Compare November 24, 2021 13:29
@SimonVrouwe
Copy link
Collaborator Author

Rebased, modified commit messages but no code changes except 2 suggestion made by rustyrussel
and added a commit that fixes the hsmtool hang by using TSCANOW flag (for the interested, have a look at this demo )

Added documentation.

@SimonVrouwe
Copy link
Collaborator Author

I think this can be merged.

How about deprecating the shutdown notification and instead create a new clean-shutdown chainable hook?

@ZmnSCPxj Not sure if it needs to be depricated, this PR makes it safe I think but it needs documentation.

I agree with a clean-shutdown hook, but how many use cases are there? Some of such functionality can be also accomplished with a method shutdown_clean such as in this example.

@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from c1d766b to 191bb0b Compare November 25, 2021 10:23
plugins expect their hooks to work also in shutdown, see issue ElementsProject#4883
…ess to db

    because:
    - shutdown_subdaemons can trigger db write, comments in that function say so at least
    - resurrecting the main event loop with subdaemons still running is counter productive
      in shutting down activity (such as htlc's, hook_calls etc.)
    - custom behavior injected by plugins via hooks should be consistent, see test
      in previous commmit

    IDEA:

    in shutdown_plugins, when starting new io_loop:

    - A plugin that is still running can return a jsonrpc_request response, this triggers
      response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted

    - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which
      cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked

    - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification)
      as these do not trigger subdaemon calls or db_write's
      Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field

    PLAN (hypothesis):
    - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with
      an "id" field, this should
      block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?)

  Q. Can internal (so not via plugin) jsonrpc_requests called in the main io_loop return/revive in
     the shutdown io_loop?
  A. No. All code under lightningd/ returning command_still_pending depends on either a subdaemon, timer or
     plugin. In shutdown loop the subdaemons are dead, timer struct cleared and plugins will be taken
     care of (in next commits).

 fixup: we can only io_break the main io_loop once
…ite's anymore

since PR ElementsProject#3867 utxos are unreserved by height, destroy_utxos and
related functions are not used anymore so clean them up also

However free(ld->jsonrpc) still needs to happen before free(ld) because its
destructors need list_head pointers from ld
Not needed anymore, see previous commit
…code -5

and the two conditions in which plugins can receive shutdown notification
shutdown_subdaemons frees the channel and calls destroy_close_command_on_channel_destroy, see gdb:

    0  destroy_close_command_on_channel_destroy (_=0x55db6ca38e18, cc=0x55db6ca43338) at lightningd/closing_control.c:94
    1  0x000055db6a8181b5 in notify (ctx=0x55db6ca38df0, type=TAL_NOTIFY_FREE, info=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:237
    2  0x000055db6a8186bb in del_tree (t=0x55db6ca38df0, orig=0x55db6ca38e18, saved_errno=0) at ccan/ccan/tal/tal.c:402
    3  0x000055db6a818a47 in tal_free (ctx=0x55db6ca38e18) at ccan/ccan/tal/tal.c:486
    4  0x000055db6a73fffa in shutdown_subdaemons (ld=0x55db6c8b4ca8) at lightningd/lightningd.c:543
    5  0x000055db6a741098 in main (argc=21, argv=0x7ffffa3e8048) at lightningd/lightningd.c:1192

Before this PR, there was no io_loop after shutdown_subdaemons and client side raised a
general `Connection to RPC server lost.`
Now we test the more specific `Channel forgotten before proper close.`, which is good!

BTW, this test was added recently in PR ElementsProject#4599.
Seems a timing issue that should be figured out, his makes the test pass.
for the case rpc "listpeers" returns an error, such as in shutdown
…s pipe

Setting SIGCHLD back to default (i.e. ignored) makes waitpid hang on an
old SIGCHLD that was still in the pipe?

This happens running test_important_plugin with developer=1:
(or with dev=0 and build-in plugins subscribed to "shutdown")

0  0x00007ff8336b6437 in __GI___waitpid (pid=-1, stat_loc=0x0, options=1) at ../sysdeps/unix/sysv/linux/waitpid.c:30
1  0x000055fb468f733a in sigchld_rfd_in (conn=0x55fb47c7cfc8, ld=0x55fb47bdce58) at lightningd/lightningd.c:785
2  0x000055fb469bcc6b in next_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8) at ccan/ccan/io/io.c:59
3  0x000055fb469bd80b in do_plan (conn=0x55fb47c7cfc8, plan=0x55fb47c7cfe8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
4  0x000055fb469bd849 in io_ready (conn=0x55fb47c7cfc8, pollflags=1) at ccan/ccan/io/io.c:417
5  0x000055fb469bfa26 in io_loop (timers=0x55fb47c41198, expired=0x7ffdf4be9028) at ccan/ccan/io/poll.c:453
6  0x000055fb468f1be9 in io_loop_with_timers (ld=0x55fb47bdce58) at lightningd/io_loop_with_timers.c:21
7  0x000055fb46924817 in shutdown_plugins (ld=0x55fb47bdce58) at lightningd/plugin.c:2114
8  0x000055fb468f7c92 in main (argc=22, argv=0x7ffdf4be9228) at lightningd/lightningd.c:1195
… test_hsm*

No idea why TCSAFLUSH was used, could not find anything in PR comments.
Also cannot explain exactly what causes the problem, but the hang can be reproduced
*with* TCSAFLUSH and not with TCSANOW.

According to termios doc:
TCSANOW
    the change occurs immediately.
TCSAFLUSH
    the change occurs after all output written to the object referred by fd has been
    transmitted, and all input that has been received but not read will be discarded
    before the change is made.
Fixes: ElementsProject#4785
Fixes: ElementsProject#4883

Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.
@SimonVrouwe SimonVrouwe force-pushed the 2021-10-30_close_subd-plugins-rpc_ignore_id-json branch from 191bb0b to ff0d591 Compare November 26, 2021 21:08
@SimonVrouwe
Copy link
Collaborator Author

Bugs keep coming.

Rebased it again, mostly a change to fix the (previous) failure, by adding checks around io_break calls.

if (jcon->ld->stop_conn == conn && jcon->ld->state == LD_STATE_RUNNING) {
/* Return us to toplevel lightningd.c */
io_break(jcon->ld);

io_break(destroy_plugin);

io_break(plugin_shutdown_timeout);

lightning/lightningd/plugin.c

Lines 2119 to 2120 in ff0d591

void *ret = io_loop_with_timers(ld);
assert(ret == plugin_shutdown_timeout || ret == destroy_plugin);

Maybe something similar is useful for #4936?

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack ff0d591

@@ -113,7 +113,7 @@ struct plugins {
/* Blacklist of plugins from --disable-plugin */
const char **blacklist;

/* Whether we are shutting down (`plugins_free` is called) */
/* Whether we are shutting down, blocks db write's */
bool shutdown;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be far neater as a enum lightningd_state, but we can patch that afterwards.

time_from_sec(30),
plugin_shutdown_timeout, ld);
/* 30 seconds should do it, use a clean timers struct */
orig_timers = ld->timers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I theory this can break our debug tests is list_del_from() if a timer were to delete itself, but luckily we don't use that.

tests/test_connection.py Show resolved Hide resolved
@rustyrussell
Copy link
Contributor

OK, I have a fix for the leak valgrind discovered, in my other PR (#4921) so I'll apply this after that one...

@rustyrussell
Copy link
Contributor

Actually, applying this now since it fixes the hsm encryption flake which is hitting our other PRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants