Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No next lookup via gossipd #3547

Merged

Conversation

rustyrussell
Copy link
Contributor

(Based on #3546, so ignore first three commits).

Every time we need to forward, we ask gossipd for where we should forward it. That's unnecessary, and slow and weird. But that also gives us a chance to get the latest channel_update, in case we need to send an error.

This does a local lookup, and if we need an update it starts with a temporary_node_failure then asks gossipd so it can make the real error. This is robust against crashing and other corner cases (such as channeld manipulating the failed HTLC) since there's never an HTLC in limbo.

@rustyrussell rustyrussell added this to the 0.8.2 milestone Feb 25, 2020
= tal_dup_talarr(cbdata, u8, failmsg_needs_update);
subd_req(cbdata, hin->key.channel->peer->ld->gossip,
take(towire_gossip_get_stripped_cupdate(NULL, failmsg_scid)),
-1, 0, failmsg_update_reply, cbdata);
Copy link
Contributor

@niftynei niftynei Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like your gossip daemon doesnt exist in all cases? failing test_restart_many_payments

EBUG:root:lightningd-1: 2020-02-25T06:08:23.095Z DEBUG 0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-chan#3: HTLC in 1 RCVD_ADD_ACK_REVOCATION->SENT_REMOVE_HTLC
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: FATAL SIGNAL 11 (version v0.8.1-36-g03045e1-modded)
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: common/daemon.c:44 (send_backtrace) 0x5634d3e4c1b0
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: common/daemon.c:52 (crashdump) 0x5634d3e4c200
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7fb9d3ca4f1f
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/subd.c:750 (subd_send_msg) 0x5634d3e43132
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/subd.c:770 (subd_req_) 0x5634d3e431f6
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:268 (local_fail_in_htlc_needs_update) 0x5634d3e35438
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:296 (fail_out_htlc) 0x5634d3e3558d
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:516 (destroy_hout_subd_died) 0x5634d3e35c86
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x5634d3eb2472
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x5634d3eb2961
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x5634d3eb2ced
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:2107 (free_htlcs) 0x5634d3e3a85a
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:520 (shutdown_subdaemons) 0x5634d3e1a7a1
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:940 (main) 0x5634d3e1b217
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7fb9d3c87b96
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x5634d3e01b09
DEBUG:root:lightningd-1: 2020-02-25T06:08:23.108Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff
DEBUG:root:Received response for stop call: {'error': 'Connection to RPC server lost.'}
----------------------------- Captured stderr call -----------------------------
lightningd: FATAL SIGNAL 11 (version v0.8.1-36-g03045e1-modded)
0x5634d3e4c15a send_backtrace
	common/daemon.c:39
0x5634d3e4c200 crashdump
	common/daemon.c:52
0x7fb9d3ca4f1f ???
	???:0
0x5634d3e43132 subd_send_msg
	lightningd/subd.c:750
0x5634d3e431f6 subd_req_
	lightningd/subd.c:770
0x5634d3e35438 local_fail_in_htlc_needs_update
	lightningd/peer_htlcs.c:268
0x5634d3e3558d fail_out_htlc
	lightningd/peer_htlcs.c:296
0x5634d3e35c86 destroy_hout_subd_died
	lightningd/peer_htlcs.c:516
0x5634d3eb2472 notify
	ccan/ccan/tal/tal.c:240
0x5634d3eb2961 del_tree
	ccan/ccan/tal/tal.c:402
0x5634d3eb2ced tal_free
	ccan/ccan/tal/tal.c:486
0x5634d3e3a85a free_htlcs
	lightningd/peer_htlcs.c:2107
0x5634d3e1a7a1 shutdown_subdaemons
	lightningd/lightningd.c:520
0x5634d3e1b217 main
	lightningd/lightningd.c:940
0x7fb9d3c87b96 ???
	???:0
0x5634d3e01b09 ???
	???:0
0xffffffffffffffff ???
	???:0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this is best addressed by freeing HTLCs and peer daemons first, then gossipd. Thanks!

}

if (!topology_synced(out->peer->ld->topology)) {
log_info(out->log, "Attempt to send HTLC but still syncing"
" with bitcoin network");
return towire_temporary_channel_failure(ctx,
out->stripped_update);
return towire_temporary_node_failure(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this change is causing test_lightningd_still_loading to fail with the wrong error message?

<class 'AssertionError'>
	Pattern 'TEMPORARY_NODE_FAILURE' not found in "RPC call failed: method: sendpay, payload: {'route': [{'msatoshi': 1000, 'id': '022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59', 'delay': 5, 'channel': '1x1x1'}], 'payment_hash': '55a61a634cd0dbfdad2461676acae93774a0576ff4e5d61cb6728b3078744af1'}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (First peer not ready)', 'data': {'erring_index': 0, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'erring_channel': '1x1x1', 'erring_direction': 1}}"
	[<TracebackEntry /home/travis/build/ElementsProject/lightning/tests/test_misc.py:189>]
test_lightningd_still_loading failed; it passed 0 out of the required 1 times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, no. That error is complaining that it's expecting a temp_node_fail, but it's getting a temp_chan_fail. And it doesn't happen for me here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it does under VALGRIND. Turns out test is flaky, fixing...

Copy link
Contributor

@niftynei niftynei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of test failures; otherwise concept ACK

@rustyrussell rustyrussell force-pushed the no-next-lookup-via-gossipd branch from 55b75cf to 7439faf Compare February 27, 2020 02:19
…use it.

The idea is that gossipd can give us the cupdate we need for an error, and
we wire things up so that we ask for it (async) just before we send the
error to the subdaemon.

I tried many other things, but they were all too high-risk.

1. We need to ask gossipd every time, since it produces these lazily
   (in particular, it doesn't actually generate an offline update unless
   the channel is used).
2. We can't do async calls in random places, since we'll end up with
   an HTLC in limbo.  What if another path tries to fail it at the same time?
3. This allows us to use a temporary_node_failure error, and upgrade it
   when gossipd replies.  This doesn't change any existing assumptions.

Signed-off-by: Rusty Russell <[email protected]>
Instead of saving a stripped_update, we use the new
local_fail_in_htlc_needs_update.

One minor change: we return the more correct
towire_temporary_channel_failure when the node is still syncing.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the no-next-lookup-via-gossipd branch from 7439faf to 46eaef9 Compare February 27, 2020 04:15
@rustyrussell
Copy link
Contributor Author

Rebased, addressed @niftynei feedback.

Even without optimization, it's faster to walk all the channels than
ping another daemon and wait for the response.

Changelog-Changed: Forwarding messages is now much faster (less inter-daemon traffic)
Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the no-next-lookup-via-gossipd branch from 46eaef9 to b17cd23 Compare February 27, 2020 04:27
@cdecker
Copy link
Member

cdecker commented Feb 27, 2020

Restarted due to flaky test_reconnect_no_update, seems unrelated.

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor clarification, otherwise LGTM

ACK b17cd23

* but it's v. unlikely */
if (!fromwire_gossip_get_stripped_cupdate_reply(msg, msg,
&stripped_update)
|| !tal_count(stripped_update)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we using tal_bytelen for u8*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a line-ball, IMHO. Especially here, where it will actually (due to our explicit fromwire_ generation policy) be NULL instead of empty.


subd_send_msg(hin->key.channel->owner,
take(towire_channel_fail_htlc(NULL, failed_htlc)));
tell_channeld_htlc_failed(hin, failed_htlc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification 👍

@rustyrussell rustyrussell merged commit f8a21f1 into ElementsProject:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants