Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectd does demux part 1 #4984

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Dec 20, 2021

(Builds on #4972) Rebased on master

This puts connectd in the middle of every peer connection: it does the decryption/encryption and streams the gossip_store.

Changelog-None

@rustyrussell rustyrussell force-pushed the connectd-demux-part-1 branch 2 times, most recently from 3d93be6 to d802834 Compare December 20, 2021 05:16
@rustyrussell rustyrussell force-pushed the connectd-demux-part-1 branch 3 times, most recently from 4f8a03f to 75e9e58 Compare December 21, 2021 05:49
@rustyrussell rustyrussell force-pushed the connectd-demux-part-1 branch 2 times, most recently from 9002c58 to ef04f77 Compare December 30, 2021 22:46
Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary comments I had collected before the end of year.

connectd/multiplex.c Outdated Show resolved Hide resolved
connectd/multiplex.c Outdated Show resolved Hide resolved
gossipd/gossipd.c Outdated Show resolved Hide resolved
connectd/multiplex.c Outdated Show resolved Hide resolved
connectd/multiplex.h Show resolved Hide resolved
common/peer_io.c Show resolved Hide resolved
common/peer_io.c Outdated Show resolved Hide resolved
connectd/multiplex.c Outdated Show resolved Hide resolved
connectd/multiplex.c Outdated Show resolved Hide resolved
connectd/multiplex.h Outdated Show resolved Hide resolved
connectd/connectd.c Show resolved Hide resolved
common/gossip_store.c Outdated Show resolved Hide resolved
@rustyrussell
Copy link
Contributor Author

OK, last three commits are new, as is 264ae92 which is a fixup.

@cdecker
Copy link
Member

cdecker commented Jan 4, 2022

Excellent changes, just needs squashing :-)

git range-diff $(git merge-base ef04f77735a159e042e5e05cbf4f836388f02a40 origin/master)..ef04f77735a159e042e5e05cbf4f836388f02a40 $(git merge-base 502fe1064228317c87979533e77338f102217a12 origin/master)..502fe1064228317c87979533e77338f102217a12
 1:  663b3f983 =  1:  663b3f983 update mocks
 2:  cb19d2924 =  2:  cb19d2924 connectd: keep timeout timer around so we can disable it.
 3:  66bc45a68 =  3:  66bc45a68 connectd: maintain connection with peer, shuffle data.
 4:  c19d2596e =  4:  c19d2596e gossipwith: create our own internal sync_crypto functions.
 5:  90d467ab3 !  5:  44a498920 lightningd: make sure gossipd knows before we update blockheight.
    @@ gossipd/gossipd.c: static void new_blockheight(struct daemon *daemon, const u8 *
      }

      #if DEVELOPER
    -@@ gossipd/gossipd.c: static void handle_txout_reply(struct daemon *daemon, const u8 *msg)
    -   good = handle_pending_cannouncement(daemon, daemon->rstate,
    -                                       &scid, sat, outscript);
    -
    -+
    -   /* If we looking specifically for this, we no longer are. */
    -   remove_unknown_scid(daemon->seeker, &scid, good);
    -
     @@ gossipd/gossipd.c: static struct io_plan *recv_req(struct io_conn *conn,
        case WIRE_GOSSIPD_DEV_COMPACT_STORE_REPLY:
        case WIRE_GOSSIPD_GOT_ONIONMSG_TO_US:
 6:  947e054d0 =  6:  904426100 connectd: do decryption for peers.
 7:  56a53d736 =  7:  c73e2efd0 peer_io: replace crypto_sync in daemons, use normal wire messages.
 8:  8bdba3c96 =  8:  04927f198 per_peer_state: remove struct crypto_state
 9:  dedcef40e =  9:  70d52263d connectd: do dev_disconnect logic.
10:  18942d63c = 10:  c570252f6 connectd: do nagle by packet type.
 -:  --------- > 11:  264ae921f fixup! connectd: do nagle by packet type.
11:  217b63aae = 12:  479ed13a7 common: add routine for absolute timeouts (vs. relative).
12:  7ff1a3653 ! 13:  f354c9a22 connectd: serve gossip_store file for the peer.
    @@ connectd/multiplex.c: static struct io_plan *write_to_peer(struct io_conn *peer_
                                                after_final_msg);
                }

    --          /* Reset urgent flag, if set. */
    --          set_urgent_flag(peer, false);
    +-          /* Tell them to read again, */
    +-          io_wake(&peer->subd_in);
     +          /* If they want us to send gossip, do so now. */
     +          msg = maybe_from_gossip_store(NULL, peer);
     +          if (!msg) {
    -+                  /* Reset urgent flag, if set. */
    -+                  set_urgent_flag(peer, false);
    -
    --          /* Tell them to read again, */
    --          io_wake(&peer->subd_in);
     +                  /* Tell them to read again, */
     +                  io_wake(&peer->subd_in);

13:  a93ea224f = 14:  95727c0f3 subdaemons: don't stream gossip_store at all.
14:  73b996976 = 15:  644245d27 lightningd: remove per_peer_state struct.
15:  4eac6867e = 16:  6e6ae9395 connectd: remove per_peer_state in favor of keeping gossip_fd directly.
16:  e7ddcc32d = 17:  f2d66807b connectd: put more stuff into struct gossip_state.
17:  ef04f7773 = 18:  2a1317d16 connectd: get addresses from lightningd, not gossipd.
 -:  --------- > 19:  dc94ea569 various: minor cleanups from Christian's review.
 -:  --------- > 20:  a8ef64b04 connectd: drop support (unused) for @ during handshake.
 -:  --------- > 21:  502fe1064 connectd: implement @ correctly.

ACK 502fe10

@rustyrussell
Copy link
Contributor Author

rustyrussell commented Jan 4, 2022

Fixup folded. And dumb non-dev inversion fixed.

Ack 4bb4ace

@cdecker
Copy link
Member

cdecker commented Jan 5, 2022

Shepherding CI, otherwise GTG

ACK 4bb4ace

@vincenzopalazzo
Copy link
Collaborator

I think there is only one missing that the Github view hide, and it is at line 4056 of channeld.c, we miss an API upgrade msg_queue_new(peer, true);

We miss at list another CI run

@cdecker
Copy link
Member

cdecker commented Jan 7, 2022

Changes look good, just a couple of flaky tests:

  • test_closing_specified_destination gets stuck on l1.rpc.call('close', {'id': chan13, 'destination': addr}). close doesn't seem to want to return a result.
  • test_htlc_rexmit_while_closing and test_penalty_outhtlc seem to be killing openingd instead of reporting an Unacceptable upfront_shutdown_script. The outcome is the same (no channel) but the error message was a bit more informative.

@rustyrussell
Copy link
Contributor Author

OK, I gave in and included many of the cleanups I had hoped to defer to a followup PR. Let's see if they help or hinder the test failures.

@rustyrussell
Copy link
Contributor Author

That... did not help :(

Since I cannot reproduce this locally, I'll be throwing debugging patches around...

Signed-off-by: Rusty Russell <[email protected]>
connectd will be keeping the conn open, so it needs to free this
"conn_timeout" timer.  Pass it through, so we can do that.

Signed-off-by: Rusty Russell <[email protected]>
This test started mostly failing (in non-DEVELOPER mode) after the
next patch, due to timing issues.

Handle both cases for now, and we'll add more enhancements later to
things we should be handling more consistently.

Signed-off-by: Rusty Russell <[email protected]>
Instead of passing the incoming socket to lightningd for the
subdaemon, create a new one and simply shuffle data between them,
keeping connectd in the loop.

For the moment, we don't decrypt at all, just shuffle.  This means our
buffer code is kind of a hack, but that goes away once we start
actually decrypting and understanding message boundaries.

This implementation is naive: it closes the socket to the local daemon
as soon as the peer closes the socket to us.  This is fixed in a
successive patch series (along with many other similar issues).

Signed-off-by: Rusty Russell <[email protected]>
This avoids changes to crypto_sync which are coming in next patch.

Signed-off-by: Rusty Russell <[email protected]>
Without this, we can get spurious failures from lnprototest, which
is waiting for lightningd to acknowledge the blockheight in getinfo:

```
2021-12-21T00:56:21.460Z DEBUG   lightningd: Adding block 109: 57a7bd3ade3a88a899e5b442075e9722ccec07e0205f9a913b304a3e2e038e26
2021-12-21T00:56:21.470Z DEBUG   lightningd: Adding block 110: 11a280eb76f588e92e20c39999be9d2baff016c3c6bac1837b649a270570b7dd
2021-12-21T00:56:21.479Z DEBUG   lightningd: Adding block 111: 02977fc9529b2ab4e0a805c4bc1bcfbff5a4e6577a8b31266341d22e204a1d27
2021-12-21T00:56:21.487Z DEBUG   lightningd: Adding block 112: 2402f31c5ddfc9e847e8bbfb7df084d29a5d5d936a4358c109f2f4cf9ea8d828
2021-12-21T00:56:21.496Z DEBUG   lightningd: Adding block 113: 5a561fe9423b4df33f004fc09985ee3ef38364d692a56a8b27ecbc6098a16d39
2021-12-21T00:56:21.505Z DEBUG   lightningd: Adding block 114: 4502f5ec23c89177872846848848322e8fa6c3fb6f5eb361194e4cd47596dfe9
2021-12-21T00:56:21.511Z DEBUG   02f9308a019258c31049344f85f89d5229b531c845836f99b08601f113bce036f9-gossipd: Ignoring future channel_announcment for 109x1x0 (current block 108)
```

Signed-off-by: Rusty Russell <[email protected]>
msg_queue was originally designed for inter-daemon comms, and so it has
a special mechanism to mark that we're trying to send an fd.  Unfortunately,
a peer could also send such a message, confusing us!

Signed-off-by: Rusty Russell <[email protected]>
We would lose packets sometimes due to this previously, but it
doesn't happen over localhost so our tests didn't notice.  However,
now we have connectd being sole thing talking to peers, we can do
a more elegant shutdown, which should fix closing.

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: Protocol: Always flush sockets to increase chance that final message get to peer (esp. error packets).
This is critical in the common case where peer sends an error and
hangs up: we almost never get to relay the error to the subd in time.

This also applies in the other direction: we need to flush the queue
to the peer when the subd closes.  Note we only free the actual peer
struct when lightningd reaps us with connectd_peer_disconnected().

Signed-off-by: Rusty Russell <[email protected]>
l1 might split in a commitment_signed before it notices the disconnect, and this test fails:

```
        for i in range(0, len(disconnects)):
            with pytest.raises(RpcError):
                l1.rpc.sendpay(route, rhash, payment_secret=inv['payment_secret'])
>               l1.rpc.waitsendpay(rhash)
E               Failed: DID NOT RAISE <class 'pyln.client.lightning.RpcError'>
```

Signed-off-by: Rusty Russell <[email protected]>
In the case where the peer sends an error (and hangs up) immediately
after init, connectd *doesn't actually read the error* (even after all the
previous fixes so it actually receives the error!).

This is because to tried to first write WIRE_CHANNEL_REESTABLISH, and
that fails, so it never tries to read.  Generally, we should ignore
write failures; we'll find out if the socket is closed when we read
nothing.

Signed-off-by: Rusty Russell <[email protected]>
Here's the "Normal Tet Config clang-fuzzing" setup where it fails:

```
CC=clang
CONFIGURATOR_CC=clang
CWARNFLAGS=-Wall -Wundef -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wold-style-definition -Werror
CDEBUGFLAGS=-std=gnu11 -g -fstack-protector-strong
COPTFLAGS=
SQLITE3_CFLAGS=
SQLITE3_LDLIBS=-lsqlite3
POSTGRES_INCLUDE=-I/usr/include/postgresql
POSTGRES_LDLIBS=-L/usr/lib/x86_64-linux-gnu -lpq
VALGRIND=0
DEVELOPER=1
EXPERIMENTAL_FEATURES=0
COMPAT=1
```

Here's the truncated test output:

```
        if anchor_expected():
            expected_1['B'].append(('external', ['anchor'], None, None))
            expected_2['B'].append(('external', ['anchor'], None, None))
            expected_1['B'].append(('wallet', ['anchor'], None, None))
            expected_2['B'].append(('wallet', ['anchor'], None, None))
    
        tags = check_utxos_channel(l1, [channel_id], expected_1)
>       check_utxos_channel(l2, [channel_id], expected_2, tags)

tests/test_closing.py:2662: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/utils.py:321: in check_utxos_channel
    txid = matchup_events(u_set, evs, chans, tag_list)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

u_set = [[{'account_id': 'external', 'blockheight': 104, 'coin_type': 'bcrt', 'credit': '0msat', ...}, {'account_id': 'external', 'blockheight': 110, 'coin_type': 'bcrt', 'credit': '0msat', ...}]]
evs = [('external', ['to_them'], None, None), ('external', ['htlc_timeout'], None, None)]
chans = ['8ede62cea34c5196467c68175f70b8915f0edda421c5087a99584d4197cfb6c4']
tag_list = {'A': 'c5b6cf97414d58997a08c521a4dd0e5f91b8705f17687c4696514ca3ce62de8e', 'B': 'bb09b25d6653aeeec188961347ff80e90dca6f4a29cc017856f6585adb8cb468'}

    def matchup_events(u_set, evs, chans, tag_list):
        assert len(u_set) == len(evs) and len(u_set) > 0
    
        txid = u_set[0][0]['utxo_txid']
        for ev in evs:
            found = False
            for u in u_set:
                # We use 'cid' as a placeholder for the channel id, since it's
                # dyanmic, but we need to sub it in. 'chans' is a list of cids,
                # which are mapped to `cid` tags' suffixes. eg. 'cid1' is the
                # first cid in the chans list
                if ev[0][:3] == 'cid':
                    idx = int(ev[0][3:])
                    acct = chans[idx - 1]
                else:
                    acct = ev[0]
    
                if u[0]['account_id'] != acct or u[0]['tags'] != ev[1]:
                    continue
    
                if ev[2] is None:
>                   assert u[1] is None
E                   AssertionError
```
We seem to hit a race between manual reconnect (with address hint) and an automatic
reconnection attempt which fails:

```
 >       l4.rpc.connect(l3.info['id'], 'localhost', l3.port)
...
 E           pyln.client.lightning.RpcError: RPC call failed: method: connect, payload: {'id': '035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d', 'host': 'localhost', 'port': 41285}, error: {'code': 401, 'message': 'All addresses failed: 127.0.0.1:36678: Connection establishment: Connection refused. '}
```

See how it didn't even try the given address?

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the connectd-demux-part-1 branch 2 times, most recently from 9824715 to 37d040e Compare January 11, 2022 06:52
test_gossip_no_empty_announcements relies on this!

Signed-off-by: Rusty Russell <[email protected]>
We used to shut down peers atomically, but now we flush the
connections there's a delay.  If we are asked to connect in that time,
we ignore it, as we are already connected, but that's wrong: we need
to remember that we were told to connect and reconnect.

This should solve a few weird test failures where "connect" would hang
indefinitely.

Signed-off-by: Rusty Russell <[email protected]>
There's actually a bug in our closing tx size estimation; I'll do
a separate patch for this, though.

Seems this used to be flaky, now we always flush queues, so it's
more reliably caught.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the connectd-demux-part-1 branch 3 times, most recently from b13a486 to f20bd71 Compare January 12, 2022 03:29
@rustyrussell rustyrussell mentioned this pull request Jan 12, 2022
This seems to trigger now, especially on PostgresQL (maybe it's faster
to process blocks?).

e.g. test_closing_simple() hangs in close(), because the close is unilateral
because the HTLC timed out, so it's waiting for a block (other lines removed):

```
lightningd-1: 2022-01-12T00:33:46.258Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: peer_out WIRE_COMMITMENT_SIGNED
lightningd-1: 2022-01-12T00:33:46.278Z DEBUG   lightningd: close_command: timeout = 172800
2022-01-12T01:03:36.9757201Z lightningd-2: 2022-01-12T00:33:46.384Z DEBUG   lightningd: Adding block 104: 73ffa19d27d048613b2731e1682b4efff0dc226807d8cc99d724523c2ea58204
2022-01-12T01:03:36.9759053Z lightningd-2: 2022-01-12T00:33:46.396Z DEBUG   lightningd: Adding block 105: 44fd06ed053a0d0594abcfefcfa69089351fc89080826799fb4b278a68fe5c20
2022-01-12T01:03:36.9760865Z lightningd-2: 2022-01-12T00:33:46.406Z DEBUG   lightningd: Adding block 106: 0fee2dcbd1376249832642079131275e195bba4fb49cc9968df3a899010bba0f
2022-01-12T01:03:36.9762632Z lightningd-2: 2022-01-12T00:33:46.418Z DEBUG   lightningd: Adding block 107: 7f24f2d7d3e83fe3256298bd661e57cdf92b058440738fd4d7e1c8ef4a4ca073
2022-01-12T01:03:36.9773411Z lightningd-2: 2022-01-12T00:33:46.429Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: peer_in WIRE_REVOKE_AND_ACK
2022-01-12T01:03:36.9794707Z lightningd-2: 2022-01-12T00:33:46.437Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: Commits outstanding after recv revoke_and_ack
2022-01-12T01:03:36.9788197Z lightningd-2: 2022-01-12T00:33:46.433Z DEBUG   lightningd: Adding block 108: 283b371fb5d1ef42980ea10ab9f5965a179af8e91ddf31c8176e79820e1ec54d
2022-01-12T01:03:36.9799347Z lightningd-2: 2022-01-12T00:33:46.439Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: HTLC 0[REMOTE] => RCVD_REMOVE_REVOCATION
2022-01-12T01:03:36.9808057Z lightningd-2: 2022-01-12T00:33:46.447Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: Peer permanent failure in CHANNELD_NORMAL: Fulfilled HTLC 0 RCVD_REMOVE_REVOCATION cltv 109 hit deadline
```

This is because `pay` returns from l1 when it has the preimage, not
when the HTLC is fully resolved.  Add a helper for this, and call it
at the end of the pay test helper.  We might need this elsewhere
though!

Signed-off-by: Rusty Russell <[email protected]>
Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack 886bede for me because I think that the following error is unrelated. There is only some slow run here

=================================== FAILURES ===================================
_________________________ test_upgrade_statickey_fail __________________________
[gw5] linux -- Python 3.6.15 /opt/hostedtoolcache/Python/3.6.15/x64/bin/python3

node_factory = <pyln.testing.utils.NodeFactory object at 0x7f5be9cfd1d0>
executor = <concurrent.futures.thread.ThreadPoolExecutor object at 0x7f5be9cfd518>
bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f5bea5e4eb8>

    @unittest.skipIf(not EXPERIMENTAL_FEATURES, "upgrade protocol not available")
    @pytest.mark.developer("dev-force-features, dev-disconnect required")
    def test_upgrade_statickey_fail(node_factory, executor, bitcoind):
        """We reconnect at all points during retransmit, and we won't upgrade."""
        l1_disconnects = ['-WIRE_COMMITMENT_SIGNED',
                          '-WIRE_REVOKE_AND_ACK']
        l2_disconnects = ['-WIRE_REVOKE_AND_ACK',
                          '-WIRE_COMMITMENT_SIGNED']
    
        l1, l2 = node_factory.line_graph(2, opts=[{'may_reconnect': True,
                                                   'dev-no-reconnect': None,
                                                   'disconnect': l1_disconnects,
                                                   'dev-force-features': ["-13", "-21"],
                                                   # Don't have feerate changes!
                                                   'feerates': (7500, 7500, 7500, 7500)},
                                                  {'may_reconnect': True,
                                                   'dev-no-reconnect': None,
                                                   'disconnect': l2_disconnects,
                                                   'dev-disable-commit-after': 1}])
    
        # This HTLC will fail
        l1.rpc.sendpay([{'msatoshi': 1000, 'id': l2.info['id'], 'delay': 5, 'channel': '1x1x1'}], '00' * 32, payment_secret='00' * 32)
    
        # Each one should cause one disconnection, no upgrade.
        for d in l1_disconnects + l2_disconnects:
            l1.daemon.wait_for_log('Peer connection lost')
            l2.daemon.wait_for_log('Peer connection lost')
            assert not l1.daemon.is_in_log('option_static_remotekey enabled')
            assert not l2.daemon.is_in_log('option_static_remotekey enabled')
            l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
    
        # On the last reconnect, it retransmitted revoke_and_ack.
        l1.daemon.wait_for_log('No upgrade: we retransmitted')
        l2.daemon.wait_for_log('No upgrade: pending changes')
    
        # Now when we reconnect, despite having an HTLC, we're quiescent.
        l1.rpc.disconnect(l2.info['id'], force=True)
        l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
    
>       l1.daemon.wait_for_log('option_static_remotekey enabled at 2/2')

tests/test_connection.py:3653: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-testing/pyln/testing/utils.py:310: in wait_for_log
    return self.wait_for_logs([regex], timeout)

@rustyrussell
Copy link
Contributor Author

Ack 886bede

@rustyrussell rustyrussell merged commit 96ff874 into ElementsProject:master Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants