Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd: don't insist on ping replies when other traffic is flowing. #5347

Merged
merged 2 commits into from
Jul 9, 2022

Conversation

rustyrussell
Copy link
Contributor

Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?). This only matters to us if no
other traffic is flowing, so mitigate that.

This may solve the problem because we've previously seen implementations
badly prioritize gossip traffic, and thus important messages can get queued
behind gossip dumps!

@wtogami
Copy link
Contributor

wtogami commented Jun 26, 2022

Tried both 94e7cac and e3a5fa9 applied against v0.11.2rc3. ACINQ's node is still reconnecting at intervals of less than a minute.

2022-06-26T07:31:29.084Z INFO    lightningd: Peer says it sees our address as: 172.105.77.251:55946
2022-06-26T07:32:19.711Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-channeld-chan#158: Peer connection lost
2022-06-26T07:32:19.711Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-chan#158: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-06-26T07:32:21.203Z INFO    lightningd: Peer says it sees our address as: 172.105.77.251:56058
2022-06-26T07:33:11.835Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-channeld-chan#158: Peer connection lost
2022-06-26T07:33:11.836Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-chan#158: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-06-26T07:33:13.451Z INFO    lightningd: Peer says it sees our address as: 172.105.77.251:56172
2022-06-26T07:34:04.064Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-channeld-chan#158: Peer connection lost
2022-06-26T07:34:04.065Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-chan#158: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-06-26T07:34:05.835Z INFO    lightningd: Peer says it sees our address as: 172.105.77.251:56282
2022-06-26T07:34:34.917Z INFO    0285c280641305f1641a9e08718d5f4b1ea3bde8003beec4acfbefb5dde207ae4d-channeld-chan#5497: Peer connection lost
2022-06-26T07:34:34.917Z INFO    0285c280641305f1641a9e08718d5f4b1ea3bde8003beec4acfbefb5dde207ae4d-chan#5497: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

Trying full debug next ...

@wtogami
Copy link
Contributor

wtogami commented Jun 26, 2022

2022-06-26T05:56:07.469Z DEBUG 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-connectd: Last ping unreturned: hanging up

Debug logs no longer doing any of these unreturned ping hangups.

2022-06-26T07:46:03.589Z DEBUG 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-connectd: peer_out INVALID 42085

Immediately after I receive these it disconnects every minute.

Prior to the less-ping patch it did disconnect in this way sometimes according to earlier debug logs.

@pm47
Copy link
Contributor

pm47 commented Jun 26, 2022

Here is what we have on our side:

2022-06-26 10:42:31,701 ERROR f.a.e.c.TransportHandler CON n:02d0e037 - cannot deserialize ByteVector(59 bytes, 0xa4657b22): message: cannot acquire 252176 bits from a vector that contains 440 bits

It seems we receive a weird message 0xa4657b22 from cln that we can't parse.

2022-06-26T07:46:03.589Z DEBUG 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-connectd: peer_out INVALID 42085

This is the decimal representation of 0xa465 which is the type of the unknown message sent by cln. Am I understanding correctly (peer_out) that cln doesn't know the type of the message it is sending itself?

edit: this message type is apparently related to peerswap : https://github.com/ElementsProject/peerswap/blob/24ad6a6d51e64a8aa800de90f8f15eede9020ad4/docs/peer-protocol.md

pm47 added a commit to ACINQ/eclair that referenced this pull request Jun 26, 2022
If the current chunk of data read from the TCP connection only contains
unknown messages (in particular, could be only one isolated unknown
message on an otherwise idle connection), we never resumed reading on
the connection.

This means all subsequent messages, including pings/pongs, won't be read,
which is why the most visible effect is disconnecting due to no response
to ping.

Related to ElementsProject/lightning#5347.
@pm47
Copy link
Contributor

pm47 commented Jun 26, 2022

So, the root cause is a bug in eclair ACINQ/eclair#2332, triggered by the unknown message. The ping issue is just a consequence.

@wtogami
Copy link
Contributor

wtogami commented Jun 26, 2022

Confirmed the Eclair fix stopped the disconnects.

Regarding this PR, while this made it easier to spot what was going on I suspect it might be more correct to keep the previous behavior as that would be more robust in attempting to recover from bad Tor circuits?

Thank you to everyone involved. Sorry for my own confusion. I learned a lot in the process and hopefully can be more useful in the future.

@pm47
Copy link
Contributor

pm47 commented Jun 26, 2022

Thank you for reporting the issue @wtogami!

@wtogami
Copy link
Contributor

wtogami commented Jun 26, 2022

I suspect we're better off closing this PR? CLN is already doing the right thing.

@rustyrussell
Copy link
Contributor Author

I still like the PR: I know LND can ahve issues if it's streaming lots of gossip to us. But it's not urgent, so I'll change milestone, and also will rework description.

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK dc7cb01

@wtogami
Copy link
Contributor

wtogami commented Jun 29, 2022

rustyrussell force-pushed the ping-less branch from dc7cb01 to 469d08a 1 hour ago

The Compare button is confusing when the Github UI doesn't tell you the base branch was changed. I was sitting here wondering "Why chainparam changes? Maybe I'm just not smart enough."

pm47 added a commit to ACINQ/eclair that referenced this pull request Jun 29, 2022
If the current chunk of data read from the TCP connection only contains unknown messages (in particular, could be only one isolated unknown message on an otherwise idle connection), we never resumed reading on the connection.

This means all subsequent messages, including pings/pongs, won't be read, which is why the most visible effect is disconnecting due to no response to ping.

Related to ElementsProject/lightning#5347.

Reported by @wtogami.
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.

Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: connectd: give busy peers more time to respond to pings.
Fixes 2569999 (pytest: fix flake in test_node_reannounce).

Converting ->set->list does not give deterministic order, so we can
end up with the two lists differing due to order:

```
 May send its own announcement *twice*, since it always spams us.
        msgs2 = list(set(msgs2))
>       assert msgs == msgs2
E       AssertionError: assert ['01012ff5580...000000000000'] == ['01014973d81...000000000000']
E         At index 0 diff: '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000' != '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000'
E         Full diff:
E           [
E         +  '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000',
E            '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000',
E         -  '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000',
E           ]
``

Signed-off-by: Rusty Russell <[email protected]>
Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 30c1346

Very useful have this so I can catch why a ping is failing in lnmetrics

@rustyrussell rustyrussell merged commit 9d1b5a6 into ElementsProject:master Jul 9, 2022
evd0kim pushed a commit to standardsats/eclair that referenced this pull request Jul 17, 2023
If the current chunk of data read from the TCP connection only contains unknown messages (in particular, could be only one isolated unknown message on an otherwise idle connection), we never resumed reading on the connection.

This means all subsequent messages, including pings/pongs, won't be read, which is why the most visible effect is disconnecting due to no response to ping.

Related to ElementsProject/lightning#5347.

Reported by @wtogami.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants