Process messages from different peers in parallel in PeerManager. #1023

TheBlueMatt · 2021-07-30T18:07:38Z

This adds the required locking to process messages from different
peers simultaneously in PeerManager. Note that channel messages
are still processed under a global lock in ChannelManager, but
the vast, vast majority of our message processing time is in
gossip messages anyway, so this should improve things there.

This is pretty low priority, but its super low hanging fruit and makes me feel good.

codecov · 2021-07-30T18:25:06Z

Codecov Report

Merging #1023 (876d690) into main (ef86a3e) will decrease coverage by 0.06%.
The diff coverage is 60.91%.

@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   90.47%   90.40%   -0.07%     
==========================================
  Files          69       70       +1     
  Lines       37137    37228      +91     
==========================================
+ Hits        33600    33657      +57     
- Misses       3537     3571      +34

Impacted Files	Coverage Δ
lightning/src/ln/msgs.rs	`86.34% <ø> (ø)`
lightning/src/ln/peer_handler.rs	`50.88% <56.52%> (+0.55%)`	⬆️
lightning/src/routing/network_graph.rs	`91.29% <80.00%> (-0.06%)`	⬇️
lightning/src/util/fairrwlock.rs	`85.71% <85.71%> (ø)`
lightning-background-processor/src/lib.rs	`94.71% <88.88%> (-0.29%)`	⬇️
lightning/src/ln/functional_tests.rs	`97.32% <88.88%> (-0.10%)`	⬇️
lightning-net-tokio/src/lib.rs	`77.45% <100.00%> (+0.76%)`	⬆️
lightning/src/ln/channel.rs	`89.04% <100.00%> (+<0.01%)`	⬆️
lightning/src/ln/wire.rs	`61.77% <0.00%> (-0.39%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef86a3e...876d690. Read the comment docs.

ariard

Otherwise SGTM

ariard · 2021-08-01T23:18:01Z

lightning/src/ln/peer_handler.rs

+	/// Only add to this set when noise completes.
+	/// Locked *after* peers. When an item is removed, it must be removed with the `peers` write
+	/// lock held. Entries may be added with only the `peers` read lock held (though the
+	/// `Descriptor` value must already exist in `peers`).


Could add a comment about what this structure is holding "Map of node pubkeys to their connection handler. One node cannot have more than one descriptor".

Also inter-related lock handling with peers could be debug_assert()" while relying on try_write().is_err()to observeRwLock` is effectively held

TheBlueMatt · 2021-09-10T04:20:42Z

Converting to draft until I can rebase on #1043 which should get the second commit basically for free.

TheBlueMatt · 2021-09-15T18:39:15Z

Rebased, much cleaner now - its a single commit

TheBlueMatt · 2021-09-26T05:03:26Z

Rebased and pushed a few more commits which significantly reduce contention plus one commit that's important for the sample.

TheBlueMatt · 2021-10-06T07:25:55Z

I finally got the locking here borderline where I want it, but then realized it looks like, at least on linux, RwLock read locks will trivially starve writers, completely breaking the approach here. I'm not sure if there's a way to fix the starvation which feels broken at an OS level, or if we'll have to rethink this approach entirely.

TheBlueMatt · 2021-10-06T20:03:14Z

This is looking quite good in testing now, I think, but I do want to give it a bit more time to bake on my test node before we move it forward here. I think there's room to optimize somewhat to avoid hitting the read-pause too often.

TheBlueMatt · 2021-10-11T05:04:55Z

Okay! This seems to be working swimmingly, at least on my public node. There's some other optimizations that are nice but not really worth including here, I think.

jkczyz

Just a few last comments but otherwise looks good!

lightning/src/util/test_utils.rs

lightning/src/ln/peer_handler.rs

jkczyz

Good to squash!

This adds the required locking to process messages from different peers simultaneously in `PeerManager`. Note that channel messages are still processed under a global lock in `ChannelManager`, and most work is still processed under a global lock in gossip message handling, but parallelizing message deserialization and message decryption is somewhat helpful.

Users are required to only ever call `read_event` serially per-peer, thus we actually don't need any locks while we're processing messages - we can only be processing messages in one thread per-peer. That said, we do need to ensure that another thread doesn't disconnect the peer we're processing messages for, as that could result in a peer_disconencted call while we're processing a message for the same peer - somewhat nonsensical. This significantly improves parallelism especially during gossip processing as it avoids waiting on the entire set of individual peer locks to forward a gossip message while several other threads are validating gossip messages with their individual peer locks held.

Unlike very ancient versions of lightning-net-tokio, this does not rely on a single global process_events future, but instead has one per connection. This could still cause significant contention, so we'll ensure only two process_events calls can exist at once in the next few commits.

Because the peers write lock "blocks the world", and happens after each read event, always taking the write lock has pretty severe impacts on parallelism. Instead, here, we only take the global write lock if we have to disconnect a peer.

Similar to the previous commit, this avoids "blocking the world" on every timer tick unless we need to disconnect peers.

Only one instance of PeerManager::process_events can run at a time, and each run always finishes all available work before returning. Thus, having several threads blocked on the process_events lock doesn't accomplish anything but blocking more threads. Here we limit the number of blocked calls on process_events to two - one processing events and one blocked at the top which will process all available events after the first completes.

This avoids any extra calls to `read_event` after a write fails to flush the write buffer fully, as is required by the PeerManager API (though it isn't critical).

Because we handle messages (which can take some time, persisting things to disk or validating cryptographic signatures) with the top-level read lock, but require the top-level write lock to connect new peers or handle disconnection, we are particularly sensitive to writer starvation issues. Rust's libstd RwLock does not provide any fairness guarantees, using whatever the OS provides as-is. On Linux, pthreads defaults to starving writers, which Rust's RwLock exposes to us (without any configurability). Here we work around that issue by blocking readers if there are pending writers, optimizing for readable code over perfectly-optimized blocking.

This avoids repeatedly deallocating-allocating a Vec for the peer read buffer after every message/header.

This reduces instances of disconnect peers after single timer intervals somewhat, at least on Tokio 1.14.

...and implement wire::Type for `()` for `feature = "_test_utils"`.

These increase coverage and caught previous lockorder inversions.

TheBlueMatt · 2022-05-10T23:40:33Z

Squashed without further changes.

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from b12ca74 to 3ed74ed Compare August 1, 2021 01:43

ariard reviewed Aug 1, 2021

View reviewed changes

TheBlueMatt marked this pull request as draft September 10, 2021 04:13

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch from 3ed74ed to 26e8df2 Compare September 15, 2021 18:39

TheBlueMatt marked this pull request as ready for review September 15, 2021 18:39

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 3 times, most recently from fb043e8 to 587145b Compare September 26, 2021 00:15

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 3 times, most recently from c137255 to 69c66f4 Compare October 6, 2021 06:59

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 7 times, most recently from 157abfb to 039f05b Compare October 6, 2021 19:02

TheBlueMatt marked this pull request as draft October 6, 2021 20:02

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from 3d64059 to 4fad4ef Compare October 11, 2021 05:04

TheBlueMatt marked this pull request as ready for review October 11, 2021 05:04

TheBlueMatt removed the Seeking Code Review label May 6, 2022

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from 1d8cde2 to 930ee2a Compare May 6, 2022 03:14

arik-so previously approved these changes May 6, 2022

View reviewed changes

jkczyz reviewed May 6, 2022

View reviewed changes

lightning/src/util/test_utils.rs Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

TheBlueMatt dismissed arik-so’s stale review via 1a95917 May 7, 2022 02:17

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch from 930ee2a to 1a95917 Compare May 7, 2022 02:17

jkczyz previously approved these changes May 10, 2022

View reviewed changes

TheBlueMatt added 15 commits May 10, 2022 23:40

Avoid the peers write lock unless we need it in timer_tick_occurred

4f50a94

Similar to the previous commit, this avoids "blocking the world" on every timer tick unless we need to disconnect peers.

Wake reader future when we fail to flush socket buffer

f909831

This avoids any extra calls to `read_event` after a write fails to flush the write buffer fully, as is required by the PeerManager API (though it isn't critical).

Keep the same read buffer unless the last message was overly large

eb17464

This avoids repeatedly deallocating-allocating a Vec for the peer read buffer after every message/header.

Drop PeerHolder as it now only has one field

96fc0f3

[net-tokio] Explicitly yield after processing messages from a peer

b222be2

This reduces instances of disconnect peers after single timer intervals somewhat, at least on Tokio 1.14.

Drop a needless match in favor of an if let

101bcd8

Require PartialEq for wire::Message in cfg(test)

45c1411

...and implement wire::Type for `()` for `feature = "_test_utils"`.

Add support for testing recvd messages in TestChannelMessageHandler

cc7f859

Add a few more simple tests of the PeerHandler

46009a5

These increase coverage and caught previous lockorder inversions.

TheBlueMatt dismissed jkczyz’s stale review via 46009a5 May 10, 2022 23:40

TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch from 1a95917 to 46009a5 Compare May 10, 2022 23:40

jkczyz approved these changes May 11, 2022

View reviewed changes

arik-so approved these changes May 11, 2022

View reviewed changes

TheBlueMatt merged commit b5a6307 into lightningdevkit:main May 11, 2022

tnull mentioned this pull request May 13, 2022

Fix unused code warnings test. #1480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process messages from different peers in parallel in PeerManager. #1023

Process messages from different peers in parallel in PeerManager. #1023

TheBlueMatt commented Jul 30, 2021

codecov bot commented Jul 30, 2021 •

edited

Loading

ariard left a comment

ariard Aug 1, 2021

TheBlueMatt commented Sep 10, 2021

TheBlueMatt commented Sep 15, 2021

TheBlueMatt commented Sep 26, 2021

TheBlueMatt commented Oct 6, 2021

TheBlueMatt commented Oct 6, 2021

TheBlueMatt commented Oct 11, 2021

jkczyz left a comment

jkczyz left a comment

TheBlueMatt commented May 10, 2022

Process messages from different peers in parallel in PeerManager. #1023

Process messages from different peers in parallel in PeerManager. #1023

Conversation

TheBlueMatt commented Jul 30, 2021

codecov bot commented Jul 30, 2021 • edited Loading

Codecov Report

ariard left a comment

Choose a reason for hiding this comment

ariard Aug 1, 2021

Choose a reason for hiding this comment

TheBlueMatt commented Sep 10, 2021

TheBlueMatt commented Sep 15, 2021

TheBlueMatt commented Sep 26, 2021

TheBlueMatt commented Oct 6, 2021

TheBlueMatt commented Oct 6, 2021

TheBlueMatt commented Oct 11, 2021

jkczyz left a comment

Choose a reason for hiding this comment

jkczyz left a comment

Choose a reason for hiding this comment

TheBlueMatt commented May 10, 2022

codecov bot commented Jul 30, 2021 •

edited

Loading