-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process messages from different peers in parallel in PeerManager. #1023
Process messages from different peers in parallel in PeerManager. #1023
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1023 +/- ##
==========================================
- Coverage 90.47% 90.40% -0.07%
==========================================
Files 69 70 +1
Lines 37137 37228 +91
==========================================
+ Hits 33600 33657 +57
- Misses 3537 3571 +34
Continue to review full report at Codecov.
|
b12ca74
to
3ed74ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise SGTM
/// Only add to this set when noise completes. | ||
/// Locked *after* peers. When an item is removed, it must be removed with the `peers` write | ||
/// lock held. Entries may be added with only the `peers` read lock held (though the | ||
/// `Descriptor` value must already exist in `peers`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a comment about what this structure is holding "Map of node pubkeys to their connection handler. One node cannot have more than one descriptor".
Also inter-related lock handling with peers
could be debug_assert()" while relying on
try_write().is_err()to observe
RwLock` is effectively held
Converting to draft until I can rebase on #1043 which should get the second commit basically for free. |
3ed74ed
to
26e8df2
Compare
Rebased, much cleaner now - its a single commit |
fb043e8
to
587145b
Compare
Rebased and pushed a few more commits which significantly reduce contention plus one commit that's important for the sample. |
c137255
to
69c66f4
Compare
I finally got the locking here borderline where I want it, but then realized it looks like, at least on linux, RwLock read locks will trivially starve writers, completely breaking the approach here. I'm not sure if there's a way to fix the starvation which feels broken at an OS level, or if we'll have to rethink this approach entirely. |
157abfb
to
039f05b
Compare
This is looking quite good in testing now, I think, but I do want to give it a bit more time to bake on my test node before we move it forward here. I think there's room to optimize somewhat to avoid hitting the read-pause too often. |
3d64059
to
4fad4ef
Compare
Okay! This seems to be working swimmingly, at least on my public node. There's some other optimizations that are nice but not really worth including here, I think. |
1d8cde2
to
930ee2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few last comments but otherwise looks good!
930ee2a
to
1a95917
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to squash!
This adds the required locking to process messages from different peers simultaneously in `PeerManager`. Note that channel messages are still processed under a global lock in `ChannelManager`, and most work is still processed under a global lock in gossip message handling, but parallelizing message deserialization and message decryption is somewhat helpful.
Users are required to only ever call `read_event` serially per-peer, thus we actually don't need any locks while we're processing messages - we can only be processing messages in one thread per-peer. That said, we do need to ensure that another thread doesn't disconnect the peer we're processing messages for, as that could result in a peer_disconencted call while we're processing a message for the same peer - somewhat nonsensical. This significantly improves parallelism especially during gossip processing as it avoids waiting on the entire set of individual peer locks to forward a gossip message while several other threads are validating gossip messages with their individual peer locks held.
Unlike very ancient versions of lightning-net-tokio, this does not rely on a single global process_events future, but instead has one per connection. This could still cause significant contention, so we'll ensure only two process_events calls can exist at once in the next few commits.
Because the peers write lock "blocks the world", and happens after each read event, always taking the write lock has pretty severe impacts on parallelism. Instead, here, we only take the global write lock if we have to disconnect a peer.
Similar to the previous commit, this avoids "blocking the world" on every timer tick unless we need to disconnect peers.
Only one instance of PeerManager::process_events can run at a time, and each run always finishes all available work before returning. Thus, having several threads blocked on the process_events lock doesn't accomplish anything but blocking more threads. Here we limit the number of blocked calls on process_events to two - one processing events and one blocked at the top which will process all available events after the first completes.
This avoids any extra calls to `read_event` after a write fails to flush the write buffer fully, as is required by the PeerManager API (though it isn't critical).
Because we handle messages (which can take some time, persisting things to disk or validating cryptographic signatures) with the top-level read lock, but require the top-level write lock to connect new peers or handle disconnection, we are particularly sensitive to writer starvation issues. Rust's libstd RwLock does not provide any fairness guarantees, using whatever the OS provides as-is. On Linux, pthreads defaults to starving writers, which Rust's RwLock exposes to us (without any configurability). Here we work around that issue by blocking readers if there are pending writers, optimizing for readable code over perfectly-optimized blocking.
This avoids repeatedly deallocating-allocating a Vec for the peer read buffer after every message/header.
This reduces instances of disconnect peers after single timer intervals somewhat, at least on Tokio 1.14.
...and implement wire::Type for `()` for `feature = "_test_utils"`.
These increase coverage and caught previous lockorder inversions.
Squashed without further changes. |
1a95917
to
46009a5
Compare
This adds the required locking to process messages from different
peers simultaneously in
PeerManager
. Note that channel messagesare still processed under a global lock in
ChannelManager
, butthe vast, vast majority of our message processing time is in
gossip messages anyway, so this should improve things there.
This is pretty low priority, but its super low hanging fruit and makes me feel good.