Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process messages from different peers in parallel in PeerManager. #1023

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

This adds the required locking to process messages from different
peers simultaneously in PeerManager. Note that channel messages
are still processed under a global lock in ChannelManager, but
the vast, vast majority of our message processing time is in
gossip messages anyway, so this should improve things there.

This is pretty low priority, but its super low hanging fruit and makes me feel good.

@codecov
Copy link

codecov bot commented Jul 30, 2021

Codecov Report

Merging #1023 (876d690) into main (ef86a3e) will decrease coverage by 0.06%.
The diff coverage is 60.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   90.47%   90.40%   -0.07%     
==========================================
  Files          69       70       +1     
  Lines       37137    37228      +91     
==========================================
+ Hits        33600    33657      +57     
- Misses       3537     3571      +34     
Impacted Files Coverage Δ
lightning/src/ln/msgs.rs 86.34% <ø> (ø)
lightning/src/ln/peer_handler.rs 50.88% <56.52%> (+0.55%) ⬆️
lightning/src/routing/network_graph.rs 91.29% <80.00%> (-0.06%) ⬇️
lightning/src/util/fairrwlock.rs 85.71% <85.71%> (ø)
lightning-background-processor/src/lib.rs 94.71% <88.88%> (-0.29%) ⬇️
lightning/src/ln/functional_tests.rs 97.32% <88.88%> (-0.10%) ⬇️
lightning-net-tokio/src/lib.rs 77.45% <100.00%> (+0.76%) ⬆️
lightning/src/ln/channel.rs 89.04% <100.00%> (+<0.01%) ⬆️
lightning/src/ln/wire.rs 61.77% <0.00%> (-0.39%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef86a3e...876d690. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from b12ca74 to 3ed74ed Compare August 1, 2021 01:43
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise SGTM

/// Only add to this set when noise completes.
/// Locked *after* peers. When an item is removed, it must be removed with the `peers` write
/// lock held. Entries may be added with only the `peers` read lock held (though the
/// `Descriptor` value must already exist in `peers`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment about what this structure is holding "Map of node pubkeys to their connection handler. One node cannot have more than one descriptor".

Also inter-related lock handling with peers could be debug_assert()" while relying on try_write().is_err()to observeRwLock` is effectively held

@TheBlueMatt TheBlueMatt marked this pull request as draft September 10, 2021 04:13
@TheBlueMatt
Copy link
Collaborator Author

Converting to draft until I can rebase on #1043 which should get the second commit basically for free.

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch from 3ed74ed to 26e8df2 Compare September 15, 2021 18:39
@TheBlueMatt TheBlueMatt marked this pull request as ready for review September 15, 2021 18:39
@TheBlueMatt
Copy link
Collaborator Author

Rebased, much cleaner now - its a single commit

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 3 times, most recently from fb043e8 to 587145b Compare September 26, 2021 00:15
@TheBlueMatt
Copy link
Collaborator Author

Rebased and pushed a few more commits which significantly reduce contention plus one commit that's important for the sample.

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 3 times, most recently from c137255 to 69c66f4 Compare October 6, 2021 06:59
@TheBlueMatt
Copy link
Collaborator Author

I finally got the locking here borderline where I want it, but then realized it looks like, at least on linux, RwLock read locks will trivially starve writers, completely breaking the approach here. I'm not sure if there's a way to fix the starvation which feels broken at an OS level, or if we'll have to rethink this approach entirely.

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 7 times, most recently from 157abfb to 039f05b Compare October 6, 2021 19:02
@TheBlueMatt TheBlueMatt marked this pull request as draft October 6, 2021 20:02
@TheBlueMatt
Copy link
Collaborator Author

This is looking quite good in testing now, I think, but I do want to give it a bit more time to bake on my test node before we move it forward here. I think there's room to optimize somewhat to avoid hitting the read-pause too often.

@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from 3d64059 to 4fad4ef Compare October 11, 2021 05:04
@TheBlueMatt
Copy link
Collaborator Author

Okay! This seems to be working swimmingly, at least on my public node. There's some other optimizations that are nice but not really worth including here, I think.

@TheBlueMatt TheBlueMatt marked this pull request as ready for review October 11, 2021 05:04
@TheBlueMatt TheBlueMatt force-pushed the 2021-07-par-gossip-processing branch 2 times, most recently from 1d8cde2 to 930ee2a Compare May 6, 2022 03:14
arik-so
arik-so previously approved these changes May 6, 2022
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few last comments but otherwise looks good!

lightning/src/util/test_utils.rs Show resolved Hide resolved
lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved
lightning/src/ln/peer_handler.rs Show resolved Hide resolved
jkczyz
jkczyz previously approved these changes May 10, 2022
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to squash!

This adds the required locking to process messages from different
peers simultaneously in `PeerManager`. Note that channel messages
are still processed under a global lock in `ChannelManager`, and
most work is still processed under a global lock in gossip message
handling, but parallelizing message deserialization and message
decryption is somewhat helpful.
Users are required to only ever call `read_event` serially
per-peer, thus we actually don't need any locks while we're
processing messages - we can only be processing messages in one
thread per-peer.

That said, we do need to ensure that another thread doesn't
disconnect the peer we're processing messages for, as that could
result in a peer_disconencted call while we're processing a
message for the same peer - somewhat nonsensical.

This significantly improves parallelism especially during gossip
processing as it avoids waiting on the entire set of individual
peer locks to forward a gossip message while several other threads
are validating gossip messages with their individual peer locks
held.
Unlike very ancient versions of lightning-net-tokio, this does not
rely on a single global process_events future, but instead has one
per connection. This could still cause significant contention, so
we'll ensure only two process_events calls can exist at once in
the next few commits.
Because the peers write lock "blocks the world", and happens after
each read event, always taking the write lock has pretty severe
impacts on parallelism. Instead, here, we only take the global
write lock if we have to disconnect a peer.
Similar to the previous commit, this avoids "blocking the world" on
every timer tick unless we need to disconnect peers.
Only one instance of PeerManager::process_events can run at a time,
and each run always finishes all available work before returning.
Thus, having several threads blocked on the process_events lock
doesn't accomplish anything but blocking more threads.

Here we limit the number of blocked calls on process_events to two
- one processing events and one blocked at the top which will
process all available events after the first completes.
This avoids any extra calls to `read_event` after a write fails to
flush the write buffer fully, as is required by the PeerManager
API (though it isn't critical).
Because we handle messages (which can take some time, persisting
things to disk or validating cryptographic signatures) with the
top-level read lock, but require the top-level write lock to
connect new peers or handle disconnection, we are particularly
sensitive to writer starvation issues.

Rust's libstd RwLock does not provide any fairness guarantees,
using whatever the OS provides as-is. On Linux, pthreads defaults
to starving writers, which Rust's RwLock exposes to us (without
any configurability).

Here we work around that issue by blocking readers if there are
pending writers, optimizing for readable code over
perfectly-optimized blocking.
This avoids repeatedly deallocating-allocating a Vec for the peer
read buffer after every message/header.
This reduces instances of disconnect peers after single timer
intervals somewhat, at least on Tokio 1.14.
...and implement wire::Type for `()` for `feature = "_test_utils"`.
These increase coverage and caught previous lockorder inversions.
@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes.

@TheBlueMatt TheBlueMatt merged commit b5a6307 into lightningdevkit:main May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants