-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refuse recursive read locks #2006
Refuse recursive read locks #2006
Conversation
Codecov ReportBase: 87.25% // Head: 87.22% // Decreases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #2006 +/- ##
==========================================
- Coverage 87.25% 87.22% -0.04%
==========================================
Files 100 100
Lines 44480 44505 +25
Branches 44480 44505 +25
==========================================
+ Hits 38810 38818 +8
- Misses 5670 5687 +17
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
3adef72
to
46428a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but will make sure to review it after rebased once conflicting things land :)
lightning/src/sync/debug_sync.rs
Outdated
self.inner.read().map(|guard| RwLockReadGuard { lock: self, guard, first_lock }).map_err(|_| ()) | ||
// Note that while we could be taking a recursive read lock here, Rust's `RwLock` may | ||
// deadlock trying to take a second read lock if another thread is waiting on the write | ||
// lock. Its platform dependent (but our in-tree `FairRwLock` guarantees this behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very tiny nit: s/Its/It's/
Needs a rebase |
Converting to draft until at least #1897 because it'll need to be adapted for that, I think. |
46428a7
to
a1bb7ca
Compare
Rebased, fixed some issues and added another check which turned up one more bug (see last commit). |
a1bb7ca
to
267f0a6
Compare
Oops, forgot to update the no-std lock file. |
267f0a6
to
ba66a76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is still failing with some lifetime errors.
ba66a76
to
1877714
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squash
1877714
to
18b1e86
Compare
Squashed without changes, but pushed one additional commit to export RUST_BACKTRACE=1 in CI because windows is failing and I can't really otherwise debug it locally. |
d459767
to
a91f5e6
Compare
Fixed a silent rebase conflict, see the last commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the one mutex per line can be a little surprising to contributors. I wonder if there's any easy way to enforce this/notify an author.
I see also test transaction sync clients on 1.59 rustc, but now we get: |
Yea, looks like |
a91f5e6
to
b8e36a2
Compare
Good point! I pushed a commit to assert this on non-Windows platforms (it will still generate false-positives on Windows, but at least on other platforms you'll get a helpful assertion failure error message). |
Found one more case, and also modified the log to include the line number: diff --git a/lightning/src/chain/channelmonitor.rs b/lightning/src/chain/channelmonitor.rs
index 44390496..64a458d2 100644
--- a/lightning/src/chain/channelmonitor.rs
+++ b/lightning/src/chain/channelmonitor.rs
@@ -4074,7 +4074,10 @@ mod tests {
fn test_prune_preimages() {
let secp_ctx = Secp256k1::new();
let logger = Arc::new(TestLogger::new());
- let broadcaster = Arc::new(TestBroadcaster{txn_broadcasted: Mutex::new(Vec::new()), blocks: Arc::new(Mutex::new(Vec::new()))});
+ let broadcaster = Arc::new(TestBroadcaster{
+ txn_broadcasted: Mutex::new(Vec::new()),
+ blocks: Arc::new(Mutex::new(Vec::new()))
+ });
let fee_estimator = TestFeeEstimator { sat_per_kw: Mutex::new(253) };
let dummy_key = PublicKey::from_secret_key(&secp_ctx, &SecretKey::from_slice(&[42; 32]).unwrap());
diff --git a/lightning/src/sync/debug_sync.rs b/lightning/src/sync/debug_sync.rs
index 76b139f7..76a7bb41 100644
--- a/lightning/src/sync/debug_sync.rs
+++ b/lightning/src/sync/debug_sync.rs
@@ -110,11 +110,11 @@ impl LockMetadata {
let lock_constr_location = get_construction_location(&res._lock_construction_bt);
LOCKS_INIT.call_once(|| { unsafe { LOCKS = Some(StdMutex::new(HashMap::new())); } });
let mut locks = unsafe { LOCKS.as_ref() }.unwrap().lock().unwrap();
- match locks.entry(lock_constr_location.0) {
+ match locks.entry(lock_constr_location.0.clone()) {
hash_map::Entry::Occupied(e) => {
assert_eq!(lock_constr_location.1,
get_construction_location(&e.get()._lock_construction_bt).1,
- "Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives.");
+ "Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives: {}", lock_constr_location.0);
return Arc::clone(e.get())
},
hash_map::Entry::Vacant(e) => { e.insert(Arc::clone(&res)); },
|
Thanks. |
b8e36a2
to
3ac12e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM, feel free to squash everything on your next push
hash_map::Entry::Occupied(e) => { | ||
assert_eq!(lock_constr_location.1, | ||
get_construction_location(&e.get()._lock_construction_bt).1, | ||
"Because Windows doesn't support column number results in backtraces, we cannot construct two mutexes on the same line or we risk lockorder detection false positives."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we do have to clone it, I still think it's useful to include the actual line where this happens, even if you can simply go through all of your changes to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backtrace in the panic message should have it?
We previously avoided holding the `total_consistency_lock` while doing crypto operations to build onions. However, now that we've abstracted out the outbound payment logic into a utility module, ensuring the state is consistent at all times is now abstracted away from code authors and reviewers, making it likely to break. Further, because we now call `send_payment_along_path` both with, and without, the `total_consistency_lock`, and because recursive read locks may deadlock, it would now be quite difficult to figure out which paths through `outbound_payment` need the lock and which don't. While it may slow writes somewhat, it's not really worth trying to figure out this mess, instead we just hold the `total_consistency_lock` before going into `outbound_payment` functions.
Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Sadly, a user ended up hitting this deadlock in production in the form of a call to `get_and_clear_pending_msg_events` which holds the `ChannelManager::total_consistency_lock` before calling `process_pending_monitor_events` and eventually `channel_monitor_updated`, which tries to take the same read lock again. Luckily, the fix is trivial, simply remove the redundand read lock in `channel_monitor_updated`. Fixes lightningdevkit#2000
When handling a `ChannelMonitor` update via the new `handle_new_monitor_update` macro, we always call the macro with the `per_peer_state` read lock held and have the macro drop the per-peer state lock. Then, when handling the resulting updates, we may take the `per_peer_state` read lock again in another function. In a coming commit, recursive read locks will be disallowed, so we have to drop the `per_peer_state` read lock before calling additional functions in `handle_new_monitor_update`, which we do here.
Our existing lockorder tests assume that a read lock on a thread that is already holding the same read lock is totally fine. This isn't at all true. The `std` `RwLock` behavior is platform-dependent - on most platforms readers can starve writers as readers will never block for a pending writer. However, on platforms where this is not the case, one thread trying to take a write lock may deadlock with another thread that both already has, and is attempting to take again, a read lock. Worse, our in-tree `FairRwLock` exhibits this behavior explicitly on all platforms to avoid the starvation issue. Thus, we shouldn't have any special handling for allowing recursive read locks, so we simply remove it here.
Taking two instances of the same mutex may be totally fine, but it requires a total lockorder that we cannot (trivially) check. Thus, its generally unsafe to do if we can avoid it. To discourage doing this, here we default to panicing on such locks in our lockorder tests, with a separate lock function added that is clearly labeled "unsafe" to allow doing so when we can guarantee a total lockorder. This requires adapting a number of sites to the new API, including fixing a bug this turned up in `ChannelMonitor`'s `PartialEq` where no lockorder was guaranteed.
as this test often fails on windows which is hard to debug locally for most contributors.
Our lockdep logic (on Windows) identifies a mutex based on which line it was constructed on. Thus, if we have two mutexes constructed on the same line it will generate false positives.
3ac12e6
to
065dc6e
Compare
Fixed the comments and squashed: $ git diff-tree -U1 3ac12e65 065dc6e6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 6f654645c..1045e77ad 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -830,3 +830,3 @@ where
/// Notifier the lock contains sends out a notification when the lock is released.
- pub(crate) total_consistency_lock: RwLock<()>,
+ total_consistency_lock: RwLock<()>,
@@ -7878,6 +7878,3 @@ mod tests {
let session_privs = nodes[0].node.test_add_new_pending_payment(our_payment_hash, Some(payment_secret), payment_id, &mpp_route).unwrap();
- {
- let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
- nodes[0].node.send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
- }
+ nodes[0].node.test_send_payment_along_path(&mpp_route.paths[0], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[0]).unwrap();
check_added_monitors!(nodes[0], 1);
@@ -7911,6 +7908,3 @@ mod tests {
// Send the second half of the original MPP payment.
- {
- let _lck = nodes[0].node.total_consistency_lock.read().unwrap();
- nodes[0].node.send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
- }
+ nodes[0].node.test_send_payment_along_path(&mpp_route.paths[1], &route.payment_params, &our_payment_hash, &Some(payment_secret), 200_000, cur_height, payment_id, &None, session_privs[1]).unwrap();
check_added_monitors!(nodes[0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
0.0.114 - Mar 3, 2023 - "Faster Async BOLT12 Retries" API Updates =========== * `InvoicePayer` has been removed and its features moved directly into `ChannelManager`. As such it now requires a simplified `Router` and supports `send_payment_with_retry` (and friends). `ChannelManager::retry_payment` was removed in favor of the automated retries. Invoice payment utilities in `lightning-invoice` now call the new code (lightningdevkit#1812, lightningdevkit#1916, lightningdevkit#1929, lightningdevkit#2007, etc). * `Sign`/`BaseSign` has been renamed `ChannelSigner`, with `EcdsaChannelSigner` split out in anticipation of future schnorr/taproot support (lightningdevkit#1967). * The catch-all `KeysInterface` was split into `EntropySource`, `NodeSigner`, and `SignerProvider`. `KeysManager` implements all three (lightningdevkit#1910, lightningdevkit#1930). * `KeysInterface::get_node_secret` is now `KeysManager::get_node_secret_key` and is no longer required for external signers (lightningdevkit#1951, lightningdevkit#2070). * A `lightning-transaction-sync` crate has been added which implements keeping LDK in sync with the chain via an esplora server (lightningdevkit#1870). Note that it can only be used on nodes that *never* ran a previous version of LDK. * `Score` is updated in `BackgroundProcessor` instead of via `Router` (lightningdevkit#1996). * `ChainAccess::get_utxo` (now `UtxoAccess`) can now be resolved async (lightningdevkit#1980). * BOLT12 `Offer`, `InvoiceRequest`, `Invoice` and `Refund` structs as well as associated builders have been added. Such invoices cannot yet be paid due to missing support for blinded path payments (lightningdevkit#1927, lightningdevkit#1908, lightningdevkit#1926). * A `lightning-custom-message` crate has been added to make combining multiple custom messages into one enum/handler easier (lightningdevkit#1832). * `Event::PaymentPathFailure` is now generated for failure to send an HTLC over the first hop on our local channel (lightningdevkit#2014, lightningdevkit#2043). * `lightning-net-tokio` no longer requires an `Arc` on `PeerManager` (lightningdevkit#1968). * `ChannelManager::list_recent_payments` was added (lightningdevkit#1873). * `lightning-background-processor` `std` is now optional in async mode (lightningdevkit#1962). * `create_phantom_invoice` can now be used in `no-std` (lightningdevkit#1985). * The required final CLTV delta on inbound payments is now configurable (lightningdevkit#1878) * bitcoind RPC error code and message are now surfaced in `block-sync` (lightningdevkit#2057). * Get `historical_estimated_channel_liquidity_probabilities` was added (lightningdevkit#1961). * `ChannelManager::fail_htlc_backwards_with_reason` was added (lightningdevkit#1948). * Macros which implement serialization using TLVs or straight writing of struct fields are now public (lightningdevkit#1823, lightningdevkit#1976, lightningdevkit#1977). Backwards Compatibility ======================= * Any inbound payments with a custom final CLTV delta will be rejected by LDK if you downgrade prior to receipt (lightningdevkit#1878). * `Event::PaymentPathFailed::network_update` will always be `None` if an 0.0.114-generated event is read by a prior version of LDK (lightningdevkit#2043). * `Event::PaymentPathFailed::all_paths_removed` will always be false if an 0.0.114-generated event is read by a prior version of LDK. Users who rely on it to determine payment retries should migrate to `Event::PaymentFailed`, in a separate release prior to upgrading to LDK 0.0.114 if downgrading is supported (lightningdevkit#2043). Performance Improvements ======================== * Channel data is now stored per-peer and channel updates across multiple peers can be operated on simultaneously (lightningdevkit#1507). * Routefinding is roughly 1.5x faster (lightningdevkit#1799). * Deserializing a `NetworkGraph` is roughly 6x faster (lightningdevkit#2016). * Memory usage for a `NetworkGraph` has been reduced substantially (lightningdevkit#2040). * `KeysInterface::get_secure_random_bytes` is roughly 200x faster (lightningdevkit#1974). Bug Fixes ========= * Fixed a bug where a delay in processing a `PaymentSent` event longer than the time taken to persist a `ChannelMonitor` update, when occurring immediately prior to a crash, may result in the `PaymentSent` event being lost (lightningdevkit#2048). * Fixed spurious rejections of rapid gossip sync data when the graph has been updated by other means between gossip syncs (lightningdevkit#2046). * Fixed a panic in `KeysManager` when the high bit of `starting_time_nanos` is set (lightningdevkit#1935). * Resolved an issue where the `ChannelManager::get_persistable_update_future` future would fail to wake until a second notification occurs (lightningdevkit#2064). * Resolved a memory leak when using `ChannelManager::send_probe` (lightningdevkit#2037). * Fixed a deadlock on some platforms at least when using async `ChannelMonitor` updating (lightningdevkit#2006). * Removed debug-only assertions which were reachable in threaded code (lightningdevkit#1964). * In some cases when payment sending fails on our local channel retries no longer take the same path and thus never succeed (lightningdevkit#2014). * Retries for spontaneous payments have been fixed (lightningdevkit#2002). * Return an `Err` if `lightning-persister` fails to read the directory listing rather than panicing (lightningdevkit#1943). * `peer_disconnected` will now never be called without `peer_connected` (lightningdevkit#2035) Security ======== 0.0.114 fixes several denial-of-service vulnerabilities which are reachable from untrusted input from channel counterparties or in deployments accepting inbound connections or channels. It also fixes a denial-of-service vulnerability in rare cases in the route finding logic. * The number of pending un-funded channels as well as peers without funded channels is now limited to avoid denial of service (lightningdevkit#1988). * A second `channel_ready` message received immediately after the first could lead to a spurious panic (lightningdevkit#2071). This issue was introduced with 0conf support in LDK 0.0.107. * A division-by-zero issue was fixed in the `ProbabilisticScorer` if the amount being sent (including previous-hop fees) is equal to a channel's capacity while walking the graph (lightningdevkit#2072). The division-by-zero was introduced with historical data tracking in LDK 0.0.112. In total, this release features 130 files changed, 21457 insertions, 10113 deletions in 343 commits from 18 authors, in alphabetical order: * Alec Chen * Allan Douglas R. de Oliveira * Andrei * Arik Sosman * Daniel Granhão * Duncan Dean * Elias Rohrer * Jeffrey Czyz * John Cantrell * Kurtsley * Matt Corallo * Max Fang * Omer Yacine * Valentine Wallace * Viktor Tigerström * Wilmer Paulino * benthecarman * jurvis
Our existing lockorder tests assume that a read lock on a thread
that is already holding the same read lock is totally fine. This
isn't at all true. The
std
RwLock
behavior isplatform-dependent - on most platforms readers can starve writers
as readers will never block for a pending writer. However, on
platforms where this is not the case, one thread trying to take a
write lock may deadlock with another thread that both already has,
and is attempting to take again, a read lock.
Worse, our in-tree
FairRwLock
exhibits this behavior explicitlyon all platforms to avoid the starvation issue.
Sadly, a user ended up hitting this deadlock in production in the
form of a call to
get_and_clear_pending_msg_events
which holdsthe
ChannelManager::total_consistency_lock
before callingprocess_pending_monitor_events
and eventuallychannel_monitor_updated
, which tries to take the same read lockagain.
Fixes #2000
May want to hold this one a little bit because some pending PRs may break/conflict and its easier to fix here than there.