Expand lockorder testing to look at mutexes, not specific instances #1420

TheBlueMatt · 2022-04-12T20:18:15Z

Our existing lockorder inversion checks look at specific instances
of mutexes rather than the general mutex itself. This changes that
behavior to look at the instruction pointer at which a mutex was
created and treat all mutexes which were created at the same
location as equivalent.

This allows us to detect lockorder inversions which occur across
tests, though it does substantially reduce parallelism during test
runs.

codecov-commenter · 2022-04-12T20:43:32Z

Codecov Report

Merging #1420 (2b2863e) into main (2a3bf03) will increase coverage by 1.02%.
The diff coverage is 95.83%.

❗ Current head 2b2863e differs from pull request most recent head ff20203. Consider uploading reports for the commit ff20203 to get more accurate results

@@            Coverage Diff             @@
##             main    #1420      +/-   ##
==========================================
+ Coverage   90.87%   91.89%   +1.02%     
==========================================
  Files          80       80              
  Lines       44569    50132    +5563     
  Branches    44569    50132    +5563     
==========================================
+ Hits        40500    46069    +5569     
+ Misses       4069     4063       -6

Impacted Files	Coverage Δ
lightning/src/debug_sync.rs	`95.58% <93.18%> (+0.78%)`	⬆️
lightning/src/chain/channelmonitor.rs	`90.96% <97.01%> (+<0.01%)`	⬆️
lightning/src/ln/functional_test_utils.rs	`96.81% <100.00%> (+1.57%)`	⬆️
lightning/src/ln/functional_tests.rs	`98.20% <100.00%> (+1.09%)`	⬆️
lightning/src/ln/peer_handler.rs	`57.25% <100.00%> (+0.18%)`	⬆️
lightning/src/ln/reorg_tests.rs	`100.00% <100.00%> (ø)`
lightning/src/ln/channel.rs	`88.78% <0.00%> (+0.02%)`	⬆️
lightning-background-processor/src/lib.rs	`95.81% <0.00%> (+0.61%)`	⬆️
lightning/src/routing/scoring.rs	`96.98% <0.00%> (+0.88%)`	⬆️
lightning/src/chain/onchaintx.rs	`94.90% <0.00%> (+0.92%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a3bf03...ff20203. Read the comment docs.

TheBlueMatt · 2022-04-12T21:21:19Z

Hmm, may need to be smarter about how we determine which stack frame is the "mutex construction" one.

In the next commit we add lockorder testing based on the line each mutex was created on rather than the particular mutex instance. This causes some additional test failure because of lockorder inversions for the same mutex across different tests, which is fixed here.

TheBlueMatt · 2022-07-14T01:03:20Z

OOOOOOoooooooooooo-KKKKKKkkkkkkkkkk, so this actually works on not-my-computer platforms now, would be great to land it!

lightning/src/debug_sync.rs

TheBlueMatt · 2022-07-15T19:56:04Z

Had to rewrite a bit to address issues turned up once I fixed the issue where I wasn't even inserting the lock metadata into the map.

wpaulino

Looks good, feel free to squash.

lightning/src/debug_sync.rs

TheBlueMatt · 2022-07-18T21:49:48Z

Squashed + added one more fixup. Needs a second reviewer still.

lightning/src/chain/channelmonitor.rs

jkczyz · 2022-07-19T15:27:34Z

lightning/src/debug_sync.rs

+impl LockDep {
+	/// Note that `Backtrace::new()` is rather expensive so we rely on the caller to fill in the
+	/// `lockdep_backtrace` field after ensuring we need it.
+	fn new_without_bt(lock: &Arc<LockMetadata>) -> Self {
+		Self { lock: Arc::clone(lock), lockdep_trace: None }
+	}
+}
+impl PartialEq for LockDep {
+	fn eq(&self, o: &LockDep) -> bool { self.lock.lock_idx == o.lock.lock_idx }
+}
+impl Eq for LockDep {}
+impl std::hash::Hash for LockDep {
+	fn hash<H: std::hash::Hasher>(&self, hasher: &mut H) { hasher.write_u64(self.lock.lock_idx); }
+}


Could we avoid the need for all this by using a HashMap instead of a HashSet for locked_before?

I pushed a commit to do this, but I'm not convinced its the right design. Ultimately we do have a set here - it happens to be unique by some index rather than a default PartialEq, but so what. We don't really have a map, but we can mangle the set into a map if we do this.

Ah, I guess the thought was you could get rid of new_without_bt, too, since you don't nee to create a LockDep until you are ready to insert it. Well, the Option within LockDep, at least, as you'd still want a constructor.

Ah, good point, yea, I guess that's worth it.

jkczyz · 2022-07-19T15:45:13Z

lightning/src/debug_sync.rs

+				if found_debug_sync {
+					if let Some(col) = symbol.colno() {
+						return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
+					} else {
+						// Windows debug symbols don't support column numbers, so fall back to
+						// line numbers only if no `colno` is available
+						return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
+					}
+				}


Could this be moved outside and before the enclosing if, which could be eliminated? Shouldn't need to fail matching the regex once found_debug_sync is true?

if found_debug_sync { // ... } let symbol_name = symbol.name().unwrap().as_str().unwrap(); found_debug_sync = sync_mutex_constr_regex.is_match(symbol_name);

I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new. We want to go up the call stack until we find the last match, rather than the first one.

I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new.

Ok, I read the backtrace docs and have a better understanding of how this works now. I feel like there may be a more succinct was of doing this with Iterator::position that might be more readable but then again maybe not lol.

We want to go up the call stack until we find the last match, rather than the first one.

Hmm... if I read the docs correctly, the first item is the top of the call stack so iterating through in order would be moving down the call stack?

Heh, yea.....so I guess "top" and "bottom" depends on what architecture you're on and if your stack grows down or up...sorry for the lack of clarity here.

I always think of the call stack as a logical data structure (a stack) regardless of how the architecture represents it. So a function call is added to the top of the stack, since that is the only way to add an item to a stack.

Heh yeah I think about them the other way around but that's mostly due to previous experience with disassemblers.

jkczyz · 2022-07-19T18:50:58Z

lightning/src/debug_sync.rs

+				if found_debug_sync {
+					if let Some(col) = symbol.colno() {
+						return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
+					} else {
+						// Windows debug symbols don't support column numbers, so fall back to
+						// line numbers only if no `colno` is available
+						return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
+					}
+				}


I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new.

Ok, I read the backtrace docs and have a better understanding of how this works now. I feel like there may be a more succinct was of doing this with Iterator::position that might be more readable but then again maybe not lol.

We want to go up the call stack until we find the last match, rather than the first one.

Hmm... if I read the docs correctly, the first item is the top of the call stack so iterating through in order would be moving down the call stack?

jkczyz · 2022-07-19T19:02:34Z

lightning/src/debug_sync.rs

+	// that as the mutex construction site. Note that the first few frames may be in
+	// `backtrace`, so we have to ignore those.


I don't quite follow the last sentence. What is "first few frames" referring to? When wouldn't the first few frames be in backtrace?

Dunno, but on my machine locally there are no reported frames in backtrace, the first one is directly the site that calls Backtrace::new, so maybe they're filtering those on some platforms.

Ok, I think the confusion is that I read "in backtace" as some BacktraceFrame returned by Backtrace::frames" whereas you mean "frames corresponding to calls on the backtrace object". Maybe just s/in/on?

Hmm, but its not on a backtrace "object" its "in the backtrace crate"? I said that instead.

jkczyz · 2022-07-19T22:04:40Z

lightning/src/debug_sync.rs

+impl LockDep {
+	/// Note that `Backtrace::new()` is rather expensive so we rely on the caller to fill in the
+	/// `lockdep_backtrace` field after ensuring we need it.
+	fn new_without_bt(lock: &Arc<LockMetadata>) -> Self {
+		Self { lock: Arc::clone(lock), lockdep_trace: None }
+	}
+}
+impl PartialEq for LockDep {
+	fn eq(&self, o: &LockDep) -> bool { self.lock.lock_idx == o.lock.lock_idx }
+}
+impl Eq for LockDep {}
+impl std::hash::Hash for LockDep {
+	fn hash<H: std::hash::Hasher>(&self, hasher: &mut H) { hasher.write_u64(self.lock.lock_idx); }
+}


Ah, I guess the thought was you could get rid of new_without_bt, too, since you don't nee to create a LockDep until you are ready to insert it. Well, the Option within LockDep, at least, as you'd still want a constructor.

lightning/src/debug_sync.rs

jkczyz · 2022-07-20T19:28:12Z

lightning/src/debug_sync.rs

+	// that as the mutex construction site. Note that the first few frames may be in
+	// `backtrace`, so we have to ignore those.


Ok, I think the confusion is that I read "in backtace" as some BacktraceFrame returned by Backtrace::frames" whereas you mean "frames corresponding to calls on the backtrace object". Maybe just s/in/on?

jkczyz · 2022-07-20T19:32:41Z

lightning/src/debug_sync.rs

+				if found_debug_sync {
+					if let Some(col) = symbol.colno() {
+						return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
+					} else {
+						// Windows debug symbols don't support column numbers, so fall back to
+						// line numbers only if no `colno` is available
+						return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
+					}
+				}


I always think of the call stack as a logical data structure (a stack) regardless of how the architecture represents it. So a function call is added to the top of the stack, since that is the only way to add an item to a stack.

wpaulino

Ready to land once squashed.

TheBlueMatt · 2022-07-20T21:31:04Z

Added one more fixup commit with a wording tweak, can squash after that.

When we add lockorder detection based on mutex construction site rather than mutex instance in the next commit, ChannelMonitor's PartialEq implementation causes spurious failures. This is caused by the lockorder detection logic considering the ChannelMonitor inner mutex to be two distinct mutexes - one when monitors are deserialized and one when monitors are created fresh. Instead, we attempt to tell the lockorder detection logic that they are the same by ensuring they're constructed in the same place - in this case a util method.

Our existing lockorder inversion checks look at specific instances of mutexes rather than the general mutex itself. This changes that behavior to look at the instruction pointer at which a mutex was created and treat all mutexes which were created at the same location as equivalent. This allows us to detect lockorder inversions which occur across tests, though it does substantially reduce parallelism during test runs.

TheBlueMatt · 2022-07-20T22:09:37Z

Went ahead and squashed:

$ git diff-tree -U1 2b2863e ff202037
diff --git a/lightning/src/debug_sync.rs b/lightning/src/debug_sync.rs
index 81e508909..daec309e9 100644
--- a/lightning/src/debug_sync.rs
+++ b/lightning/src/debug_sync.rs
@@ -77,3 +77,3 @@ fn get_construction_location(backtrace: &Backtrace) -> String {
 	// that as the mutex construction site. Note that the first few frames may be in
-	// `backtrace`, so we have to ignore those.
+	// the `backtrace` crate, so we have to ignore those.
 	let sync_mutex_constr_regex = regex::Regex::new(r"lightning.*debug_sync.*new").unwrap();

TheBlueMatt mentioned this pull request Apr 12, 2022

Process messages from different peers in parallel in PeerManager. #1023

Merged

jkczyz self-requested a review April 12, 2022 20:21

TheBlueMatt marked this pull request as draft April 12, 2022 21:21

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 457f4d6 to 52aa726 Compare July 13, 2022 20:04

TheBlueMatt mentioned this pull request Jul 13, 2022

Store channels per-peer #1507

Merged

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 12 times, most recently from d8562f0 to 0104080 Compare July 14, 2022 01:01

TheBlueMatt marked this pull request as ready for review July 14, 2022 01:01

TheBlueMatt added the Seeking Code Review label Jul 14, 2022

wpaulino reviewed Jul 15, 2022

View reviewed changes

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from bca44d0 to 1d0b473 Compare July 15, 2022 19:55

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 520a00a to 919c11a Compare July 16, 2022 20:17

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 6 times, most recently from 1668072 to e45bb7b Compare July 16, 2022 22:14

wpaulino reviewed Jul 18, 2022

View reviewed changes

lightning/src/debug_sync.rs Show resolved Hide resolved

lightning/src/debug_sync.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from e3338f7 to f14a324 Compare July 18, 2022 21:49

jkczyz reviewed Jul 19, 2022

View reviewed changes

jkczyz removed the Seeking Code Review label Jul 19, 2022

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 25c3dba to 6d54ebb Compare July 19, 2022 20:54

jkczyz reviewed Jul 19, 2022

View reviewed changes

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from 6d54ebb to 2b2863e Compare July 20, 2022 19:04

jkczyz previously approved these changes Jul 20, 2022

View reviewed changes

wpaulino reviewed Jul 20, 2022

View reviewed changes

TheBlueMatt dismissed jkczyz’s stale review via 0737783 July 20, 2022 21:30

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from 2b2863e to 0737783 Compare July 20, 2022 21:30

jkczyz previously approved these changes Jul 20, 2022

View reviewed changes

TheBlueMatt added 3 commits July 20, 2022 22:08

Swap HashSets with custom Hash in debug_sync for HashMaps

ff20203

TheBlueMatt dismissed jkczyz’s stale review via ff20203 July 20, 2022 22:09

TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from 0737783 to ff20203 Compare July 20, 2022 22:09

jkczyz approved these changes Jul 20, 2022

View reviewed changes

wpaulino approved these changes Jul 20, 2022

View reviewed changes

TheBlueMatt merged commit 834fe63 into lightningdevkit:main Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand lockorder testing to look at mutexes, not specific instances #1420

Expand lockorder testing to look at mutexes, not specific instances #1420

TheBlueMatt commented Apr 12, 2022

codecov-commenter commented Apr 12, 2022 •

edited

Loading

TheBlueMatt commented Apr 12, 2022

TheBlueMatt commented Jul 14, 2022

TheBlueMatt commented Jul 15, 2022

wpaulino left a comment

TheBlueMatt commented Jul 18, 2022

jkczyz Jul 19, 2022

TheBlueMatt Jul 19, 2022

jkczyz Jul 19, 2022

TheBlueMatt Jul 20, 2022

jkczyz Jul 19, 2022

wpaulino Jul 19, 2022

jkczyz Jul 19, 2022

TheBlueMatt Jul 20, 2022

jkczyz Jul 20, 2022

wpaulino Jul 20, 2022

jkczyz Jul 19, 2022

jkczyz Jul 19, 2022

TheBlueMatt Jul 20, 2022

jkczyz Jul 20, 2022

TheBlueMatt Jul 20, 2022

jkczyz Jul 19, 2022

jkczyz Jul 20, 2022

jkczyz Jul 20, 2022

wpaulino left a comment

TheBlueMatt commented Jul 20, 2022

TheBlueMatt commented Jul 20, 2022

		// that as the mutex construction site. Note that the first few frames may be in
		// `backtrace`, so we have to ignore those.

Expand lockorder testing to look at mutexes, not specific instances #1420

Expand lockorder testing to look at mutexes, not specific instances #1420

Conversation

TheBlueMatt commented Apr 12, 2022

codecov-commenter commented Apr 12, 2022 • edited Loading

Codecov Report

TheBlueMatt commented Apr 12, 2022

TheBlueMatt commented Jul 14, 2022

TheBlueMatt commented Jul 15, 2022

wpaulino left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Jul 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wpaulino left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Jul 20, 2022

TheBlueMatt commented Jul 20, 2022

codecov-commenter commented Apr 12, 2022 •

edited

Loading