Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand lockorder testing to look at mutexes, not specific instances #1420

Merged
merged 4 commits into from
Jul 21, 2022

Conversation

TheBlueMatt
Copy link
Collaborator

Our existing lockorder inversion checks look at specific instances
of mutexes rather than the general mutex itself. This changes that
behavior to look at the instruction pointer at which a mutex was
created and treat all mutexes which were created at the same
location as equivalent.

This allows us to detect lockorder inversions which occur across
tests, though it does substantially reduce parallelism during test
runs.

@codecov-commenter
Copy link

codecov-commenter commented Apr 12, 2022

Codecov Report

Merging #1420 (2b2863e) into main (2a3bf03) will increase coverage by 1.02%.
The diff coverage is 95.83%.

❗ Current head 2b2863e differs from pull request most recent head ff20203. Consider uploading reports for the commit ff20203 to get more accurate results

@@            Coverage Diff             @@
##             main    #1420      +/-   ##
==========================================
+ Coverage   90.87%   91.89%   +1.02%     
==========================================
  Files          80       80              
  Lines       44569    50132    +5563     
  Branches    44569    50132    +5563     
==========================================
+ Hits        40500    46069    +5569     
+ Misses       4069     4063       -6     
Impacted Files Coverage Δ
lightning/src/debug_sync.rs 95.58% <93.18%> (+0.78%) ⬆️
lightning/src/chain/channelmonitor.rs 90.96% <97.01%> (+<0.01%) ⬆️
lightning/src/ln/functional_test_utils.rs 96.81% <100.00%> (+1.57%) ⬆️
lightning/src/ln/functional_tests.rs 98.20% <100.00%> (+1.09%) ⬆️
lightning/src/ln/peer_handler.rs 57.25% <100.00%> (+0.18%) ⬆️
lightning/src/ln/reorg_tests.rs 100.00% <100.00%> (ø)
lightning/src/ln/channel.rs 88.78% <0.00%> (+0.02%) ⬆️
lightning-background-processor/src/lib.rs 95.81% <0.00%> (+0.61%) ⬆️
lightning/src/routing/scoring.rs 96.98% <0.00%> (+0.88%) ⬆️
lightning/src/chain/onchaintx.rs 94.90% <0.00%> (+0.92%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a3bf03...ff20203. Read the comment docs.

@TheBlueMatt
Copy link
Collaborator Author

Hmm, may need to be smarter about how we determine which stack frame is the "mutex construction" one.

@TheBlueMatt TheBlueMatt marked this pull request as draft April 12, 2022 21:21
In the next commit we add lockorder testing based on the line each
mutex was created on rather than the particular mutex instance.
This causes some additional test failure because of lockorder
inversions for the same mutex across different tests, which is
fixed here.
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 457f4d6 to 52aa726 Compare July 13, 2022 20:04
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 12 times, most recently from d8562f0 to 0104080 Compare July 14, 2022 01:01
@TheBlueMatt TheBlueMatt marked this pull request as ready for review July 14, 2022 01:01
@TheBlueMatt
Copy link
Collaborator Author

OOOOOOoooooooooooo-KKKKKKkkkkkkkkkk, so this actually works on not-my-computer platforms now, would be great to land it!

lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from bca44d0 to 1d0b473 Compare July 15, 2022 19:55
@TheBlueMatt
Copy link
Collaborator Author

Had to rewrite a bit to address issues turned up once I fixed the issue where I wasn't even inserting the lock metadata into the map.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 520a00a to 919c11a Compare July 16, 2022 20:17
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 6 times, most recently from 1668072 to e45bb7b Compare July 16, 2022 22:14
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, feel free to squash.

lightning/src/debug_sync.rs Show resolved Hide resolved
lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from e3338f7 to f14a324 Compare July 18, 2022 21:49
@TheBlueMatt
Copy link
Collaborator Author

Squashed + added one more fixup. Needs a second reviewer still.

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
Comment on lines 79 to 72
impl LockDep {
/// Note that `Backtrace::new()` is rather expensive so we rely on the caller to fill in the
/// `lockdep_backtrace` field after ensuring we need it.
fn new_without_bt(lock: &Arc<LockMetadata>) -> Self {
Self { lock: Arc::clone(lock), lockdep_trace: None }
}
}
impl PartialEq for LockDep {
fn eq(&self, o: &LockDep) -> bool { self.lock.lock_idx == o.lock.lock_idx }
}
impl Eq for LockDep {}
impl std::hash::Hash for LockDep {
fn hash<H: std::hash::Hasher>(&self, hasher: &mut H) { hasher.write_u64(self.lock.lock_idx); }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the need for all this by using a HashMap instead of a HashSet for locked_before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to do this, but I'm not convinced its the right design. Ultimately we do have a set here - it happens to be unique by some index rather than a default PartialEq, but so what. We don't really have a map, but we can mangle the set into a map if we do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess the thought was you could get rid of new_without_bt, too, since you don't nee to create a LockDep until you are ready to insert it. Well, the Option within LockDep, at least, as you'd still want a constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, yea, I guess that's worth it.

Comment on lines +105 to +93
if found_debug_sync {
if let Some(col) = symbol.colno() {
return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
} else {
// Windows debug symbols don't support column numbers, so fall back to
// line numbers only if no `colno` is available
return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved outside and before the enclosing if, which could be eliminated? Shouldn't need to fail matching the regex once found_debug_sync is true?

if found_debug_sync {
    // ...
}

let symbol_name = symbol.name().unwrap().as_str().unwrap();
found_debug_sync = sync_mutex_constr_regex.is_match(symbol_name);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new. We want to go up the call stack until we find the last match, rather than the first one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new.

Ok, I read the backtrace docs and have a better understanding of how this works now. I feel like there may be a more succinct was of doing this with Iterator::position that might be more readable but then again maybe not lol.

We want to go up the call stack until we find the last match, rather than the first one.

Hmm... if I read the docs correctly, the first item is the top of the call stack so iterating through in order would be moving down the call stack?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, yea.....so I guess "top" and "bottom" depends on what architecture you're on and if your stack grows down or up...sorry for the lack of clarity here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always think of the call stack as a logical data structure (a stack) regardless of how the architecture represents it. So a function call is added to the top of the stack, since that is the only way to add an item to a stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh yeah I think about them the other way around but that's mostly due to previous experience with disassemblers.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch 2 times, most recently from 25c3dba to 6d54ebb Compare July 19, 2022 20:54
Comment on lines +105 to +93
if found_debug_sync {
if let Some(col) = symbol.colno() {
return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
} else {
// Windows debug symbols don't support column numbers, so fall back to
// line numbers only if no `colno` is available
return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If a call to LockMetadata::new doesn't get inlined (as done in the debug_sync tests), we'll have multiple frames that can match found_debug_sync, i.e., one frame for Mutex::new and another for its underlying LockMetadata::new.

Ok, I read the backtrace docs and have a better understanding of how this works now. I feel like there may be a more succinct was of doing this with Iterator::position that might be more readable but then again maybe not lol.

We want to go up the call stack until we find the last match, rather than the first one.

Hmm... if I read the docs correctly, the first item is the top of the call stack so iterating through in order would be moving down the call stack?

Comment on lines 97 to 98
// that as the mutex construction site. Note that the first few frames may be in
// `backtrace`, so we have to ignore those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow the last sentence. What is "first few frames" referring to? When wouldn't the first few frames be in backtrace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, but on my machine locally there are no reported frames in backtrace, the first one is directly the site that calls Backtrace::new, so maybe they're filtering those on some platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the confusion is that I read "in backtace" as some BacktraceFrame returned by Backtrace::frames" whereas you mean "frames corresponding to calls on the backtrace object". Maybe just s/in/on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but its not on a backtrace "object" its "in the backtrace crate"? I said that instead.

Comment on lines 79 to 72
impl LockDep {
/// Note that `Backtrace::new()` is rather expensive so we rely on the caller to fill in the
/// `lockdep_backtrace` field after ensuring we need it.
fn new_without_bt(lock: &Arc<LockMetadata>) -> Self {
Self { lock: Arc::clone(lock), lockdep_trace: None }
}
}
impl PartialEq for LockDep {
fn eq(&self, o: &LockDep) -> bool { self.lock.lock_idx == o.lock.lock_idx }
}
impl Eq for LockDep {}
impl std::hash::Hash for LockDep {
fn hash<H: std::hash::Hasher>(&self, hasher: &mut H) { hasher.write_u64(self.lock.lock_idx); }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess the thought was you could get rid of new_without_bt, too, since you don't nee to create a LockDep until you are ready to insert it. Well, the Option within LockDep, at least, as you'd still want a constructor.

lightning/src/debug_sync.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-moar-lockorder branch from 6d54ebb to 2b2863e Compare July 20, 2022 19:04
jkczyz
jkczyz previously approved these changes Jul 20, 2022
Comment on lines 97 to 98
// that as the mutex construction site. Note that the first few frames may be in
// `backtrace`, so we have to ignore those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the confusion is that I read "in backtace" as some BacktraceFrame returned by Backtrace::frames" whereas you mean "frames corresponding to calls on the backtrace object". Maybe just s/in/on?

Comment on lines +105 to +93
if found_debug_sync {
if let Some(col) = symbol.colno() {
return format!("{}:{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap(), col);
} else {
// Windows debug symbols don't support column numbers, so fall back to
// line numbers only if no `colno` is available
return format!("{}:{}", symbol.filename().unwrap().display(), symbol.lineno().unwrap());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always think of the call stack as a logical data structure (a stack) regardless of how the architecture represents it. So a function call is added to the top of the stack, since that is the only way to add an item to a stack.

Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to land once squashed.

@TheBlueMatt
Copy link
Collaborator Author

Added one more fixup commit with a wording tweak, can squash after that.

jkczyz
jkczyz previously approved these changes Jul 20, 2022
When we add lockorder detection based on mutex construction site
rather than mutex instance in the next commit, ChannelMonitor's
PartialEq implementation causes spurious failures. This is caused
by the lockorder detection logic considering the ChannelMonitor
inner mutex to be two distinct mutexes - one when monitors are
deserialized and one when monitors are created fresh. Instead, we
attempt to tell the lockorder detection logic that they are the
same by ensuring they're constructed in the same place - in this
case a util method.
Our existing lockorder inversion checks look at specific instances
of mutexes rather than the general mutex itself. This changes that
behavior to look at the instruction pointer at which a mutex was
created and treat all mutexes which were created at the same
location as equivalent.

This allows us to detect lockorder inversions which occur across
tests, though it does substantially reduce parallelism during test
runs.
@TheBlueMatt
Copy link
Collaborator Author

Went ahead and squashed:

$ git diff-tree -U1 2b2863e ff202037
diff --git a/lightning/src/debug_sync.rs b/lightning/src/debug_sync.rs
index 81e508909..daec309e9 100644
--- a/lightning/src/debug_sync.rs
+++ b/lightning/src/debug_sync.rs
@@ -77,3 +77,3 @@ fn get_construction_location(backtrace: &Backtrace) -> String {
 	// that as the mutex construction site. Note that the first few frames may be in
-	// `backtrace`, so we have to ignore those.
+	// the `backtrace` crate, so we have to ignore those.
 	let sync_mutex_constr_regex = regex::Regex::new(r"lightning.*debug_sync.*new").unwrap();

@TheBlueMatt TheBlueMatt merged commit 834fe63 into lightningdevkit:main Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants