-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize some banking stage tests #6251
Stabilize some banking stage tests #6251
Conversation
@@ -1128,18 +1140,20 @@ mod tests { | |||
let bank = Bank::new(&genesis_block); | |||
bank.process_transaction(&fund_tx).unwrap(); | |||
//receive entries + ticks | |||
for _ in 0..10 { | |||
loop { | |||
let entries: Vec<Entry> = entry_receiver | |||
.iter() | |||
.map(|(_bank, (entry, _tick_height))| entry) | |||
.collect(); | |||
|
|||
assert!(entries.verify(&blockhash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assertion should be moved inside the newly-added empty guard if clause.
@@ -1031,7 +1038,6 @@ mod tests { | |||
vote_receiver, | |||
); | |||
trace!("sending bank"); | |||
sleep(Duration::from_millis(600)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because poh_config
's life time is controlled by the target_tick_count
, this is no longer needed.
bank.process_transactions(&entry.transactions) | ||
.iter() | ||
.for_each(|x| assert_eq!(*x, Ok(()))); | ||
if !entries.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under high load, I observed empty entries are returned.
core/src/poh_service.rs
Outdated
if poh_config.target_tick_count.is_none() { | ||
Self::sleepy_tick_producer(poh_recorder, &poh_config, &poh_exit_); | ||
} else { | ||
Self::short_lived_tick_producer(poh_recorder, &poh_config, &poh_exit_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: or short_lived_sleepy_tick_producer
.. (too long?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like descriptive names 👍
core/src/banking_stage.rs
Outdated
) -> ( | ||
Arc<AtomicBool>, | ||
Arc<Mutex<PohRecorder>>, | ||
PohService, | ||
Receiver<WorkingBankEntry>, | ||
) { | ||
let exit = Arc::new(AtomicBool::new(false)); | ||
let poh_config = Arc::new(PohConfig::default()); | ||
let poh_config = if let Some(poh_config) = poh_config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm sure there is better way to write in more idiomatic Rust way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let poh_config = Arc::new(poh_config.unwrap_or(PohConfig::default()))
😄
@ryoqun, thanks for taking this on! In the |
@carllin Thanks for checking this out! Sure thing, I'll elaborate it! I'll work on to finish this up tomorrow's work-time in JST. :) |
@carllin Sorry for being late... I added the details of race conditions. Hope that clarifies what I wanted to fix... Also updated this PR with your comments! |
core/src/banking_stage.rs
Outdated
@@ -1020,8 +1021,10 @@ mod tests { | |||
let blocktree = Arc::new( | |||
Blocktree::open(&ledger_path).expect("Expected to be able to open database ledger"), | |||
); | |||
let mut poh_config = PohConfig::default(); | |||
poh_config.target_tick_count = Some(6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this test is to check that the number of ticks output is genesis_block.ticks_per_slot
, no matter how many extra ticks are sent by the PohService. For clarity, instead of the magic number 6, can we write this as genesis_block.ticks_per_slot + num_extra_ticks
for some number num_extra_ticks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing! done at a9138d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right 😄 for the same reasons as you found in the other cases listed below, this should also use bank.max_tick_height()
instead of genesis_block.ticks_per_slot
for clarity and consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! done cef4770
core/src/banking_stage.rs
Outdated
@@ -1212,8 +1220,11 @@ mod tests { | |||
Blocktree::open(&ledger_path) | |||
.expect("Expected to be able to open database ledger"), | |||
); | |||
let mut poh_config = PohConfig::default(); | |||
// limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage | |||
poh_config.target_tick_count = Some(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, quite a subtle bug, especially for your first time through the code base :)
Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1
, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing! done at a9138d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun, ah yes, it's because of a special case where slot 0 generates one less tick than all the other slots. For instance, if genesis_block.ticks_per_slot
is equal to 10, then the first slot will have bank.max_tick_height
of 9
, so PoH will, starting at 0, generate a tick for tick_heights 0->1, 1->2, 2->3...8->9
, which is only 9 ticks. The next slot, slot 1, will then generate 10 ticks because it's max tick height is 19, and it will start generating a tick for tick heights 9->10, 10->11, ... 18 -> 19
. This is all because in the past we wanted to keep the first tick in a slot as 0-indexed.
This open PR here actually will address this issue: #6263, if you are curious and want to take a look :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for very detailed explanation! I'll look at it. :D
core/src/banking_stage.rs
Outdated
@@ -1069,8 +1071,11 @@ mod tests { | |||
let blocktree = Arc::new( | |||
Blocktree::open(&ledger_path).expect("Expected to be able to open database ledger"), | |||
); | |||
let mut poh_config = PohConfig::default(); | |||
// limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage | |||
poh_config.target_tick_count = Some(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing! done at a9138d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58
@ryoqun, looking good! Just a few nits, and a little more plumbing to make CI pass :) |
} | ||
Blocktree::destroy(&ledger_path).unwrap(); | ||
} | ||
|
||
#[test] | ||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
Codecov Report
@@ Coverage Diff @@
## master #6251 +/- ##
=========================================
- Coverage 77.9% 64.4% -13.6%
=========================================
Files 216 216
Lines 40601 49205 +8604
=========================================
+ Hits 31654 31717 +63
- Misses 8947 17488 +8541 |
@carllin - does this look good to land? It's green |
Problem
Various banking stage tests are fragile.
Summary of Changes
Fixed several race conditions in tests.
This fixes 3 unstable unit tests. For all of them, the root cause of unstability is a race condition between solana-poh-service-tick_producer thread and solana-banking-stage-tx thread. So, to precisely control the timings of solana-poh-service-tick_producer thread, I introduced new PohConfig field just for testing purpose.
The requested explanations for each tests follow (in simpler-hardest order):
test_banking_stage_tick
test_banking_stage_entryfication
(originally mentioned by the issue)test_banking_stage_entries_only
test_banking_stage_entryfication
. There are additional issues. First, assertion may be run before banking stage finishes. Second, the wall time resulted fromfor _ in 0..10
+sleep(Duration::from_millis(200));
may not be enough.banking_stage.join().unwrap();
up in the test code and replace the 10-time repetition with unbounded looping also with guard for empty entries fromentry_receiver
(First starting from #5660, I ended up with spotting other similar race conditions in neighbor tests, so fixed them as well.)
Fixes #5660