Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize some banking stage tests #6251

Merged
merged 8 commits into from
Oct 16, 2019

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Oct 7, 2019

Problem

Various banking stage tests are fragile.

Summary of Changes

Fixed several race conditions in tests.

This fixes 3 unstable unit tests. For all of them, the root cause of unstability is a race condition between solana-poh-service-tick_producer thread and solana-banking-stage-tx thread. So, to precisely control the timings of solana-poh-service-tick_producer thread, I introduced new PohConfig field just for testing purpose.

The requested explanations for each tests follow (in simpler-hardest order):

  • test_banking_stage_tick
    • Problem: Even if we waited for 600ms, it's possible for ticks not to be produced enough when the ticker thread is stalled under heavy system load.
    • Solution: Always ensure enough ticks are produced (in this pull request, produce 6 ticks)
  • test_banking_stage_entryfication (originally mentioned by the issue)
    • Problem: Right in the middle of solana-banking-stage-tx's processing of received packets, the thread may be paused and the solana-poh-service-tick_producer thread starts kicking in. If that is the case and new tick entry unfortunately knocks the maximum tick height, it interrupts the in-flight processing of the packets. Depending on the exact timing, the banking stage interrupt results in MaxHeightReached error from PohRecorder when banking stage tries to record a transaction entry or simply enqueuing code path of (unprocessed_packets=>buffered_packets). Then, at test code, this results in an infinite loop at HERE (btw, this loop is itself a bit off..., should fix it too?), and eventual timeout on the CI, vomitting an odd RocksDB error.
    • Solution: Prevent slots from becoming full merely with tick PoH entries.
  • test_banking_stage_entries_only
    • Problem: In addition to the problem and solution of test_banking_stage_entryfication. There are additional issues. First, assertion may be run before banking stage finishes. Second, the wall time resulted from for _ in 0..10 + sleep(Duration::from_millis(200)); may not be enough.
    • For the two additional issue, do simple fix: reorder banking_stage.join().unwrap(); up in the test code and replace the 10-time repetition with unbounded looping also with guard for empty entries from entry_receiver

(First starting from #5660, I ended up with spotting other similar race conditions in neighbor tests, so fixed them as well.)

Fixes #5660

@@ -1128,18 +1140,20 @@ mod tests {
let bank = Bank::new(&genesis_block);
bank.process_transaction(&fund_tx).unwrap();
//receive entries + ticks
for _ in 0..10 {
loop {
let entries: Vec<Entry> = entry_receiver
.iter()
.map(|(_bank, (entry, _tick_height))| entry)
.collect();

assert!(entries.verify(&blockhash));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assertion should be moved inside the newly-added empty guard if clause.

@@ -1031,7 +1038,6 @@ mod tests {
vote_receiver,
);
trace!("sending bank");
sleep(Duration::from_millis(600));
Copy link
Member Author

@ryoqun ryoqun Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because poh_config's life time is controlled by the target_tick_count, this is no longer needed.

bank.process_transactions(&entry.transactions)
.iter()
.for_each(|x| assert_eq!(*x, Ok(())));
if !entries.is_empty() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under high load, I observed empty entries are returned.

if poh_config.target_tick_count.is_none() {
Self::sleepy_tick_producer(poh_recorder, &poh_config, &poh_exit_);
} else {
Self::short_lived_tick_producer(poh_recorder, &poh_config, &poh_exit_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: or short_lived_sleepy_tick_producer.. (too long?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like descriptive names 👍

@mvines mvines requested a review from carllin October 7, 2019 21:10
@mvines mvines added the CI Pull Request is ready to enter CI label Oct 7, 2019
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Oct 7, 2019
) -> (
Arc<AtomicBool>,
Arc<Mutex<PohRecorder>>,
PohService,
Receiver<WorkingBankEntry>,
) {
let exit = Arc::new(AtomicBool::new(false));
let poh_config = Arc::new(PohConfig::default());
let poh_config = if let Some(poh_config) = poh_config {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm sure there is better way to write in more idiomatic Rust way.

Copy link
Contributor

@carllin carllin Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let poh_config = Arc::new(poh_config.unwrap_or(PohConfig::default())) 😄

@carllin
Copy link
Contributor

carllin commented Oct 8, 2019

@ryoqun, thanks for taking this on! In the Problem description of this PR, could you describe in greater detail the race condition that is making the test fragile? It would better frame the discussion for this change.

@ryoqun
Copy link
Member Author

ryoqun commented Oct 9, 2019

@carllin Thanks for checking this out! Sure thing, I'll elaborate it! I'll work on to finish this up tomorrow's work-time in JST. :)

@ryoqun ryoqun marked this pull request as ready for review October 10, 2019 21:43
@ryoqun
Copy link
Member Author

ryoqun commented Oct 10, 2019

@carllin Sorry for being late... I added the details of race conditions. Hope that clarifies what I wanted to fix... Also updated this PR with your comments!

@carllin carllin added the CI Pull Request is ready to enter CI label Oct 11, 2019
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Oct 11, 2019
@@ -1020,8 +1021,10 @@ mod tests {
let blocktree = Arc::new(
Blocktree::open(&ledger_path).expect("Expected to be able to open database ledger"),
);
let mut poh_config = PohConfig::default();
poh_config.target_tick_count = Some(6);
Copy link
Contributor

@carllin carllin Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this test is to check that the number of ticks output is genesis_block.ticks_per_slot, no matter how many extra ticks are sent by the PohService. For clarity, instead of the magic number 6, can we write this as genesis_block.ticks_per_slot + num_extra_ticks for some number num_extra_ticks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing! done at a9138d1

Copy link
Contributor

@carllin carllin Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right 😄 for the same reasons as you found in the other cases listed below, this should also use bank.max_tick_height() instead of genesis_block.ticks_per_slot for clarity and consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! done cef4770

@@ -1212,8 +1220,11 @@ mod tests {
Blocktree::open(&ledger_path)
.expect("Expected to be able to open database ledger"),
);
let mut poh_config = PohConfig::default();
// limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage
poh_config.target_tick_count = Some(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, quite a subtle bug, especially for your first time through the code base :)

Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing! done at a9138d1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58

Copy link
Contributor

@carllin carllin Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, ah yes, it's because of a special case where slot 0 generates one less tick than all the other slots. For instance, if genesis_block.ticks_per_slot is equal to 10, then the first slot will have bank.max_tick_height of 9, so PoH will, starting at 0, generate a tick for tick_heights 0->1, 1->2, 2->3...8->9, which is only 9 ticks. The next slot, slot 1, will then generate 10 ticks because it's max tick height is 19, and it will start generating a tick for tick heights 9->10, 10->11, ... 18 -> 19. This is all because in the past we wanted to keep the first tick in a slot as 0-indexed.

This open PR here actually will address this issue: #6263, if you are curious and want to take a look :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for very detailed explanation! I'll look at it. :D

@@ -1069,8 +1071,11 @@ mod tests {
let blocktree = Arc::new(
Blocktree::open(&ledger_path).expect("Expected to be able to open database ledger"),
);
let mut poh_config = PohConfig::default();
// limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage
poh_config.target_tick_count = Some(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing! done at a9138d1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58

@carllin
Copy link
Contributor

carllin commented Oct 11, 2019

@ryoqun, looking good! Just a few nits, and a little more plumbing to make CI pass :)

}
Blocktree::destroy(&ledger_path).unwrap();
}

#[test]
#[ignore]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@ryoqun ryoqun requested a review from carllin October 12, 2019 08:12
@carllin carllin added the CI Pull Request is ready to enter CI label Oct 12, 2019
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Oct 12, 2019
@codecov
Copy link

codecov bot commented Oct 13, 2019

Codecov Report

Merging #6251 into master will decrease coverage by 13.5%.
The diff coverage is 85.1%.

@@            Coverage Diff            @@
##           master   #6251      +/-   ##
=========================================
- Coverage    77.9%   64.4%   -13.6%     
=========================================
  Files         216     216              
  Lines       40601   49205    +8604     
=========================================
+ Hits        31654   31717      +63     
- Misses       8947   17488    +8541

@mvines mvines added the CI Pull Request is ready to enter CI label Oct 14, 2019
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Oct 14, 2019
@mvines
Copy link
Member

mvines commented Oct 14, 2019

@carllin - does this look good to land? It's green

@mvines mvines added the CI Pull Request is ready to enter CI label Oct 16, 2019
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Oct 16, 2019
@carllin carllin merged commit e267dfa into solana-labs:master Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unignore flaky test_banking_stage_entryfication
4 participants