-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop over confirm_slot until no more entries or slot completed #28535
Conversation
prioritization_fee_cache, | ||
)?; | ||
|
||
let mut more_entries_to_process = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do bank.is_complete here, but should've been picked up after-the-fact on the previous iteration and not marked for replay
Do you have any stats on how much this improves replay by in prod?
We wouldn't loop here, but what is the current behavior if the validator is slowly fed shreds? |
will load onto one of our mainnet machines now and respond shortly |
it would def help to avoid a whole other iteration of replay stage logic (why run another iteration of replay stage logic if you can stay here and continue to process to finish earlier?), but the change you made should help too. i think they're separate but related |
two voting validators running on the same hardware. replay slot complete timestamps entry poh verification elapsed. replay compute time (blue = new algorithm) replay elapsed (red = new algorithm. this is where @offerm pr would come in the most handy i think) tx verify time (light pink = new algorithm) fetch entry time(green = new algorithm) entry poh verification over long time period (green = new algo). kinda surprised how much it makes a difference here |
// more entries may have been received while replaying this slot. | ||
// looping over this ensures that slots will be processed as fast as possible with the | ||
// lowest latency. | ||
while more_entries_to_process { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want some kind of exit out of here based on timing as a safe-guard. Otherwise we are potentially not voting for a long time or starting a leader slot which wouldn't be great.
@carllin wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree that'd be the safest thing to do, could limit to 400-ish ms. worst case exec takes longer than expected and we just hit another iteration of replay before hitting this again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleep may not be necessary, worst case we just hit one more iteration of reading from blockstore before returning here right:
solana/ledger/src/blockstore_processor.rs
Lines 1151 to 1153 in 0145447
if slot_entries_load_result.0.is_empty() { | |
return Ok(false); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think sleep is the right word, more like:
let start = Instant::now();
while more_entries_to_process && start.elapsed() < Duration::from_millis(TIMEOUT) {
confirm_slot(...);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm yeah that works, should prevent bad leader from DOSing by continually streaming new shreds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could even probably warn if it hits that timeout
Problem
Summary of Changes
Risks