Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop over confirm_slot until no more entries or slot completed #28535

Closed
wants to merge 2 commits into from

Conversation

buffalu
Copy link
Contributor

@buffalu buffalu commented Oct 21, 2022

Problem

  • while confirm_slot is running, more shreds may have been received and inserted into blockstore.
  • we should process those immediately

Summary of Changes

  • loop over confirm_slot until end of slot or no more shreds

Risks

  • what if leader slowly feeds shreds to validators and keeps them stuck in confirm_slot? it doesn't seem like there's any concept of time in here, so perhaps we consider adding a timeout for how long we're willing to run this method for?

@mergify mergify bot added the community Community contribution label Oct 21, 2022
@mergify mergify bot requested a review from a team October 21, 2022 15:32
prioritization_fee_cache,
)?;

let mut more_entries_to_process = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do bank.is_complete here, but should've been picked up after-the-fact on the previous iteration and not marked for replay

@apfitzge
Copy link
Contributor

apfitzge commented Oct 21, 2022

Do you have any stats on how much this improves replay by in prod?

what if leader slowly feeds shreds to validators and keeps them stuck in confirm_slot?

We wouldn't loop here, but what is the current behavior if the validator is slowly fed shreds?

@buffalu
Copy link
Contributor Author

buffalu commented Oct 21, 2022

Do you have any stats on how much this improves replay by in prod?

will load onto one of our mainnet machines now and respond shortly

@offerm
Copy link

offerm commented Oct 21, 2022

I believe this is highly related to #27729 and the suggested fix #28069
I'm not sure the loop provide any value once @28069 merged

@buffalu
Copy link
Contributor Author

buffalu commented Oct 21, 2022

I believe this is highly related to #27729 and the suggested fix #28069 I'm not sure the loop provide any value once @28069 merged

it would def help to avoid a whole other iteration of replay stage logic (why run another iteration of replay stage logic if you can stay here and continue to process to finish earlier?), but the change you made should help too. i think they're separate but related

@buffalu
Copy link
Contributor Author

buffalu commented Oct 22, 2022

two voting validators running on the same hardware.

replay slot complete timestamps
green = new algorithm
Screen Shot 2022-10-21 at 8 27 56 PM

entry poh verification elapsed.
green = new algorithm
Screen Shot 2022-10-21 at 8 28 08 PM

replay compute time (blue = new algorithm)
Screen Shot 2022-10-21 at 8 28 35 PM

replay elapsed (red = new algorithm. this is where @offerm pr would come in the most handy i think)
Screen Shot 2022-10-21 at 8 29 03 PM

tx verify time (light pink = new algorithm)
Screen Shot 2022-10-21 at 8 29 45 PM

fetch entry time(green = new algorithm)
Screen Shot 2022-10-21 at 8 30 17 PM

entry poh verification over long time period (green = new algo). kinda surprised how much it makes a difference here
Screen Shot 2022-10-21 at 8 31 12 PM

// more entries may have been received while replaying this slot.
// looping over this ensures that slots will be processed as fast as possible with the
// lowest latency.
while more_entries_to_process {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want some kind of exit out of here based on timing as a safe-guard. Otherwise we are potentially not voting for a long time or starting a leader slot which wouldn't be great.

@carllin wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that'd be the safest thing to do, could limit to 400-ish ms. worst case exec takes longer than expected and we just hit another iteration of replay before hitting this again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep may not be necessary, worst case we just hit one more iteration of reading from blockstore before returning here right:

if slot_entries_load_result.0.is_empty() {
return Ok(false);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think sleep is the right word, more like:

let start = Instant::now();
while more_entries_to_process && start.elapsed() < Duration::from_millis(TIMEOUT) {
    confirm_slot(...);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm yeah that works, should prevent bad leader from DOSing by continually streaming new shreds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even probably warn if it hits that timeout

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 29, 2022
@github-actions github-actions bot closed this Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants