Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slot_has_updates should not check if slot is full (and connected) #27786

Closed
wants to merge 2 commits into from

Conversation

offerm
Copy link

@offerm offerm commented Sep 14, 2022

This allows the replay loop to process shreds earlier. The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available.

Expected speed up to replay loop 100-200 ms

direct impact on all RPC and Validator nodes.

No risk

Problem

See issue #27729

Summary of Changes

Fixes #
#27729

This allows the replay loop to process shreds earlier.
The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available.

Expected speed up to replay loop 100-200 ms

No risk
@mergify mergify bot added the community Community contribution label Sep 14, 2022
@mergify mergify bot requested a review from a team September 14, 2022 20:05
@jstarry jstarry requested a review from carllin September 14, 2022 20:19
Copy link
Contributor

@buffalu buffalu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niceee

@jstarry
Copy link
Member

jstarry commented Sep 14, 2022

@carllin I personally think that the change in this PR improves replay stage because it just sends the signal more often. I think that the real bug here is that we always decide to wait for shreds when did_complete_bank is true whether or not we already have the next block's shreds. Thoughts?

@t-nelson
Copy link
Contributor

t-nelson commented Sep 14, 2022

chatted with @carllin offline and it seems like the more correct change would be to fix the is_connected signal. working theory is that it can be always false if a validator is restarted from a remote snapshot with a dirty ledger directory. this will leave a hole in the ledger and break the is_connected linkage back to genesis. solution then being to treat snapshot slots as is_connected as is necessarily true given that they are rooted.

accordingly, it's unlikely that this affects the cluster broadly as it should only show up where a node has been restarted from remote snapshot with a dirty ledger directory, which any seasoned validator operator would be loath to attempt

@steviez
Copy link
Contributor

steviez commented Sep 14, 2022

@offerm - Also, can you please run ./cargo +nightly fmt (from root directory of repo) for the sake of getting a CI run; the first run failed from this:
https://buildkite.com/solana-labs/solana/builds/81859#01833d9b-0d74-4df0-9731-e86eb9e8045f

[Cargo.lock]: cargo nightly fmt --all -- --check | 5s
-- | --
  | ++ dirname Cargo.lock
  | + cd .
  | + /solana/cargo nightly fmt --all -- --check
  | + exec cargo +nightly-2022-08-12 fmt --all -- --check
  | Diff in /solana/ledger/src/blockstore.rs at line 3822:
  | // from block 0, which is true iff:
  | // 1) The block with index prev_block_index is itself part of the trunk of consecutive blocks
  | // starting from block 0,
  | -        // AND either:
  | -        // 1) The slot didn't exist in the database before, and now we have a consecutive
  | -        // block for that slot
  | -        (slot_meta_backup.is_none() && slot_meta.consumed != 0) \|\|
  | +    // AND either:
  | +    // 1) The slot didn't exist in the database before, and now we have a consecutive
  | +    // block for that slot
  | +    (slot_meta_backup.is_none() && slot_meta.consumed != 0) \|\|
  | // OR
  | // 2) The slot did exist, but now we have a new consecutive block for that slot
  | (slot_meta_backup.is_some() && slot_meta_backup.as_ref().unwrap().consumed != slot_meta.consumed)

@offerm
Copy link
Author

offerm commented Sep 15, 2022

I can confirm that in my setup, is_connected is always false. This causes slot_has_update to always be false and as a result, no signal is sent to the replay_stage and the code is waiting 100ms before attempting to process available shreds. This explains #27729.

Now, assuming you fix is_connected and keep its semantic, replay_stage will still not be optimal.

    // True if this slot is full (consumed == last_index + 1) and if every
    // slot that is a parent of this slot is also connected.
    pub is_connected: bool,

is_connected should be true only when the slot is full (we got all shreds).

This means that in many [most] cases, there will be work to be done but the replay_stage will not pick it before 100ms expires. This will still lead to unneeded wait_time, a delay in slot-replay completion, delay to the RPC and delay of descendent slot replay start.
it is only when the slot becomes full that replay-stage would not wait 100ms so improvement will be very partial.
this delay will happen regardless of the way the node started.

my conclusion:

@steviez
Copy link
Contributor

steviez commented Sep 15, 2022

  • slot_has_updates should not depend on is_connected

As of now, I agree with your conclusion here. As you called out, we can still do work before (a properly functioning) is_connected is true. It seems as if instead of checking slot_meta.is_connected, we want something more like parent_slot_meta.is_connected to avoid signaling early when there is no work to do; the failing unit test from your CI run (test_new_shreds_signal) actually highlights this scenario.

However, getting the parent_slot_meta would require some plumbing and extra blockstore read so want to mull on this a little more. The only downside I currently see of not doing this check is that we might return early when there is no work that can be done; I think we'd be fine functionally tho

Unless we decide to do the parent_slot_meta check, I'm inclined to agree with you on this.

Going to chat / mull this over a little more with someone, will reply back after that

@t-nelson
Copy link
Contributor

right. the gist is that there are two issues here

  1. the is_connected bug. higher priority, lower risk, as it corrects intended behavior
  2. improve work signalling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

i'd propose closing this PR, which uses a blunt tool for what needs to be a precision job, in favor of two PRs breaking the problem along risk lines

@offerm
Copy link
Author

offerm commented Sep 15, 2022

  • slot_has_updates should not depend on is_connected

As of now, I agree with your conclusion here. As you called out, we can still do work before (a properly functioning) is_connected is true. It seems as if instead of checking slot_meta.is_connected, we want something more like parent_slot_meta.is_connected to avoid signaling early when there is no work to do; the failing unit test from your CI run (test_new_shreds_signal) actually highlights this scenario.

However, getting the parent_slot_meta would require some plumbing and extra blockstore read so want to mull on this a little more. The only downside I currently see of not doing this check is that we might return early when there is no work that can be done; I think we'd be fine functionally tho

Unless we decide to do the parent_slot_meta check, I'm inclined to agree with you on this.

Going to chat / mull this over a little more with someone, will reply back after that

Agree.
Note that that extra cycle without actual work may happen today as a result of 100ms timeout or an extra iteration if did_complete_bank is true so I'm not sure the downside has real negative impact.

@offerm
Copy link
Author

offerm commented Sep 15, 2022

right. the gist is that there are two issues here

  1. the is_connected bug. higher priority, lower risk, as it corrects intended behavior
  2. improve work signalling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

i'd propose closing this PR, which uses a blunt tool for what needs to be a precision job, in favor of two PRs breaking the problem along risk lines

Completely disagree with your conclusion on #2.
Risk is minimal - only downside is what @steviez mentioned above which can easily happen with the current code and would just add neglectable execution time.
Priority is high due to the impact it has on the nodes and network. For example, I think this may help leader to miss less slots.

just to be sure we understand the impact, here is a result of a compare between my machine running 1.10.38+fix and the nearest Solana public RPC node (185.209.178.55).
As can be seen, my Solana node published 100% of the updates before the public RPC node with an average of 168ms per update. For traders, this is huge. (I have seen examples of 250ms and even 400ms average )

image

@yhchiang-sol
Copy link
Contributor

I can confirm that in my setup, is_connected is always false.

Can I know whether your validator has ever restarted from a snapshot with some local ledger data such that it creates some gaps that make is_connected always false? Or do you always restart from a clean snapshot without local ledger data OR only restart with local ledger data but not both?

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Sep 15, 2022

Offline discussed with @steviez about this (@steviez: and feel free to add something that I might miss):

the is_connected bug. higher priority, lower risk, as it corrects intended behavior

We think fixing the bug is high pri, and we should replace is_connected check by checking if:

  • the current slot has a parent slot, and
  • there're consecutive shreds starting from index 0 in the current slot, and
  • the parent slot is full, and
  • the parent slot is connected all the way to the snapshot.

If all the above statements are true, then we are guaranteed to process the complete (i.e., connected) data up to the current shred.

improve work signaling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

We discussed the possible consequence of not waiting 100ms. If we skip the 100ms wait, there's a higher chance that some validators have not yet received part of the shreds in the parent slot but already have consecutive shreds of the current slots available to continue the replay (i.e., the correct version of is_connected is false) while the other validators have received the completed data. While removing the 100ms wait will make the validator runs faster, it might have some impact on the vote and the consensus.

@yhchiang-sol yhchiang-sol self-requested a review September 15, 2022 23:26
Copy link
Contributor

@yhchiang-sol yhchiang-sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up, we have two different issues:

  1. correct is_connected flag --- Once this is fixed, the 100ms wait time should only happen rarely when there are missing or delayed shreds.
  2. Improve the signaling logic

And I think item 1 should be done before item 2, as without a correct is_connected, it is unsafe to change any signaling logic as the consequence is hard to measure.

As we need to first focus on item 1 to give item 2 a clear picture, we will close this PR for now but keep issue #27729 open, and we can continue the discussion over #27729 while we are working on item 1.

Thank @offerm for reporting and discussing the issue.

@offerm
Copy link
Author

offerm commented Sep 16, 2022

@yhchiang-sol

Your suggested changes to is_connected should provide the same impact as my suggested change.

Two comments:

i) You are changing the semantic of is_connected which currently means that the slot is full.

// True if this slot is full (consumed == last_index + 1) and if every
// slot that is a parent of this slot is also connected.
pub is_connected: bool,

Impact on slot chaining.

ii)

We discussed the possible consequence of not waiting 100ms. If we skip the 100ms wait, there's a higher chance that some validators have not yet received part of the shreds in the parent slot but already have consecutive shreds of the current slots available to continue the replay (i.e., the correct version of is_connected is false) while the other validators have received the completed data. While removing the 100ms wait will make the validator runs faster, it might have some impact on the vote and the consensus.

Above is not very clear.
if there is a risk as listed in the above should you allow replay if the 100ms timed out?

Also, replay_stage replays active bank forks. list of active bank fork is generated by generate_new_bank_forks by selecting forks of frozen_banks. As such, according to my limited understanding, there is no possible negative impact from triggering even in the situation listed above.

Anyway, looks like you intent to provide a different solution with high priority so I'm happy about it.

Many thanks.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Sep 16, 2022

i) You are changing the semantic of is_connected which currently means that the slot is full.

Yep, so is_connected should mean everything is connected from the snapshot to the current shred.

So if slot N is connected but not full, then slot N-1 must be connected and full (if they are both after the snapshot).

if there is a risk as listed in the above should you allow replay if the 100ms timed out?

So first of all, I don't like the 100ms design as well, but changing it also has some impact on the consensus, and we also don't have a clear picture of it. To make things correct in a safer way, we should correct is_connected first (and this will make is_connected mostly true) and then the signaling logic.

And to your question: Yes, because the validator might never receive missing shred data that makes is_connected to true, so validators still need to proceed. In that case, it means validators that have waited 100ms have still not yet received the missing shreds and will choose a different fork than the leader. When is_connected is fixed, this situation should only happen rarely instead of constantly.

However, if we remove the 100ms wait, then we give no grace period of any out-of-order data, in that case, I am afraid that the validators will fork too early as they might later receive the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants