slot_has_updates should not check if slot is full (and connected) #27786

offerm · 2022-09-14T20:05:00Z

This allows the replay loop to process shreds earlier. The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available.

Expected speed up to replay loop 100-200 ms

direct impact on all RPC and Validator nodes.

No risk

Problem

See issue #27729

Summary of Changes

Fixes #
#27729

This allows the replay loop to process shreds earlier. The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available. Expected speed up to replay loop 100-200 ms No risk

buffalu

niceee

jstarry · 2022-09-14T21:42:19Z

@carllin I personally think that the change in this PR improves replay stage because it just sends the signal more often. I think that the real bug here is that we always decide to wait for shreds when did_complete_bank is true whether or not we already have the next block's shreds. Thoughts?

t-nelson · 2022-09-14T22:36:04Z

chatted with @carllin offline and it seems like the more correct change would be to fix the is_connected signal. working theory is that it can be always false if a validator is restarted from a remote snapshot with a dirty ledger directory. this will leave a hole in the ledger and break the is_connected linkage back to genesis. solution then being to treat snapshot slots as is_connected as is necessarily true given that they are rooted.

accordingly, it's unlikely that this affects the cluster broadly as it should only show up where a node has been restarted from remote snapshot with a dirty ledger directory, which any seasoned validator operator would be loath to attempt

steviez · 2022-09-14T22:40:03Z

@offerm - Also, can you please run ./cargo +nightly fmt (from root directory of repo) for the sake of getting a CI run; the first run failed from this:
https://buildkite.com/solana-labs/solana/builds/81859#01833d9b-0d74-4df0-9731-e86eb9e8045f

[Cargo.lock]: cargo nightly fmt --all -- --check | 5s
-- | --
  | ++ dirname Cargo.lock
  | + cd .
  | + /solana/cargo nightly fmt --all -- --check
  | + exec cargo +nightly-2022-08-12 fmt --all -- --check
  | Diff in /solana/ledger/src/blockstore.rs at line 3822:
  | // from block 0, which is true iff:
  | // 1) The block with index prev_block_index is itself part of the trunk of consecutive blocks
  | // starting from block 0,
  | -        // AND either:
  | -        // 1) The slot didn't exist in the database before, and now we have a consecutive
  | -        // block for that slot
  | -        (slot_meta_backup.is_none() && slot_meta.consumed != 0) \|\|
  | +    // AND either:
  | +    // 1) The slot didn't exist in the database before, and now we have a consecutive
  | +    // block for that slot
  | +    (slot_meta_backup.is_none() && slot_meta.consumed != 0) \|\|
  | // OR
  | // 2) The slot did exist, but now we have a new consecutive block for that slot
  | (slot_meta_backup.is_some() && slot_meta_backup.as_ref().unwrap().consumed != slot_meta.consumed)

offerm · 2022-09-15T07:10:33Z

I can confirm that in my setup, is_connected is always false. This causes slot_has_update to always be false and as a result, no signal is sent to the replay_stage and the code is waiting 100ms before attempting to process available shreds. This explains #27729.

Now, assuming you fix is_connected and keep its semantic, replay_stage will still not be optimal.

    // True if this slot is full (consumed == last_index + 1) and if every
    // slot that is a parent of this slot is also connected.
    pub is_connected: bool,

is_connected should be true only when the slot is full (we got all shreds).

This means that in many [most] cases, there will be work to be done but the replay_stage will not pick it before 100ms expires. This will still lead to unneeded wait_time, a delay in slot-replay completion, delay to the RPC and delay of descendent slot replay start.
it is only when the slot becomes full that replay-stage would not wait 100ms so improvement will be very partial.
this delay will happen regardless of the way the node started.

my conclusion:

slot_has_updates should not depend on is_connected
is_connected should probably be fixed regardless of slot_has_updates (unrelated to this PR and to replay_stage is waiting x100ms for new shreds #27729)

steviez · 2022-09-15T17:40:20Z

slot_has_updates should not depend on is_connected

As of now, I agree with your conclusion here. As you called out, we can still do work before (a properly functioning) is_connected is true. It seems as if instead of checking slot_meta.is_connected, we want something more like parent_slot_meta.is_connected to avoid signaling early when there is no work to do; the failing unit test from your CI run (test_new_shreds_signal) actually highlights this scenario.

However, getting the parent_slot_meta would require some plumbing and extra blockstore read so want to mull on this a little more. The only downside I currently see of not doing this check is that we might return early when there is no work that can be done; I think we'd be fine functionally tho

is_connected should probably be fixed regardless of slot_has_updates (unrelated to this PR and to replay_stage is waiting x100ms for new shreds #27729)

Unless we decide to do the parent_slot_meta check, I'm inclined to agree with you on this.

Going to chat / mull this over a little more with someone, will reply back after that

t-nelson · 2022-09-15T18:27:42Z

right. the gist is that there are two issues here

the is_connected bug. higher priority, lower risk, as it corrects intended behavior
improve work signalling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

i'd propose closing this PR, which uses a blunt tool for what needs to be a precision job, in favor of two PRs breaking the problem along risk lines

offerm · 2022-09-15T19:01:10Z

slot_has_updates should not depend on is_connected

As of now, I agree with your conclusion here. As you called out, we can still do work before (a properly functioning) is_connected is true. It seems as if instead of checking slot_meta.is_connected, we want something more like parent_slot_meta.is_connected to avoid signaling early when there is no work to do; the failing unit test from your CI run (test_new_shreds_signal) actually highlights this scenario.

However, getting the parent_slot_meta would require some plumbing and extra blockstore read so want to mull on this a little more. The only downside I currently see of not doing this check is that we might return early when there is no work that can be done; I think we'd be fine functionally tho

is_connected should probably be fixed regardless of slot_has_updates (unrelated to this PR and to replay_stage is waiting x100ms for new shreds #27729)

Unless we decide to do the parent_slot_meta check, I'm inclined to agree with you on this.

Going to chat / mull this over a little more with someone, will reply back after that

Agree.
Note that that extra cycle without actual work may happen today as a result of 100ms timeout or an extra iteration if did_complete_bank is true so I'm not sure the downside has real negative impact.

offerm · 2022-09-15T19:13:49Z

right. the gist is that there are two issues here

the is_connected bug. higher priority, lower risk, as it corrects intended behavior

improve work signalling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

i'd propose closing this PR, which uses a blunt tool for what needs to be a precision job, in favor of two PRs breaking the problem along risk lines

Completely disagree with your conclusion on #2.
Risk is minimal - only downside is what @steviez mentioned above which can easily happen with the current code and would just add neglectable execution time.
Priority is high due to the impact it has on the nodes and network. For example, I think this may help leader to miss less slots.

just to be sure we understand the impact, here is a result of a compare between my machine running 1.10.38+fix and the nearest Solana public RPC node (185.209.178.55).
As can be seen, my Solana node published 100% of the updates before the public RPC node with an average of 168ms per update. For traders, this is huge. (I have seen examples of 250ms and even 400ms average )

yhchiang-sol · 2022-09-15T22:08:45Z

I can confirm that in my setup, is_connected is always false.

Can I know whether your validator has ever restarted from a snapshot with some local ledger data such that it creates some gaps that make is_connected always false? Or do you always restart from a clean snapshot without local ledger data OR only restart with local ledger data but not both?

yhchiang-sol · 2022-09-15T23:24:44Z

Offline discussed with @steviez about this (@steviez: and feel free to add something that I might miss):

the is_connected bug. higher priority, lower risk, as it corrects intended behavior

We think fixing the bug is high pri, and we should replace is_connected check by checking if:

the current slot has a parent slot, and
there're consecutive shreds starting from index 0 in the current slot, and
the parent slot is full, and
the parent slot is connected all the way to the snapshot.

If all the above statements are true, then we are guaranteed to process the complete (i.e., connected) data up to the current shred.

improve work signaling logic. lower priority, higher risk as it requires more careful consideration of Nth-order effects

We discussed the possible consequence of not waiting 100ms. If we skip the 100ms wait, there's a higher chance that some validators have not yet received part of the shreds in the parent slot but already have consecutive shreds of the current slots available to continue the replay (i.e., the correct version of is_connected is false) while the other validators have received the completed data. While removing the 100ms wait will make the validator runs faster, it might have some impact on the vote and the consensus.

yhchiang-sol

To sum up, we have two different issues:

correct is_connected flag --- Once this is fixed, the 100ms wait time should only happen rarely when there are missing or delayed shreds.
Improve the signaling logic

And I think item 1 should be done before item 2, as without a correct is_connected, it is unsafe to change any signaling logic as the consequence is hard to measure.

As we need to first focus on item 1 to give item 2 a clear picture, we will close this PR for now but keep issue #27729 open, and we can continue the discussion over #27729 while we are working on item 1.

Thank @offerm for reporting and discussing the issue.

offerm · 2022-09-16T11:12:59Z

@yhchiang-sol

Your suggested changes to is_connected should provide the same impact as my suggested change.

Two comments:

i) You are changing the semantic of is_connected which currently means that the slot is full.

// True if this slot is full (consumed == last_index + 1) and if every
// slot that is a parent of this slot is also connected.
pub is_connected: bool,

Impact on slot chaining.

ii)

We discussed the possible consequence of not waiting 100ms. If we skip the 100ms wait, there's a higher chance that some validators have not yet received part of the shreds in the parent slot but already have consecutive shreds of the current slots available to continue the replay (i.e., the correct version of is_connected is false) while the other validators have received the completed data. While removing the 100ms wait will make the validator runs faster, it might have some impact on the vote and the consensus.

Above is not very clear.
if there is a risk as listed in the above should you allow replay if the 100ms timed out?

Also, replay_stage replays active bank forks. list of active bank fork is generated by generate_new_bank_forks by selecting forks of frozen_banks. As such, according to my limited understanding, there is no possible negative impact from triggering even in the situation listed above.

Anyway, looks like you intent to provide a different solution with high priority so I'm happy about it.

Many thanks.

yhchiang-sol · 2022-09-16T19:18:21Z

i) You are changing the semantic of is_connected which currently means that the slot is full.

Yep, so is_connected should mean everything is connected from the snapshot to the current shred.

So if slot N is connected but not full, then slot N-1 must be connected and full (if they are both after the snapshot).

if there is a risk as listed in the above should you allow replay if the 100ms timed out?

So first of all, I don't like the 100ms design as well, but changing it also has some impact on the consensus, and we also don't have a clear picture of it. To make things correct in a safer way, we should correct is_connected first (and this will make is_connected mostly true) and then the signaling logic.

And to your question: Yes, because the validator might never receive missing shred data that makes is_connected to true, so validators still need to proceed. In that case, it means validators that have waited 100ms have still not yet received the missing shreds and will choose a different fork than the leader. When is_connected is fixed, this situation should only happen rarely instead of constantly.

However, if we remove the 100ms wait, then we give no grace period of any out-of-order data, in that case, I am afraid that the validators will fork too early as they might later receive the data.

mergify bot added the community Community contribution label Sep 14, 2022

mergify bot requested a review from a team September 14, 2022 20:05

jstarry requested a review from carllin September 14, 2022 20:19

buffalu approved these changes Sep 14, 2022

View reviewed changes

buffalu mentioned this pull request Sep 14, 2022

Graph-based Replay Algorithm and Other Latency/Throughput Changes #27720

Closed

fix fmt

0d7ca1c

steviez mentioned this pull request Sep 15, 2022

SlotMeta is_connected field isn't updated as expected #27820

Closed

yhchiang-sol self-requested a review September 15, 2022 23:26

yhchiang-sol reviewed Sep 16, 2022

View reviewed changes

yhchiang-sol closed this Sep 16, 2022

steviez mentioned this pull request Oct 14, 2022

Fix SlotMeta connected tracking #28069

Merged

8 tasks

offerm mentioned this pull request Oct 17, 2022

replay_stage is waiting x100ms for new shreds #27729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slot_has_updates should not check if slot is full (and connected) #27786

slot_has_updates should not check if slot is full (and connected) #27786

offerm commented Sep 14, 2022

buffalu left a comment

jstarry commented Sep 14, 2022

t-nelson commented Sep 14, 2022 •

edited

Loading

steviez commented Sep 14, 2022

offerm commented Sep 15, 2022 •

edited

Loading

steviez commented Sep 15, 2022

t-nelson commented Sep 15, 2022

offerm commented Sep 15, 2022

offerm commented Sep 15, 2022

yhchiang-sol commented Sep 15, 2022

yhchiang-sol commented Sep 15, 2022 •

edited

Loading

yhchiang-sol left a comment •

edited

Loading

offerm commented Sep 16, 2022

yhchiang-sol commented Sep 16, 2022 •

edited

Loading

slot_has_updates should not check if slot is full (and connected) #27786

slot_has_updates should not check if slot is full (and connected) #27786

Conversation

offerm commented Sep 14, 2022

Problem

Summary of Changes

buffalu left a comment

Choose a reason for hiding this comment

jstarry commented Sep 14, 2022

t-nelson commented Sep 14, 2022 • edited Loading

steviez commented Sep 14, 2022

offerm commented Sep 15, 2022 • edited Loading

steviez commented Sep 15, 2022

t-nelson commented Sep 15, 2022

offerm commented Sep 15, 2022

offerm commented Sep 15, 2022

yhchiang-sol commented Sep 15, 2022

yhchiang-sol commented Sep 15, 2022 • edited Loading

yhchiang-sol left a comment • edited Loading

Choose a reason for hiding this comment

offerm commented Sep 16, 2022

yhchiang-sol commented Sep 16, 2022 • edited Loading

t-nelson commented Sep 14, 2022 •

edited

Loading

offerm commented Sep 15, 2022 •

edited

Loading

yhchiang-sol commented Sep 15, 2022 •

edited

Loading

yhchiang-sol left a comment •

edited

Loading

yhchiang-sol commented Sep 16, 2022 •

edited

Loading