-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slot_has_updates should not check if slot is full (and connected) #27786
Conversation
This allows the replay loop to process shreds earlier. The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available. Expected speed up to replay loop 100-200 ms No risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niceee
@carllin I personally think that the change in this PR improves replay stage because it just sends the signal more often. I think that the real bug here is that we always decide to wait for shreds when |
chatted with @carllin offline and it seems like the more correct change would be to fix the accordingly, it's unlikely that this affects the cluster broadly as it should only show up where a node has been restarted from remote snapshot with a dirty ledger directory, which any seasoned validator operator would be loath to attempt |
@offerm - Also, can you please run
|
I can confirm that in my setup, Now, assuming you fix
This means that in many [most] cases, there will be work to be done but the my conclusion:
|
As of now, I agree with your conclusion here. As you called out, we can still do work before (a properly functioning) However, getting the parent_slot_meta would require some plumbing and extra blockstore read so want to mull on this a little more. The only downside I currently see of not doing this check is that we might return early when there is no work that can be done; I think we'd be fine functionally tho
Unless we decide to do the Going to chat / mull this over a little more with someone, will reply back after that |
right. the gist is that there are two issues here
i'd propose closing this PR, which uses a blunt tool for what needs to be a precision job, in favor of two PRs breaking the problem along risk lines |
Agree. |
Completely disagree with your conclusion on #2. just to be sure we understand the impact, here is a result of a compare between my machine running 1.10.38+fix and the nearest Solana public RPC node (185.209.178.55). |
Can I know whether your validator has ever restarted from a snapshot with some local ledger data such that it creates some gaps that make is_connected always false? Or do you always restart from a clean snapshot without local ledger data OR only restart with local ledger data but not both? |
Offline discussed with @steviez about this (@steviez: and feel free to add something that I might miss):
We think fixing the bug is high pri, and we should replace
If all the above statements are true, then we are guaranteed to process the complete (i.e., connected) data up to the current shred.
We discussed the possible consequence of not waiting 100ms. If we skip the 100ms wait, there's a higher chance that some validators have not yet received part of the shreds in the parent slot but already have consecutive shreds of the current slots available to continue the replay (i.e., the correct version of is_connected is false) while the other validators have received the completed data. While removing the 100ms wait will make the validator runs faster, it might have some impact on the vote and the consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sum up, we have two different issues:
- correct
is_connected
flag --- Once this is fixed, the 100ms wait time should only happen rarely when there are missing or delayed shreds. - Improve the signaling logic
And I think item 1 should be done before item 2, as without a correct is_connected
, it is unsafe to change any signaling logic as the consequence is hard to measure.
As we need to first focus on item 1 to give item 2 a clear picture, we will close this PR for now but keep issue #27729 open, and we can continue the discussion over #27729 while we are working on item 1.
Thank @offerm for reporting and discussing the issue.
Your suggested changes to is_connected should provide the same impact as my suggested change. Two comments: i) You are changing the semantic of is_connected which currently means that the slot is full.
ii)
Above is not very clear. Also, Anyway, looks like you intent to provide a different solution with high priority so I'm happy about it. Many thanks. |
Yep, so is_connected should mean everything is connected from the snapshot to the current shred. So if slot N is connected but not full, then slot N-1 must be connected and full (if they are both after the snapshot).
So first of all, I don't like the 100ms design as well, but changing it also has some impact on the consensus, and we also don't have a clear picture of it. To make things correct in a safer way, we should correct is_connected first (and this will make is_connected mostly true) and then the signaling logic. And to your question: Yes, because the validator might never receive missing shred data that makes is_connected to true, so validators still need to proceed. In that case, it means validators that have waited 100ms have still not yet received the missing shreds and will choose a different fork than the leader. When is_connected is fixed, this situation should only happen rarely instead of constantly. However, if we remove the 100ms wait, then we give no grace period of any out-of-order data, in that case, I am afraid that the validators will fork too early as they might later receive the data. |
This allows the replay loop to process shreds earlier. The replay loop does not need to wait 100ms between processing of shreds and can process them as soon as these are available.
Expected speed up to replay loop 100-200 ms
direct impact on all RPC and Validator nodes.
No risk
Problem
See issue #27729
Summary of Changes
Fixes #
#27729