-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitfield-distribution: subsystem queue seems to get full #5657
Comments
I did some investigations on bitfield-distribution being clogged sometimes and all data points that the throughput of the system on average is more than sufficient, this is backed by multiple data sources like:
This leads me to think this rare occasions of the susbystem being clogged are just of bursts of messages that happen because of the fact that all validators decide to send their bitfield all of the same time. Doing some math on the total number of message it looks like this:
Clogging on the subsystem queue, even briefly, is really bad because because it blocks the sender and in this case the sender is network-bridge-rx which is dispatching communication for all the other subsystems, so we want to avoid it entirely or minimize it, for this we have 2 low hanging fruits that we should do:
Proposed fix: #5787 |
…5787) ## Description Details and rationale explained here: #5657 (comment) Fixes: #5657 --------- Signed-off-by: Alexandru Gheorghe <[email protected]>
On Kusama we can observe that the channel gets full every once in a while, leading to a brief stall of the network bridge. This has been happening as we increased the number of valdiators which increased the number of bitifeld messages in the network.
The bitfield gossip is bursty as all nodes setup a timer of 1.5s when they import the block. When timer expires they all send out their bitfield to other validators.
We need to investigate this further and see if it is a potential problem when we scale up to 1k validators. We might want to optimize this a bit or maybe just having a larger subsystem channel size to absorb these bursts is enough.
The text was updated successfully, but these errors were encountered: