Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitfield-distribution: subsystem queue seems to get full #5657

Closed
sandreim opened this issue Sep 10, 2024 · 1 comment · Fixed by #5787
Closed

bitfield-distribution: subsystem queue seems to get full #5657

sandreim opened this issue Sep 10, 2024 · 1 comment · Fixed by #5787
Assignees
Labels
T0-node This PR/Issue is related to the topic “node”.

Comments

@sandreim
Copy link
Contributor

On Kusama we can observe that the channel gets full every once in a while, leading to a brief stall of the network bridge. This has been happening as we increased the number of valdiators which increased the number of bitifeld messages in the network.

The bitfield gossip is bursty as all nodes setup a timer of 1.5s when they import the block. When timer expires they all send out their bitfield to other validators.

We need to investigate this further and see if it is a potential problem when we scale up to 1k validators. We might want to optimize this a bit or maybe just having a larger subsystem channel size to absorb these bursts is enough.

Screenshot 2024-09-10 at 11 12 10
@sandreim sandreim added the T0-node This PR/Issue is related to the topic “node”. label Sep 10, 2024
@alexggh alexggh self-assigned this Sep 19, 2024
@alexggh
Copy link
Contributor

alexggh commented Sep 20, 2024

I did some investigations on bitfield-distribution being clogged sometimes and all data points that the throughput of the system on average is more than sufficient, this is backed by multiple data sources like:

  • Subsystem benchmarks shows that processing all bitfields for 500 validators takes around 50ms of cpu time on the reference hardware.
  • Looking at CPU usage on kusama nodes it does not go over 4%.

This leads me to think this rare occasions of the susbystem being clogged are just of bursts of messages that happen because of the fact that all validators decide to send their bitfield all of the same time.

Doing some math on the total number of message it looks like this:

  1. We have 500 validators, so there are 500 unique bitfield messages.
  2. Each unique messages can be received by a node 6 times(2 times because of X and Y neighbour and 4 times because any messages is also gossiped randomly to 4 peers)
  3. Hence a node can receive 3000 (500 * 6) bitfield messages per relay chain block.
  4. Now the clogging seems to be correlated with relay-chain forks I see cases on kusama where we have 3 or 4-way forks, in that case you have a node receiving up to 3000 * 4 = 12_000 bitfield messages all coming around the same time. The messages get processed really fast, because we don't see them gathering and Time of flight for all messages seem to be almost always bellow 100ms, with most of them bellow 100 micro-seconds.
  5. Bitfield distribution uses the default message_capacity=2048, so I think that's why when we have this bursts of messages caused by relay-chain forks the queue gets full, important to note here is that this happens very rarely like ~4 times a day.

Clogging on the subsystem queue, even briefly, is really bad because because it blocks the sender and in this case the sender is network-bridge-rx which is dispatching communication for all the other subsystems, so we want to avoid it entirely or minimize it, for this we have 2 low hanging fruits that we should do:

  1. Increase the message_capacity, I propose setting it to 8192, the only downside here is that we slightly increase the memory footprint when the queue would be at max capacity, our messages have around 1k, so we would go from a theoretical max of 2MiB for this subsystem queue to 8MiB, I think that's a trade-off perfectly acceptable because production nodes are suppose to be running with at least 32GiB MiB, so this is really negligible.

  2. Make the subsystem run on a blocking task, this would have two benefits. First, it should make the subsystem quicker to react because it gets its own thread rather than share the task pool with everyone else. Secondly, the subsystem does some signature checking here:

    let signed_availability = match bitfield.try_into_checked(&signing_context, &validator) {
    , which is a CPU intensive task and running on the blocking pool is the recommended behaviour to reduce the impact on the other tasks in the tokio-pool.

Proposed fix: #5787

github-merge-queue bot pushed a commit that referenced this issue Sep 23, 2024
…5787)

## Description

Details and rationale explained here:
#5657 (comment)

Fixes: #5657

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
@github-project-automation github-project-automation bot moved this from Backlog to Completed in parachains team board Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

2 participants