Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix order of resending messages after restart #6729

Merged
merged 7 commits into from
Dec 10, 2024

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Dec 2, 2024

The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments.

I think this bug contribute to the reason the network did not recovered on 25-11-25 15:55:40 after the restarts.

Tested this scenario with a zombienet where nodes are finalising blocks because of aggression and all nodes are restarted at once and confirmed the network lags and doesn't recover before and it does after the fix

The way we build the messages we need to send to approval-distribution
can result in a situation where is we have multiple assignments covered
by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each
candidate and add to the queue of messages both the assignment and the
approval for that candidate, and when the approval reaches the
approval-distribution subsystem it won't be imported and gossiped
because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we
could end up in a situation where a set of the nodes correctly received
the assignments and approvals before the restart and approve their
blocks and don't trigger their assignments. The other set of nodes
should receive the assignments and approvals after the restart, but
because the approvals never get broacasted anymore because of this bug,
the only way they could approve is if other nodes start broadcasting
their assignments.

I think this bug contribute to the reason the network did not recovered
on `25-11-25 15:55:40` after the restarts.

Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
@alexggh alexggh marked this pull request as ready for review December 9, 2024 08:31
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12235807472
Failed job name: fmt

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Copy link
Contributor

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@alexggh alexggh added T8-polkadot This PR/Issue is related to/affects the Polkadot network. A3-backport Pull request is already reviewed well in another branch. A4-needs-backport Pull request must be backported to all maintained releases. and removed A3-backport Pull request is already reviewed well in another branch. labels Dec 10, 2024
@alexggh alexggh enabled auto-merge December 10, 2024 10:03
@alexggh alexggh added this pull request to the merge queue Dec 10, 2024
Merged via the queue into master with commit 65a4e5e Dec 10, 2024
197 of 199 checks passed
@alexggh alexggh deleted the alexggh/fix_resending_after_restart branch December 10, 2024 13:28
@paritytech-cmd-bot-polkadot-sdk

Created backport PR for stable2407:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6729-to-stable2407
git worktree add --checkout .worktree/backport-6729-to-stable2407 backport-6729-to-stable2407
cd .worktree/backport-6729-to-stable2407
git reset --hard HEAD^
git cherry-pick -x 65a4e5ee06b11844d536730379d4e1cab337beb4
git push --force-with-lease

@paritytech-cmd-bot-polkadot-sdk

Created backport PR for stable2409:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6729-to-stable2409
git worktree add --checkout .worktree/backport-6729-to-stable2409 backport-6729-to-stable2409
cd .worktree/backport-6729-to-stable2409
git reset --hard HEAD^
git cherry-pick -x 65a4e5ee06b11844d536730379d4e1cab337beb4
git push --force-with-lease

github-actions bot pushed a commit that referenced this pull request Dec 10, 2024
The way we build the messages we need to send to approval-distribution
can result in a situation where is we have multiple assignments covered
by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each
candidate and add to the queue of messages both the assignment and the
approval for that candidate, and when the approval reaches the
approval-distribution subsystem it won't be imported and gossiped
because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we
could end up in a situation where a set of the nodes correctly received
the assignments and approvals before the restart and approve their
blocks and don't trigger their assignments. The other set of nodes
should receive the assignments and approvals after the restart, but
because the approvals never get broacasted anymore because of this bug,
the only way they could approve is if other nodes start broadcasting
their assignments.

I think this bug contribute to the reason the network did not recovered
on `25-11-25 15:55:40` after the restarts.

Tested this scenario with a `zombienet` where `nodes` are finalising
blocks because of aggression and all nodes are restarted at once and
confirmed the network lags and doesn't recover before and it does after
the fix

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
(cherry picked from commit 65a4e5e)
@paritytech-cmd-bot-polkadot-sdk

Successfully created backport PR for stable2412:

@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2025-11-25-kusama-parachains-spammening-aftermath/11108/1

EgorPopelyaev pushed a commit that referenced this pull request Dec 11, 2024
Backport #6729 into `stable2407` from alexggh.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
EgorPopelyaev pushed a commit that referenced this pull request Dec 11, 2024
Backport #6729 into `stable2409` from alexggh.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
EgorPopelyaev pushed a commit that referenced this pull request Dec 11, 2024
Backport #6729 into `stable2412` from alexggh.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

Co-authored-by: Alexandru Gheorghe <[email protected]>
Ank4n pushed a commit that referenced this pull request Dec 15, 2024
The way we build the messages we need to send to approval-distribution
can result in a situation where is we have multiple assignments covered
by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each
candidate and add to the queue of messages both the assignment and the
approval for that candidate, and when the approval reaches the
approval-distribution subsystem it won't be imported and gossiped
because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we
could end up in a situation where a set of the nodes correctly received
the assignments and approvals before the restart and approve their
blocks and don't trigger their assignments. The other set of nodes
should receive the assignments and approvals after the restart, but
because the approvals never get broacasted anymore because of this bug,
the only way they could approve is if other nodes start broadcasting
their assignments.

I think this bug contribute to the reason the network did not recovered
on `25-11-25 15:55:40` after the restarts.

Tested this scenario with a `zombienet` where `nodes` are finalising
blocks because of aggression and all nodes are restarted at once and
confirmed the network lags and doesn't recover before and it does after
the fix

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
dudo50 pushed a commit to paraspell-research/polkadot-sdk that referenced this pull request Jan 4, 2025
The way we build the messages we need to send to approval-distribution
can result in a situation where is we have multiple assignments covered
by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each
candidate and add to the queue of messages both the assignment and the
approval for that candidate, and when the approval reaches the
approval-distribution subsystem it won't be imported and gossiped
because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we
could end up in a situation where a set of the nodes correctly received
the assignments and approvals before the restart and approve their
blocks and don't trigger their assignments. The other set of nodes
should receive the assignments and approvals after the restart, but
because the approvals never get broacasted anymore because of this bug,
the only way they could approve is if other nodes start broadcasting
their assignments.

I think this bug contribute to the reason the network did not recovered
on `25-11-25 15:55:40` after the restarts.

Tested this scenario with a `zombienet` where `nodes` are finalising
blocks because of aggression and all nodes are restarted at once and
confirmed the network lags and doesn't recover before and it does after
the fix

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A4-needs-backport Pull request must be backported to all maintained releases. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants