Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spamming Westend causes grandpa-voter to fail #6507

Open
tugytur opened this issue Nov 16, 2024 · 5 comments
Open

Spamming Westend causes grandpa-voter to fail #6507

tugytur opened this issue Nov 16, 2024 · 5 comments
Labels
I2-bug The node fails to follow expected behavior.

Comments

@tugytur
Copy link
Contributor

tugytur commented Nov 16, 2024

In preparation for the spamming the Relaychains we were testing it on Westend.

After couple seconds of running the blocktime went >1 minute.
image

I don't have access to the validator logs/metrics, but thankfully @lovelaced checked.
image

This error showed up.
image

@lovelaced and I both suspect it could be caused by the low compute specs of the Westend valiators.

Would appreciate feedback if we should continue on Kusama or if Parity wants to first check some things on Westend.
For Kusama, we would first do a 1-3 minute run before doing the full 30 minutes couple days later.

If any more logs are required please reach out to @lovelaced and just let me know if you want me to run it on Westend again.

@lovelaced
Copy link

I realized the first graph is for Westend AH (my fault) but the story is the same on the validators.

Screenshot_20241116-104355.png

@bkchr
Copy link
Member

bkchr commented Nov 16, 2024

We need more logs. @lovelaced please enable grandpa=trace.

After that is done we need the script again @tugytur

@BulatSaif
Copy link
Contributor

We need more logs. @lovelaced please enable grandpa=trace.

After that is done we need the script again @tugytur

grandpa=trace added to westend-validators: https://grafana.teleport.parity.io/goto/TXBj7i7HR?orgId=1

@lexnv
Copy link
Contributor

lexnv commented Nov 20, 2024

The gossip engine exists when either the notification event stream closes or the sync event stream closes:

// The network event stream closed. Do the same for [`GossipValidator`].
Poll::Ready(None) => {
self.is_terminated = true;
return Poll::Ready(())
},

// The sync event stream closed. Do the same for [`GossipValidator`].
Poll::Ready(None) => {
self.is_terminated = true;
return Poll::Ready(())
},

Have detected a poll mismatch implementation, although I don't expect that to be the root cause of this issue:

We'll probably need to collect more logs.

@tugytur Would it be possible to make the script public? Would like to run this in versi-net to check litep2p as well 🙏

davidk-pt pushed a commit that referenced this issue Nov 21, 2024
…6553)

The `GossipEngine::poll_next` implementation polls both the
`notification_service` and the `sync_event_stream`.

If both polls produce valid data to be processed
(`Poll::Ready(Some(..))`), then the sync event is ignored when we
receive `NotificationEvent::NotificationStreamOpened` and the role
cannot be deduced.

This PR ensures both events are processed gracefully. While at it, I
have added a warning to the sync engine related to
`notification_service` producing `Poll::Ready(None)`.

This effectively ensures that `SyncEvents` propagate to the network
potentially fixing any state mismatch.


For more context: #6507

cc @paritytech/sdk-node

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Krayt78 pushed a commit to Krayt78/polkadot-sdk that referenced this issue Dec 18, 2024
…aritytech#6553)

The `GossipEngine::poll_next` implementation polls both the
`notification_service` and the `sync_event_stream`.

If both polls produce valid data to be processed
(`Poll::Ready(Some(..))`), then the sync event is ignored when we
receive `NotificationEvent::NotificationStreamOpened` and the role
cannot be deduced.

This PR ensures both events are processed gracefully. While at it, I
have added a warning to the sync engine related to
`notification_service` producing `Poll::Ready(None)`.

This effectively ensures that `SyncEvents` propagate to the network
potentially fixing any state mismatch.


For more context: paritytech#6507

cc @paritytech/sdk-node

---------

Signed-off-by: Alexandru Vasile <[email protected]>
@bkchr
Copy link
Member

bkchr commented Dec 24, 2024

@lexnv https://github.com/amforc/spammening but it was hard to get it running. We probably need some coordinated effort together with @tugytur to run it again on westend.

@bkchr bkchr added the I2-bug The node fails to follow expected behavior. label Dec 24, 2024
@bkchr bkchr added this to SDK Node Dec 24, 2024
@github-project-automation github-project-automation bot moved this to backlog in SDK Node Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: backlog
Development

No branches or pull requests

5 participants