Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collators sometimes miss blocks #5349

Closed
2 tasks done
JayPavlina opened this issue Aug 13, 2024 · 5 comments · Fixed by #5352
Closed
2 tasks done

Collators sometimes miss blocks #5349

JayPavlina opened this issue Aug 13, 2024 · 5 comments · Fixed by #5352
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@JayPavlina
Copy link

JayPavlina commented Aug 13, 2024

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

There is a bug introduced in #3308 that causes collators to sometimes miss blocks, causing longer block times and triggering a reorg. You can reproduce the issue by building the parachain template and polkadot, and then run them with zombienet. If you look at the latency screen, you will see something like this:

Screenshot 2024-08-12 at 3 39 24 PM

Undoing everything in #3308 fixes the problem and the collators will no longer periodically miss blocks. It seems to happen whether or not async backing is enabled, but I mostly tested on older versions without it.

In my testing, the bug only occurs if both the relaychain and parachain are using binaries that include that commit. If either one was built from a version before that commit, the collator performs normally.

We experienced this bug on our testnet when upgrading Enjin Blockchain to polkadot sdk v1.9.0. We worked backwards to find the first version that worked correctly. We solved the issue by forking the sdk and undoing the changes in the mentioned PR.

Steps to reproduce

  1. Build the parachain template
  2. Build polkadot
  3. Run them with zombienet
  4. Check the latency and notice long block times periodically

Here is the zombienet config I used:

[settings]
timeout = 1000

[relaychain]
default_command = "path/to/polkadot"
chain = "rococo-local"

    [[relaychain.nodes]]
    name = "Alice"
    validator = true

    [[relaychain.nodes]]
    name = "Bob"
    validator = true

[[parachains]]
id = 1000
name = "parachain-template-node"
cumulus_based = true
add_to_genesis = true
register_para = true

    [[parachains.collators]]
    name = "Alice"
    command = "path/to/template-node"
    args = ["-ldebug"]

    [[parachains.collators]]
    name = "Bob"
    command = "path/to/template-node"
    args = ["-ldebug"]
@JayPavlina JayPavlina added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Aug 13, 2024
@bkchr
Copy link
Member

bkchr commented Aug 14, 2024

@JayPavlina which version of the template are you using?

@JayPavlina
Copy link
Author

JayPavlina commented Aug 14, 2024

I used the most recent and v1.9.0. For v1.8.0 and lower I used this template. I ran at least 10 different versions as I was narrowing down which commit caused it. It doesn't happen on v1.7.3, but it will happen any version above that.

@bkchr
Copy link
Member

bkchr commented Aug 14, 2024

I used the most recent

The most recent template is using the lookahead collator which was not affected by the problem and wasn't touched by the pr you mentioned above. For rococo local you should be aware that the session length is 1 minute and parachain candidates are not making across session boundaries which leads to skipped blocks.

@JayPavlina
Copy link
Author

Undoing the PR fixed it for v1.9.0. That's what we are using on our testnet.

@alexggh
Copy link
Contributor

alexggh commented Aug 14, 2024

Undoing the PR fixed it for v1.9.0. That's what we are using on our testnet.

What @bkchr is suggesting is that starting with 16d8205, which was included first in v1.12.0 the template is using the lookahead collator(needed for async backing), and that is not affected by this PR: #3308.

github-merge-queue bot pushed a commit that referenced this issue Aug 15, 2024
We only want to build one block per slot for Aura on parachains.
However, we still need to build on each relay chain fork, which is using
the same slot.

Closes: #5349

---------

Co-authored-by: Davide Galassi <[email protected]>
Co-authored-by: Sebastian Kunert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants