Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slot based collations #3168

Open
Tracked by #1829
bkchr opened this issue Feb 1, 2024 · 2 comments
Open
Tracked by #1829

Slot based collations #3168

bkchr opened this issue Feb 1, 2024 · 2 comments
Assignees

Comments

@bkchr
Copy link
Member

bkchr commented Feb 1, 2024

The current collator implementations depend on the relay chain block import as "clock". This means for every imported relay chain block, they will check the relay chain state if the parachain is allowed to build a block and then do so. However, the relay chain for example has a task that fires every 6 seconds and then builds a block. For sure parachains still need to include their blocks still in the relay chain, but with async backing parachains get more freedom when it comes to block production. So, we should split up the current collator implementation into two tasks:

  1. Block production
  2. Collation generation

The block production would run the same way as it is done on the relay chain. This means we have a fixed slot and this fire every X seconds to build a block. This mechanism needs to be implemented in a flexible way to support the following cases:

  • One parachain block every X seconds, where X >= 6 seconds (the relay chain slot time).
  • Build a parachain block on demand on some trigger. The trigger could be for example that the transaction pool is full enough and we have bought some on demand slot.
  • One parachain block every X seconds, where X < 6 seconds, but the block time is less than 2 seconds. This should be useful for parachains that want to run in a higher pace than the relay chain to give faster feedback on transaction inclusion.
  • One parachain block every X seconds, where X < 6 seconds, but the block time is 2 seconds. This should be useful for parachains that want to build 3 blocks every 6 seconds because they have 3 cores constantly to achieve an higher throughput.
  • Future: Scale block production depending on the demand/special triggers.

This logic for block production should be implemented in a fairly generic way. Meaning that the actual block production is hidden behind some generic type as the stuff here is actually more like the trigger to build a block. The actual block production is then something like Aura that gets the key/builds/signs the block and returns it. Then we almost have like a normal chain that has its block production separated. However, we should consider to slow down the block production if our parachain is running too far in front of what is already enacted in the relay chain. This means that the collation tasks tells the block production to slow down. As each parachain block is build on a certain relay chain block that provides some context, we would use the best block of the relay chain as of starting the block production.

The collation task would work as the current collator task is working. This means it listens for relay chain block imports, checks if the parachain has a slot and if yes, creates the collation. After #3167 is finished, it would also be possible to build the collation early enough to send it on time to the relay chain. We could probably think about certain kind of optimizations, like keeping the storage proof from an imported/built block around to not require to re-run the block for creating the collation.

Parachain blocks are being build in the context of certain relay chain and the relay chain allows with async backing that this context block is lacking behind to the point of view of validating the parachain block. However, the difference between the context and the point of validating have a limit. Collation and block production need to ensure that the blocks are staying valid or we may need to build a new block with a new context. This is basically the main reason for slowing down the block production that was already mentioned above.

Forks on the relay chain also need to be considered. A parachain block can not be validated if the context is a different fork. There are forks with BABE on the relay chain, but they are not that long. One simple way to improve the situation is that parachain blocks are always build on at least one relay chain block before the relay chain block they will be validated for.

Another thing we should think about is to have some kind of slot offset. Let's assume we have a parachain running at 6 seconds block time. We want to ensure that we are able to include a block every 6 seconds on the relay chain. We could for example run with a slot offset of 2 seconds to always have the block produced before the relay chain block. However, maybe that doesn't make that much sense and we just always run behind in the relay chain block context. This should probably achieve the same. This means we build on context X, let it validate on X + 1 and include in X + 2.

Tasks

No tasks being tracked yet.
@sandreim
Copy link
Contributor

sandreim commented Feb 1, 2024

However, we should consider to slow down the block production if our parachain is running too far in front of what is already enacted in the relay chain.

This is exactly the same problem of the relay chain where we use client back off vs the proper solution of using finality proofs and letting the runtime decide when to back off. Do we want for the short term to implement something similar or we just go for the proper solution in cumulus ?

The collation task would work as the current collator task is working. This means it listens for relay chain block imports, checks if the parachain has a slot and if yes, creates the collation. After #3167 is finished, it would also be possible to build the collation early enough to send it on time to the relay chain. We could probably think about certain kind of optimizations, like keeping the storage proof from an imported/built block around to not require to re-run the block for creating the collation.

This makes a lot of sense to do even more since advertising the collations is not tied to block production. While there is block authorship consensus , it is not clear if only specific or all collators should advertise collations?

Parachain blocks are being build in the context of certain relay chain and the relay chain allows with async backing that this context block is lacking behind to the point of view of validating the parachain block. However, the difference between the context and the point of validating have a limit. Collation and block production need to ensure that the blocks are staying valid or we may need to build a new block with a new context. This is basically the main reason for slowing down the block production that was already mentioned above.

Also, the slow down is mandatory to avoid OOM.

Forks on the relay chain also need to be considered. A parachain block can not be validated if the context is a different fork. There are forks with BABE on the relay chain, but they are not that long. One simple way to improve the situation is that parachain blocks are always build on at least one relay chain block before the relay chain block they will be validated for.

Could collators prefer to use relay parents that do not have any siblings rather than latest head ? This should help asuming the forks are not long, but we need some more lenient parameters around the depth of the relay parent.

@bkchr
Copy link
Member Author

bkchr commented Feb 1, 2024

This is exactly the same problem of the relay chain where we use client back off vs the proper solution of using finality proofs and letting the runtime decide when to back off. Do we want for the short term to implement something similar or we just go for the proper solution in cumulus ?

I mean this isn't entirely related. The parachain doesn't "care" about finality. The problem is also actually already solved in the parachain runtime:

// Current block validity check: ensure there is space in the unincluded segment.
//
// If this fails, the parachain needs to wait for ancestors to be included before
// a new block is allowed.
assert!(new_len < capacity.get(), "no space left for the block in the unincluded segment");

However, the problem here is also different to the one at the relay chain. Here, if a malicious node would continue to produce blocks, they wouldn't get anything out of it. Because the relay chain would reject these parachain blocks if the context is too old. So, a malicious node would not really start getting a bigger share in the block production.

While there is block authorship consensus , it is not clear if only specific or all collators should advertise collations?

Yeah a good point. I actually thought about this. I think for the beginning, we should just let the author send the collation to the relay chain. However, you could also "offload" this process to some random node in the network (depending on your needs and whatever).

Also, the slow down is mandatory to avoid OOM.

Not sure what should OOM. Yes, maybe the thing that caches the storage proofs, but any proper solution would work with a bounded cache any way.

Could collators prefer to use relay parents that do not have any siblings rather than latest head ? This should help asuming the forks are not long, but we need some more lenient parameters around the depth of the relay parent.

I mean there is no guarantee that there doesn't exist a sibling that you maybe not yet have seen yet. Picking for example the block with the primary BABE slot would also a way to ensure you are on the best chain. But yeah, multiple ways are possible here. Someone should do some calculations on the average fork length and then setup the lenient parameters around this average fork length to ensure parachains do not run into the fork problem.

@skunert skunert moved this from Backlog to In Progress in parachains team board Feb 2, 2024
@skunert skunert moved this from backlog to in progress in SDK Node Feb 2, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 9, 2024
Cumulus test-parachain node and test runtime were still using relay
chain consensus and 12s blocktimes. With async backing around the corner
on the major chains we should switch our tests too.

Also needed to nicely test the changes coming to collators in #3168.

### Changes Overview
- Followed the [migration
guide](https://wiki.polkadot.network/docs/maintain-guides-async-backing)
for async backing for the cumulus-test-runtime
- Adjusted the cumulus-test-service to use the correct import-queue,
lookahead collator etc.
- The block validation function now uses the Aura Ext Executor so that
the seal of the block is validated
- Previous point requires that we seal block before calling into
`validate_block`, I introduced a helper function for that
- Test client adjusted to provide a slot to the relay chain proof and
the aura pre-digest
github-merge-queue bot pushed a commit that referenced this issue Jul 5, 2024
Part of #3168 
On top of #3568

### Changes Overview
- Introduces a new collator variant in
`cumulus/client/consensus/aura/src/collators/slot_based/mod.rs`
- Two tasks are part of that module, one for block building and one for
collation building and submission.
- Introduces a new variant of `cumulus-test-runtime` which has 2s slot
duration, used for zombienet testing
- Zombienet tests for the new collator

**Note:** This collator is considered experimental and should only be
used for testing and exploration for now.

### Comparison with `lookahead` collator
- The new variant is slot based, meaning it waits for the next slot of
the parachain, then starts authoring
- The search for potential parents remains mostly unchanged from
lookahead
- As anchor, we use the current best relay parent
- In general, the new collator tends to be anchored to one relay parent
earlier. `lookahead` generally waits for a new relay block to arrive
before it attempts to build a block. This means the actual timing of
parachain blocks depends on when the relay block has been authored and
imported. With the slot-triggered approach we are authoring directly on
the slot boundary, were a new relay chain block has probably not yet
arrived.

### Limitations
- Overall, the current implementation focuses on the "happy path"
- We assume that we want to collate close to the tip of the relay chain.
It would be useful however to have some kind of configurable drift, so
that we could lag behind a bit.
#3965
- The collation task is pretty dumb currently. It checks if we have
cores scheduled and if yes, submits all the messages we have received
from the block builder until we have something submitted for every core.
Ideally we should do some extra checks, i.e. we do not need to submit if
the built block is already too old (build on a out of range relay
parent) or was authored with a relay parent that is not an ancestor of
the relay block we are submitting at.
#3966
- There is no throttling, we assume that we can submit _velocity_ blocks
every relay chain block. There should be communication between the
collator task and block-builder task.
- The parent search and ConsensusHook are not yet properly adjusted. The
parent search makes assumptions about the pending candidate which no
longer hold. #3967
- Custom triggers for block building not implemented.

---------

Co-authored-by: Davide Galassi <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: command-bot <>
TomaszWaszczyk pushed a commit to TomaszWaszczyk/polkadot-sdk that referenced this issue Jul 7, 2024
Part of paritytech#3168 
On top of paritytech#3568

### Changes Overview
- Introduces a new collator variant in
`cumulus/client/consensus/aura/src/collators/slot_based/mod.rs`
- Two tasks are part of that module, one for block building and one for
collation building and submission.
- Introduces a new variant of `cumulus-test-runtime` which has 2s slot
duration, used for zombienet testing
- Zombienet tests for the new collator

**Note:** This collator is considered experimental and should only be
used for testing and exploration for now.

### Comparison with `lookahead` collator
- The new variant is slot based, meaning it waits for the next slot of
the parachain, then starts authoring
- The search for potential parents remains mostly unchanged from
lookahead
- As anchor, we use the current best relay parent
- In general, the new collator tends to be anchored to one relay parent
earlier. `lookahead` generally waits for a new relay block to arrive
before it attempts to build a block. This means the actual timing of
parachain blocks depends on when the relay block has been authored and
imported. With the slot-triggered approach we are authoring directly on
the slot boundary, were a new relay chain block has probably not yet
arrived.

### Limitations
- Overall, the current implementation focuses on the "happy path"
- We assume that we want to collate close to the tip of the relay chain.
It would be useful however to have some kind of configurable drift, so
that we could lag behind a bit.
paritytech#3965
- The collation task is pretty dumb currently. It checks if we have
cores scheduled and if yes, submits all the messages we have received
from the block builder until we have something submitted for every core.
Ideally we should do some extra checks, i.e. we do not need to submit if
the built block is already too old (build on a out of range relay
parent) or was authored with a relay parent that is not an ancestor of
the relay block we are submitting at.
paritytech#3966
- There is no throttling, we assume that we can submit _velocity_ blocks
every relay chain block. There should be communication between the
collator task and block-builder task.
- The parent search and ConsensusHook are not yet properly adjusted. The
parent search makes assumptions about the pending candidate which no
longer hold. paritytech#3967
- Custom triggers for block building not implemented.

---------

Co-authored-by: Davide Galassi <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: command-bot <>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
Part of paritytech#3168 
On top of paritytech#3568

### Changes Overview
- Introduces a new collator variant in
`cumulus/client/consensus/aura/src/collators/slot_based/mod.rs`
- Two tasks are part of that module, one for block building and one for
collation building and submission.
- Introduces a new variant of `cumulus-test-runtime` which has 2s slot
duration, used for zombienet testing
- Zombienet tests for the new collator

**Note:** This collator is considered experimental and should only be
used for testing and exploration for now.

### Comparison with `lookahead` collator
- The new variant is slot based, meaning it waits for the next slot of
the parachain, then starts authoring
- The search for potential parents remains mostly unchanged from
lookahead
- As anchor, we use the current best relay parent
- In general, the new collator tends to be anchored to one relay parent
earlier. `lookahead` generally waits for a new relay block to arrive
before it attempts to build a block. This means the actual timing of
parachain blocks depends on when the relay block has been authored and
imported. With the slot-triggered approach we are authoring directly on
the slot boundary, were a new relay chain block has probably not yet
arrived.

### Limitations
- Overall, the current implementation focuses on the "happy path"
- We assume that we want to collate close to the tip of the relay chain.
It would be useful however to have some kind of configurable drift, so
that we could lag behind a bit.
paritytech#3965
- The collation task is pretty dumb currently. It checks if we have
cores scheduled and if yes, submits all the messages we have received
from the block builder until we have something submitted for every core.
Ideally we should do some extra checks, i.e. we do not need to submit if
the built block is already too old (build on a out of range relay
parent) or was authored with a relay parent that is not an ancestor of
the relay block we are submitting at.
paritytech#3966
- There is no throttling, we assume that we can submit _velocity_ blocks
every relay chain block. There should be communication between the
collator task and block-builder task.
- The parent search and ConsensusHook are not yet properly adjusted. The
parent search makes assumptions about the pending candidate which no
longer hold. paritytech#3967
- Custom triggers for block building not implemented.

---------

Co-authored-by: Davide Galassi <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: command-bot <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: in progress
Status: In Progress
Development

No branches or pull requests

3 participants