Slot based collations #3168

bkchr · 2024-02-01T10:08:47Z

The current collator implementations depend on the relay chain block import as "clock". This means for every imported relay chain block, they will check the relay chain state if the parachain is allowed to build a block and then do so. However, the relay chain for example has a task that fires every 6 seconds and then builds a block. For sure parachains still need to include their blocks still in the relay chain, but with async backing parachains get more freedom when it comes to block production. So, we should split up the current collator implementation into two tasks:

Block production
Collation generation

The block production would run the same way as it is done on the relay chain. This means we have a fixed slot and this fire every X seconds to build a block. This mechanism needs to be implemented in a flexible way to support the following cases:

One parachain block every X seconds, where X >= 6 seconds (the relay chain slot time).
Build a parachain block on demand on some trigger. The trigger could be for example that the transaction pool is full enough and we have bought some on demand slot.
One parachain block every X seconds, where X < 6 seconds, but the block time is less than 2 seconds. This should be useful for parachains that want to run in a higher pace than the relay chain to give faster feedback on transaction inclusion.
One parachain block every X seconds, where X < 6 seconds, but the block time is 2 seconds. This should be useful for parachains that want to build 3 blocks every 6 seconds because they have 3 cores constantly to achieve an higher throughput.
Future: Scale block production depending on the demand/special triggers.

This logic for block production should be implemented in a fairly generic way. Meaning that the actual block production is hidden behind some generic type as the stuff here is actually more like the trigger to build a block. The actual block production is then something like Aura that gets the key/builds/signs the block and returns it. Then we almost have like a normal chain that has its block production separated. However, we should consider to slow down the block production if our parachain is running too far in front of what is already enacted in the relay chain. This means that the collation tasks tells the block production to slow down. As each parachain block is build on a certain relay chain block that provides some context, we would use the best block of the relay chain as of starting the block production.

The collation task would work as the current collator task is working. This means it listens for relay chain block imports, checks if the parachain has a slot and if yes, creates the collation. After #3167 is finished, it would also be possible to build the collation early enough to send it on time to the relay chain. We could probably think about certain kind of optimizations, like keeping the storage proof from an imported/built block around to not require to re-run the block for creating the collation.

Parachain blocks are being build in the context of certain relay chain and the relay chain allows with async backing that this context block is lacking behind to the point of view of validating the parachain block. However, the difference between the context and the point of validating have a limit. Collation and block production need to ensure that the blocks are staying valid or we may need to build a new block with a new context. This is basically the main reason for slowing down the block production that was already mentioned above.

Forks on the relay chain also need to be considered. A parachain block can not be validated if the context is a different fork. There are forks with BABE on the relay chain, but they are not that long. One simple way to improve the situation is that parachain blocks are always build on at least one relay chain block before the relay chain block they will be validated for.

Another thing we should think about is to have some kind of slot offset. Let's assume we have a parachain running at 6 seconds block time. We want to ensure that we are able to include a block every 6 seconds on the relay chain. We could for example run with a slot offset of 2 seconds to always have the block produced before the relay chain block. However, maybe that doesn't make that much sense and we just always run behind in the relay chain block context. This should probably achieve the same. This means we build on context X, let it validate on X + 1 and include in X + 2.

Tasks

Give feedback

No tasks being tracked yet.

Options

sandreim · 2024-02-01T12:30:16Z

However, we should consider to slow down the block production if our parachain is running too far in front of what is already enacted in the relay chain.

This is exactly the same problem of the relay chain where we use client back off vs the proper solution of using finality proofs and letting the runtime decide when to back off. Do we want for the short term to implement something similar or we just go for the proper solution in cumulus ?

The collation task would work as the current collator task is working. This means it listens for relay chain block imports, checks if the parachain has a slot and if yes, creates the collation. After #3167 is finished, it would also be possible to build the collation early enough to send it on time to the relay chain. We could probably think about certain kind of optimizations, like keeping the storage proof from an imported/built block around to not require to re-run the block for creating the collation.

This makes a lot of sense to do even more since advertising the collations is not tied to block production. While there is block authorship consensus , it is not clear if only specific or all collators should advertise collations?

Parachain blocks are being build in the context of certain relay chain and the relay chain allows with async backing that this context block is lacking behind to the point of view of validating the parachain block. However, the difference between the context and the point of validating have a limit. Collation and block production need to ensure that the blocks are staying valid or we may need to build a new block with a new context. This is basically the main reason for slowing down the block production that was already mentioned above.

Also, the slow down is mandatory to avoid OOM.

Forks on the relay chain also need to be considered. A parachain block can not be validated if the context is a different fork. There are forks with BABE on the relay chain, but they are not that long. One simple way to improve the situation is that parachain blocks are always build on at least one relay chain block before the relay chain block they will be validated for.

Could collators prefer to use relay parents that do not have any siblings rather than latest head ? This should help asuming the forks are not long, but we need some more lenient parameters around the depth of the relay parent.

bkchr · 2024-02-01T13:05:47Z

This is exactly the same problem of the relay chain where we use client back off vs the proper solution of using finality proofs and letting the runtime decide when to back off. Do we want for the short term to implement something similar or we just go for the proper solution in cumulus ?

I mean this isn't entirely related. The parachain doesn't "care" about finality. The problem is also actually already solved in the parachain runtime:

polkadot-sdk/cumulus/pallets/parachain-system/src/lib.rs

Lines 1335 to 1339 in 2b2d406

    
           // Current block validity check: ensure there is space in the unincluded segment. 
        
           // 
        
           // If this fails, the parachain needs to wait for ancestors to be included before 
        
           // a new block is allowed. 
        
           assert!(new_len < capacity.get(), "no space left for the block in the unincluded segment");

However, the problem here is also different to the one at the relay chain. Here, if a malicious node would continue to produce blocks, they wouldn't get anything out of it. Because the relay chain would reject these parachain blocks if the context is too old. So, a malicious node would not really start getting a bigger share in the block production.

While there is block authorship consensus , it is not clear if only specific or all collators should advertise collations?

Yeah a good point. I actually thought about this. I think for the beginning, we should just let the author send the collation to the relay chain. However, you could also "offload" this process to some random node in the network (depending on your needs and whatever).

Also, the slow down is mandatory to avoid OOM.

Not sure what should OOM. Yes, maybe the thing that caches the storage proofs, but any proper solution would work with a bounded cache any way.

Could collators prefer to use relay parents that do not have any siblings rather than latest head ? This should help asuming the forks are not long, but we need some more lenient parameters around the depth of the relay parent.

I mean there is no guarantee that there doesn't exist a sibling that you maybe not yet have seen yet. Picking for example the block with the primary BABE slot would also a way to ensure you are on the best chain. But yeah, multiple ways are possible here. Someone should do some calculations on the average fork length and then setup the lenient parameters around this average fork length to ensure parachains do not run into the fork problem.

Cumulus test-parachain node and test runtime were still using relay chain consensus and 12s blocktimes. With async backing around the corner on the major chains we should switch our tests too. Also needed to nicely test the changes coming to collators in #3168. ### Changes Overview - Followed the [migration guide](https://wiki.polkadot.network/docs/maintain-guides-async-backing) for async backing for the cumulus-test-runtime - Adjusted the cumulus-test-service to use the correct import-queue, lookahead collator etc. - The block validation function now uses the Aura Ext Executor so that the seal of the block is validated - Previous point requires that we seal block before calling into `validate_block`, I introduced a helper function for that - Test client adjusted to provide a slot to the relay chain proof and the aura pre-digest

Part of #3168 On top of #3568 ### Changes Overview - Introduces a new collator variant in `cumulus/client/consensus/aura/src/collators/slot_based/mod.rs` - Two tasks are part of that module, one for block building and one for collation building and submission. - Introduces a new variant of `cumulus-test-runtime` which has 2s slot duration, used for zombienet testing - Zombienet tests for the new collator **Note:** This collator is considered experimental and should only be used for testing and exploration for now. ### Comparison with `lookahead` collator - The new variant is slot based, meaning it waits for the next slot of the parachain, then starts authoring - The search for potential parents remains mostly unchanged from lookahead - As anchor, we use the current best relay parent - In general, the new collator tends to be anchored to one relay parent earlier. `lookahead` generally waits for a new relay block to arrive before it attempts to build a block. This means the actual timing of parachain blocks depends on when the relay block has been authored and imported. With the slot-triggered approach we are authoring directly on the slot boundary, were a new relay chain block has probably not yet arrived. ### Limitations - Overall, the current implementation focuses on the "happy path" - We assume that we want to collate close to the tip of the relay chain. It would be useful however to have some kind of configurable drift, so that we could lag behind a bit. #3965 - The collation task is pretty dumb currently. It checks if we have cores scheduled and if yes, submits all the messages we have received from the block builder until we have something submitted for every core. Ideally we should do some extra checks, i.e. we do not need to submit if the built block is already too old (build on a out of range relay parent) or was authored with a relay parent that is not an ancestor of the relay block we are submitting at. #3966 - There is no throttling, we assume that we can submit _velocity_ blocks every relay chain block. There should be communication between the collator task and block-builder task. - The parent search and ConsensusHook are not yet properly adjusted. The parent search makes assumptions about the pending candidate which no longer hold. #3967 - Custom triggers for block building not implemented. --------- Co-authored-by: Davide Galassi <[email protected]> Co-authored-by: Andrei Sandu <[email protected]> Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: Javier Viola <[email protected]> Co-authored-by: command-bot <>

Part of paritytech#3168 On top of paritytech#3568 ### Changes Overview - Introduces a new collator variant in `cumulus/client/consensus/aura/src/collators/slot_based/mod.rs` - Two tasks are part of that module, one for block building and one for collation building and submission. - Introduces a new variant of `cumulus-test-runtime` which has 2s slot duration, used for zombienet testing - Zombienet tests for the new collator **Note:** This collator is considered experimental and should only be used for testing and exploration for now. ### Comparison with `lookahead` collator - The new variant is slot based, meaning it waits for the next slot of the parachain, then starts authoring - The search for potential parents remains mostly unchanged from lookahead - As anchor, we use the current best relay parent - In general, the new collator tends to be anchored to one relay parent earlier. `lookahead` generally waits for a new relay block to arrive before it attempts to build a block. This means the actual timing of parachain blocks depends on when the relay block has been authored and imported. With the slot-triggered approach we are authoring directly on the slot boundary, were a new relay chain block has probably not yet arrived. ### Limitations - Overall, the current implementation focuses on the "happy path" - We assume that we want to collate close to the tip of the relay chain. It would be useful however to have some kind of configurable drift, so that we could lag behind a bit. paritytech#3965 - The collation task is pretty dumb currently. It checks if we have cores scheduled and if yes, submits all the messages we have received from the block builder until we have something submitted for every core. Ideally we should do some extra checks, i.e. we do not need to submit if the built block is already too old (build on a out of range relay parent) or was authored with a relay parent that is not an ancestor of the relay block we are submitting at. paritytech#3966 - There is no throttling, we assume that we can submit _velocity_ blocks every relay chain block. There should be communication between the collator task and block-builder task. - The parent search and ConsensusHook are not yet properly adjusted. The parent search makes assumptions about the pending candidate which no longer hold. paritytech#3967 - Custom triggers for block building not implemented. --------- Co-authored-by: Davide Galassi <[email protected]> Co-authored-by: Andrei Sandu <[email protected]> Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: Javier Viola <[email protected]> Co-authored-by: command-bot <>

bkchr assigned skunert Feb 1, 2024

bkchr added this to SDK Node and parachains team board Feb 1, 2024

github-project-automation bot moved this to Backlog in parachains team board Feb 1, 2024

github-project-automation bot moved this to backlog in SDK Node Feb 1, 2024

bkchr mentioned this issue Feb 1, 2024

Elastic Scaling #1829

Open

skunert moved this from Backlog to In Progress in parachains team board Feb 2, 2024

skunert moved this from backlog to in progress in SDK Node Feb 2, 2024

bkchr mentioned this issue Feb 5, 2024

Very often reorgs when async backing is enabled #3205

Closed

eskimor mentioned this issue Feb 15, 2024

Collator Protocol Revamp Draft #616

Open

2 tasks

skunert mentioned this issue Mar 4, 2024

Move cumulus zombienet tests to aura & async backing #3568

Merged

sandreim mentioned this issue Mar 26, 2024

Look-ahead collator: build on all assigned cores #3837

Closed

This was referenced Apr 3, 2024

Introduce basic slot-based collator #3963

Closed

slot-based-collator: Allow slot drift #3965

Open

This was referenced Apr 9, 2024

Look further in ClaimQueue during collation generation #4049

Closed

Expose and use claim queue for asynchronous on-demand operability #1797

Open

skunert mentioned this issue Apr 12, 2024

Introduce basic slot-based collator #4097

Merged

eskimor mentioned this issue Jun 18, 2024

Idea: Block based collator #4813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slot based collations #3168

Slot based collations #3168

bkchr commented Feb 1, 2024 •

edited by skunert

Loading

Tasks

sandreim commented Feb 1, 2024 •

edited

Loading

bkchr commented Feb 1, 2024

Slot based collations #3168

Slot based collations #3168

Comments

bkchr commented Feb 1, 2024 • edited by skunert Loading

Tasks

sandreim commented Feb 1, 2024 • edited Loading

bkchr commented Feb 1, 2024

bkchr commented Feb 1, 2024 •

edited by skunert

Loading

sandreim commented Feb 1, 2024 •

edited

Loading