Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrain parachain block validity on a specific core #103

Merged
merged 24 commits into from
Sep 9, 2024
Merged
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
313 changes: 313 additions & 0 deletions text/0103-introduce-core-index-commitment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
# RFC-0103: Introduce a `CoreIndex` commitment and a `SessionIndex` field in candidate receipts

| | |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Start Date** | 15 July 2024 |
| **Description** | Constrain parachain block validity to a specific core and session |
| **Authors** | Andrei Sandu |

## Summary

Elastic scaling is not resilient against griefing attacks without a way for a PoV (Proof of Validity)
to commit to the particular core index it was intended for. This RFC proposes a way to include
core index information in the candidate commitments and the `CandidateDescriptor` data structure
in a backwards compatible way. Additionally it proposes the addition of a `SessionIndex` field in
the `CandidateDescriptor` to make dispute resolution more secure and robust.


## Motivation

This RFC proposes a way to solve two different problems:

1. For Elastic Scaling, it prevents anyone who has acquired a valid collation to DoS the parachain
by providing the same collation to all backing groups assigned to the parachain. This can
happen before the next valid parachain block is authored and will prevent the chain of
candidates to be formed, reducing the throughput of the parachain to a single core.
2. The dispute protocol relies on validators trusting the session index provided by other
validators when initiating and participating in disputes. It is used to lookup validator keys
and check dispute vote signatures. By adding a `SessionIndex` in the `CandidateDescriptor`,
validators no longer have to trust the `Sessionindex` provided by the validator raising a
dispute. It can happen that the dispute concerns a relay chain block not yet imported by a
validator. In this case validators can safely assume the session index refers to the session
the candidate has appeared in, otherwise the chain would have rejected candidate.

## Stakeholders

- Polkadot core developers.
- Cumulus node developers.
- Tooling, block explorer developers.

This approach and alternatives have been considered and discussed in [this issue](https://github.com/polkadot-fellows/RFCs/issues/92).

## Explanation

The approach proposed below was chosen primarily because it minimizes the number of breaking
changes, the complexity and takes less implementation and testing time. The proposal is to change
the existing primitives while keeping binary compatibility with the older versions. We repurpose
unused fields to introduce core index and a session index information in the `CandidateDescriptor`
and extend the UMP for transporting non-XCM messages.

### Reclaiming unused space in the descriptor

The `CandidateDescriptor` currently includes `collator` and `signature` fields. The collator
includes a signature on the following descriptor fields: parachain id, relay parent, validation
data hash, validation code hash and the PoV hash.

However, in practice, having a collator signature in the receipt on the relay chain does not
provide any benefits as there is no mechanism to punish or reward collators that have provided
bad parachain blocks.

This proposal aims to remove the collator signature and all the logic that checks the collator
signatures of candidate receipts. We use the first 7 reclaimed bytes to represent version,
the core and session index, and fill the rest with zeroes. So, there is no change in the layout
and length of the receipt. The new primitive is binary compatible with the old one.

### UMP transport

[CandidateCommitments](https://github.com/paritytech/polkadot-sdk/blob/b5029eb4fd6c7ffd8164b2fe12b71bad0c59c9f2/polkadot/primitives/src/v7/mod.rs#L682)
remains unchanged as we will store scale encoded `UMPSignal` messages directly in the parachain
UMP queue by outputing them in [upward_messages](https://github.com/paritytech/polkadot-sdk/blob/b5029eb4fd6c7ffd8164b2fe12b71bad0c59c9f2/polkadot/primitives/src/v7/mod.rs#L684).

The UMP queue layout is changed to allow the relay chain to receive both the XCM messages and
`UMPSignal` messages. An empty message (empty `Vec<u8>`) is used to mark the end XCM messages and
the start of `UMPSignal` messages.
sandreim marked this conversation as resolved.
Show resolved Hide resolved

This way of representing the new messages has been chosen over introducing an enum wrapper to
minimize breaking changes of XCM message decoding in tools like Subscan for example.

Example:
```rust
[ XCM message1, XCM message2, ..., EMPTY message, UMPSignal::SelectCore ]
```

#### `UMPSignal` messages

```rust
/// An `u8` wrap around sequence number. Typically this would be the least significant byte of the
/// parachain block number.
pub struct CoreSelector(pub u8);

/// An offset in the relay chain claim queue.
pub struct ClaimQueueOffset(pub u8);
sandreim marked this conversation as resolved.
Show resolved Hide resolved

/// Signals sent by a parachain to the relay chain.
pub enum UMPSignal {
/// A message sent by a parachain to select the core the candidate is committed to.
/// Relay chain validators, in particular backers, use the `CoreSelector` and `ClaimQueueOffset`
/// to compute the index of the core the candidate has committed to.
///
SelectCore(CoreSelector, ClaimQueueOffset),
}
```

The parachain runtime is not concerned with the actual `CoreIndex` the candidate is intended for,
but must provide enough information for the validators and collators to compute it. `CoreSelector`
and `ClaimQueueOffset` use only 2 bytes and fulfill the requirement.

**Example:**

`cq_offset = 1` and `core_selector = 3`

The table below represents a snapshot of the claim queue:
sandreim marked this conversation as resolved.
Show resolved Hide resolved

| | offset = 0 | offset = 1 | offset = 2 |
| :--: | :--: | :--: | :--: |
| Core 1 | **Para A** | **Para A** | **Para A** |
| Core 2 | **Para A** | Para B | **Para A** |
| Core 3 | Para B | **Para A** | **Para A** |

The purpose of `ClaimQueueOffset` is to select the column from the above table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Polkadot right now allows to use a claim ahead in the queue? I still don't really get why we need it. The parachain runtime should ensure that always a new core is being used, for that using selector is enough.
Do we use right now the relay chain block of the pov was build on to determine the claim queue and then to determine the core? Or what are we doing right now?

Copy link
Contributor Author

@sandreim sandreim Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the core index is determined by the backers who have voted. We map validator index to group index to core index.

The claim queue offset is required because of the CoreSelector % number_of_assignments operation that selects the core. Because parachains can have a varying number of cores assigned to them (on-demand) at different offsets in the claim queue, the core index you get might be different depending on which claim is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I get this. However, you want to us all possible claims? So, you probably don't want to waste any cores. Which means you will build on all the cores. With prospective parachains you also can calculate which should be the next claim to use. Thus, we should not require the offset?

Copy link
Contributor Author

@sandreim sandreim Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I get this. However, you want to us all possible claims? So, you probably don't want to waste any cores. Which means you will build on all the cores.

Yes

With prospective parachains you also can calculate which should be the next claim to use. Thus, we should not require the offset?

That would not be enough. Let's see what happens. In backing we need to determine if the candidate is valid on the core the backers are assigned to. We take the core selector and get compute the core index and assume we start to fill the claims at top of the queue , which is exactly like claim queue offset 0, or we could have a more evolved tracking mechanism and we'd know which is the next free claim on a higher queue depth.

The problem is that the collator has to provide the core index in the descriptor. Keep in mind that the modulo core_selector operation gives different results depending on number of claims. So the parachain runtime has to actually pick a claim. That is why we need this offset so that validator know what the parachain has picked. If we don't have this parameter, collators need to make same assumption as validators and the runtime otherwise they get different core index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a little bit more and I think I found a way to not require the offset. My main problem with this offset is that it is static. You also write in the RFC itself that it isn't that great.

So, to my proposal. I would propose we store the latest core_selector of each parachain in the relay chain state. Then we can use this latest_core_selector (fetched at the candidate context relay chain block) together with the claim_queue to determine if the core_index in the candidate is correct. First we would calculate the core_selector_diff = core_selector.wrapping_sub(latest_core_selector). Then we would fetch the claim_queue at the candidate context relay chain block. We would count all cores assigned to our parachain until the count reaches core_selector_diff and then we would have our core_index.

I think we can then also get rid off max_candidate_len from the async backing parameters as the maximum length would be determined by the "claim queue lookahead".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping a claim queue snapshot at valid relayparents in the runtime allows us to compute the core index correctly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshots are fine. Possible future optimization (if necessary): Accompany each entry in the claim queue with the block number it came into existence. Then by only knowing the block number of the relay parent, we can filter to only the relevant entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One easy fix is to duplicate the claim for a order if the claim queue is empty. The parachain gets two chances if it uses cq offset 0 and has enough time to do well with cq offset 1.

And if the claim queue is not empty? They loose their claim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One easy fix is to duplicate the claim for a order if the claim queue is empty. The parachain gets two chances if it uses cq offset 0 and has enough time to do well with cq offset 1.

And if the claim queue is not empty? They loose their claim?

If the parachain is using offset 1, no. They have enough time to build the candidate.
If the parachain is using offset 0, then it can only do synchronous backing if my understanding is right.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the claim queue is not empty? They loose their claim?

No. The claim would not even be lost without the duplication on empty queues, but if the queue was empty before your claim immediately goes to position 0, which means only synchronous backing. Position 0 is therefore kind of "special".

There is no such problem if the queue was not empty. If at least one item was present before, you start at position 1, which already provides capability for full blocks.

For `cq_offset = 1` we get `[ Para A, Para B, Para A]` and use as input to create
a sorted vec with the cores A is assigned to: `[ Core 1, Core 3]` and call it `para_assigned_cores`.
sandreim marked this conversation as resolved.
Show resolved Hide resolved
We use `core_selector` and determine the committed core index is `Core 3` like this:

```rust
let committed_core_index = para_assigned_cores[core_selector % para_assigned_cores.len()];
```

Parachains should prefer to have a static `ClaimQueueOffset` value that makes sense for their
usecase which can be changed by governance at some future point. Changing the value dynamically
can be a friction point. It will work out fine to decrease the value to build more into the
present. But if the value is increased to build more into the future, a relay chain block will
be skipped.
sandreim marked this conversation as resolved.
Show resolved Hide resolved

### Polkadot Primitive changes

#### New [CandidateDescriptor](https://github.com/paritytech/polkadot-sdk/blob/b5029eb4fd6c7ffd8164b2fe12b71bad0c59c9f2/polkadot/primitives/src/v7/mod.rs#L512)

- reclaim 32 bytes from `collator: CollatorId` and 64 bytes from `signature: CollatorSignature`
and rename to `reserved1` and `reserved2` fields.
- take 1 bytes from `reserved1` for a new `version: u8` field.
sandreim marked this conversation as resolved.
Show resolved Hide resolved
sandreim marked this conversation as resolved.
Show resolved Hide resolved
- take 2 bytes from `reserved1` for a new `core_index: u16` field.
- take 4 bytes from `reserved1` for a new `session_index: u32` field.
- the remaining `reserved1` and `reserved2` fields are zeroed

Thew new primitive will look like this:

```rust
pub struct CandidateDescriptorV2<H = Hash> {
/// The ID of the para this is a candidate for.
para_id: ParaId,
/// The hash of the relay-chain block this is executed in the context of.
relay_parent: H,
/// Version field. The raw value here is not exposed, instead it is used
/// to determine the `CandidateDescriptorVersion`
version: InternalVersion,
sandreim marked this conversation as resolved.
Show resolved Hide resolved
/// The core index where the candidate is backed.
core_index: u16,
/// The session index of the candidate relay parent.
session_index: SessionIndex,
sandreim marked this conversation as resolved.
Show resolved Hide resolved
/// Reserved bytes.
reserved1: [u8; 25],
/// The blake2-256 hash of the persisted validation data. This is extra data derived from
/// relay-chain state which may vary based on bitfields included before the candidate.
/// Thus it cannot be derived entirely from the relay-parent.
persisted_validation_data_hash: Hash,
/// The blake2-256 hash of the PoV.
pov_hash: Hash,
/// The root of a block's erasure encoding Merkle tree.
erasure_root: Hash,
/// Reserved bytes.
reserved2: [u8; 64],
/// Hash of the para header that is being generated by this candidate.
para_head: Hash,
/// The blake2-256 hash of the validation code bytes.
validation_code_hash: ValidationCodeHash,
}
```

In future format versions, parts of the `reserved1` and `reserved2` bytes can be used to include
additional information in the descriptor.

### Parachain block validation

If the candidate descriptor is version 1, there are no changes.
sandreim marked this conversation as resolved.
Show resolved Hide resolved

Backers must check the validity of `core_index` and `session_index` fields.
A candidate must not be backed if any of the following are true:
- the `core_index` in the descriptor does not match the core the backer is assigned to
- the `session_index` is not equal to the session of the `relay_parent` in the descriptor
- the `core_index` in the descriptor does not match the one determined by the
`UMPSignal::SelectCore` message


### On-chain backing

If the candidate descriptor is version 1, there are no changes.

For version 2 descriptors the runtime will determine the `core_index` similarly to backers but
will always ignore the committed claim queue offset and use `0`, as it only cares
about what candidates can be backed at current block.

As the chain advances the claims also advance to the top of the queue. The runtime will only back
a candidate if the claimed core selected by it's claim queue offset has reached the top of the queue
at the current relay chain block: `current_block_num - relay_parent_num - 1 == claim_queue_offset`.

The impact of this change is that candidates built into the future (`claim queue offset > 0`)
can no longer be backed earlier even if the core is free and the core is assigned to the
parachain.

Introducing this additional check is required to ensure the runtime computes the core index
determinstically. For example, some collator has missed his slot and the core is now used
to back a candidate with a higher claim queue offset. The number of assigned cores can
be different at these two queue offsets and the committed core indices would be different.


### Backwards compatibility

There are two flavors of candidate receipts that are used in network protocols, runtime and node
implementation:
- `CommittedCandidateReceipt` which includes the `CanidateDescriptor` and the `CandidateCommitments`
- `CandidateReceipt` which includes the `CanidateDescriptor` and just a hash of the commitments
sandreim marked this conversation as resolved.
Show resolved Hide resolved

We want to support both the old and new versions in the runtime and node. The implementation must
be able to detect the version of a given candidate receipt.

The version of the descriptor is detected by checking if the `version` field is `0` and the
sandreim marked this conversation as resolved.
Show resolved Hide resolved
reserved fields are zeroed. If this is true it means the descriptor is version 2,
otherwise we consider it is version 1.


## Drawbacks

The only drawback is that further additions to the descriptor are limited to the amount of
remaining unused space.

## Testing, Security, and Privacy

Standard testing (unit tests, CI zombienet tests) for functionality and mandatory security audit
to ensure the implementation does not introduce any new security issues.

Backwards compatibility of the implementation will be tested on testnets (Versi and Westend).

There is no impact on privacy.

## Performance

The expected performance impact is negligible for sending and processing the new UMP
message in the runtime as well as on the node side.
sandreim marked this conversation as resolved.
Show resolved Hide resolved

The `ClaimQueueOffset` along with the relay parent choice allows parachains to optimize their
block production for either throughput or latency. A value of `0` with newest relay parent
provides best latency while picking older relay parents avoids re-orgs.


## Ergonomics

It is mandatory for elastic parachains to switch to the new receipt format. It is optional but
desired that all parachains switch to the new receipts for providing the session index for
disputes.

The implementation of this RFC itself must not introduce any breaking changes for the parachain
runtime or collator nodes.

## Compatibility

The proposed changes are backwards compatible in general, but additional care must be taken by
waiting for at least `2/3 + 1` validators to upgrade. Validators that have not upgraded will not
back candidates using the new descriptor format and will also initiate disputes against.

### Relay chain runtime

The first step is to remove collator signature checking logic in the runtime, but keep the node
side collator signature
checks.

The runtime must be upgraded to the new primitives before any collator or node is allowed to use
the new candidate receipts format.

### Validators

To ensure a smooth launch, a new node feature is required.
sandreim marked this conversation as resolved.
Show resolved Hide resolved
The feature acts as a signal for supporting the new candidate receipts on the node side and can
only be safely enabled if at least `2/3 + 1` of the validators are upgraded.

Once enabled, the validators will skip checking the collator signature when processing the
candidate receipts and verify the `CoreIndex` and `SessionIndex` fields if present in the receipt.

No new implementation of networking protocol versions for collation and validation is required.

### Tooling

Any tooling that decodes UMP XCM messages needs an update to support or ignore the new UMP
messages, but they should be fine to decode the regular XCM messages that come before the
separator.

## Prior Art and References

Forum discussion about a new `CandidateReceipt` format:
https://forum.polkadot.network/t/pre-rfc-discussion-candidate-receipt-format-v2/3738

## Unresolved Questions

N/A

## Future Directions and Related Material

The implementation is extensible and future proof to some extent. With minimal or no breaking
changes, additional fields can be added in the candidate descriptor until the reserved space is
exhausted

At this point there is a simple way to determine the version of the receipt, by testing for zeroed
reserved bytes in the descriptor. Future versions of the receipt can be implemented and identified
by using the `version` field of the descirptor introduced in this RFC.