-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provisioner: allow multiple cores assigned to the same para #3233
Conversation
fa0df72
to
1d60af2
Compare
…aling-provisioner
I don't think I am getting what you are saying here. What successors are we talking about? In my mind nothing is special in this scenario. We have free cores and we will them with a connected chain of blocks. |
Say we have all 5 cores for the same paraid. The runtime will next evict C D and E and will have 3 scheduled cores for this paraid. It'll only back one candidate for now, because that's what the provisioner offered. This is a compromise I made to not overcomplicate the code, considering timeouts should not happen often. |
I think we should also timeout all dependents with the dependency. So if C times out, we should also timeout D and E as a consequence, freeing their cores as well. |
Yes, that's what I described. The runtime shouls evict those cores |
I was talking about what the provisioner is doing. If we want to make sure that we're as optimal as possible in the provisioner (to never leave any free cores after a timeout), we'd have to request backable candidates from prospective-parachains at once for all paraids. And have prospective-parachains first perform all the logic of detecting the dependencies of timed out candidates before doing all the things this PR is doing (because a timed out candidate can have a different para id as its next_up_on_timeout) Review the PR please and if you think it's not complex enough already let me know :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good and makes sense. We just backfill the timedout core + any depending cores (which are already available) with any existing candidates that are not already pending availability (if there are any). This is the best we can do. So in the case that the dependents are already available, we will already backfill - because we think they are available, while in fact they get freed because of their dependency timing out. Result is the same though. Only in the case that they are not yet available would we need another block to backfill them. We can fix that once we have proper dependencies (based on Candidatehash)
I think the cycle code can be simplified & and we need to reconsider our treatment of cores.
@@ -1546,6 +1659,481 @@ mod tests { | |||
assert_eq!(tree.nodes[1].parent, NodePointer::Storage(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this tests module a proper file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but it'd be harder to review (to see which tests were added). I'd do that in its own PR
} | ||
} else if occupied_core.time_out_at > block_number { | ||
// Not timed out and not available. | ||
ancestors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like passing these Ancestor states into prospective parachains is exposing an implementation detail and also does not seem necessary:
- Working with cycles can be done by prospective parachains alone. In particular the implementation seems also wrong, by using a HashMap from
CandidateHash
toAncestorState
you are not handling the cycles prospective parachains cares about, because those cycles would not just be one cycle, but multiple, by exploiting the fact that the candidates can differ, even if they depend on the same head data. Thuscount
will be 1 even if there are such cycles. It would be better to just appendCandidateHash
es to a vec or insert them in aHashSet
. We should be able to assume thatCandidateHash
es are unique. - Instead of setting timeout to true, we could just not add the timedout candidate to the ancestry to begin with. (voiding my argument above that we can lift the adding to ancestry up.) Then in fragment_tree, there would be a hole in the chain - which is fine, we just need to stop once we find no matching child - ignoring any remaining elements in ancestors. (Should no longer be an error.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the point one by one.
Working with cycles can be done by prospective parachains alone. In particular the implementation seems also wrong, by using a HashMap from CandidateHash to AncestorState you are not handling the cycles prospective parachains cares about, because those cycles would not just be one cycle, but multiple, by exploiting the fact that the candidates can differ, even if they depend on the same head data. Thus count will be 1 even if there are such cycles
You're mostly right.
I think there are two cases here:
- different candidate hashes occupying the cores lead to cycles in the head_data state. Example: We have two head data states A and B. Candidate X transitions the parent_head_data from A to output head data B. Candidate Y transitions the parent_head_data B to the output head data A. So the cores will be occupied by X and Y and the output state of X and Y would be A. The next backable candidate would be X (or another, non-cyclic candidate if it exists in the fragment tree).
- same candidate hashes occupying the cores lead to cycles in the head_data state. Simplest example is a candidate X which has parent head data A and output head data A (a cycle of size 0). Another example is the one given at the previous bullet point, if we continue cycling X and Y (so X Y X for example).
In both cases, under the hood, prospective-parachains will build the fragment trees taking the head_data cycles into account.
So the trees will look like:
- X Y ...
- X X ...
So, to summarise, you're right saying "Thus count will be 1 even if there are such cycles". This was done by-design.
A count larger than 1 is not strictly equivalent to a cycle. If the count is larger than 1, we can be sure there's a cycle, but there can be cycles even if the count is 1.
So here in the provisioner, we're not detecting cycles. We're just building a record of currently occupied cores. We use the HashMap for efficiency. We could use a Vec to express the ancestors but it's more efficient as a HashMap because in find_path_from_ancestors
we do a lot of existence checks and removals which are O(1) compared to O(n) for a vector.
Prospecive-parachains is detecting the cycles when building the fragment tree and in select_children
it only traverses this tree.
We should be able to assume that CandidateHashes are unique.
As described above, I don't think we can. Having a cycle could mean having an identical candidate on multiple cores or having different candidates that are linked by head_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 2:
Instead of setting timeout to true, we could just not add the timedout candidate to the ancestry to begin with. (voiding my argument above that we can lift the adding to ancestry up.) Then in fragment_tree, there would be a hole in the chain - which is fine, we just need to stop once we find no matching child - ignoring any remaining elements in ancestors. (Should no longer be an error.)
Yes, we could do that. I didn't do this because I figured we'd want to distinguish between a hole in the chain given by a timeout or a hole in the chain due to a runtime error backing candidates which don't form a chain or incorrectly freeing the cores.
So if the cores were occupied by A->B->C and the runtime freed B's core, we'd end up treating B as timed out and suggest B again as a new backable candidate. If we can distinguish between these two situations, we won't supply any more candidates to the runtime until it recovers from the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the cores were occupied by A->B->C and the runtime freed B's core, we'd end up treating B as timed out and suggest B again as a new backable candidate. If we can distinguish between these two situations, we won't supply
the runtime can not free B
without also freeing A
... so there would not be a hole in that case. I am not against defensive programming, but we should not complicate node code to enhance performance in case the runtime has a bug.
If the count is larger than 1, we can be sure there's a cycle,
Ok. In theory this is really possible. The candidate descriptor contains the relay parent, so same candidates would be limited to be anchored to the same relay parent, but with asynchronous backing and also elastic scaling there is nothing preventing a parachain from not being a chain and producing the same candidate twice (which is the same as copying it) and then later moving on. This can only be prevented by adding the parent candidate hash to the descriptor itself. But for all intents and purposes those candidates are already exactly the same. They take the same parent head data and they produce the exact same output with the exact same "block" data. So things like this could exist:
A <- B <- A
now if we keep building, we could get:
A <- B <- A <- C
but given that the second A arrives at exactly the same output and it can not have consumed or sent any messages (otherwise they would not be the same), hence from Polkadot's perspective this is the same as:
A <- C
So if the cycle A <- B <- A exists within a single block (elastic scaling), we can just all but one A in the runtime. In fact the first A
must also depend on B
or more precisely the head data produced by B.
I would say for now, we should really just drop duplicates and in the next iteration we absolutely should have the parent candidate hash in the descriptor to stop this madness altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say for now, we should really just drop duplicates and in the next iteration we absolutely should have the parent candidate hash in the descriptor to stop this madness altogether.
Sounds like a good idea
the runtime can not free B without also freeing A ... so there would not be a hole in that case. I am not against defensive programming, but we should not complicate node code to enhance performance in case the runtime has a bug.
right, I agree.
So, I'll remove the AncestorState
and just use a HashSet to express the ancestors. Should be much simpler. Great suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, PTAL
continue | ||
} | ||
|
||
for (core_index, candidate) in cores.into_iter().zip(response.into_iter()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This core mapping via zip does not seem correct. I think we need Andrei's changes for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, re-reading my own notes,I think we can get away without having prospective parachains have to only provide us unique core candidates.
Basically we need to have two distinctive ways to treat cores, which don't necessarily map to each other:
- We need cores to determine the right backing group (and to check signatures of those backing validators). These cores will be determined by the relay parent of the candidate.
- We have cores which have the
ParaId
in their claimqueue (including scheduled) a the time of block production.
For (2) it actually does not really matter in the context of elastic scaling whether or not the candidate was intended for that core or not as long as the core has the paraid in its claimqueue is scheduled, it is fine. We check signatures, based on the assigned group, which is based on the relay parent, but on which availability core we actually put it, can actually differ.
So the core in candidate commitments later on would actually also not necessarily be equal to the core that actually gets occupied in the end, but merely specifies the correct backing group.
I think having this flexibility to use a different core in (2) is worthwhile as it simplifies things - and might also have performance advantages, but to stay sane I think we should make this a different concept. For (1) we actually don't care about the core, we care about the backing group that is responsible .... maybe instead of having the core index in the commitments, we should have the group index in the commitments.
For the MVP (this PR), all we have to do is use the provided core index (or should we also directly encode the group index already?) for signature checking, but then we are free to put it on any availability core that has that para assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only using the core mapping here because the runtime expects the vector of BackedCandidates to be sorted by CoreIndex, that's it. So we're not breaking anything with these changes.
As discussed, we may want to have this sorted in dependency order instead. But this will come in the future, when making runtime changes.
I agree with your larger point that it makes sense to decouple the backing validator group from the availability core index in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only using the core mapping here because the runtime expects the vector of BackedCandidates to be sorted by CoreIndex, that's it. So we're not breaking anything with these changes.
but it is not matching the actual core right? We are just picking candidates from prospective parachains that match the para, we don't even know on which core they have been provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this. We actually don't care whether or not the backed candidates are sorted in order of core_index, because they're sorted in the runtime:
backed_candidates_with_core.sort_by(|(_x, core_x), (_y, core_y)| core_x.cmp(&core_y)); |
Yes, correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job @alindima and big thanks for answering my questions offline.
…aling-provisioner
Will need to be rebased on top of #3231 |
Rebased. Waiting on a final approval from @eskimor 🙏🏻 |
* master: Finish documenting `#[pallet::xxx]` macros (#2638) Remove `as frame_system::DefaultConfig` from the required syntax in `derive_impl` (#3505) provisioner: allow multiple cores assigned to the same para (#3233) subsystem-bench: add regression tests for availability read and write (#3311) make SelfParaId a metadata constant (#3517) Fix crash of synced parachain node run with `--sync=warp` (#3523) [Backport] Node version and spec_version bumps and ordering of the prdoc files from 1.8.0 (#3508) Add `claim_assets` extrinsic to `pallet-xcm` (#3403)
…data * ao-collator-parent-head-data: add a comment (review) Finish documenting `#[pallet::xxx]` macros (#2638) Remove `as frame_system::DefaultConfig` from the required syntax in `derive_impl` (#3505) provisioner: allow multiple cores assigned to the same para (#3233) subsystem-bench: add regression tests for availability read and write (#3311) make SelfParaId a metadata constant (#3517) Fix crash of synced parachain node run with `--sync=warp` (#3523) [Backport] Node version and spec_version bumps and ordering of the prdoc files from 1.8.0 (#3508) Add `claim_assets` extrinsic to `pallet-xcm` (#3403)
…ch#3233) paritytech#3130 builds on top of paritytech#3160 Processes the availability cores and builds a record of how many candidates it should request from prospective-parachains and their predecessors. Tries to supply as many candidates as the runtime can back. Note that the runtime changes to back multiple candidates per para are not yet done, but this paves the way for it. The following backing/inclusion policy is assumed: 1. the runtime will never back candidates of the same para which don't form a chain with the already backed candidates. Even if the others are still pending availability. We're optimistic that they won't time out and we don't want to back parachain forks (as the complexity would be huge). 2. if a candidate is timed out of the core before being included, all of its successors occupying a core will be evicted. 3. only the candidates which are made available and form a chain starting from the on-chain para head may be included/enacted and cleared from the cores. In other words, if para head is at A and the cores are occupied by B->C->D, and B and D are made available, only B will be included and its core cleared. C and D will remain on the cores awaiting for C to be made available or timed out. As point (2) above already says, if C is timed out, D will also be dropped. 4. The runtime will deduplicate candidates which form a cycle. For example if the provisioner supplies candidates A->B->A, the runtime will only back A (as the state output will be the same) Note that if a candidate is timed out, we don't guarantee that in the next relay chain block the block author will be able to fill all of the timed out cores of the para. That increases complexity by a lot. Instead, the provisioner will supply N candidates where N is the number of candidates timed out, but doesn't include their successors which will be also deleted by the runtime. This'll be backfilled in the next relay chain block. Adjacent changes: - Also fixes: paritytech#3141 - For non prospective-parachains, don't supply multiple candidates per para (we can't have elastic scaling without prospective parachains enabled). paras_inherent should already sanitise this input but it's more efficient this way. Note: all of these changes are backwards-compatible with the non-elastic-scaling scenario (one core per para).
Changes needed to implement the runtime part of elastic scaling: #3131, #3132, #3202 Also fixes #3675 TODOs: - [x] storage migration - [x] optimise process_candidates from O(N^2) - [x] drop backable candidates which form cycles - [x] fix unit tests - [x] add more unit tests - [x] check the runtime APIs which use the pending availability storage. We need to expose all of them, see #3576 - [x] optimise the candidate selection. we're currently picking randomly until we satisfy the weight limit. we need to be smart about not breaking candidate chains while being fair to all paras - #3573 Relies on the changes made in #3233 in terms of the inclusion policy and the candidate ordering --------- Signed-off-by: alindima <[email protected]> Co-authored-by: command-bot <> Co-authored-by: eskimor <[email protected]>
…h#3479) Changes needed to implement the runtime part of elastic scaling: paritytech#3131, paritytech#3132, paritytech#3202 Also fixes paritytech#3675 TODOs: - [x] storage migration - [x] optimise process_candidates from O(N^2) - [x] drop backable candidates which form cycles - [x] fix unit tests - [x] add more unit tests - [x] check the runtime APIs which use the pending availability storage. We need to expose all of them, see paritytech#3576 - [x] optimise the candidate selection. we're currently picking randomly until we satisfy the weight limit. we need to be smart about not breaking candidate chains while being fair to all paras - paritytech#3573 Relies on the changes made in paritytech#3233 in terms of the inclusion policy and the candidate ordering --------- Signed-off-by: alindima <[email protected]> Co-authored-by: command-bot <> Co-authored-by: eskimor <[email protected]>
Will make #4035 easier to review (the mentioned PR already does this move so the diff will be clearer). Also called out as part of: #3233 (comment)
#3130
builds on top of #3160
Processes the availability cores and builds a record of how many candidates it should request from prospective-parachains and their predecessors.
Tries to supply as many candidates as the runtime can back. Note that the runtime changes to back multiple candidates per para are not yet done, but this paves the way for it.
The following backing/inclusion policy is assumed:
Note that if a candidate is timed out, we don't guarantee that in the next relay chain block the block author will be able to fill all of the timed out cores of the para. That increases complexity by a lot. Instead, the provisioner will supply N candidates where N is the number of candidates timed out, but doesn't include their successors which will be also deleted by the runtime. This'll be backfilled in the next relay chain block.
Adjacent changes:
Note: all of these changes are backwards-compatible with the non-elastic-scaling scenario (one core per para).