-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2 / 5] Make approval-distribution logic runnable on a separate thread #4845
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
alexggh
changed the title
[WIP] Make approval-distribution logic runnable on a separate thread
Make approval-distribution logic runnable on a separate thread
Jun 20, 2024
6 tasks
alexggh
changed the title
Make approval-distribution logic runnable on a separate thread
[1 / 5] Make approval-distribution logic runnable on a separate thread
Jun 20, 2024
alexggh
force-pushed
the
alexaggh/approval-voting-parallel-1-5
branch
from
July 2, 2024 11:40
381e0b9
to
88e15da
Compare
alexggh
changed the title
[1 / 5] Make approval-distribution logic runnable on a separate thread
[3 / 5] Make approval-distribution logic runnable on a separate thread
Jul 2, 2024
alexggh
changed the base branch from
master
to
alexaggh/approval-voting-parallel-3-5
July 2, 2024 11:46
alexggh
changed the base branch from
alexaggh/approval-voting-parallel-3-5
to
master
July 2, 2024 11:50
alexggh
force-pushed
the
alexaggh/approval-voting-parallel-1-5
branch
from
July 2, 2024 11:54
88e15da
to
4b3f489
Compare
alexggh
changed the title
[3 / 5] Make approval-distribution logic runnable on a separate thread
[2 / 5] Make approval-distribution logic runnable on a separate thread
Jul 2, 2024
alexggh
changed the base branch from
master
to
alexaggh/approval-voting-parallel-4-5
July 2, 2024 12:03
The CI pipeline was cancelled due to failure one of the required jobs. |
alexggh
force-pushed
the
alexaggh/approval-voting-parallel-4-5
branch
from
July 2, 2024 12:29
4b3f489
to
713aef4
Compare
alexggh
force-pushed
the
alexaggh/approval-voting-parallel-1-5
branch
2 times, most recently
from
July 5, 2024 11:36
406d11c
to
2281330
Compare
Signed-off-by: Alexandru Gheorghe <[email protected]>
alexggh
force-pushed
the
alexaggh/approval-voting-parallel-1-5
branch
from
July 16, 2024 12:39
2281330
to
8dbb088
Compare
sandreim
approved these changes
Jul 25, 2024
alindima
approved these changes
Jul 26, 2024
Signed-off-by: Alexandru Gheorghe <[email protected]>
ordian
approved these changes
Jul 26, 2024
missing prdoc |
ordian
added
the
T8-polkadot
This PR/Issue is related to/affects the Polkadot network.
label
Jul 26, 2024
Signed-off-by: Alexandru Gheorghe <[email protected]>
AndreiEres
approved these changes
Jul 29, 2024
TarekkMA
pushed a commit
to moonbeam-foundation/polkadot-sdk
that referenced
this pull request
Aug 2, 2024
paritytech#4845) This is part of the work to further optimize the approval subsystems, if you want to understand the full context start with reading paritytech#4849 (comment), # Description This PR contain changes to make possible the run of multiple instances of approval-distribution, so that we can parallelise the work. This does not contain any functional changes it just decouples the subsystem from the subsystem Context and introduces more specific trait dependencies for each function instead of all of them requiring a context. It does not have any dependency of the follow PRs, so it can be merged independently of them. --------- Signed-off-by: Alexandru Gheorghe <[email protected]>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Sep 26, 2024
This is the implementation of the approach described here: #1617 (comment) & #1617 (comment) & #1617 (comment). ## Description of changes The end goal is to have an architecture where we have single subsystem(`approval-voting-parallel`) and multiple worker types that would full-fill the work that currently is fulfilled by the `approval-distribution` and `approval-voting` subsystems. The main loop of the new subsystem would do just the distribution of work to the workers. The new subsystem will have: - N approval-distribution workers: This would do the work that is currently being done by the approval-distribution subsystem and in addition to that will also perform the crypto-checks that an assignment is valid and that a vote is correctly signed. Work is assigned via the following formula: `worker_index = msg.validator % WORKER_COUNT`, this guarantees that all assignments and approvals from the same validator reach the same worker. - 1 approval-voting worker: This would receive an already valid message and do everything the approval-voting currently does, except the crypto-checking that has been moved already to the approval-distribution worker. On the hot path of processing messages **no** synchronisation and waiting is needed between approval-distribution and approval-voting workers. <img width="1431" alt="Screenshot 2024-06-07 at 11 28 08" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/a196199b-b705-4140-87d4-c6900ba8595e"> ## Guidelines for reading The full implementation is broken in 5 PRs and all of them are self-contained and improve things incrementally even without the parallelisation being implemented/enabled, the reason this approach was taken instead of a big-bang PR, is to make things easier to review and reduced the risk of breaking this critical subsystems. After reading the full description of this PR, the changes should be read in the following order: 1. #4848, some other micro-optimizations for networks with a high number of validators. This change gives us a speed up by itself without any other changes. 2. #4845 , this contains only interface changes to decouple the subsystem from the `Context` and be able to run multiple instances of the subsystem on different threads. **No functional changes** 3. #4928, moving of the crypto checks from approval-voting in approval-distribution, so that the approval-distribution has no reason to wait after approval-voting anymore. This change gives us a speed up by itself without any other changes. 4. #4846, interface changes to make approval-voting runnable on a separate thread. **No functional changes** 5. This PR, where we instantiate an `approval-voting-parallel` subsystem that runs on different workers the logic currently in `approval-distribution` and `approval-voting`. 6. The next step after this changes get merged and deploy would be to bring all the files from approval-distribution, approval-voting, approval-voting-parallel into a single rust crate, to make it easier to maintain and understand the structure. ## Results Running subsystem-benchmarks with 1000 validators 100 fully ocuppied cores and triggering all assignments and approvals for all tranches #### Approval does not lags behind. Master ``` Chain selection approved after 72500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a ``` With this PoC ``` Chain selection approved after 3500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a ``` #### Gathering enough assignments Enough assignments are gathered in less than 500ms, so that gives un a guarantee that un-necessary work does not get triggered, on master on the same benchmark because the subsystems fall behind on work, that number goes above 32 seconds on master. <img width="2240" alt="Screenshot 2024-06-20 at 15 48 22" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/d2f2b29c-5ff6-44b4-a245-5b37ab8e58bc"> #### Cpu usage: Master ``` CPU usage, seconds total per block approval-distribution 96.9436 9.6944 approval-voting 117.4676 11.7468 test-environment 44.0092 4.4009 ``` With this PoC ``` CPU usage, seconds total per block approval-distribution 0.0014 0.0001 --- unused approval-voting 0.0437 0.0044. --- unused approval-voting-parallel 5.9560 0.5956 approval-voting-parallel-0 22.9073 2.2907 approval-voting-parallel-1 23.0417 2.3042 approval-voting-parallel-2 22.0445 2.2045 approval-voting-parallel-3 22.7234 2.2723 approval-voting-parallel-4 21.9788 2.1979 approval-voting-parallel-5 23.0601 2.3060 approval-voting-parallel-6 22.4805 2.2481 approval-voting-parallel-7 21.8330 2.1833 approval-voting-parallel-db 37.1954 3.7195. --- the approval-voting thread. ``` # Enablement strategy Because just some trivial plumbing is needed in approval-distribution and approval-voting to be able to run things in parallel and because this subsystems plays a critical part in the system this PR proposes that we keep both ways of running the approval work, as separated subsystems and just a single subsystem(`approval-voting-parallel`) which has multiple workers for the distribution work and one worker for the approval-voting work and switch between them with a comandline flag. The benefits for this is twofold. 1. With the same polkadot binary we can easily switch just a few validators to use the parallel approach and gradually make this the default way of running, if now issues arise. 2. In the worst case scenario were it becomes the default way of running things, but we discover there are critical issues with it we have the path to quickly disable it by asking validators to adjust their command line flags. # Next steps - [x] Make sure through various testing we are not missing anything - [x] Polish the implementations to make them production ready - [x] Add Unittest Tests for approval-voting-parallel. - [x] Define and implement the strategy for rolling this change, so that the blast radius is minimal(single validator) in case there are problems with the implementation. - [x] Versi long running tests. - [x] Add relevant metrics. @ordian @eskimor @sandreim @AndreiEres, let me know what you think. --------- Signed-off-by: Alexandru Gheorghe <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
T0-node
This PR/Issue is related to the topic “node”.
T8-polkadot
This PR/Issue is related to/affects the Polkadot network.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is part of the work to further optimize the approval subsystems, if you want to understand the full context start with reading #4849 (comment),
Description
This PR contain changes to make possible the run of multiple instances of approval-distribution, so that we can parallelise the work. This does not contain any functional changes it just decouples the subsystem from the subsystem Context and introduces more specific trait dependencies for each function instead of all of them requiring a context.
It does not have any dependency of the follow PRs, so it can be merged independently of them.