[Executor] Merge sequential & parallel execution flow #4683

gelash · 2022-09-30T18:01:30Z

De-spagettify the aptos-vm execution flow

remove unused status (only used in harness tests, fix the tests to have the same behavior as in prod).
merge sequential with parallel executor crate, re-using code but keep the same algorithm for redundancy / testing & fallback (i.e. Block-STM could also run sequential with one thread with probably not too much overhead).

Benefits:

Cleaner flow
Can hook to the same executor regardless of whether we are executing sequential or parallel.
Can use the powerful testing framework we have for parallel executor for sequential execution.
Can re-use the executor rayon threadpool for block signature checks.

EDIT: These are coming in a different diff.
Additional improvements to tests:
- Test Over / Underflows based on storage values (close to 0 or close to u128::MAX)
- test different number of cores
- Refactors: ModulePath enum, isolate BaselineState for generating baseline with different configurations (i.e. aggregator values materialized or not), check delta sequence, etc.
- Make sure StorageError precedence over DeltaApplicationFailure in speculative executions when aggregator is deleted but deltas on top also fail. Not relevant for current use-case (no under/over flows and deletes), also also the proper algorithm for handling general case may not have the issue anyway.
After the testing PR, @perryjrandall we should start running on different platforms too.

dariorussi · 2022-10-01T14:18:49Z

just a quick comment as I am reviewing this, and it will be slow for me as it touches a lot of core stuff I have not seen before. Though a wonderful exercise, thanks!
I am not sure when we cut the branch (if we did not already) and all of the main-net implications, but this is touching a bunch of core stuff, as far as I can tell and we want it in definitely after we are done with main-net and all of that, right?
Can you share your thoughts?

Also please review errors which look legit?

gelash · 2022-10-04T04:48:49Z

@dariorussi - yes, if we like it, we should roll this in after main-net. Should be nothing breaking, more tests, common flow and should facilitate lots of future improvements (e.g. even just the virtue of having a single executor object for callbacks or what not).

Will def fix all the linter and related errors. And (TODO to self): will also experiment with Dashset/Dashmap configuration settings (num shards mainly), we use defaults and maybe there is some perf to squeeze.

zekun000

early comments, need to do another pass

aptos-move/e2e-tests/src/executor.rs

aptos-move/aptos-vm/src/parallel_executor/vm_wrapper.rs

aptos-move/aptos-vm/src/data_cache.rs

aptos-move/aptos-vm/src/parallel_executor/mod.rs

dariorussi

I think you are splitting this up in multiple PRs which is a very good idea, so love it and maybe you want to mark it somehow.
Just commenting given I have a small comment I think you may enjoy.

aptos-move/aptos-vm/src/parallel_executor/mod.rs

gelash · 2022-11-18T08:27:38Z

I think you are splitting this up in multiple PRs which is a very good idea, so love it and maybe you want to mark it somehow. Just commenting given I have a small comment I think you may enjoy.

Will do ASAP. To document here, the plan is to do proptest changes in a separate diff.

I will make sure to stack the proptest PR on top though and not land the first PR without ensuring the second one (with new and stronger tests) passes.

gelash · 2022-11-19T22:07:49Z

Split the PR, this is the first one that refactors the aptos-vm execution flow, removing unused status and merging sequential & parallel execution. Renamed parallel_executor -> block_executor as per @dariorussi 's suggestion, but since that's a lot of lines, made it a separate commit for the ease of reviewing.

@runtian-zhou can you have a look now?

gelash · 2022-11-19T22:08:03Z

@zekun000 @sasha8 ping ping

wrwg · 2022-11-26T08:09:25Z

Lets give some more time for review from Move team members. (I know you filed this a while ago so sorry for no earlier attention.)

gelash · 2022-11-26T14:15:22Z

Lets give some more time for review from Move team members. (I know you filed this a while ago so sorry for no earlier attention.)

I was never going to land this without @runtian-zhou 's approval, but appreciate more eyes.

For context:
I already have 3 follow-ups implemented (2 drafts linked 2 in the comments, or can see here: main...seqparoutput, third for adding proptests which I separated from here for clarity) which intend to improve and simplify the whole executor <-> aptos-vm integration (+ sequential now), delta resolution, etc.

But it all starts here, and once we have all executor flow nicely in the block_executor, opens up all the extension and gas hooks that we need for different outstanding tasks (can explain offline).

wrwg

This generally seems to go into the right direction (that is unifying sequential/parallel block execution). We need a larger refactoring of the adapter (see comment below), but this PR can be an incremental step for this. Leaving detail review to folks more acquainted with PE.

wrwg · 2022-11-28T04:22:26Z

aptos-move/aptos-vm/src/block_executor/mod.rs

+}
+
+// Wrapper to avoid orphan rule
+pub(crate) struct AptosTransactionOutput(TransactionOutputExt);


Not a big fan of avoiding the orphan rule which is there for a reason (and not a technical one). It makes perhaps notations a bit more convenient, but generally the code harder to understand. Not saying this needs to be differently done, but just an opinion.

I really dislike this as well and tried to get rid of it, can solve some issues but the problem I couldn't easily overcome is a potential circular dependency between aptos-aggregator (where TransactionOutputExt is defined) and block-executor crates (that defines traits, and also uses aggregators).. Hopefully we will get to a place where we restructure this wrapper, or in general restructure the crates and factor out the common types (like we do currently at the aptos-core level, something similar maybe we could do the same at the aptos-vm or aptos-move).

wrwg · 2022-11-28T04:40:35Z

aptos-move/aptos-vm/src/block_executor/vm_wrapper.rs

+
+pub(crate) struct AptosVMWrapper<'a, S> {
+    vm: AptosVM,
+    base_view: &'a S,


This S is always a StateView, right? Then this is really strange and the need for this wrapper type just demonstrates how broken the AptosVM is (and requires a larger overhaul). Because if you look at AptosVM, you see that it wraps AptosVMImpl, which in turn is created from a state view. That state view is then hidden inside of the storage adapter. The need to be able to get hand a StateView already created acrobatic code in other places.

This is probably another set of PRs after this one, but I really wish we could drastically simplify the architecture here:

Only one AptosVM (no VMImpl, BlockVM, MoveVmExt, and other complications)

That AptosVM implements all the parallel and sequential execution logic. Multiple impl AptosVM split over files can help to tam the complexity.

Because parallel execution is build into our Move adapter there is really no need to maintain a code layer underneath without PE.

I think that would be a great place to get to.

Next PR will help a bit with the StateView wrapping business, but it won't fully solve it - more like unify what we currently use (StateViewCache and VersionedView) to represent storage stateview + some writes from the block, and isolate it within the block_executor as implementation details (together with helping delta resolution also become an implementation detail) as a better temporary place. But I fully subscribe to revamping these boundaries as soon as possible and will try to make it more feasible with the current queue of changes.

aptos-move/aptos-vm/src/block_executor/vm_wrapper.rs

gelash · 2022-11-28T06:44:33Z

This generally seems to go into the right direction (that is unifying sequential/parallel block execution). We need a larger refactoring of the adapter (see comment below), but this PR can be an incremental step for this. Leaving detail review to folks more acquainted with PE.

Totally agreed, that's precisely the intention. There are some specific fixes, but regarding the overall refactoring aspect, my rationale at the moment is to try and make some incremental local simplifications / improvements while hopefully going in the good global direction (allow having a single "executor" that can be cleanly connected to other parts of aptos-vm, re-used across blocks, etc). Hopefully with some iteration (and there are indeed 2/3 PRs coming right after), we will also start seeing the big picture better. One heuristic for me for now is to push some pieces of logic to the block_executor side (e.g. in the next PR, the StateView wrapper business currently done in different ways in aptos-vm), since that code is newer and should have more fixed structure / flow.

However, then we should absolutely look into precisely the question of how the block executor is incorporated into aptos_vm. For some context, I believe the current state of affairs is due to two things - @runtian-zhou can confirm or deny as the author and the ultimate expert on the wrapper flow:
(a) make parallel executor standalone crate with generic parameters - be able to test without Move-VM.
(b) a clear separator of abstractions to facilitate the development at the time.
The testing (and prop-testing) framework is probably the best thing that we got out of building it this way. But now all these layers in aptos-vm / aptos_vm_impl / adapter / wrapper need to eventually also be simplified, especially since we have a lot of use-cases where we'd need hooks to & from other parts of aptos-vm.

runtian-zhou · 2022-11-28T23:16:10Z

The way how parallel executor is structured like this is exactly what @gelash suggests. The whole intention of this parallel_executor crate abstraction is to:

Help testing core scheduling logic without worrying about the Aptos VM implementation.
Being able to run benchmarks that are independent of the AptosVM.

The testing aspect is probably more important in this context cuz it's generally quite hard to test parallel code like this.

runtian-zhou

Almost looks good to me! Agreed with @wrwg that we need more refactoring here but this is a great step moving forward!

runtian-zhou · 2022-11-28T23:18:37Z

aptos-move/block-executor/src/errors.rs

@@ -3,9 +3,6 @@

 #[derive(Debug, PartialEq, Eq)]
 pub enum Error<E> {
-    /// Invariant violation that happens internally inside of scheduler, usually an indication of
-    /// implementation error.
-    InvariantViolation,


Why is this error removed?

Was never used - happy to bring it back whenever needed.

aptos-move/block-executor/src/executor.rs

runtian-zhou · 2022-11-28T23:26:19Z

aptos-move/block-executor/src/executor.rs

    ) -> Result<
        (
            Vec<E::Output>,
            OutputDeltaResolver<<T as Transaction>::Key, <T as Transaction>::Value>,
        ),
        E::Error,
    > {
+        assert!(self.concurrency_level > 1, "Must use sequential execution");


Was wondering if those assertion would fail in production as they are a good source for the network availability attack. Would it be better to return an error when such condition is violated?

Currently, must be true due to
https://github.com/aptos-labs/aptos-core/blob/e197a64f990b349b888eec4624c1adf945d0ef67/aptos-move/aptos-vm/src/block_executor/mod.rs#L150, and second follow-up PR moves this dispatching inside block_executor crate:

aptos-core/aptos-move/block-executor/src/executor.rs

Line 377 in b423013

let mut ret = if self.concurrency_level > 1 {

, and execute_parallel becomes pub(crate), so we could also just delete the assert at that point. In fact, I like that idea to delete just this assert then.

If you really think we should return an error here, let me know what kind of error and what should happen in that case, we skip the whole block? But it will complicate some places and we should worry about determinism and probably not worth for some invariant that trivially should never happen.

github-actions · 2022-11-29T01:44:08Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7405 TPS, 5244 ms latency, 7000 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::single-validator-upgrade : 4806 TPS, 8401 ms latency, 12200 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::half-validator-upgrade : 4763 TPS, 8435 ms latency, 11000 ms p99 latency,no expired txns
4. upgrading second batch to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6904 TPS, 5806 ms latency, 11100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2022-11-29T01:45:02Z

✅ Forge suite `land_blocking` success on `c419c2153ba3336fac166f9b1184ec28869179d5`

performance benchmark with full nodes : 6943 TPS, 5706 ms latency, 8700 ms p99 latency,(!) expired 540 out of 2965300 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

* Merge sequential and parallel flows * rename parallel to block

gelash requested review from perryjrandall, sasha8, runtian-zhou, georgemitenkov and dariorussi September 30, 2022 18:01

gelash requested review from zekun000, davidiw and wrwg as code owners September 30, 2022 18:01

gelash requested a review from danielxiangzl October 27, 2022 00:33

gelash force-pushed the seqinpar branch from fef8268 to 50bf762 Compare October 27, 2022 00:40

gelash force-pushed the seqinpar branch 2 times, most recently from e7ebf45 to 1511c1e Compare November 7, 2022 01:46

zekun000 reviewed Nov 15, 2022

View reviewed changes

gelash requested a review from movekevin as a code owner November 17, 2022 00:55

gelash force-pushed the seqinpar branch 2 times, most recently from f597fd3 to c53ccd5 Compare November 17, 2022 07:20

dariorussi reviewed Nov 17, 2022

View reviewed changes

aptos-move/aptos-vm/src/parallel_executor/mod.rs Outdated Show resolved Hide resolved

gelash force-pushed the seqinpar branch 2 times, most recently from dc6df84 to e3cddb9 Compare November 19, 2022 22:03

gelash force-pushed the seqinpar branch from e3cddb9 to 82968ff Compare November 19, 2022 22:11

gelash changed the title ~~[Executor] Merge sequential & parallel execution flow, refactor, test~~ [Executor] Merge sequential & parallel execution flow Nov 19, 2022

gelash added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Nov 21, 2022

This comment has been minimized.

Sign in to view

gelash force-pushed the seqinpar branch from d22cc15 to 73bf181 Compare November 26, 2022 04:33

This comment has been minimized.

Sign in to view

wrwg requested a review from vgao1996 November 26, 2022 08:06

wrwg reviewed Nov 28, 2022

View reviewed changes

gelash force-pushed the seqinpar branch from 73bf181 to e197a64 Compare November 28, 2022 07:16

This comment has been minimized.

Sign in to view

runtian-zhou reviewed Nov 28, 2022

View reviewed changes

rename parallel to block

c419c21

gelash force-pushed the seqinpar branch from e197a64 to c419c21 Compare November 29, 2022 01:02

This comment has been minimized.

Sign in to view

zekun000 approved these changes Nov 29, 2022

View reviewed changes

runtian-zhou merged commit feec33f into main Nov 29, 2022

runtian-zhou deleted the seqinpar branch November 29, 2022 19:57

Markuze mentioned this pull request Dec 5, 2022

perf tools #5772

Closed

areshand pushed a commit to areshand/aptos-core-1 that referenced this pull request Dec 18, 2022

[Executor] Merge sequential & parallel execution flow (aptos-labs#4683)

c0a95a0

* Merge sequential and parallel flows * rename parallel to block

Markuze mentioned this pull request Dec 26, 2022

markuze/aptos perf #5998

Closed

Markuze mentioned this pull request Jan 3, 2023

[Network] Adding AptosPerf Client #6054

Closed

[Executor] Merge sequential & parallel execution flow #4683

[Executor] Merge sequential & parallel execution flow #4683

Conversation

gelash commented Sep 30, 2022 • edited Loading

dariorussi commented Oct 1, 2022 • edited Loading

gelash commented Oct 4, 2022

zekun000 left a comment

Choose a reason for hiding this comment

dariorussi left a comment

Choose a reason for hiding this comment

gelash commented Nov 18, 2022

gelash commented Nov 19, 2022

gelash commented Nov 19, 2022

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

wrwg commented Nov 26, 2022

gelash commented Nov 26, 2022

wrwg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gelash commented Nov 28, 2022

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

runtian-zhou commented Nov 28, 2022

runtian-zhou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 29, 2022

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5

github-actions bot commented Nov 29, 2022

✅ Forge suite land_blocking success on c419c2153ba3336fac166f9b1184ec28869179d5

gelash commented Sep 30, 2022 •

edited

Loading

dariorussi commented Oct 1, 2022 •

edited

Loading

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

✅ Forge suite `land_blocking` success on `c419c2153ba3336fac166f9b1184ec28869179d5`