compat expansion; forge refactor #13302

brianolson · 2024-05-16T20:16:15Z

Description

Expand "compat" forge test to simultaneously do traffic generation, gather stats, and run a gradual upgrade of validator nodes. Ensure that TPS stays high enough during upgrade.

Lots of refactor to get there, replacing lots of &Foo and &mut Foo with Arc<Mutex<Foo>> to make things multithread capable.

Type of Change

Which Components or Systems Does This Change Impact?

How Has This Been Tested?

This is tests. This PR running tests in forge clusters is the test.

Key Areas to Review

Is this idiomatic Rust? Is there a better way?

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

lots of `&foo` and `&mut foo` replaced with Arc<Mutex<Foo>>

cleanup

trunk-io · 2024-05-16T20:16:19Z

⏱️ 4h 34m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
forge-compat-test / forge	54m	🟥
rust-images-performance / rust-all	33m	🟥 🟩
execution-performance / single-node-performance	31m	🟥 🟥 🟥 🟥
rust-images / rust-all	22m	🟥 🟩
forge-e2e-test / forge	20m	🟩
rust-smoke-tests	20m	🟥 🟥
rust-targeted-unit-tests	14m	🟥 🟥
execution-performance / test-target-determinator	13m	🟩 🟩 🟩 🟩
rust-lints	9m	🟥 🟥
run-tests-main-branch	8m	🟩 🟩
check	8m	🟩 🟩
rust-build-cached-packages	8m	🟩 🟩
test-target-determinator	6m	🟩 🟩
rust-move-tests	6m	🟩 🟩
cli-e2e-tests / run-cli-tests	6m	🟩
check-dynamic-deps	5m	🟩 🟩 🟩 🟩
general-lints	5m	🟩 🟩
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose	2m	🟩
semgrep/ci	2m	🟩 🟩 🟩 🟩
node-api-compatibility-tests / node-api-compatibility-tests	49s	🟩
file_change_determinator	23s	🟩 🟩
file_change_determinator	19s	🟩 🟩
file_change_determinator	18s	🟩 🟩
permission-check	8s	🟩 🟩
permission-check	8s	🟩 🟩 🟩
permission-check	6s	🟩 🟩
permission-check	4s	🟩 🟩
determine-docker-build-metadata	4s	🟩 🟩
permission-check	4s	🟩 🟩

🚨 4 jobs on the last run were significantly faster/slower than expected

Job	Duration	vs 7d avg
forge-compat-test / forge	54m	14m
rust-images / rust-all	18m	13m
forge-e2e-test / forge	20m	15m
rust-move-tests	3m	11m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

Arc<std::sync::Mutex<LocalAccount>> -> Arc<LocalAccount> because it contains an atomic counter which hides mutability and is safe, no additional mutex needed or desired Some back to just &LocalAccount Add tokio Handle to NetworkContextSynchronizer and use it for async-ness inside NetworkTest run() implementations. NetworkContextSynchronizer Arc<Mutex<NetworkContext<'t>>> -> Arc<tokio::sync::Mutex<NetworkContext<'t>>> because tokio contaminates and enblobifiies all

brianolson · 2024-06-24T17:29:53Z

The most likely failure mode (which I encountered several times while getting the tests I have run to pass) is that some function declares a new tokio Runtime and calls .block_on() inside some function that is now newly under an async function anywhere in the call stack above it. This will panic and crash. It's annoying but 'not hard' to fix. The usual fix is to make the function with the Runtime to instead itself be async and then block_on(foo()) -> foo().await.

I have probably missed some cases where this will happen, and there will probably be test failures that then take 30-60 minutes each of developer time to fix. I know I caught like a dozen sites where this would have happened; I probably missed some. If there's a tag we can apply to this PR to "run all forge the tests", we should do that?

This change creates a better signal from the compat test. I think being able to measure continued sustained TPS during a rolling restart is a pretty good thing! We could apply that to other tests if they are currently serialized phases of "measure TPS", "do change", "measure TPS".

reference vs Arc? Yeah, there will probably be more of that. This much is just what I needed to get the compat change I wanted. It looks like a big change because it touched some interfaces that were propagated to many parts of the code. If you pretend that changing the signature of a function that's implemented 18 times is 'one change' then it's smaller? :-D

igor-aptos

thanks for changing all these, having all these be async is so much cleaner

igor-aptos · 2024-06-24T17:05:49Z

crates/transaction-generator-lib/src/accounts_pool_wrapper.rs

+        // Wrap LocalAccount in Arc+Mutex
+        // let account_arcs : Vec<Arc<LocalAccount>> = accounts_to_use.into_iter().map(Arc::new).collect();
+        // get txns
        let txns = accounts_to_use
-            .iter_mut()
+            .iter()
            .flat_map(|account| self.generator.generate_transactions(account, 1))
            .collect();
+        // let txns = accounts_to_use
+        //     .iter_mut()
+        //     .flat_map(|account| {
+        //
+        //         self.generator.generate_transactions(account, 1)
+        //     })
+        //     .collect();

+        // back to plain LocalAccount, add to accounts
+        // let accounts_to_use = account_arcs.into_iter().map(|account| {
+        //     Arc::into_inner(account).unwrap()
+        // }).collect();


what is all the commented code here for ? is it needed, or should it be deleted?

igor-aptos · 2024-06-24T17:07:07Z

testsuite/forge-cli/src/main.rs

+    let duration = if args.suite == "compat" {
+        // TODO: if this needs to be more perminent than hacking into this branch, edit
+        // .github/workflows/docker-build-test.yaml
+        Duration::from_secs(30 * 60)
+    } else {
+        Duration::from_secs(args.duration_secs as u64)
+    };


I don't understand why do we need to do this, doesn't compat have it's own duration specified?

@rustielin

igor-aptos · 2024-06-24T17:46:00Z

testsuite/forge/src/interface/network.rs

-    pub fn swarm(&mut self) -> &mut dyn Swarm {
-        self.swarm
-    }
+    // pub fn swarm(&mut self) -> &mut dyn Swarm {


igor-aptos · 2024-06-24T17:47:48Z

testsuite/forge/src/test_utils/consensus_utils.rs

+    // let validator_clients = {
+    //     swarm.read().await.get_validator_clients_with_names()
+    // };


igor-aptos · 2024-06-24T17:48:06Z

testsuite/forge/src/test_utils/consensus_utils.rs

@@ -142,7 +146,8 @@ pub async fn test_consensus_fault_tolerance(
        }

        if new_epoch_on_cycle {
-            swarm.aptos_public_info().reconfig().await;
+            // swarm.read().await.aptos_public_info().reconfig().await;


igor-aptos · 2024-06-24T17:49:38Z

testsuite/testcases/src/consensus_reliability_tests.rs

        // because we are doing failure testing, we should be sending
        // traffic to nodes that are alive.
-        if ctx.swarm().full_nodes().count() > 0 {
+        let full_nodes_count = { ctx.swarm.read().await.full_nodes().count() };


why do you have braces here?

to make sure I get exactly the scope I want on the read-lock object

brianolson · 2024-06-24T18:56:09Z

@igor-aptos found a bunch of things my brain had clearly glossed over after reading the diff too many times. Fixed those and some more cleanups I found from doing another fresh read through.

Changing the compat duration might be significant. I'll be watching the current run with interest.

igor-aptos · 2024-06-24T19:02:21Z

for the compat, for testing you can do it here, but before landing - you should be changing it in the compat config file, not hardcode in the code, correct?

github-actions · 2024-06-27T14:19:44Z

✅ Forge suite `compat` success on `f648076a280621dbfd4e73b1ca83e3a3f52878ed` ==> `e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15`

Compatibility test results for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15 (PR)
1. Check liveness of validators at old version: f648076a280621dbfd4e73b1ca83e3a3f52878ed
compatibility::simple-validator-upgrade::liveness-check : committed: 8711.337789176801 txn/s, latency: 3724.978431617785 ms, (p50: 2700 ms, p90: 6300 ms, p99: 27200 ms), latency samples: 319820
2. Upgrading first Validator to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 3022.142600786418 txn/s, latency: 8434.328949545079 ms, (p50: 9100 ms, p90: 12400 ms, p99: 12700 ms), latency samples: 72540
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 3220.7644115341172 txn/s, latency: 9603.694095513749 ms, (p50: 9500 ms, p90: 14700 ms, p99: 15000 ms), latency samples: 138200
3. Upgrading rest of first batch to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 2965.2197167968156 txn/s, latency: 8827.207835509864 ms, (p50: 9300 ms, p90: 11500 ms, p99: 11900 ms), latency samples: 71980
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 3228.181070086963 txn/s, latency: 9598.419116587405 ms, (p50: 9500 ms, p90: 14700 ms, p99: 15100 ms), latency samples: 138780
4. upgrading second batch to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 3702.635754811265 txn/s, latency: 6844.2262286324785 ms, (p50: 6400 ms, p90: 13700 ms, p99: 15700 ms), latency samples: 93600
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6024.896805013498 txn/s, latency: 5344.613418251614 ms, (p50: 5000 ms, p90: 9300 ms, p99: 10500 ms), latency samples: 229240
5. check swarm health
Compatibility test for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15 passed
Test Ok

github-actions · 2024-06-27T14:19:52Z

✅ Forge suite `realistic_env_max_load` success on `e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15`

two traffics test: inner traffic : committed: 8606.631772933848 txn/s, latency: 4540.769147145537 ms, (p50: 4500 ms, p90: 5400 ms, p99: 9600 ms), latency samples: 3731000
two traffics test : committed: 100.0049681329498 txn/s, latency: 2083.022988505747 ms, (p50: 2000 ms, p90: 2300 ms, p99: 5500 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.213, avg: 0.208", "QsPosToProposal: max: 0.327, avg: 0.249", "ConsensusProposalToOrdered: max: 0.311, avg: 0.286", "ConsensusOrderedToCommit: max: 0.368, avg: 0.356", "ConsensusProposalToCommit: max: 0.652, avg: 0.642"]
Max round gap was 1 [limit 4] at version 1858522. Max no progress secs was 4.873959 [limit 15] at version 1858522.
Test Ok

Expand "compat" forge test to simultaneously do traffic generation, gather stats, and run a gradual upgrade of validator nodes. Lots of refactor to get there, replacing lots of `&Foo` and `&mut Foo` with `Arc<Mutex<Foo>>` and hidden internal mutability to make things multithread capable.

brianolson added 3 commits May 16, 2024 12:36

batch_update_gradually

ff39282

big refactor towards multi-threading

cd60397

lots of `&foo` and `&mut foo` replaced with Arc<Mutex<Foo>>

upgrade_and_gather_stats() on all phases

633dca2

cleanup

brianolson added CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR CICD:build-performance-images build performance docker image variants labels May 16, 2024

brianolson self-assigned this May 16, 2024

lock reference fix

e5d5d5e

This comment has been minimized.

Sign in to view

brianolson requested review from msmouse, lightmark and grao1991 as code owners May 31, 2024 20:29

This comment has been minimized.

Sign in to view

try not to create a tokio runtime within a tokio runtime

5980849

This comment has been minimized.

Sign in to view

async fn generate_traffic()

835735b

This comment has been minimized.

Sign in to view

perryjrandall requested review from sitalkedia, rustielin, igor-aptos and a team June 24, 2024 17:01

igor-aptos reviewed Jun 24, 2024

View reviewed changes

brianolson added 3 commits June 24, 2024 14:40

PR cleanup

fc46210

Merge remote-tracking branch 'origin/main' into compat2-big-refactor

95e73dd

cleanup

a6bc708

This comment has been minimized.

Sign in to view

igor-aptos approved these changes Jun 24, 2024

View reviewed changes

brianolson added 2 commits June 27, 2024 06:43

Merge remote-tracking branch 'origin/main' into compat2-big-refactor

ebb0a48

fix

e138cfb

This comment has been minimized.

Sign in to view

brianolson enabled auto-merge (squash) June 27, 2024 13:36

This comment has been minimized.

Sign in to view

brianolson merged commit de7be81 into aptos-labs:main Jun 27, 2024
88 of 89 checks passed

bchocho mentioned this pull request Jun 28, 2024

brian/test graceful overload #13856

Closed

21 tasks

compat expansion; forge refactor #13302

compat expansion; forge refactor #13302

Conversation

brianolson commented May 16, 2024

Description

Type of Change

Which Components or Systems Does This Change Impact?

How Has This Been Tested?

Key Areas to Review

Checklist

trunk-io bot commented May 16, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

brianolson commented Jun 24, 2024

igor-aptos left a comment

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

igor-aptos Jun 24, 2024

Choose a reason for hiding this comment

brianolson Jun 24, 2024

Choose a reason for hiding this comment

brianolson commented Jun 24, 2024

igor-aptos commented Jun 24, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jun 27, 2024

✅ Forge suite compat success on f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15

github-actions bot commented Jun 27, 2024

✅ Forge suite realistic_env_max_load success on e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15

trunk-io bot commented May 16, 2024 •

edited

Loading

✅ Forge suite `compat` success on `f648076a280621dbfd4e73b1ca83e3a3f52878ed` ==> `e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15`

✅ Forge suite `realistic_env_max_load` success on `e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15`