Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compat expansion; forge refactor #13302

Merged
merged 35 commits into from
Jun 27, 2024

Conversation

brianolson
Copy link
Contributor

Description

Expand "compat" forge test to simultaneously do traffic generation, gather stats, and run a gradual upgrade of validator nodes. Ensure that TPS stays high enough during upgrade.

Lots of refactor to get there, replacing lots of &Foo and &mut Foo with Arc<Mutex<Foo>> to make things multithread capable.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

This is tests. This PR running tests in forge clusters is the test.

Key Areas to Review

Is this idiomatic Rust? Is there a better way?

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

@brianolson brianolson added CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR CICD:build-performance-images build performance docker image variants labels May 16, 2024
@brianolson brianolson self-assigned this May 16, 2024
Copy link

trunk-io bot commented May 16, 2024

⏱️ 4h 34m total CI duration on this PR
Job Cumulative Duration Recent Runs
forge-compat-test / forge 54m 🟥
rust-images-performance / rust-all 33m 🟥🟩
execution-performance / single-node-performance 31m 🟥🟥🟥🟥
rust-images / rust-all 22m 🟥🟩
forge-e2e-test / forge 20m 🟩
rust-smoke-tests 20m 🟥🟥
rust-targeted-unit-tests 14m 🟥🟥
execution-performance / test-target-determinator 13m 🟩🟩🟩🟩
rust-lints 9m 🟥🟥
run-tests-main-branch 8m 🟩🟩
check 8m 🟩🟩
rust-build-cached-packages 8m 🟩🟩
test-target-determinator 6m 🟩🟩
rust-move-tests 6m 🟩🟩
cli-e2e-tests / run-cli-tests 6m 🟩
check-dynamic-deps 5m 🟩🟩🟩🟩
general-lints 5m 🟩🟩
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose 2m 🟩
semgrep/ci 2m 🟩🟩🟩🟩
node-api-compatibility-tests / node-api-compatibility-tests 49s 🟩
file_change_determinator 23s 🟩🟩
file_change_determinator 19s 🟩🟩
file_change_determinator 18s 🟩🟩
permission-check 8s 🟩🟩
permission-check 8s 🟩🟩🟩
permission-check 6s 🟩🟩
permission-check 4s 🟩🟩
determine-docker-build-metadata 4s 🟩🟩
permission-check 4s 🟩🟩

🚨 4 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
forge-compat-test / forge 54m 14m +287%
rust-images / rust-all 18m 13m +40%
forge-e2e-test / forge 20m 15m +33%
rust-move-tests 3m 11m -74%

settingsfeedbackdocs ⋅ learn more about trunk.io

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Arc<std::sync::Mutex<LocalAccount>> -> Arc<LocalAccount>
  because it contains an atomic counter which hides mutability and is safe, no additional mutex needed or desired
Some back to just &LocalAccount
Add tokio Handle to NetworkContextSynchronizer and use it for async-ness inside NetworkTest run() implementations.
NetworkContextSynchronizer Arc<Mutex<NetworkContext<'t>>> -> Arc<tokio::sync::Mutex<NetworkContext<'t>>>
  because tokio contaminates and enblobifiies all

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@perryjrandall perryjrandall requested review from sitalkedia, rustielin, igor-aptos and a team June 24, 2024 17:01
@brianolson
Copy link
Contributor Author

The most likely failure mode (which I encountered several times while getting the tests I have run to pass) is that some function declares a new tokio Runtime and calls .block_on() inside some function that is now newly under an async function anywhere in the call stack above it. This will panic and crash. It's annoying but 'not hard' to fix. The usual fix is to make the function with the Runtime to instead itself be async and then block_on(foo()) -> foo().await.

I have probably missed some cases where this will happen, and there will probably be test failures that then take 30-60 minutes each of developer time to fix. I know I caught like a dozen sites where this would have happened; I probably missed some. If there's a tag we can apply to this PR to "run all forge the tests", we should do that?

This change creates a better signal from the compat test. I think being able to measure continued sustained TPS during a rolling restart is a pretty good thing! We could apply that to other tests if they are currently serialized phases of "measure TPS", "do change", "measure TPS".

reference vs Arc? Yeah, there will probably be more of that. This much is just what I needed to get the compat change I wanted. It looks like a big change because it touched some interfaces that were propagated to many parts of the code. If you pretend that changing the signature of a function that's implemented 18 times is 'one change' then it's smaller? :-D

Copy link
Contributor

@igor-aptos igor-aptos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for changing all these, having all these be async is so much cleaner

Comment on lines 49 to 67
// Wrap LocalAccount in Arc+Mutex
// let account_arcs : Vec<Arc<LocalAccount>> = accounts_to_use.into_iter().map(Arc::new).collect();
// get txns
let txns = accounts_to_use
.iter_mut()
.iter()
.flat_map(|account| self.generator.generate_transactions(account, 1))
.collect();
// let txns = accounts_to_use
// .iter_mut()
// .flat_map(|account| {
//
// self.generator.generate_transactions(account, 1)
// })
// .collect();

// back to plain LocalAccount, add to accounts
// let accounts_to_use = account_arcs.into_iter().map(|account| {
// Arc::into_inner(account).unwrap()
// }).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is all the commented code here for ? is it needed, or should it be deleted?

Comment on lines 282 to 288
let duration = if args.suite == "compat" {
// TODO: if this needs to be more perminent than hacking into this branch, edit
// .github/workflows/docker-build-test.yaml
Duration::from_secs(30 * 60)
} else {
Duration::from_secs(args.duration_secs as u64)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why do we need to do this, doesn't compat have it's own duration specified?

@rustielin

pub fn swarm(&mut self) -> &mut dyn Swarm {
self.swarm
}
// pub fn swarm(&mut self) -> &mut dyn Swarm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment on lines 73 to 75
// let validator_clients = {
// swarm.read().await.get_validator_clients_with_names()
// };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

@@ -142,7 +146,8 @@ pub async fn test_consensus_fault_tolerance(
}

if new_epoch_on_cycle {
swarm.aptos_public_info().reconfig().await;
// swarm.read().await.aptos_public_info().reconfig().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

// because we are doing failure testing, we should be sending
// traffic to nodes that are alive.
if ctx.swarm().full_nodes().count() > 0 {
let full_nodes_count = { ctx.swarm.read().await.full_nodes().count() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have braces here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure I get exactly the scope I want on the read-lock object

@brianolson
Copy link
Contributor Author

@igor-aptos found a bunch of things my brain had clearly glossed over after reading the diff too many times. Fixed those and some more cleanups I found from doing another fresh read through.

Changing the compat duration might be significant. I'll be watching the current run with interest.

@igor-aptos
Copy link
Contributor

for the compat, for testing you can do it here, but before landing - you should be changing it in the compat config file, not hardcode in the code, correct?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@brianolson brianolson enabled auto-merge (squash) June 27, 2024 13:36

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15

Compatibility test results for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15 (PR)
1. Check liveness of validators at old version: f648076a280621dbfd4e73b1ca83e3a3f52878ed
compatibility::simple-validator-upgrade::liveness-check : committed: 8711.337789176801 txn/s, latency: 3724.978431617785 ms, (p50: 2700 ms, p90: 6300 ms, p99: 27200 ms), latency samples: 319820
2. Upgrading first Validator to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 3022.142600786418 txn/s, latency: 8434.328949545079 ms, (p50: 9100 ms, p90: 12400 ms, p99: 12700 ms), latency samples: 72540
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 3220.7644115341172 txn/s, latency: 9603.694095513749 ms, (p50: 9500 ms, p90: 14700 ms, p99: 15000 ms), latency samples: 138200
3. Upgrading rest of first batch to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 2965.2197167968156 txn/s, latency: 8827.207835509864 ms, (p50: 9300 ms, p90: 11500 ms, p99: 11900 ms), latency samples: 71980
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 3228.181070086963 txn/s, latency: 9598.419116587405 ms, (p50: 9500 ms, p90: 14700 ms, p99: 15100 ms), latency samples: 138780
4. upgrading second batch to new version: e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 3702.635754811265 txn/s, latency: 6844.2262286324785 ms, (p50: 6400 ms, p90: 13700 ms, p99: 15700 ms), latency samples: 93600
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6024.896805013498 txn/s, latency: 5344.613418251614 ms, (p50: 5000 ms, p90: 9300 ms, p99: 10500 ms), latency samples: 229240
5. check swarm health
Compatibility test for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on e138cfb4234c663a7bfd76398ef1a0ecc4c7ed15

two traffics test: inner traffic : committed: 8606.631772933848 txn/s, latency: 4540.769147145537 ms, (p50: 4500 ms, p90: 5400 ms, p99: 9600 ms), latency samples: 3731000
two traffics test : committed: 100.0049681329498 txn/s, latency: 2083.022988505747 ms, (p50: 2000 ms, p90: 2300 ms, p99: 5500 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.213, avg: 0.208", "QsPosToProposal: max: 0.327, avg: 0.249", "ConsensusProposalToOrdered: max: 0.311, avg: 0.286", "ConsensusOrderedToCommit: max: 0.368, avg: 0.356", "ConsensusProposalToCommit: max: 0.652, avg: 0.642"]
Max round gap was 1 [limit 4] at version 1858522. Max no progress secs was 4.873959 [limit 15] at version 1858522.
Test Ok

@brianolson brianolson merged commit de7be81 into aptos-labs:main Jun 27, 2024
88 of 89 checks passed
areshand pushed a commit that referenced this pull request Jun 27, 2024
Expand "compat" forge test to simultaneously do traffic generation, gather stats, and run a gradual upgrade of validator nodes.

Lots of refactor to get there, replacing lots of `&Foo` and `&mut Foo` with `Arc<Mutex<Foo>>` and hidden internal mutability to make things multithread capable.
@bchocho bchocho mentioned this pull request Jun 28, 2024
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-performance-images build performance docker image variants CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants