Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[forge] fix chaos-mesh injection direction #13476

Merged
merged 1 commit into from
May 31, 2024
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented May 30, 2024

Description

It turns out, network chaos is only applied based on the source and destination pod IPs, however, we use k8s services in Forge cluster, which resolve to service IPs that are different from pod IPs. In order to support them, this PR extends the targets to include the service IPs. To do so, it leverages the externalTargets field which can either be domains or IP address and populates them with the service domains of the stateful sets.

externalTargets are only supported with to direction, so it splits the chaos with both direction into two each with to direction.

Also, fixes a bug where we applied the RTT delay in each direction instead of dividing it.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify) Forge

How Has This Been Tested?

On Forge

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented May 30, 2024

⏱️ 20h 5m total CI duration on this PR
Job Cumulative Duration Recent Runs
forge-e2e-test / forge 7h 20m 🟥🟩🟩🟥🟩 (+22 more)
rust-images / rust-all 2h 54m 🟩🟩 (+11 more)
rust-targeted-unit-tests 1h 59m 🟩🟩🟩🟩 (+10 more)
rust-lints 1h 24m 🟥🟥🟥🟩 (+10 more)
run-tests-main-branch 1h 9m 🟩🟩🟩🟩🟥 (+11 more)
rust-move-tests 59m 🟩🟩🟩 (+12 more)
test-target-determinator 51m 🟩🟩🟩 (+11 more)
windows-build 43m 🟩
rust-smoke-tests 35m 🟩
check-dynamic-deps 29m 🟩🟩🟩🟩🟩 (+12 more)
general-lints 28m 🟩🟩🟩🟩🟩 (+10 more)
forge-framework-upgrade-test / forge 17m 🟩
forge-compat-test / forge 14m 🟩
rust-build-cached-packages 9m 🟩
semgrep/ci 7m 🟩🟩🟩🟩🟩 (+12 more)
cli-e2e-tests / run-cli-tests 6m 🟩
check 4m 🟩
file_change_determinator 3m 🟩🟩🟩🟩 (+12 more)
execution-performance / test-target-determinator 3m 🟩
file_change_determinator 3m 🟩🟩🟩🟩🟩 (+12 more)
file_change_determinator 3m 🟩🟩🟩 (+11 more)
node-api-compatibility-tests / node-api-compatibility-tests 1m 🟩
permission-check 55s 🟩🟩🟩🟩🟩 (+11 more)
permission-check 52s 🟩🟩🟩🟩🟩 (+12 more)
permission-check 51s 🟩🟩🟩🟩🟩 (+12 more)
permission-check 47s 🟩🟩🟩🟩🟩 (+12 more)
permission-check 40s 🟩🟩🟩🟩🟩 (+12 more)
determine-docker-build-metadata 38s 🟩🟩🟩🟩🟩 (+11 more)
execution-performance / single-node-performance 8s 🟩
check-terraform-modifications 1s 🟥🟥🟥🟥🟥 (+2 more)

🚨 4 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
rust-build-cached-packages 9m 5m +86%
rust-targeted-unit-tests 8m 19m -56%
rust-move-tests 3m 10m -68%
execution-performance / single-node-performance 8s 20m -99%

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun added the CICD:run-forge-e2e-perf Run the e2e perf forge only label May 30, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/fix-chaos-mesh branch 2 times, most recently from 591ee0c to a7dafbc Compare May 31, 2024 17:56
@ibalajiarun ibalajiarun marked this pull request as ready for review May 31, 2024 18:01

This comment has been minimized.

pub(crate) fn get_service_targets(&self, target_nodes: &[AccountAddress]) -> String {
target_nodes
.iter()
.filter_map(|node| self.get_service_name(node))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be returning the ip addresses instead of the service name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Services names get resolved via DNS. so still works.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/fix-chaos-mesh branch from a7dafbc to 0a151b1 Compare May 31, 2024 20:41
@ibalajiarun ibalajiarun enabled auto-merge (squash) May 31, 2024 20:41

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36

Compatibility test results for a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36 (PR)
1. Check liveness of validators at old version: a68e71c05caebf01504d4499110f3fba213fb53d
compatibility::simple-validator-upgrade::liveness-check : committed: 9619.94679660746 txn/s, latency: 3108.9743953450957 ms, (p50: 2500 ms, p90: 3600 ms, p99: 14700 ms), latency samples: 372940
2. Upgrading first Validator to new version: 0a151b173366cbb648f7b04857563e0c09ef5f36
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 3363.847284117603 txn/s, latency: 9285.737432254284 ms, (p50: 9500 ms, p90: 13800 ms, p99: 14200 ms), latency samples: 136540
3. Upgrading rest of first batch to new version: 0a151b173366cbb648f7b04857563e0c09ef5f36
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 2982.8218366425076 txn/s, latency: 10423.194853298752 ms, (p50: 10400 ms, p90: 13900 ms, p99: 14100 ms), latency samples: 123380
4. upgrading second batch to new version: 0a151b173366cbb648f7b04857563e0c09ef5f36
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6430.614139546484 txn/s, latency: 5112.025314315442 ms, (p50: 4800 ms, p90: 8100 ms, p99: 9200 ms), latency samples: 230660
5. check swarm health
Compatibility test for a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36 passed
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite framework_upgrade success on a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36

Compatibility test results for a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36 (PR)
Upgrade the nodes to version: 0a151b173366cbb648f7b04857563e0c09ef5f36
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1223.5985358083437 txn/s, submitted: 1226.6912203184206 txn/s, failed submission: 3.0926845100770555 txn/s, expired: 3.0926845100770555 txn/s, latency: 2516.5109676836973 ms, (p50: 2100 ms, p90: 4400 ms, p99: 8100 ms), latency samples: 110780
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1071.551929928697 txn/s, submitted: 1074.6066493379826 txn/s, failed submission: 3.054719409285636 txn/s, expired: 3.054719409285636 txn/s, latency: 2701.9883424964364 ms, (p50: 2100 ms, p90: 4900 ms, p99: 7400 ms), latency samples: 98220
5. check swarm health
Compatibility test for a68e71c05caebf01504d4499110f3fba213fb53d ==> 0a151b173366cbb648f7b04857563e0c09ef5f36 passed
Upgrade the remaining nodes to version: 0a151b173366cbb648f7b04857563e0c09ef5f36
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1244.4389125687742 txn/s, submitted: 1247.0680088629338 txn/s, failed submission: 2.6290962941593823 txn/s, expired: 2.6290962941593823 txn/s, latency: 2487.3680897887325 ms, (p50: 2100 ms, p90: 3900 ms, p99: 7500 ms), latency samples: 113600
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 0a151b173366cbb648f7b04857563e0c09ef5f36

two traffics test: inner traffic : committed: 8518.733544427805 txn/s, latency: 4585.699779479094 ms, (p50: 4500 ms, p90: 5400 ms, p99: 9600 ms), latency samples: 3691260
two traffics test : committed: 100.03216205214551 txn/s, latency: 2061 ms, (p50: 2000 ms, p90: 2300 ms, p99: 6000 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.215, avg: 0.206", "QsPosToProposal: max: 0.220, avg: 0.210", "ConsensusProposalToOrdered: max: 0.420, avg: 0.377", "ConsensusOrderedToCommit: max: 0.376, avg: 0.363", "ConsensusProposalToCommit: max: 0.748, avg: 0.739"]
Max round gap was 1 [limit 4] at version 1790881. Max no progress secs was 4.99684 [limit 15] at version 1790881.
Test Ok

@ibalajiarun ibalajiarun merged commit a27fe1c into main May 31, 2024
67 of 86 checks passed
@ibalajiarun ibalajiarun deleted the balaji/fix-chaos-mesh branch May 31, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-forge-e2e-perf Run the e2e perf forge only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants