Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[forge] LatencyBreakdown counters and use them in success criteria #9393

Merged
merged 3 commits into from
Aug 3, 2023

Conversation

igor-aptos
Copy link
Contributor

@igor-aptos igor-aptos commented Jul 31, 2023

  • refactor SystemMetrics to be cleaner, and build LatencyBreakdown metrics in the same flow

  • have Swarm surface API for generic querying range from prometheus, and move validation and what needs to be fetched to SuccessCriteria

  • move retries from system metrics alone, to prometheus calls themselves

  • rename system_metrics.rs to prometheus_metrics.rs - and have fetching of system and latency metrics there

  • move threshold logic to SuccessCriteria

  • add fetching QS and consensus latency breakdown, and having a way to assert they pass.

Description

Test Plan

@igor-aptos igor-aptos added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 31, 2023
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch 3 times, most recently from c91b8ab to fee3b95 Compare August 1, 2023 03:27
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from fee3b95 to 461bd1d Compare August 1, 2023 04:27
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from 461bd1d to 3e0acf5 Compare August 1, 2023 14:12
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos removed the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Aug 2, 2023
@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from 3df527e to 21b4764 Compare August 2, 2023 06:48
Copy link
Contributor

@sitalkedia sitalkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % minor nits

Ok(range
.first()
.ok_or_else(|| {
anyhow!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably just unwrap() here as you ensure the length is 1 above.

@@ -216,6 +218,24 @@ struct Resize {
enable_haproxy: bool,
}

// common metrics thresholds:
static SYSTEM_12_CORES_5GB_THRESHOLD: Lazy<SystemMetricsThreshold> = Lazy::new(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Better to move it to system_metrics module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no system_metrics module any more :)

thresholds structs are defined in the SuccessCriteria, I don't think there is better. they are only used in this file, seems fine to have constants here.

.add_latency_threshold(3.4, LatencyType::P50)
.add_latency_threshold(4.5, LatencyType::P90)
.add_latency_breakdown_threshold(LatencyBreakdownThreshold::new_strict(
0.3, 0.25, 0.8, 0.6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to a builder pattern so it's clear what these thresholds are and we can extend this easily further...

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from 21b4764 to 00f6842 Compare August 2, 2023 15:57
@igor-aptos igor-aptos enabled auto-merge (squash) August 2, 2023 15:59
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from 00f6842 to bad609e Compare August 2, 2023 20:15
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/forge_fetch_counters branch from bad609e to f6322d5 Compare August 3, 2023 05:06
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

✅ Forge suite compat success on aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce

Compatibility test results for aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce (PR)
1. Check liveness of validators at old version: aptos-node-v1.5.1
compatibility::simple-validator-upgrade::liveness-check : committed: 4727 txn/s, latency: 6490 ms, (p50: 6700 ms, p90: 9000 ms, p99: 9900 ms), latency samples: 184380
2. Upgrading first Validator to new version: f6322d562e034006cf20de3de1a45a5bd8ea8bce
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1770 txn/s, latency: 15842 ms, (p50: 19200 ms, p90: 21900 ms, p99: 22400 ms), latency samples: 92060
3. Upgrading rest of first batch to new version: f6322d562e034006cf20de3de1a45a5bd8ea8bce
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1771 txn/s, latency: 15874 ms, (p50: 19100 ms, p90: 22000 ms, p99: 22600 ms), latency samples: 92140
4. upgrading second batch to new version: f6322d562e034006cf20de3de1a45a5bd8ea8bce
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3371 txn/s, latency: 9354 ms, (p50: 10100 ms, p90: 13500 ms, p99: 13900 ms), latency samples: 141600
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

✅ Forge suite realistic_env_max_load success on f6322d562e034006cf20de3de1a45a5bd8ea8bce

two traffics test: inner traffic : committed: 6231 txn/s, latency: 6292 ms, (p50: 6000 ms, p90: 8100 ms, p99: 11700 ms), latency samples: 2698240
two traffics test : committed: 100 txn/s, latency: 3015 ms, (p50: 2900 ms, p90: 3500 ms, p99: 7200 ms), latency samples: 1820
Max round gap was 1 [limit 4] at version 1294434. Max no progress secs was 3.7926931 [limit 10] at version 1294434.
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce

Compatibility test results for aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce (PR)
Upgrade the nodes to version: f6322d562e034006cf20de3de1a45a5bd8ea8bce
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 4433 txn/s, latency: 7325 ms, (p50: 7800 ms, p90: 10500 ms, p99: 12300 ms), latency samples: 164040
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> f6322d562e034006cf20de3de1a45a5bd8ea8bce passed
Test Ok

@igor-aptos igor-aptos merged commit 1063755 into aptos-labs:main Aug 3, 2023
xbtmatt pushed a commit that referenced this pull request Aug 13, 2023
…9393)

* refactor SystemMetrics to be cleaner, and build LatencyBreakdown metrics in the same flow
* have Swarm surface API for generic querying range from prometheus, and move validation and what needs to be fetched to SuccessCriteria
* move retries from system metrics alone, to prometheus calls themselves
* rename system_metrics.rs to prometheus_metrics.rs - and have fetching of system and latency metrics there
* move threshold logic to SuccessCriteria
* add fetching QS and consensus latency breakdown, and having a way to assert they pass.
* fixing construct_query_with_extra_labels, and updating land_blocking checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants