Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consensus] enable round timeout message #14914

Merged
merged 3 commits into from
Oct 16, 2024
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Oct 9, 2024

Description

  • Enables the RoundTimeoutMsg on main. The last release should have all the handling logic, so it can be turned on for next release cut.
  • Adds counters to track the round timeout reason and if timeout due to OptQS (still disabled) missing payload, then track the missing authors.

Copy link

trunk-io bot commented Oct 9, 2024

⏱️ 7h 35m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 3h 🟩🟩🟩🟩🟩 (+3 more)
execution-performance / test-target-determinator 35m 🟩🟩🟩🟩🟩 (+3 more)
test-target-determinator 34m 🟩🟩🟩🟩 (+4 more)
check 26m 🟩🟩🟩🟩 (+3 more)
dispatch_event 15m 🟥
forge-compat-test / forge 13m 🟩
rust-cargo-deny 13m 🟩🟩🟩🟩 (+3 more)
fetch-last-released-docker-image-tag 13m 🟩🟩🟩🟩 (+4 more)
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Oct 9, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-heuristics branch 4 times, most recently from ba1dc89 to 04a1dc8 Compare October 11, 2024 04:06
Base automatically changed from balaji/optqs-heuristics to main October 11, 2024 04:36
@ibalajiarun ibalajiarun force-pushed the balaji/round-timeout-enable branch 2 times, most recently from fcbc77a to f03ddb0 Compare October 11, 2024 04:43

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun marked this pull request as ready for review October 15, 2024 22:18

This comment has been minimized.

This comment has been minimized.

counters::TIMEOUT_ROUNDS_COUNT.inc();
counters::ROUND_TIMEOUT_REASON
.with_label_values(&[&reason.to_string(), &is_valid_proposer.to_string()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this useful? this is previous round timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the reason as aggregated by the node. Also, the boolean second label is to see what the next leader thinks the reason is vs others. Ideally, I want everyone to align on the reason, but let's see how varying it is.

for idx in missing_authors.iter_ones() {
if let Some(author) = ordered_peers.get(idx) {
counters::ROUND_TIMEOUT_REASON_MISSING_AUTHORS
.with_label_values(&[author.short_str().as_str()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is aggregated missing authors? do we want to monitor the raw metrics of individual missing authors? we should be able to aggregate on grafana easily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can collect raw as well, but we will need to monitor this aggregation anyhow to be precise. This aggregation looks at a specific quorum, not possible with grafana. I also wanted to make sure it's recorded by the leader so we don't explode cardinality.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/round-timeout-enable branch from 8e58924 to dd21d74 Compare October 15, 2024 22:52
@@ -354,14 +355,34 @@ impl RoundManager {
&mut self,
new_round_event: NewRoundEvent,
) -> anyhow::Result<()> {
let is_valid_proposer = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this probably should be "is_current_proposer"

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/round-timeout-enable branch from dd21d74 to eff64df Compare October 15, 2024 23:11

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/round-timeout-enable branch from eff64df to 17f1c72 Compare October 16, 2024 00:45
@ibalajiarun ibalajiarun enabled auto-merge (squash) October 16, 2024 00:45

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/round-timeout-enable branch from 17f1c72 to 453ee9f Compare October 16, 2024 03:22

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 453ee9f50eea267f5fc7d6814f7f49725c135d2e

two traffics test: inner traffic : committed: 13566.31 txn/s, latency: 2933.46 ms, (p50: 2700 ms, p70: 3000, p90: 3300 ms, p99: 3600 ms), latency samples: 5158200
two traffics test : committed: 100.02 txn/s, latency: 2664.17 ms, (p50: 2500 ms, p70: 2700, p90: 3000 ms, p99: 8800 ms), latency samples: 1720
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.231, avg: 0.217", "QsPosToProposal: max: 0.288, avg: 0.247", "ConsensusProposalToOrdered: max: 0.307, avg: 0.299", "ConsensusOrderedToCommit: max: 0.474, avg: 0.458", "ConsensusProposalToCommit: max: 0.775, avg: 0.756"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.56s no progress at version 5828614 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.21s no progress at version 2597840 (avg 8.21s) [limit 15].
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite framework_upgrade success on 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e

Compatibility test results for 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e (PR)
Upgrade the nodes to version: 453ee9f50eea267f5fc7d6814f7f49725c135d2e
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1196.20 txn/s, submitted: 1198.40 txn/s, failed submission: 2.21 txn/s, expired: 2.21 txn/s, latency: 2498.13 ms, (p50: 2400 ms, p70: 2700, p90: 3600 ms, p99: 5900 ms), latency samples: 108440
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1086.31 txn/s, submitted: 1088.59 txn/s, failed submission: 2.28 txn/s, expired: 2.28 txn/s, latency: 2773.45 ms, (p50: 2400 ms, p70: 2900, p90: 4700 ms, p99: 6900 ms), latency samples: 95360
5. check swarm health
Compatibility test for 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e passed
Upgrade the remaining nodes to version: 453ee9f50eea267f5fc7d6814f7f49725c135d2e
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1058.99 txn/s, submitted: 1061.73 txn/s, failed submission: 2.74 txn/s, expired: 2.74 txn/s, latency: 2861.96 ms, (p50: 2400 ms, p70: 3000, p90: 4800 ms, p99: 6800 ms), latency samples: 92780
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e

Compatibility test results for 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e (PR)
1. Check liveness of validators at old version: 7eeba4cd15892717741a614add1afde004c7855f
compatibility::simple-validator-upgrade::liveness-check : committed: 12641.62 txn/s, latency: 2708.37 ms, (p50: 1900 ms, p70: 2100, p90: 3300 ms, p99: 22700 ms), latency samples: 467740
2. Upgrading first Validator to new version: 453ee9f50eea267f5fc7d6814f7f49725c135d2e
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6755.79 txn/s, latency: 4157.08 ms, (p50: 4500 ms, p70: 4700, p90: 5800 ms, p99: 5900 ms), latency samples: 133220
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6945.99 txn/s, latency: 4625.55 ms, (p50: 4700 ms, p70: 5100, p90: 6900 ms, p99: 7300 ms), latency samples: 233180
3. Upgrading rest of first batch to new version: 453ee9f50eea267f5fc7d6814f7f49725c135d2e
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6195.70 txn/s, latency: 4665.60 ms, (p50: 5300 ms, p70: 5600, p90: 5800 ms, p99: 5900 ms), latency samples: 112620
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5828.23 txn/s, latency: 5644.28 ms, (p50: 5700 ms, p70: 5900, p90: 7100 ms, p99: 7800 ms), latency samples: 217060
4. upgrading second batch to new version: 453ee9f50eea267f5fc7d6814f7f49725c135d2e
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10034.59 txn/s, latency: 2639.94 ms, (p50: 2800 ms, p70: 3000, p90: 3400 ms, p99: 3800 ms), latency samples: 184080
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9508.69 txn/s, latency: 3260.40 ms, (p50: 2800 ms, p70: 3000, p90: 6800 ms, p99: 8600 ms), latency samples: 337480
5. check swarm health
Compatibility test for 7eeba4cd15892717741a614add1afde004c7855f ==> 453ee9f50eea267f5fc7d6814f7f49725c135d2e passed
Test Ok

@ibalajiarun ibalajiarun merged commit 09427b2 into main Oct 16, 2024
45 of 48 checks passed
@ibalajiarun ibalajiarun deleted the balaji/round-timeout-enable branch October 16, 2024 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants