wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

msmouse · 2024-08-06T18:47:36Z

…etime_guard

It's in theory safer this way and it avoids error logs before

Description

Type of Change

Bug fix

Which Components or Systems Does This Change Impact?

Validator Node

How Has This Been Tested?

existing coverage, forge

…etime_guard It's in theory safer this way and it avoids error logs before

trunk-io · 2024-08-06T18:47:39Z

⏱️ 1h 26m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
forge-framework-upgrade-test / forge	18m	🟩
forge-compat-test / forge	15m	🟩
forge-e2e-test / forge	14m	🟩
general-lints	5m	🟩 🟩 🟩
rust-cargo-deny	5m	🟩 🟩 🟩
rust-doc-tests	5m	🟩
execution-performance / test-target-determinator	4m	🟩
test-target-determinator	4m	🟩
check	4m	🟩
check-dynamic-deps	3m	🟩 🟩 🟩
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
rust-move-tests	1m	🟩
semgrep/ci	1m	🟩 🟩 🟩
Backport PR	43s	🟥 🟩
file_change_determinator	33s	🟩 🟩 🟩
file_change_determinator	32s	🟩 🟩 🟩
execution-performance / single-node-performance	11s	🟩
file_change_determinator	10s	🟩
permission-check	10s	🟩 🟩 🟩
permission-check	10s	🟩 🟩 🟩
permission-check	9s	🟩 🟩 🟩
permission-check	6s	🟩 🟩 🟩
permission-check	4s	🟩 🟩
determine-docker-build-metadata	3s	🟩
permission-check	2s	🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
execution-performance / single-node-performance	11s	19m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

ibalajiarun · 2024-08-06T18:54:15Z

consensus/src/pipeline/execution_schedule_phase.rs

-            for (block, fut) in itertools::zip_eq(ordered_blocks, futs) {
+            // wait for all futs so that lifetime_guard is guaranteed to be dropped only
+            // after all executor calls are over
+            for (block, fut) in itertools::zip_eq(&ordered_blocks, futs) {


How about put these futs in a FuturesOrdered and poll them all together?

doesn't matter, they are already spawned and running in parallel?

They run in parallel, but I notice there is a future here where we do some post-processing with some .await. And, possibility that in the future, this future can have other things.

On the other hand, polling one by one would keep the logs sequential. Otherwise, we risk reading unordered logs when debugging. Let's leave it as-is then.

Also not sure if things in the post processing wants to be sequential.. For example, does the mempool tolerate seeing the notifications out of order?

ibalajiarun · 2024-08-06T19:15:37Z

consensus/src/pipeline/execution_schedule_phase.rs

-            for (block, fut) in itertools::zip_eq(ordered_blocks, futs) {
+            // wait for all futs so that lifetime_guard is guaranteed to be dropped only
+            // after all executor calls are over
+            for (block, fut) in itertools::zip_eq(&ordered_blocks, futs) {


They run in parallel, but I notice there is a future here where we do some post-processing with some .await. And, possibility that in the future, this future can have other things.

On the other hand, polling one by one would keep the logs sequential. Otherwise, we risk reading unordered logs when debugging. Let's leave it as-is then.

igor-aptos

this fixes the lifetime issue, so is probably necessary - but I am a bit worried, if we have multiple blocks, and first fails - needing to wait on execution of all others, just to discard them - and then retry them all together - seems like something we can have issues down the line.

msmouse · 2024-08-06T21:08:47Z

@igor-aptos It seems if an earlier block fails the laterones will almost always fail immediately anyway. If the VM or DB returns error, it's probably not recoverable anyway (even switching to state sync might probably not get around the underlying issue, like a full disk); if block fetching times out and a later one times out as well, it should fail at about the same time, right?

Anyway let's see how it works out. And maybe re-evaluate if we should bundle several block ids together in the first place, as you suggested on slack.

github-actions · 2024-08-06T21:38:06Z

✅ Forge suite `realistic_env_max_load` success on `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`

two traffics test: inner traffic : committed: 12665.27 txn/s, latency: 3143.43 ms, (p50: 3000 ms, p90: 3600 ms, p99: 4500 ms), latency samples: 4815620
two traffics test : committed: 100.08 txn/s, latency: 2807.05 ms, (p50: 2700 ms, p90: 3200 ms, p99: 10500 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.232, avg: 0.219", "QsPosToProposal: max: 0.344, avg: 0.302", "ConsensusProposalToOrdered: max: 0.318, avg: 0.309", "ConsensusOrderedToCommit: max: 0.616, avg: 0.584", "ConsensusProposalToCommit: max: 0.927, avg: 0.892"]
Max round gap was 1 [limit 4] at version 2643444. Max no progress secs was 7.912502 [limit 15] at version 2643444.
Test Ok

github-actions · 2024-08-06T21:39:51Z

✅ Forge suite `compat` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 7403.45 txn/s, latency: 3951.61 ms, (p50: 3200 ms, p90: 7000 ms, p99: 16700 ms), latency samples: 293140
2. Upgrading first Validator to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7054.20 txn/s, latency: 3807.69 ms, (p50: 4200 ms, p90: 4500 ms, p99: 4700 ms), latency samples: 135880
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6365.95 txn/s, latency: 4520.55 ms, (p50: 4500 ms, p90: 4700 ms, p99: 7200 ms), latency samples: 248220
3. Upgrading rest of first batch to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6590.54 txn/s, latency: 4136.66 ms, (p50: 4500 ms, p90: 5100 ms, p99: 5300 ms), latency samples: 124420
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6480.10 txn/s, latency: 4758.93 ms, (p50: 4900 ms, p90: 7100 ms, p99: 7400 ms), latency samples: 222360
4. upgrading second batch to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11104.58 txn/s, latency: 2531.61 ms, (p50: 2800 ms, p90: 3000 ms, p99: 3200 ms), latency samples: 200780
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10465.18 txn/s, latency: 3189.03 ms, (p50: 3000 ms, p90: 3800 ms, p99: 6500 ms), latency samples: 341760
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f passed
Test Ok

github-actions · 2024-08-06T21:42:14Z

✅ Forge suite `framework_upgrade` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f (PR)
Upgrade the nodes to version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1071.96 txn/s, submitted: 1073.57 txn/s, failed submission: 1.61 txn/s, expired: 1.61 txn/s, latency: 3039.94 ms, (p50: 2400 ms, p90: 6000 ms, p99: 9600 ms), latency samples: 93120
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 670.12 txn/s, submitted: 671.89 txn/s, failed submission: 1.77 txn/s, expired: 1.77 txn/s, latency: 4492.03 ms, (p50: 2700 ms, p90: 10900 ms, p99: 15700 ms), latency samples: 60740
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f passed
Upgrade the remaining nodes to version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1129.85 txn/s, submitted: 1132.98 txn/s, failed submission: 3.14 txn/s, expired: 3.14 txn/s, latency: 2718.96 ms, (p50: 2400 ms, p90: 4500 ms, p99: 6600 ms), latency samples: 100860
Test Ok

#14224) (cherry picked from commit ef343b7)

github-actions · 2024-08-06T21:43:13Z

💚 All backports created successfully

Status	Branch	Result
✅	aptos-release-v1.18

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

#14224) (#14229) (cherry picked from commit ef343b7)

wait for all futs to clear from ExecutionPipeline before dropping lif…

40a1a0f

…etime_guard It's in theory safer this way and it avoids error logs before

msmouse requested review from zekun000, sasha8 and ibalajiarun as code owners August 6, 2024 18:47

msmouse requested review from sitalkedia and igor-aptos August 6, 2024 18:47

ibalajiarun reviewed Aug 6, 2024

View reviewed changes

ibalajiarun approved these changes Aug 6, 2024

View reviewed changes

igor-aptos approved these changes Aug 6, 2024

View reviewed changes

igor-aptos added the v1.18 label Aug 6, 2024

msmouse enabled auto-merge (squash) August 6, 2024 21:08

This comment has been minimized.

Sign in to view

msmouse merged commit ef343b7 into main Aug 6, 2024
137 of 138 checks passed

msmouse deleted the 0806-alden-wait-for-all branch August 6, 2024 21:42

github-actions bot pushed a commit that referenced this pull request Aug 6, 2024

wait for all futs to clear from ExecutionPipeline before dropping lif… (

99ead17

#14224) (cherry picked from commit ef343b7)

github-actions bot mentioned this pull request Aug 6, 2024

[cp][aptos-release-v1.18] wait for all futs to clear from ExecutionPipeline before dropping lif… #14229

Merged

igor-aptos pushed a commit that referenced this pull request Aug 7, 2024

wait for all futs to clear from ExecutionPipeline before dropping lif… (

49c89e0

#14224) (#14229) (cherry picked from commit ef343b7)

grao1991 mentioned this pull request Sep 5, 2024

[DO NOT MERGE] 1.18 plus internal indexer progress fix #14537

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

msmouse commented Aug 6, 2024

trunk-io bot commented Aug 6, 2024 •

edited

Loading

ibalajiarun Aug 6, 2024

msmouse Aug 6, 2024

ibalajiarun Aug 6, 2024

msmouse Aug 6, 2024

ibalajiarun Aug 6, 2024

igor-aptos left a comment

msmouse commented Aug 6, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Aug 6, 2024

github-actions bot commented Aug 6, 2024

github-actions bot commented Aug 6, 2024

github-actions bot commented Aug 6, 2024

wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

Conversation

msmouse commented Aug 6, 2024

Description

Type of Change

Which Components or Systems Does This Change Impact?

How Has This Been Tested?

trunk-io bot commented Aug 6, 2024 • edited Loading

ibalajiarun Aug 6, 2024

Choose a reason for hiding this comment

msmouse Aug 6, 2024

Choose a reason for hiding this comment

ibalajiarun Aug 6, 2024

Choose a reason for hiding this comment

msmouse Aug 6, 2024

Choose a reason for hiding this comment

ibalajiarun Aug 6, 2024

Choose a reason for hiding this comment

igor-aptos left a comment

Choose a reason for hiding this comment

msmouse commented Aug 6, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Aug 6, 2024

✅ Forge suite realistic_env_max_load success on 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

github-actions bot commented Aug 6, 2024

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

github-actions bot commented Aug 6, 2024

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

github-actions bot commented Aug 6, 2024

💚 All backports created successfully

Questions ?

trunk-io bot commented Aug 6, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`

✅ Forge suite `compat` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`

✅ Forge suite `framework_upgrade` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `40a1a0f8af853c3c6013bbd7f8809a424b29ce3f`