Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait for all futs to clear from ExecutionPipeline before dropping lif… #14224

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Aug 6, 2024

…etime_guard

It's in theory safer this way and it avoids error logs before

Description

Type of Change

  • Bug fix

Which Components or Systems Does This Change Impact?

  • Validator Node

How Has This Been Tested?

existing coverage, forge

…etime_guard

It's in theory safer this way and it avoids error logs before
Copy link

trunk-io bot commented Aug 6, 2024

for (block, fut) in itertools::zip_eq(ordered_blocks, futs) {
// wait for all futs so that lifetime_guard is guaranteed to be dropped only
// after all executor calls are over
for (block, fut) in itertools::zip_eq(&ordered_blocks, futs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about put these futs in a FuturesOrdered and poll them all together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter, they are already spawned and running in parallel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They run in parallel, but I notice there is a future here where we do some post-processing with some .await. And, possibility that in the future, this future can have other things.

On the other hand, polling one by one would keep the logs sequential. Otherwise, we risk reading unordered logs when debugging. Let's leave it as-is then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if things in the post processing wants to be sequential.. For example, does the mempool tolerate seeing the notifications out of order?

for (block, fut) in itertools::zip_eq(ordered_blocks, futs) {
// wait for all futs so that lifetime_guard is guaranteed to be dropped only
// after all executor calls are over
for (block, fut) in itertools::zip_eq(&ordered_blocks, futs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They run in parallel, but I notice there is a future here where we do some post-processing with some .await. And, possibility that in the future, this future can have other things.

On the other hand, polling one by one would keep the logs sequential. Otherwise, we risk reading unordered logs when debugging. Let's leave it as-is then.

Copy link
Contributor

@igor-aptos igor-aptos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes the lifetime issue, so is probably necessary - but I am a bit worried, if we have multiple blocks, and first fails - needing to wait on execution of all others, just to discard them - and then retry them all together - seems like something we can have issues down the line.

@msmouse
Copy link
Contributor Author

msmouse commented Aug 6, 2024

@igor-aptos It seems if an earlier block fails the laterones will almost always fail immediately anyway. If the VM or DB returns error, it's probably not recoverable anyway (even switching to state sync might probably not get around the underlying issue, like a full disk); if block fetching times out and a later one times out as well, it should fail at about the same time, right?

Anyway let's see how it works out. And maybe re-evaluate if we should bundle several block ids together in the first place, as you suggested on slack.

@msmouse msmouse enabled auto-merge (squash) August 6, 2024 21:08

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Aug 6, 2024

✅ Forge suite realistic_env_max_load success on 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

two traffics test: inner traffic : committed: 12665.27 txn/s, latency: 3143.43 ms, (p50: 3000 ms, p90: 3600 ms, p99: 4500 ms), latency samples: 4815620
two traffics test : committed: 100.08 txn/s, latency: 2807.05 ms, (p50: 2700 ms, p90: 3200 ms, p99: 10500 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.232, avg: 0.219", "QsPosToProposal: max: 0.344, avg: 0.302", "ConsensusProposalToOrdered: max: 0.318, avg: 0.309", "ConsensusOrderedToCommit: max: 0.616, avg: 0.584", "ConsensusProposalToCommit: max: 0.927, avg: 0.892"]
Max round gap was 1 [limit 4] at version 2643444. Max no progress secs was 7.912502 [limit 15] at version 2643444.
Test Ok

Copy link
Contributor

github-actions bot commented Aug 6, 2024

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 7403.45 txn/s, latency: 3951.61 ms, (p50: 3200 ms, p90: 7000 ms, p99: 16700 ms), latency samples: 293140
2. Upgrading first Validator to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7054.20 txn/s, latency: 3807.69 ms, (p50: 4200 ms, p90: 4500 ms, p99: 4700 ms), latency samples: 135880
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6365.95 txn/s, latency: 4520.55 ms, (p50: 4500 ms, p90: 4700 ms, p99: 7200 ms), latency samples: 248220
3. Upgrading rest of first batch to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6590.54 txn/s, latency: 4136.66 ms, (p50: 4500 ms, p90: 5100 ms, p99: 5300 ms), latency samples: 124420
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6480.10 txn/s, latency: 4758.93 ms, (p50: 4900 ms, p90: 7100 ms, p99: 7400 ms), latency samples: 222360
4. upgrading second batch to new version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11104.58 txn/s, latency: 2531.61 ms, (p50: 2800 ms, p90: 3000 ms, p99: 3200 ms), latency samples: 200780
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10465.18 txn/s, latency: 3189.03 ms, (p50: 3000 ms, p90: 3800 ms, p99: 6500 ms), latency samples: 341760
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f passed
Test Ok

Copy link
Contributor

github-actions bot commented Aug 6, 2024

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f (PR)
Upgrade the nodes to version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1071.96 txn/s, submitted: 1073.57 txn/s, failed submission: 1.61 txn/s, expired: 1.61 txn/s, latency: 3039.94 ms, (p50: 2400 ms, p90: 6000 ms, p99: 9600 ms), latency samples: 93120
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 670.12 txn/s, submitted: 671.89 txn/s, failed submission: 1.77 txn/s, expired: 1.77 txn/s, latency: 4492.03 ms, (p50: 2700 ms, p90: 10900 ms, p99: 15700 ms), latency samples: 60740
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f passed
Upgrade the remaining nodes to version: 40a1a0f8af853c3c6013bbd7f8809a424b29ce3f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1129.85 txn/s, submitted: 1132.98 txn/s, failed submission: 3.14 txn/s, expired: 3.14 txn/s, latency: 2718.96 ms, (p50: 2400 ms, p90: 4500 ms, p99: 6600 ms), latency samples: 100860
Test Ok

@msmouse msmouse merged commit ef343b7 into main Aug 6, 2024
137 of 138 checks passed
@msmouse msmouse deleted the 0806-alden-wait-for-all branch August 6, 2024 21:42
github-actions bot pushed a commit that referenced this pull request Aug 6, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

💚 All backports created successfully

Status Branch Result
aptos-release-v1.18

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants