Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-apply qs backpressure increase with buffer latency increase #13961

Merged
merged 3 commits into from
Jul 10, 2024

Conversation

igor-aptos
Copy link
Contributor

@igor-aptos igor-aptos commented Jul 9, 2024

Description

As Quorum Store batches are bucketed, and we are looking to increase block limits, now is the time to reduce Quorum Store backpressure.

We now allow 36K transactions outstanding. At 12K TPS, this is approximately 3 seconds worth of batches.

For forge tests, a lot of the queuing shifts from mempool to POS-to-Proposal, so the limits need to be adjusted accordingly.

Additionally, this increases time batches can take to be included in the block, so increasing eager expirations

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

bchocho and others added 2 commits July 9, 2024 15:50
As Quorum Store batches are bucketed, and we are looking to increase block limits, now is the time to reduce Quorum Store backpressure.

We now allow 36K transactions outstanding. At 12K TPS, this is approximately 3 seconds worth of batches.

For forge tests, a lot of the queuing shifts from mempool to POS-to-Proposal, so the limits need to be adjusted accordingly.
Copy link

trunk-io bot commented Jul 9, 2024

⏱️ 6h 12m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 2h 26m 🟩🟩🟩🟩
adhoc-forge-test / forge 58m 🟥🟩
rust-images / rust-all 35m 🟩🟩🟩
execution-performance / single-node-performance 31m 🟩
forge-framework-upgrade-test / forge 17m 🟩
forge-e2e-test / forge 14m 🟩
forge-compat-test / forge 14m 🟩
test-target-determinator 13m 🟩🟩🟩
general-lints 7m 🟩🟩🟩🟩
rust-cargo-deny 7m 🟩🟩🟩🟩
rust-move-tests 6m 🟩
rust-move-tests 5m 🟩
execution-performance / test-target-determinator 4m 🟩
check 4m 🟩
check-dynamic-deps 3m 🟩🟩🟩🟩
rust-move-tests 3m 🟩
rust-move-tests 2m 🟩
semgrep/ci 2m 🟩🟩🟩🟩
file_change_determinator 46s 🟩🟩🟩🟩
file_change_determinator 44s 🟩🟩🟩🟩
file_change_determinator 26s 🟩🟩🟩
permission-check 13s 🟩🟩🟩🟩
permission-check 11s 🟩🟩🟩🟩
permission-check 10s 🟩🟩🟩🟩
permission-check 9s 🟩🟩🟩🟩
determine-docker-build-metadata 8s 🟩🟩🟩
permission-check 7s 🟩🟩🟩
determine-forge-run-metadata 3s 🟩🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 31m 21m +44%
execution-performance / test-target-determinator 4m 5m -23%

settingsfeedbackdocs ⋅ learn more about trunk.io

@vusirikala vusirikala added the CICD:build-images when this label is present github actions will start build+push rust images from the PR. label Jul 9, 2024
Copy link
Contributor

@vusirikala vusirikala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As graceful overload test failed last time, I ran graceful overload test here. It seems to be working.
https://github.com/aptos-labs/aptos-core/actions/runs/9865912053

@@ -85,8 +85,8 @@ impl Default for MempoolConfig {
system_transaction_timeout_secs: 600,
system_transaction_gc_interval_ms: 60_000,
broadcast_buckets: DEFAULT_BUCKETS.to_vec(),
eager_expire_threshold_ms: Some(10_000),
eager_expire_time_ms: 3_000,
eager_expire_threshold_ms: Some(15_000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the downside of increasing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new constants make the logic like this:

if there is any transaction in the mempool (that never entered parking lot) that was there for longer than eager_expire_threshold_ms (which is increased from 10s -> 15s here) - that means we have a significant backlog.

if that is the case, we pull into batches only transactions that have at least eager_expire_time_ms time left (increased from 3s to 6s)

We've increased both, as with reduced QS backpressure, we see it taking more time from batch creation to batch being included in the block.

So the side-consequence of this change is that - if that during backlog, if you use expiration <6s, your transactions will be ignored (previously <3s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Slightly orthogonal but do we really need this logic after @vusirikala 's change to exclude expired transaction in QS backpressure calculation in https://github.com/aptos-labs/aptos-core/pull/13850/files. cc - @bchocho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating useless batches, and then requiring everyone to fetch them, just in order to throw them out - probably has negative effect on our overall throughput.

Copy link
Contributor

@vusirikala vusirikala Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitalkedia This is the run for graceful overload test with my changes and increasing QS backpressure limits without changing the eager_expire_threshold_ms config.
#13964 (comment)

Copy link
Contributor

@sitalkedia sitalkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert the forge changes before landing.

@igor-aptos igor-aptos enabled auto-merge (squash) July 10, 2024 18:54

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830 (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 8581.789631436144 txn/s, latency: 3844.26015848353 ms, (p50: 2700 ms, p90: 6900 ms, p99: 27800 ms), latency samples: 321800
2. Upgrading first Validator to new version: f32ed12d1c637013562a77c5eb4b3787eb964830
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7244.589460339935 txn/s, latency: 3683.4140929755094 ms, (p50: 4100 ms, p90: 4400 ms, p99: 4600 ms), latency samples: 136380
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7097.102335386834 txn/s, latency: 4594.588239413412 ms, (p50: 4600 ms, p90: 5500 ms, p99: 6100 ms), latency samples: 242760
3. Upgrading rest of first batch to new version: f32ed12d1c637013562a77c5eb4b3787eb964830
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7255.482725864893 txn/s, latency: 3629.5158147882266 ms, (p50: 4100 ms, p90: 4300 ms, p99: 4500 ms), latency samples: 139300
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6371.6037286143 txn/s, latency: 4691.899807013222 ms, (p50: 4600 ms, p90: 5400 ms, p99: 6600 ms), latency samples: 243540
4. upgrading second batch to new version: f32ed12d1c637013562a77c5eb4b3787eb964830
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 8347.754639218138 txn/s, latency: 3236.902757690232 ms, (p50: 2900 ms, p90: 4800 ms, p99: 8000 ms), latency samples: 185300
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 8916.222014651585 txn/s, latency: 3707.7197677577074 ms, (p50: 3200 ms, p90: 7400 ms, p99: 10600 ms), latency samples: 294520
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on f32ed12d1c637013562a77c5eb4b3787eb964830

two traffics test: inner traffic : committed: 9325.15520925622 txn/s, latency: 4273.548640865965 ms, (p50: 4200 ms, p90: 4500 ms, p99: 10800 ms), latency samples: 3545640
two traffics test : committed: 99.93183569997724 txn/s, latency: 2070.8837209302324 ms, (p50: 2000 ms, p90: 2200 ms, p99: 2400 ms), latency samples: 1720
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.244, avg: 0.223", "QsPosToProposal: max: 1.937, avg: 1.866", "ConsensusProposalToOrdered: max: 0.310, avg: 0.295", "ConsensusOrderedToCommit: max: 0.414, avg: 0.404", "ConsensusProposalToCommit: max: 0.712, avg: 0.699"]
Max round gap was 1 [limit 4] at version 1865582. Max no progress secs was 5.940661 [limit 15] at version 1865582.
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830 (PR)
Upgrade the nodes to version: f32ed12d1c637013562a77c5eb4b3787eb964830
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1056.4637829490234 txn/s, submitted: 1058.721184194641 txn/s, failed submission: 2.2574012456175714 txn/s, expired: 2.2574012456175714 txn/s, latency: 2912.2025854700855 ms, (p50: 2100 ms, p90: 5100 ms, p99: 11100 ms), latency samples: 93600
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1114.1735858734867 txn/s, submitted: 1117.1044350471725 txn/s, failed submission: 2.930849173685821 txn/s, expired: 2.930849173685821 txn/s, latency: 2670.952994738972 ms, (p50: 2100 ms, p90: 4800 ms, p99: 9900 ms), latency samples: 98840
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> f32ed12d1c637013562a77c5eb4b3787eb964830 passed
Upgrade the remaining nodes to version: f32ed12d1c637013562a77c5eb4b3787eb964830
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1046.56533881327 txn/s, submitted: 1049.6299518112592 txn/s, failed submission: 3.0646129979890775 txn/s, expired: 3.0646129979890775 txn/s, latency: 2911.131185944363 ms, (p50: 2100 ms, p90: 5400 ms, p99: 11700 ms), latency samples: 95620
Test Ok

@igor-aptos igor-aptos merged commit 8d78743 into main Jul 10, 2024
92 checks passed
@igor-aptos igor-aptos deleted the igor/re_revert_qs_backpressure branch July 10, 2024 19:29
zi0Black pushed a commit that referenced this pull request Jul 12, 2024
* [quorum store] reduce backpressure significantly for more TPS (#13558)

As Quorum Store batches are bucketed, and we are looking to increase block limits, now is the time to reduce Quorum Store backpressure.

We now allow 36K transactions outstanding. At 12K TPS, this is approximately 3 seconds worth of batches.

For forge tests, a lot of the queuing shifts from mempool to POS-to-Proposal, so the limits need to be adjusted accordingly.

* increase buffer for expiration in batch creation

* adding buffers on inner traffic as well

---------

Co-authored-by: Brian (Sunghoon) Cho <[email protected]>
igor-aptos added a commit that referenced this pull request Jul 12, 2024
* [quorum store] reduce backpressure significantly for more TPS (#13558)

As Quorum Store batches are bucketed, and we are looking to increase block limits, now is the time to reduce Quorum Store backpressure.

We now allow 36K transactions outstanding. At 12K TPS, this is approximately 3 seconds worth of batches.

For forge tests, a lot of the queuing shifts from mempool to POS-to-Proposal, so the limits need to be adjusted accordingly.

* increase buffer for expiration in batch creation

* adding buffers on inner traffic as well

---------

Co-authored-by: Brian (Sunghoon) Cho <[email protected]>
igor-aptos added a commit that referenced this pull request Jul 15, 2024
… (#13997)

* [quorum store] reduce backpressure significantly for more TPS (#13558)

As Quorum Store batches are bucketed, and we are looking to increase block limits, now is the time to reduce Quorum Store backpressure.

We now allow 36K transactions outstanding. At 12K TPS, this is approximately 3 seconds worth of batches.

For forge tests, a lot of the queuing shifts from mempool to POS-to-Proposal, so the limits need to be adjusted accordingly.

* increase buffer for expiration in batch creation

* adding buffers on inner traffic as well

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-images when this label is present github actions will start build+push rust images from the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants