[Mempool] optimize fullnode broadcast hops for latency #9309

bchocho · 2023-07-25T23:18:38Z

Description

Increase single fullnode max throughput from 4K TPS to 6K TPS (max batch size 200 -> 300, scheduled every 50 ms)
Increase throughput when broadcast RTT is large, by increasing the number of outstanding requests. E.g., previously an RTT of 500 ms with 2 outstanding requests, meant only 2 requests could be made every 500 ms.

Test Plan

Run forge with PFNs and network emulation, observe that Avg Insertion-to-Broadcast-Batched on PFNs drops significantly, from 1-2 s for some PFNs to < 200 ms.

JoshLind

Config changes are the best changes 😄

JoshLind · 2023-08-01T22:28:09Z

testsuite/forge-cli/src/main.rs

@@ -500,7 +500,7 @@ fn single_test_suite(
    let single_test_suite = match test_name {
        // Land-blocking tests to be run on every PR:
        "land_blocking" => land_blocking_test_suite(duration), // to remove land_blocking, superseeded by the below
-        "realistic_env_max_load" => realistic_env_max_load_test(duration, test_cmd, 7, 5),
+        "realistic_env_max_load" => pfn_performance(duration, true, true),


Does this need to be rolled back? 😄

This reverts commit da6a924.

sitalkedia · 2023-08-02T20:59:05Z

config/src/config/mempool_config.rs

@@ -103,19 +103,25 @@ impl ConfigOptimizer for MempoolConfig {

        // Change the default configs for VFNs
        let mut modified_config = false;
+        if node_type.is_validator() {
+            // Set the max_broadcasts_per_peer to 2 (default is 20)
+            if local_mempool_config_yaml["max_broadcasts_per_peer"].is_null() {


Can you explain why do we need to override this for the validator?

This isn't used anymore because there is no mempool broadcast between validators with Quorum Store. For the time being, I want to preserve behavior as much as possible just in case we rollback Quorum Store.

Can you add a comment to clarify this?

sitalkedia · 2023-08-02T21:00:20Z

testsuite/forge-cli/src/main.rs

@@ -500,7 +500,7 @@ fn single_test_suite(
    let single_test_suite = match test_name {
        // Land-blocking tests to be run on every PR:
        "land_blocking" => land_blocking_test_suite(duration), // to remove land_blocking, superseeded by the below
-        "realistic_env_max_load" => realistic_env_max_load_test(duration, test_cmd, 7, 5),
+        "realistic_env_max_load" => pfn_const_tps(duration, true, true),


You probably want to revert this change before landing?

Oops, now really reverted.

sitalkedia

LGTM

github-actions · 2023-08-02T21:37:43Z

✅ Forge suite `compat` success on `aptos-node-v1.5.1` ==> `0de5ec55aa401ab734fdc3235475a80b0877ac7c`

Compatibility test results for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c (PR)
1. Check liveness of validators at old version: aptos-node-v1.5.1
compatibility::simple-validator-upgrade::liveness-check : committed: 4485 txn/s, latency: 7266 ms, (p50: 7800 ms, p90: 9900 ms, p99: 11800 ms), latency samples: 165960
2. Upgrading first Validator to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1762 txn/s, latency: 16715 ms, (p50: 19100 ms, p90: 22400 ms, p99: 22600 ms), latency samples: 91660
3. Upgrading rest of first batch to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1887 txn/s, latency: 15574 ms, (p50: 19600 ms, p90: 22000 ms, p99: 22200 ms), latency samples: 92480
4. upgrading second batch to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3077 txn/s, latency: 9858 ms, (p50: 10200 ms, p90: 12600 ms, p99: 13300 ms), latency samples: 132320
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2023-08-02T21:39:12Z

✅ Forge suite `realistic_env_max_load` success on `0de5ec55aa401ab734fdc3235475a80b0877ac7c`

two traffics test: inner traffic : committed: 6467 txn/s, latency: 6055 ms, (p50: 5700 ms, p90: 7800 ms, p99: 10700 ms), latency samples: 2800580
two traffics test : committed: 100 txn/s, latency: 2953 ms, (p50: 2900 ms, p90: 3400 ms, p99: 4200 ms), latency samples: 1840
Max round gap was 1 [limit 4] at version 1392416. Max no progress secs was 3.76145 [limit 10] at version 1392416.
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2023-08-02T21:50:53Z

✅ Forge suite `framework_upgrade` success on `aptos-node-v1.5.1` ==> `0de5ec55aa401ab734fdc3235475a80b0877ac7c`

Compatibility test results for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c (PR)
Upgrade the nodes to version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2895 txn/s, latency: 7590 ms, (p50: 7600 ms, p90: 10500 ms, p99: 16300 ms), latency samples: 162160
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

### Description * Increase single fullnode max throughput from 4K TPS to 6K TPS (max batch size 200 -> 300, scheduled every 50 ms) * Increase throughput when broadcast RTT is large, by increasing the number of outstanding requests. E.g., previously an RTT of 500 ms with 2 outstanding requests, meant only 2 requests could be made every 500 ms. ### Test Plan Run forge with PFNs and network emulation, observe that `Avg Insertion-to-Broadcast-Batched` on PFNs drops significantly, from 1-2 s for some PFNs to < 200 ms.

### Description Reduce the noise added by #9309. Make expected "errors" trace, and sample regardless.

bchocho added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 25, 2023

This comment has been minimized.

Sign in to view

bchocho added 2 commits August 1, 2023 13:37

Be more aggressive on mempool broadcast, based on forge observations

fab42ad

Test pfn_performance

da6a924

bchocho force-pushed the brian/mempool-submit-latency branch from e1f66ba to da6a924 Compare August 1, 2023 20:46

This comment has been minimized.

Sign in to view

bchocho marked this pull request as ready for review August 1, 2023 21:30

bchocho requested review from JoshLind and gregnazario as code owners August 1, 2023 21:30

This comment has been minimized.

Sign in to view

bchocho requested a review from brianolson August 1, 2023 22:04

JoshLind approved these changes Aug 1, 2023

View reviewed changes

Revert "Test pfn_performance"

46a02ab

This reverts commit da6a924.

sitalkedia reviewed Aug 2, 2023

View reviewed changes

revert forge test change

0de5ec5

sitalkedia approved these changes Aug 2, 2023

View reviewed changes

bchocho enabled auto-merge (squash) August 2, 2023 21:06

This comment has been minimized.

Sign in to view

bchocho merged commit 429f4dd into main Aug 2, 2023

bchocho deleted the brian/mempool-submit-latency branch August 2, 2023 21:51

bchocho mentioned this pull request Aug 24, 2023

[Mempool] reduce mempool log noise #9767

Merged

bchocho added a commit that referenced this pull request Aug 25, 2023

[Mempool] reduce mempool log noise (#9767)

091469b

### Description Reduce the noise added by #9309. Make expected "errors" trace, and sample regardless.

[Mempool] optimize fullnode broadcast hops for latency #9309

[Mempool] optimize fullnode broadcast hops for latency #9309

Conversation

bchocho commented Jul 25, 2023 • edited Loading

Description

Test Plan

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

JoshLind left a comment

Choose a reason for hiding this comment

JoshLind Aug 1, 2023

Choose a reason for hiding this comment

sitalkedia Aug 2, 2023

Choose a reason for hiding this comment

bchocho Aug 2, 2023

Choose a reason for hiding this comment

sitalkedia Aug 2, 2023

Choose a reason for hiding this comment

sitalkedia Aug 2, 2023

Choose a reason for hiding this comment

bchocho Aug 2, 2023

Choose a reason for hiding this comment

bchocho Aug 2, 2023

Choose a reason for hiding this comment

sitalkedia left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Aug 2, 2023

✅ Forge suite compat success on aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c

github-actions bot commented Aug 2, 2023

✅ Forge suite realistic_env_max_load success on 0de5ec55aa401ab734fdc3235475a80b0877ac7c

github-actions bot commented Aug 2, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c

bchocho commented Jul 25, 2023 •

edited

Loading

✅ Forge suite `compat` success on `aptos-node-v1.5.1` ==> `0de5ec55aa401ab734fdc3235475a80b0877ac7c`

✅ Forge suite `realistic_env_max_load` success on `0de5ec55aa401ab734fdc3235475a80b0877ac7c`

✅ Forge suite `framework_upgrade` success on `aptos-node-v1.5.1` ==> `0de5ec55aa401ab734fdc3235475a80b0877ac7c`