Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mempool] optimize fullnode broadcast hops for latency #9309

Merged
merged 4 commits into from
Aug 2, 2023

Conversation

bchocho
Copy link
Contributor

@bchocho bchocho commented Jul 25, 2023

Description

  • Increase single fullnode max throughput from 4K TPS to 6K TPS (max batch size 200 -> 300, scheduled every 50 ms)
  • Increase throughput when broadcast RTT is large, by increasing the number of outstanding requests. E.g., previously an RTT of 500 ms with 2 outstanding requests, meant only 2 requests could be made every 500 ms.

Test Plan

Run forge with PFNs and network emulation, observe that Avg Insertion-to-Broadcast-Batched on PFNs drops significantly, from 1-2 s for some PFNs to < 200 ms.

@bchocho bchocho added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 25, 2023
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bchocho bchocho force-pushed the brian/mempool-submit-latency branch from e1f66ba to da6a924 Compare August 1, 2023 20:46
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bchocho bchocho marked this pull request as ready for review August 1, 2023 21:30
@github-actions

This comment has been minimized.

@bchocho bchocho requested a review from brianolson August 1, 2023 22:04
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config changes are the best changes 😄

@@ -500,7 +500,7 @@ fn single_test_suite(
let single_test_suite = match test_name {
// Land-blocking tests to be run on every PR:
"land_blocking" => land_blocking_test_suite(duration), // to remove land_blocking, superseeded by the below
"realistic_env_max_load" => realistic_env_max_load_test(duration, test_cmd, 7, 5),
"realistic_env_max_load" => pfn_performance(duration, true, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be rolled back? 😄

@@ -103,19 +103,25 @@ impl ConfigOptimizer for MempoolConfig {

// Change the default configs for VFNs
let mut modified_config = false;
if node_type.is_validator() {
// Set the max_broadcasts_per_peer to 2 (default is 20)
if local_mempool_config_yaml["max_broadcasts_per_peer"].is_null() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why do we need to override this for the validator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anymore because there is no mempool broadcast between validators with Quorum Store. For the time being, I want to preserve behavior as much as possible just in case we rollback Quorum Store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to clarify this?

@@ -500,7 +500,7 @@ fn single_test_suite(
let single_test_suite = match test_name {
// Land-blocking tests to be run on every PR:
"land_blocking" => land_blocking_test_suite(duration), // to remove land_blocking, superseeded by the below
"realistic_env_max_load" => realistic_env_max_load_test(duration, test_cmd, 7, 5),
"realistic_env_max_load" => pfn_const_tps(duration, true, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to revert this change before landing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, now really reverted.

Copy link
Contributor

@sitalkedia sitalkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bchocho bchocho enabled auto-merge (squash) August 2, 2023 21:06
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

✅ Forge suite compat success on aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c

Compatibility test results for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c (PR)
1. Check liveness of validators at old version: aptos-node-v1.5.1
compatibility::simple-validator-upgrade::liveness-check : committed: 4485 txn/s, latency: 7266 ms, (p50: 7800 ms, p90: 9900 ms, p99: 11800 ms), latency samples: 165960
2. Upgrading first Validator to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1762 txn/s, latency: 16715 ms, (p50: 19100 ms, p90: 22400 ms, p99: 22600 ms), latency samples: 91660
3. Upgrading rest of first batch to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1887 txn/s, latency: 15574 ms, (p50: 19600 ms, p90: 22000 ms, p99: 22200 ms), latency samples: 92480
4. upgrading second batch to new version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3077 txn/s, latency: 9858 ms, (p50: 10200 ms, p90: 12600 ms, p99: 13300 ms), latency samples: 132320
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

✅ Forge suite realistic_env_max_load success on 0de5ec55aa401ab734fdc3235475a80b0877ac7c

two traffics test: inner traffic : committed: 6467 txn/s, latency: 6055 ms, (p50: 5700 ms, p90: 7800 ms, p99: 10700 ms), latency samples: 2800580
two traffics test : committed: 100 txn/s, latency: 2953 ms, (p50: 2900 ms, p90: 3400 ms, p99: 4200 ms), latency samples: 1840
Max round gap was 1 [limit 4] at version 1392416. Max no progress secs was 3.76145 [limit 10] at version 1392416.
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c

Compatibility test results for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c (PR)
Upgrade the nodes to version: 0de5ec55aa401ab734fdc3235475a80b0877ac7c
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2895 txn/s, latency: 7590 ms, (p50: 7600 ms, p90: 10500 ms, p99: 16300 ms), latency samples: 162160
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 0de5ec55aa401ab734fdc3235475a80b0877ac7c passed
Test Ok

@bchocho bchocho merged commit 429f4dd into main Aug 2, 2023
@bchocho bchocho deleted the brian/mempool-submit-latency branch August 2, 2023 21:51
xbtmatt pushed a commit that referenced this pull request Aug 13, 2023
### Description

* Increase single fullnode max throughput from 4K TPS to 6K TPS (max batch size 200 -> 300, scheduled every 50 ms)
* Increase throughput when broadcast RTT is large, by increasing the number of outstanding requests. E.g., previously an RTT of 500 ms with 2 outstanding requests, meant only 2 requests could be made every 500 ms.

### Test Plan

Run forge with PFNs and network emulation, observe that `Avg Insertion-to-Broadcast-Batched` on PFNs drops significantly, from 1-2 s for some PFNs to < 200 ms.
bchocho added a commit that referenced this pull request Aug 25, 2023
### Description

Reduce the noise added by #9309. Make expected "errors" trace, and sample regardless.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants