Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squash network inbound queues #13956

Merged
merged 37 commits into from
Jul 19, 2024
Merged

Conversation

brianolson
Copy link
Contributor

@brianolson brianolson commented Jul 9, 2024

Description

Remove several queues of indirection, move received messages directly from the reader thread of a peer to application code.

add inbound queue delay metric aptos_network_inbound_queue_time

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Local cluster tests. unit tests. forge cluster tests.

Key Areas to Review

Some unit tests in mempool were lost due to relying on layers of the network that are being phased out.

Some parts of the inbound path may still be there and unused, further code cleanup is possible.

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jul 9, 2024

⏱️ 12h 13m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 9h 43m 🟩🟩🟩🟩🟩 (+11 more)
general-lints 28m 🟩🟩🟩🟩 (+11 more)
rust-cargo-deny 27m 🟩🟩🟩🟩 (+11 more)
check-dynamic-deps 15m 🟩🟩🟩🟩🟩 (+11 more)
rust-move-tests 7m 🟩
rust-move-tests 7m 🟩
rust-move-tests 7m 🟩
rust-move-tests 7m 🟩
semgrep/ci 6m 🟩🟩🟩🟩🟩 (+11 more)
rust-move-tests 6m 🟩
rust-move-tests 6m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
file_change_determinator 3m 🟩🟩🟩🟩🟩 (+11 more)
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
file_change_determinator 3m 🟩🟩🟩🟩🟩 (+11 more)
rust-move-tests 2m 🟩
rust-move-tests 2m
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
permission-check 52s 🟩🟩🟩🟩🟩 (+10 more)
permission-check 48s 🟩🟩🟩🟩🟩 (+11 more)
permission-check 46s 🟩🟩🟩🟩🟩 (+11 more)
permission-check 41s 🟩🟩🟩🟩🟩 (+11 more)

settingsfeedbackdocs ⋅ learn more about trunk.io

@brianolson brianolson marked this pull request as ready for review July 10, 2024 14:08
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass. Will return to look at the internals 😄

consensus/src/network_tests.rs Outdated Show resolved Hide resolved
mempool/src/tests/node.rs Outdated Show resolved Hide resolved

#[cfg(wrong_network_abstraction)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchocho, any thoughts on updating or removing all these tests? Should that be done before this PR?

(My feeling is that these should be fixed/removed first, e.g., to ensure we don't accidentally miss something. Feels odd just ignoring them all.)

network/framework/src/counters.rs Outdated Show resolved Hide resolved
data: data.into(),
res_tx,
let notification = ReceivedMessage {
message: NetworkMessage::RpcRequest(RpcRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be worth exposing a test-only method to create these for tests (e.g., to avoid us having to assign 0's to request_id, priority and rx_at everywhere)?

network/framework/src/protocols/network/mod.rs Outdated Show resolved Hide resolved
network/framework/src/protocols/health_checker/test.rs Outdated Show resolved Hide resolved
@@ -26,6 +26,7 @@ pub enum PeerManagerRequest {
}

/// Notifications sent by PeerManager to upstream actors.
/// TODO: PeerManagerNotification now only exists in test code and should be deleted; probably use `ReceivedMessage`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the mempool-side changes, this can now be removed completely?

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. 😄 Will wait for @bchocho to comment.

network/framework/src/protocols/health_checker/test.rs Outdated Show resolved Hide resolved
network/framework/src/peer/test.rs Outdated Show resolved Hide resolved
let mut connection = MultiplexMessageSink::new(connection, MAX_FRAME_SIZE);
for _ in 0..30 {
// The client should then send the network message.
connection.send(&send_msg).await.unwrap();
}
info!("client sent");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to commit these, or were these for local debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test code logging is pretty cheap? it gets thrown away if the test passes, right?

network/framework/src/peer/test.rs Outdated Show resolved Hide resolved
network/framework/src/protocols/network/mod.rs Outdated Show resolved Hide resolved
network/framework/src/peer/test.rs Outdated Show resolved Hide resolved
network/framework/src/peer/test.rs Outdated Show resolved Hide resolved
);
match self.upstream_handlers.get(&direct.protocol_id) {
None => {
// TODO: better label than "declined"? more like "garbage-in"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to prioritize following up on the TODOs in a future PR 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not_recognized?

@brianolson
Copy link
Contributor Author

I took a stab at mempool fixing and it wasn't as bad as the full network2 branch with changing both inbound and outbound paths. Merged the fix in here. aptos-mempool tests enabled and passing again.

Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mempool changes look great! I'll continue looking at the rest of the PR

network/framework/src/peer_manager/mod.rs Outdated Show resolved Hide resolved
@@ -26,6 +26,7 @@ pub enum PeerManagerRequest {
}

/// Notifications sent by PeerManager to upstream actors.
/// TODO: PeerManagerNotification now only exists in test code and should be deleted; probably use `ReceivedMessage`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the mempool-side changes, this can now be removed completely?

);
match self.upstream_handlers.get(&direct.protocol_id) {
None => {
// TODO: better label than "declined"? more like "garbage-in"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not_recognized?

network/framework/src/peer/mod.rs Outdated Show resolved Hide resolved
@@ -214,31 +262,52 @@ impl<TMessage> Stream for NetworkEvents<TMessage> {
}
}

fn unix_micros() -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use time service?

fn now_unix_time(&self) -> Duration;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the usage sites are pretty awkward to plumb that through to

@brianolson brianolson requested a review from gregnazario as a code owner July 19, 2024 00:51
Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brianolson 😄

@@ -26,6 +26,7 @@ pub const RECEIVED_LABEL: &str = "received";
pub const SENT_LABEL: &str = "sent";
pub const SUCCEEDED_LABEL: &str = "succeeded";
pub const FAILED_LABEL: &str = "failed";
pub const UNKNOWN_LABEL: &str = "unk";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just make this unknown? 😄

@@ -124,7 +122,7 @@ fn build_test_peer_manager(
peer_manager,
peer_manager_request_tx,
connection_reqs_tx,
hello_rx,
// hello_rx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove?

}
},
NetworkMessage::RpcResponse(response) => {
NetworkMessage::RpcResponse(_) => {
// non-reference cast identical to this match case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason we can't just match this directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other branches work better with a match over a reference, and *response didn't work.

} = message;
let dequeue_at = unix_micros();
let dt_micros = dequeue_at - rx_at;
let dt_seconds = (dt_micros as f64) / 1000000.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd move this to a helper:

let dequeue_delta_secs = calculate_dequeue_delta(rx_at);
...
fn calculate_dequeue_delta(rx_at: u64) -> u64 {
    let dequeue_at = unix_micros();
    let dt_micros = dequeue_at - rx_at;
    (dt_micros as f64) / 1000000.0
}

@brianolson brianolson enabled auto-merge (squash) July 19, 2024 20:03

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 5634764423ef39264cf65a5c12ddc4edf8f0c9c9

two traffics test: inner traffic : committed: 9280.194716204971 txn/s, submitted: 9491.019767062267 txn/s, failed submission: 0.2971959923512299 txn/s, expired: 210.8250508572972 txn/s, latency: 2993.366245621394 ms, (p50: 3000 ms, p90: 3600 ms, p99: 4200 ms), latency samples: 3528520
two traffics test : committed: 100.04058287308666 txn/s, latency: 1976.982 ms, (p50: 2000 ms, p90: 2200 ms, p99: 2600 ms), latency samples: 2000
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.240, avg: 0.216", "QsPosToProposal: max: 1.313, avg: 0.795", "ConsensusProposalToOrdered: max: 0.329, avg: 0.295", "ConsensusOrderedToCommit: max: 0.411, avg: 0.394", "ConsensusProposalToCommit: max: 0.703, avg: 0.690"]
Max round gap was 1 [limit 4] at version 1967174. Max no progress secs was 5.752185 [limit 15] at version 1967174.
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9 (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 9496.047319405721 txn/s, latency: 3757.8464577071704 ms, (p50: 2700 ms, p90: 6300 ms, p99: 23800 ms), latency samples: 326060
2. Upgrading first Validator to new version: 5634764423ef39264cf65a5c12ddc4edf8f0c9c9
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7222.512499504325 txn/s, latency: 3714.4086574206463 ms, (p50: 4200 ms, p90: 4500 ms, p99: 4500 ms), latency samples: 139880
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7138.663241654682 txn/s, latency: 4588.8342717258265 ms, (p50: 4600 ms, p90: 5400 ms, p99: 6100 ms), latency samples: 245100
3. Upgrading rest of first batch to new version: 5634764423ef39264cf65a5c12ddc4edf8f0c9c9
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7433.628176226802 txn/s, latency: 3599.1301285714285 ms, (p50: 4000 ms, p90: 4300 ms, p99: 4500 ms), latency samples: 140000
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6364.982880699781 txn/s, latency: 4707.852137353989 ms, (p50: 4600 ms, p90: 5500 ms, p99: 6600 ms), latency samples: 241420
4. upgrading second batch to new version: 5634764423ef39264cf65a5c12ddc4edf8f0c9c9
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10890.826786657743 txn/s, latency: 2677.19926221336 ms, (p50: 2800 ms, p90: 3500 ms, p99: 4500 ms), latency samples: 200600
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9728.574388508257 txn/s, latency: 3698.33191514861 ms, (p50: 3300 ms, p90: 5500 ms, p99: 9500 ms), latency samples: 333760
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9 (PR)
Upgrade the nodes to version: 5634764423ef39264cf65a5c12ddc4edf8f0c9c9
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1129.7474142338024 txn/s, submitted: 1130.8558791221794 txn/s, failed submission: 1.1084648883769648 txn/s, expired: 1.1084648883769648 txn/s, latency: 2875.405749607535 ms, (p50: 2100 ms, p90: 4800 ms, p99: 12600 ms), latency samples: 101920
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1109.6204707045051 txn/s, submitted: 1112.1211519537878 txn/s, failed submission: 2.500681249282843 txn/s, expired: 2.500681249282843 txn/s, latency: 2855.685218192993 ms, (p50: 2100 ms, p90: 4800 ms, p99: 12000 ms), latency samples: 97620
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 5634764423ef39264cf65a5c12ddc4edf8f0c9c9 passed
Upgrade the remaining nodes to version: 5634764423ef39264cf65a5c12ddc4edf8f0c9c9
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1074.2108306818761 txn/s, submitted: 1076.6287392428544 txn/s, failed submission: 2.4179085609782356 txn/s, expired: 2.4179085609782356 txn/s, latency: 2857.8949866994067 ms, (p50: 2100 ms, p90: 5300 ms, p99: 12100 ms), latency samples: 97740
Test Ok

@brianolson brianolson merged commit 4d7d8d3 into aptos-labs:main Jul 19, 2024
87 of 90 checks passed
@brianolson brianolson deleted the squash-inbound branch July 22, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants