Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: improve comms limits to combat network timeouts #3224

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion base_layer/wallet/tests/output_manager_service/service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ use tari_wallet::{
types::ValidationRetryStrategy,
};

use tari_comms::protocol::rpc::RpcClientConfig;
use tari_wallet::output_manager_service::storage::models::OutputStatus;
use tokio::{
runtime::Runtime,
Expand Down Expand Up @@ -1816,7 +1817,12 @@ fn test_txo_validation_rpc_timeout() {
.unwrap();

runtime.block_on(async {
let mut delay = delay_for(Duration::from_secs(100)).fuse();
let mut delay = delay_for(
RpcClientConfig::default().deadline.unwrap() +
RpcClientConfig::default().deadline_grace_period +
Duration::from_secs(30),
)
.fuse();
let mut failed = 0;
loop {
futures::select! {
Expand Down
2 changes: 1 addition & 1 deletion comms/src/connection_manager/dialer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -514,7 +514,7 @@ where
.map_err(|_| ConnectionManagerError::WireFormatSendFailed)?;

let noise_socket = time::timeout(
Duration::from_secs(30),
Duration::from_secs(40),
sdbondi marked this conversation as resolved.
Show resolved Hide resolved
noise_config.upgrade_socket(socket, ConnectionDirection::Outbound),
)
.await
Expand Down
39 changes: 36 additions & 3 deletions comms/src/connection_manager/listener.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ use crate::{
protocol::ProtocolId,
runtime,
transports::Transport,
types::CommsPublicKey,
utils::multiaddr::multiaddr_to_socketaddr,
PeerManager,
};
Expand All @@ -67,6 +68,7 @@ use std::{
},
time::Duration,
};
use tari_crypto::tari_utilities::hex::Hex;
use tari_shutdown::ShutdownSignal;
use tokio::time;

Expand Down Expand Up @@ -249,7 +251,7 @@ where
let result = Self::perform_socket_upgrade_procedure(
node_identity,
peer_manager,
noise_config,
noise_config.clone(),
conn_man_notifier.clone(),
socket,
peer_addr,
Expand Down Expand Up @@ -286,10 +288,15 @@ where
}
},
Ok(WireMode::Comms(byte)) => {
// TODO: This call is expensive and only added for the benefit of improved logging and may lead to
// TODO: DoS attacks. Remove later when not needed anymore or make it optional with a config file
// TODO: setting.
let public_key = Self::remote_public_key_from_socket(socket, noise_config).await;
Copy link
Member

@sdbondi sdbondi Aug 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the rationale insofar as it relates to this PR "improving comms limits".

  1. All we know at this point is some connection from some/anywhere (browser, telnet, literally any TCP client) send a byte other than 0x46.
  2. In the happy/non-shenanigan case you are performing the noise socket protocol with peers that are configured on different networks so that your logs can have a public key of that node.
  3. The "connection is closed" logs I've seen on seed nodes are all related to RPC substreams (2+ days ago, none in the past day, though I have only checked one node). This part of the code executes at the inception of a PeerConnection before any RPC. Are we actually seeing this branch hit (unless a peer is not configured for weatherwax - which is the case it handles by denying the connection i.e. on the peer connection level, connection is closed == connection denied)?

Copy link
Contributor Author

@hansieodendaal hansieodendaal Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have seen ample of these messages coming through, see some sampled extracts below:

2021-08-20 19:58:07.777564100 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1170' (be4d48ce68e4c7a05593159e9a11e75c0d646199b44115909e7bcf2faf7ef265) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 19:58:35.094426800 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1845' (be4d48ce68e4c7a05593159e9a11e75c0d646199b44115909e7bcf2faf7ef265) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 19:59:07.773171900 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/23093' (be4d48ce68e4c7a05593159e9a11e75c0d646199b44115909e7bcf2faf7ef265) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 19:59:34.088223100 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1193' (be4d48ce68e4c7a05593159e9a11e75c0d646199b44115909e7bcf2faf7ef265) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 20:08:22.288730400 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1243' (421557737326d3b2d355383094b5b9916dc62513bebacaddeb8afdb9877ae541) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 20:10:47.750326800 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/22301' (421557737326d3b2d355383094b5b9916dc62513bebacaddeb8afdb9877ae541) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 20:13:17.745209600 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1241' (421557737326d3b2d355383094b5b9916dc62513bebacaddeb8afdb9877ae541) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 20:17:47.710094300 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1944' (421557737326d3b2d355383094b5b9916dc62513bebacaddeb8afdb9877ae541) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 20:21:17.915394000 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1218' (f860e6654457bfcfd73f1a70c233b94dc1d139a0d1b2812c9d7e833ec409a568) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 21:11:52.915466300 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1033' (4e148b8edf987a108d2dc67ff379e5587a73bf09dc780ba4d21e7a8802d9d867) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 21:20:27.314824700 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/1803' (a8f95d9997d962fda575e2a93713a887fe2d44af2b2756d87ffaf696eeeb374e) sent invalid wire format byte. Expected 23 got: 6 
2021-08-20 21:22:58.413287200 [comms::connection_manager::listener] WARN  Peer at address '/ip4/127.0.0.1/tcp/23306' (a0b24a255230cb9307e3fc21510e278ef03f99339bacc632770be1590d909f0a) sent invalid wire format byte. Expected 23 got: 6 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the rationale insofar as it relates to this PR "improving comms limits".

Adding the public key to wire mode logging was not the major focus of the PR, just a side issue. The important change was that waiting for the first byte to arrive was upped from 7s to 45s as the former proved to be inadequate during load conditions.

Copy link
Member

@sdbondi sdbondi Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so why are these messages coming through? Are these nodes configured to the correct network? Briefly looking at git history, I think byte 6 is stibbons

warn!(
target: LOG_TARGET,
"Peer at address '{}' sent invalid wire format byte. Expected {:x?} got: {:x?} ",
"Peer at address '{}' ({}) sent invalid wire format byte. Expected {:x?} got: {:x?} ",
peer_addr,
public_key,
config.network_info.network_byte,
byte,
);
Expand All @@ -315,7 +322,7 @@ where
Err(err) => {
warn!(
target: LOG_TARGET,
"Peer at address '{}' failed to send wire format. Expected network byte {:x?} or liveness \
"Peer at address '{}' failed to send its wire format. Expected network byte {:x?} or liveness \
byte {:x?} not received. Error: {}",
peer_addr,
config.network_info.network_byte,
Expand All @@ -331,6 +338,32 @@ where
self.bounded_executor.spawn(inbound_fut).await;
}

async fn remote_public_key_from_socket(socket: TTransport::Output, noise_config: NoiseConfig) -> String {
let public_key: Option<CommsPublicKey> = match time::timeout(
Duration::from_secs(30),
noise_config.upgrade_socket(socket, ConnectionDirection::Inbound),
)
.await
.map_err(|_| ConnectionManagerError::NoiseProtocolTimeout)
{
Ok(Ok(noise_socket)) => {
match noise_socket
.get_remote_public_key()
.ok_or(ConnectionManagerError::InvalidStaticPublicKey)
{
Ok(pk) => Some(pk),
_ => None,
}
},
_ => None,
};

match public_key {
None => "public key not known".to_string(),
Some(pk) => pk.to_hex(),
}
}

#[allow(clippy::too_many_arguments)]
async fn perform_socket_upgrade_procedure(
node_identity: Arc<NodeIdentity>,
Expand Down
8 changes: 4 additions & 4 deletions comms/src/connection_manager/manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,14 @@ pub struct ConnectionManagerConfig {
/// The number of dial attempts to make before giving up. Default: 3
pub max_dial_attempts: usize,
/// The maximum number of connection tasks that will be spawned at the same time. Once this limit is reached, peers
/// attempting to connect will have to wait for another connection attempt to complete. Default: 20
/// attempting to connect will have to wait for another connection attempt to complete. Default: 100
pub max_simultaneous_inbound_connects: usize,
/// Set to true to allow peers to send loopback, local-link and other addresses normally not considered valid for
/// peer-to-peer comms. Default: false
pub allow_test_addresses: bool,
/// Version information for this node
pub network_info: NodeNetworkInfo,
/// The maximum time to wait for the first byte before closing the connection. Default: 7s
/// The maximum time to wait for the first byte before closing the connection. Default: 45s
pub time_to_first_byte: Duration,
/// The number of liveness check sessions to allow. Default: 0
pub liveness_max_sessions: usize,
Expand All @@ -122,15 +122,15 @@ impl Default for ConnectionManagerConfig {
#[cfg(test)]
listener_address: "/memory/0".parse().unwrap(),
max_dial_attempts: 3,
max_simultaneous_inbound_connects: 20,
max_simultaneous_inbound_connects: 100,
network_info: Default::default(),
#[cfg(not(test))]
allow_test_addresses: false,
// This must always be true for internal crate tests
#[cfg(test)]
allow_test_addresses: true,
liveness_max_sessions: 0,
time_to_first_byte: Duration::from_secs(7),
time_to_first_byte: Duration::from_secs(45),
liveness_cidr_allowlist: vec![cidr::AnyIpCidr::V4("127.0.0.1/32".parse().unwrap())],
auxilary_tcp_listener_address: None,
}
Expand Down
6 changes: 3 additions & 3 deletions comms/src/protocol/rpc/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -250,9 +250,9 @@ impl RpcClientConfig {
impl Default for RpcClientConfig {
fn default() -> Self {
Self {
deadline: Some(Duration::from_secs(30)),
deadline_grace_period: Duration::from_secs(30),
handshake_timeout: Duration::from_secs(30),
deadline: Some(Duration::from_secs(120)),
deadline_grace_period: Duration::from_secs(60),
handshake_timeout: Duration::from_secs(90),
}
}
}
Expand Down
7 changes: 5 additions & 2 deletions integration_tests/features/Sync.feature
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,11 @@ Feature: Block Sync
Then node PNODE2 is at height 40
When I start base node NODE1
# We need for node to boot up and supply node 2 with blocks
And I connect node NODE2 to node NODE1 and wait 60 seconds
Then all nodes are at height 40
And I connect node NODE2 to node NODE1 and wait 5 seconds
# NODE2 may initially try to sync from PNODE1 and PNODE2, then eventually try to sync from NODE1; mining blocks
# on NODE1 will make this test less flaky and force NODE2 to sync from NODE1 much quicker
When I mine 10 blocks on NODE1
Then all nodes are at height 50

Scenario Outline: Syncing node while also mining before tip sync
Given I have a seed node SEED
Expand Down
Loading