Skip to content

Commit

Permalink
Security: Spawn a separate task for each initial handshake
Browse files Browse the repository at this point in the history
This fix prevents hangs and deadlocks during initialization, particularly
when there are a small number of valid peers in the initial peer config
(or from the DNS seeders).

Security: Correctly handle the minimum peer connection interval

Previously, if we hadn't had a connection for a while, we'd allow a lot
of connections all at once, until we'd caught up.

Security: sleep MIN_PEER_CONNECTION_INTERVAL between initial handshakes

This prevents denial of service if the local network is constrained, and
the seeders return a large number of peers.

Only wait for ready handshakes

Drain all waiting handshakes when enough have succeeded

Refactor MetaAddr to enable security fixes

Track multiple last used times for each peer:
- Add separate untrusted_last_seen, attempt, success, and failed time fields (#1868, #1876, #1848)
- Add the new fields to the peer states, so they only appear in states where they are valid
- Insert initial seed peers in the AddressBook in the correct states

Create a new MetaAddrChange type for AddressBook changes:
- Ignore invalid state changes
    - Ignore updates to the untrusted last seen time (but update the services field)
    - If we get a gossiped or alternate change for a seed peer, use the last seen and services info
    - Once a peer has responded, don't go back to the NeverResponded... states
- Update the address book metrics

- Optimise getting the next connection address from the address book

Do an extra crawl for each handshake on startup

And whenever there aren't many recently live peers.

Remove duplicate initial crawl code

This change uses the candidate set for initial seed peers,
gossiped peers, and alternate peers.

It significantly reduces the complexity of the initialization code.
(By about 200 lines.)

Apply readiness timeout to each fanout

Also get the fanout limit from the number of recently live peers.

Launch each CandidateSet fanout in its own task

Spawn each `CandidateSet::update` in its own task

Move `CandidateSet::next` into the handshake task

Move all crawler awaits and threaded locks into spawned tasks

In this commit:
- Move sending PeerSet changes into a spawned task
- Move the locking in `CandidateSet::report_failed` into a spawned task

Increase the peer set buffer size for concurrent fanouts

Launch sync fanouts concurrently, with peer set readiness timeouts

Wait for seed peers before the first crawl

WIP: Add a timeout to crawl addr requests

This is a workaround for a zcashd response rate-limit.

Move AddressBook::lock() onto a blocking thread

Process all ready timestamp changes each time the task runs

Wait for the initial crawl before launching the syncer

Security: Limit unverified blocks to avoid memory DoS

Also document the security implications of changing these limits.

Drop early inbound requests to avoid load shedding during network setup

Stop closing connections when the inbound service is overloaded

SECURITY: Make buffer sizes dynamically depend on the config

This change significantly increases the inbound buffer size, increasing memory
denial of service risks. However, users can reduce the buffer size using
existing related config options.

These risks are documented under the relevant configs.

Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error

Also:
- handle errors in service readiness the same as errors in requests

Closes #1655
  • Loading branch information
teor2345 committed May 25, 2021
1 parent 559271a commit 7511e2b
Show file tree
Hide file tree
Showing 25 changed files with 2,345 additions and 875 deletions.
390 changes: 255 additions & 135 deletions zebra-network/src/address_book.rs

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions zebra-network/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,19 @@ pub struct Config {
/// If you have a slow network connection, and Zebra is having trouble
/// syncing, try reducing the peer set size. You can also reduce the peer
/// set size to reduce Zebra's bandwidth usage.
///
/// Note: changing this config can make Zebra slower or less reliabile.
/// (See ticket #2193.)
///
/// # SECURITY
///
/// This config controls the inbound download buffer size.
/// Large values are a memory denial of service risk, because the
/// inbound buffer is before semantic verification, which checks
/// proof of work.
///
/// This buffer is high risk, because it accepts partially-validated blocks
/// from any peer.
pub peerset_initial_target_size: usize,

/// How frequently we attempt to crawl the network to discover new peer
Expand Down
154 changes: 123 additions & 31 deletions zebra-network/src/constants.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,113 @@ use crate::protocol::external::types::*;

use zebra_chain::parameters::NetworkUpgrade;

/// The maximum number of [`GetAddr]` requests sent when crawling for new peers.
/// This is also the default fanout limit.
///
/// ## SECURITY
///
/// The fanout should be greater than 2, so that Zebra avoids getting a majority
/// of its initial address book entries from a single peer.
///
/// Zebra regularly crawls for new peers, initiating a new crawl every
/// [`crawl_new_peer_interval`].
///
/// TODO: limit the number of addresses that Zebra uses from a single peer
/// response (#1869)
///
/// [`GetAddr`]: [`crate::protocol::external::message::Message::GetAddr`]
/// [`crawl_new_peer_interval`](crate::config::Config.crawl_new_peer_interval).
///
/// Note: Zebra is currently very sensitive to fanout changes (#2193)
pub const MAX_GET_ADDR_FANOUT: usize = 3;

/// The fraction of live peers used for [`GetAddr`] fanouts.
///
/// `zcashd` rate-limits [`Addr`] responses. To avoid choking the peer set, we
/// only want send requests to a small fraction of peers.
///
/// [`GetAddr`]: [`crate::protocol::external::message::Message::GetAddr`]
/// [`Addr`]: [`crate::protocol::external::message::Message::Addr`]
///
/// Note: Zebra is currently very sensitive to fanout changes (#2193)
pub const GET_ADDR_FANOUT_LIVE_PEERS_DIVISOR: usize = 10;

/// Controls the number of peers used for each ObtainTips and ExtendTips request.
///
/// Note: Zebra is currently very sensitive to fanout changes (#2193)
pub const SYNC_FANOUT: usize = 4;

/// The buffer size for the peer set.
///
/// This should be greater than 1 to avoid sender contention, but also reasonably
/// small, to avoid queueing too many in-flight block downloads. (A large queue
/// of in-flight block downloads can choke a constrained local network
/// connection, or a small peer set on testnet.)
///
/// This should also be greater than [`MAX_GET_ADDR_FANOUT`], to avoid contention
/// during [`CandidateSet::update`].
///
/// We assume that Zebra nodes have at least 10 Mbps bandwidth. Therefore, a
/// maximum-sized block can take up to 2 seconds to download. So the peer set
/// buffer adds up to 6 seconds worth of blocks to the queue.
pub const PEERSET_BUFFER_SIZE: usize = 3;
/// buffer adds up to `PEERSET_BUFFER_SIZE*2` seconds worth of blocks to the
/// queue.
///
/// We want enough buffer to support concurrent fanouts, a sync download,
/// an inbound download, and some extra slots to avoid contention.
///
/// Note: Zebra is currently very sensitive to buffer size changes (#2193)
pub const PEERSET_BUFFER_SIZE: usize = MAX_GET_ADDR_FANOUT + SYNC_FANOUT + 3;

/// The timeout for requests made to a remote peer.
pub const REQUEST_TIMEOUT: Duration = Duration::from_secs(20);
/// The timeout for DNS lookups.
///
/// [6.1.3.3 Efficient Resource Usage] from [RFC 1123: Requirements for Internet Hosts]
/// suggest no less than 5 seconds for resolving timeout.
///
/// [RFC 1123: Requirements for Internet Hosts]: https://tools.ietf.org/rfcmarkup?doc=1123
/// [6.1.3.3 Efficient Resource Usage]: https://tools.ietf.org/rfcmarkup?doc=1123#page-77
pub const DNS_LOOKUP_TIMEOUT: Duration = Duration::from_secs(5);

/// The minimum time between connections to initial or candidate peers.
///
/// ## Security
///
/// Zebra resists distributed denial of service attacks by making sure that new peer connections
/// are initiated at least `MIN_PEER_CONNECTION_INTERVAL` apart.
pub const MIN_PEER_CONNECTION_INTERVAL: Duration = Duration::from_millis(100);

/// The timeout for handshakes when connecting to new peers.
///
/// This timeout should remain small, because it helps stop slow peers getting
/// into the peer set. This is particularly important for network-constrained
/// nodes, and on testnet.
/// into the peer set. There is a tradeoff for network-constrained nodes and on
/// testnet:
/// - we want to avoid very slow nodes, but
/// - we want to have as many nodes as possible.
///
/// Note: Zebra is currently very sensitive to timing changes (#2193)
pub const HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(4);

/// The timeout for crawler [`GetAddr`] requests.
///
/// This timeout is a tradeoff between:
/// - ignored late responses because the timeout is too low, and
/// - peer set congestion from [`GetAddr`] requests dropped by peers.
///
/// Test changes to this timeout against `zcashd` peers on [`Testnet`], or
/// another small network.
///
/// [`GetAddr`]: [`crate::protocol::external::message::Message::GetAddr`]
/// [`Testnet`]: [`zebra_chain::parameters::Network::Testnet`]
///
/// Note: Zebra is currently very sensitive to timing changes (#2193)
pub const GET_ADDR_TIMEOUT: Duration = Duration::from_secs(8);

/// The maximum timeout for requests made to a remote peer.
///
/// [`PeerSet`] callers can set lower timeouts using [`tower::timeout`].
///
/// Note: Zebra is currently very sensitive to timing changes (#2193)
pub const REQUEST_TIMEOUT: Duration = Duration::from_secs(20);

/// We expect to receive a message from a live peer at least once in this time duration.
///
/// This is the sum of:
Expand All @@ -49,20 +134,6 @@ pub const LIVE_PEER_DURATION: Duration = Duration::from_secs(60 + 20 + 20 + 20);
/// connected peer.
pub const HEARTBEAT_INTERVAL: Duration = Duration::from_secs(60);

/// The number of GetAddr requests sent when crawling for new peers.
///
/// ## SECURITY
///
/// The fanout should be greater than 2, so that Zebra avoids getting a majority
/// of its initial address book entries from a single peer.
///
/// Zebra regularly crawls for new peers, initiating a new crawl every
/// [`crawl_new_peer_interval`](crate::config::Config.crawl_new_peer_interval).
///
/// TODO: limit the number of addresses that Zebra uses from a single peer
/// response (#1869)
pub const GET_ADDR_FANOUT: usize = 3;

/// Truncate timestamps in outbound address messages to this time interval.
///
/// ## SECURITY
Expand All @@ -71,6 +142,14 @@ pub const GET_ADDR_FANOUT: usize = 3;
/// messages from each of our peers.
pub const TIMESTAMP_TRUNCATION_SECONDS: i64 = 30 * 60;

/// The maximum queue length for the timestamp collector. Further changes will
/// block until the task processes them.
///
/// This parameter provides a performance/memory tradeoff. It doesn't have much
/// impact on latency, because all queued changes are processed each time the
/// collector runs.
pub const TIMESTAMP_WORKER_BUFFER_SIZE: usize = 100;

/// The User-Agent string provided by the node.
///
/// This must be a valid [BIP 14] user agent.
Expand Down Expand Up @@ -123,15 +202,6 @@ lazy_static! {
}.expect("regex is valid");
}

/// The timeout for DNS lookups.
///
/// [6.1.3.3 Efficient Resource Usage] from [RFC 1123: Requirements for Internet Hosts]
/// suggest no less than 5 seconds for resolving timeout.
///
/// [RFC 1123: Requirements for Internet Hosts] https://tools.ietf.org/rfcmarkup?doc=1123
/// [6.1.3.3 Efficient Resource Usage] https://tools.ietf.org/rfcmarkup?doc=1123#page-77
pub const DNS_LOOKUP_TIMEOUT: Duration = Duration::from_secs(5);

/// Magic numbers used to identify different Zcash networks.
pub mod magics {
use super::*;
Expand All @@ -145,6 +215,18 @@ pub mod magics {
mod tests {

use super::*;
use std::cmp::min;

/// Make sure that the fanout and buffer sizes are consistent with each other.
#[test]
fn ensure_buffers_consistent() {
zebra_test::init();

assert!(
PEERSET_BUFFER_SIZE >= min(MAX_GET_ADDR_FANOUT, SYNC_FANOUT),
"Zebra's peerset buffer should hold the smallest fanout"
);
}

/// This assures that the `Duration` value we are computing for
/// LIVE_PEER_DURATION actually matches the other const values it
Expand All @@ -164,9 +246,19 @@ mod tests {
fn ensure_timeouts_consistent() {
zebra_test::init();

// Specific requests can't have timeouts longer than the maximum timeout
assert!(HANDSHAKE_TIMEOUT <= REQUEST_TIMEOUT,
"Handshakes are requests, so the handshake timeout can't be longer than the timeout for all requests.");
// This check is particularly important on testnet, which has a small
"Handshakes are requests, so their timeout can't be longer than the timeout for all requests.");
assert!(GET_ADDR_TIMEOUT <= REQUEST_TIMEOUT,
"GetAddrs are requests, so their timeout can't be longer than the timeout for all requests.");

// Other requests shouldn't have timeouts shorter than the handshake timeout
assert!(GET_ADDR_TIMEOUT >= HANDSHAKE_TIMEOUT,
"GetAddrs require a successful handshake, so their timeout shouldn't be shorter than the handshake timeout.");

// Basic EWMA checks

// The RTT check is particularly important on testnet, which has a small
// number of peers, which are often slow.
assert!(EWMA_DEFAULT_RTT > REQUEST_TIMEOUT,
"The default EWMA RTT should be higher than the request timeout, so new peers are required to prove they are fast, before we prefer them to other peers.");
Expand Down
2 changes: 1 addition & 1 deletion zebra-network/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ pub use crate::{
config::Config,
isolated::connect_isolated,
meta_addr::PeerAddrState,
peer_set::init,
peer_set::{init, spawn_fanout},
policies::{RetryErrors, RetryLimit},
protocol::internal::{Request, Response},
};
Expand Down
Loading

0 comments on commit 7511e2b

Please sign in to comment.