Security Issue #38: Stop disconnecting all peers when the inbound service is overloaded #6596
Labels
A-concurrency
Area: Async code, needs extra work to make it work properly.
A-network
Area: Network protocol updates or fixes
C-bug
Category: This is a bug
C-security
Category: Security issues
I-hang
A Zebra component stops responding to requests
I-remote-trigger
Remote nodes can make Zebra do something bad
Motivation
When Zebra's inbound service is overloaded, it disconnects every peer that makes a request to Zebra, until the inbound service processes at least one request.
This allows a single peer to disconnect all other peers by repeatedly making inbound connections, and sending lots of inbound requests on those connections.
Goals
Overloaded
error during shutdown with a new error type, so that connections shut down properly after this fixzebra/zebra-network/src/peer/connection.rs
Lines 1256 to 1261 in 185d138
Complex Code or Requirements
All requests concurrently received from peers go into a single queue, where they are processed one by one.
The load shedding is configured here, in response to the buffer reaching its limit:
zebra/zebrad/src/commands/start.rs
Lines 133 to 139 in 166526a
Possible Solutions
Rate-limit connections
INBOUND_USAGE_LIMIT
andINBOUND_USAGE_LIMIT_REFRESH_INTERVAL
constantsAtomicUsize
that is reset to zero at the intervalRandomly keep some overloaded connections
When the inbound service is overloaded, randomly choose to keep or drop connections that send inbound requests. If a short time has elapsed since the last overload, increase this probability. Otherwise, reset this probability.
This can be done without locking by storing the last overload time in an
AtomicU64
. But aMutex
would probably be fine, too.Add a PendingRequests load measurer to the service stack, above the buffer:
https://docs.rs/tower/latest/tower/load/pending_requests/struct.PendingRequests.html
If the service isn't fully overloaded, but the load is past a lower limit, randomly return an Overloaded error for some requests:
zebra/zebra-network/src/peer/connection.rs
Lines 1242 to 1261 in 166526a
Make the overloaded errors more likely as the load gets closer to the load shed limit.
Apply a layer to the inbound service in zebra-network to buffer requests and reserve service capacity for each peer connection.
Call the underlying service with one queued request from each peer before calling it with a second queued request from any given peer.
Testing
Deliberately cause a partial overload, and make sure only some connections are dropped.
Make sure all connections that make an inbound request are dropped during a full overload.
Related Work
We've seen all connections dropped in some of our tests, see:
The text was updated successfully, but these errors were encountered: