Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security Issue #38: Stop disconnecting all peers when the inbound service is overloaded #6596

Closed
2 tasks
Tracked by #6390
teor2345 opened this issue May 1, 2023 · 1 comment · Fixed by #6790
Closed
2 tasks
Tracked by #6390
Assignees
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-remote-trigger Remote nodes can make Zebra do something bad

Comments

@teor2345
Copy link
Contributor

teor2345 commented May 1, 2023

Motivation

When Zebra's inbound service is overloaded, it disconnects every peer that makes a request to Zebra, until the inbound service processes at least one request.

This allows a single peer to disconnect all other peers by repeatedly making inbound connections, and sending lots of inbound requests on those connections.

Goals

  • Make sure a single peer can't make Zebra drop connections from other peers
  • Replace Overloaded error during shutdown with a new error type, so that connections shut down properly after this fix

if self.svc.ready().await.is_err() {
// Treat all service readiness errors as Overloaded
// TODO: treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error (#1655)
self.fail_with(PeerError::Overloaded);
return;
}

Complex Code or Requirements

All requests concurrently received from peers go into a single queue, where they are processed one by one.

The load shedding is configured here, in response to the buffer reaching its limit:

let inbound = ServiceBuilder::new()
.load_shed()
.buffer(inbound::downloads::MAX_INBOUND_CONCURRENCY)
.service(Inbound::new(
config.sync.full_verify_concurrency_limit,
setup_rx,
));

Possible Solutions

Rate-limit connections

  • Define INBOUND_USAGE_LIMIT and INBOUND_USAGE_LIMIT_REFRESH_INTERVAL constants
  • Keep a count of requests made by a connection during an interval with an AtomicUsize that is reset to zero at the interval
  • Delay requests until the next usage interval if a connection has already reached its limit
  • Start the next interval when the inbound service buffer is emptied

Randomly keep some overloaded connections

When the inbound service is overloaded, randomly choose to keep or drop connections that send inbound requests. If a short time has elapsed since the last overload, increase this probability. Otherwise, reset this probability.

This can be done without locking by storing the last overload time in an AtomicU64. But a Mutex would probably be fine, too.


Add a PendingRequests load measurer to the service stack, above the buffer:

https://docs.rs/tower/latest/tower/load/pending_requests/struct.PendingRequests.html

If the service isn't fully overloaded, but the load is past a lower limit, randomly return an Overloaded error for some requests:

let rsp = match self.svc.call(req.clone()).await {
Err(e) => {
if e.is::<Overloaded>() {
tracing::info!("inbound service is overloaded, closing connection");
metrics::counter!("pool.closed.loadshed", 1);
self.fail_with(PeerError::Overloaded);
} else {
// We could send a reject to the remote peer, but that might cause
// them to disconnect, and we might be using them to sync blocks.
// For similar reasons, we don't want to fail_with() here - we
// only close the connection if the peer is doing something wrong.
info!(%e,
connection_state = ?self.state,
client_receiver = ?self.client_rx,
"error processing peer request");
}
return;
}
Ok(rsp) => rsp,
};

Make the overloaded errors more likely as the load gets closer to the load shed limit.


Apply a layer to the inbound service in zebra-network to buffer requests and reserve service capacity for each peer connection.

Call the underlying service with one queued request from each peer before calling it with a second queued request from any given peer.

Testing

Deliberately cause a partial overload, and make sure only some connections are dropped.
Make sure all connections that make an inbound request are dropped during a full overload.

Related Work

We've seen all connections dropped in some of our tests, see:

@teor2345 teor2345 added S-needs-triage Status: A bug report needs triage C-security Category: Security issues labels May 1, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra May 1, 2023
@mpguerra mpguerra added this to Zebra May 1, 2023
@mpguerra mpguerra assigned oxarbitrage and unassigned oxarbitrage May 8, 2023
@mpguerra
Copy link
Contributor

@arya2 can you please set an estimate for this one?

@teor2345 teor2345 added C-bug Category: This is a bug P-Medium ⚡ I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. I-remote-trigger Remote nodes can make Zebra do something bad and removed S-needs-triage Status: A bug report needs triage P-High 🔥 labels May 30, 2023
@teor2345 teor2345 changed the title Security Issue #38 Security Issue #38: Stop disconnecting all peers when the inbound service is overloaded May 30, 2023
@mergify mergify bot closed this as completed in #6790 May 31, 2023
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Zebra May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-remote-trigger Remote nodes can make Zebra do something bad
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants