Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(security): Randomly drop connections when inbound service is overloaded #6790

Merged
merged 14 commits into from
May 31, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented May 30, 2023

Motivation

Zebra should avoid dropping all peer connections when the inbound service is overloaded by a subset of peers.

This PR will randomly drop connections that get an overloaded error, with a high probability

Closes #6596.

Solution

  • Randomly drop connections that receive overloaded errors, with greater likelihood when a second overloaded error is seen soon after the first.
  • Add a ServiceShutdown variant to PeerError for when the inbound service has failed or shutdown.

Testing:

  • more overloads mean a higher probability of being dropped
  • less overloads mean a lower probability
  • frequent overloads significantly increase the probability
  • when we make multiple connections, some are dropped and some aren't

Related fixes:

  • add some test utility methods and fix docs

Review

This is a routine fix.

The production code has already been reviewed privately, the tests are new.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

  • Use a skiplist to keep an ordered buffer queue

@teor2345 teor2345 added C-bug Category: This is a bug P-Medium ⚡ C-security Category: Security issues I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. I-remote-trigger Remote nodes can make Zebra do something bad labels May 30, 2023
@teor2345 teor2345 requested review from a team as code owners May 30, 2023 03:18
@teor2345 teor2345 self-assigned this May 30, 2023
@teor2345 teor2345 requested review from upbqdn and removed request for a team May 30, 2023 03:18
@github-actions github-actions bot added the C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG label May 30, 2023
@teor2345 teor2345 requested review from arya2 and oxarbitrage and removed request for a team and upbqdn May 30, 2023 03:21
@codecov
Copy link

codecov bot commented May 30, 2023

Codecov Report

Merging #6790 (ca23d02) into main (6f8c981) will increase coverage by 0.17%.
The diff coverage is 78.68%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6790      +/-   ##
==========================================
+ Coverage   77.99%   78.17%   +0.17%     
==========================================
  Files         308      308              
  Lines       40975    41023      +48     
==========================================
+ Hits        31960    32071     +111     
+ Misses       9015     8952      -63     

@teor2345 teor2345 force-pushed the randomly-keep-overloaded-conns branch from ae836a6 to ca23d02 Compare May 31, 2023 00:47
@teor2345
Copy link
Contributor Author

Sorry for the doctest compile errors, I didn't catch them locally.

(Next time I'll fix the test API separately.)

mergify bot added a commit that referenced this pull request May 31, 2023
@mergify mergify bot merged commit 6eaf83b into main May 31, 2023
@mergify mergify bot deleted the randomly-keep-overloaded-conns branch May 31, 2023 19:04
@teor2345 teor2345 mentioned this pull request Jun 7, 2023
41 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-hang A Zebra component stops responding to requests I-remote-trigger Remote nodes can make Zebra do something bad
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Security Issue #38: Stop disconnecting all peers when the inbound service is overloaded
2 participants