Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit fanouts to the number of active peers, Credit: Equilibrium #2214

Closed
13 tasks
Tracked by #3247 ...
teor2345 opened this issue May 27, 2021 · 2 comments
Closed
13 tasks
Tracked by #3247 ...

Limit fanouts to the number of active peers, Credit: Equilibrium #2214

teor2345 opened this issue May 27, 2021 · 2 comments
Labels
A-network Area: Network protocol updates or fixes A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network S-blocked Status: Blocked on other tasks S-needs-design Status: Needs a design decision

Comments

@teor2345
Copy link
Contributor

teor2345 commented May 27, 2021

Motivation

  1. Zebra fans out getaddr, getdata, and mempool messages to 3 peers. But this is useless when we only have 1-3 ready peers. This mainly happens on testnet, but it can also happen on busy mainnet nodes.

  2. When Zebra sends a request, it stops the peer connection being used for other things, until a response or request timeout. This can really slow down the node, because those connections can't be used for syncing. We can observe these timeouts using waiting for the peer service to become ready logs.

Scheduling

We should try to fix the underlying PeerSet bugs first, and then see if Zebra still has hangs.

This is a complex refactor, so we should avoid doing it unless we are sure it is needed. Instead, we should try to find easier fixes for the underlying PeerSet issues.

We should re-evaluate the risk as part of the lightwalletd work.

Solution

Refactoring

This depends on the refactoring done as part of #3230.

  • Add a PeerSet::route_fanout method, that routes a request to multiple peers. The fanout number should be limited to the number of ready peer services (should it be ready and unready peer services?)
  • Refactor existing fanouts in different places to not perform a manual fanout, and make the PeerSet decide which requests to fanout

Existing fanouts

  • GetAddr fanout in the candidate set
    • Restore this fanout to 3 as part of this change
  • GetData fanouts in the syncer
    • obtain tips
    • extend tips
  • MempoolTransactionIds fanout in the mempool crawler

Tests

  • proptest to see if fanouts go to the correct number of peers
  • proptest to see if the PeerService recovers from a fanned out request (i.e., the peers handling the fanned out requests become ready again. We might need to fanout multiple times until we know we have sent at least one request to each peer, and then check if the PeerSet becomes ready after a while)
  • proptests to see if the PeerService handles errors during fanout requests
    • returns an error if all responses to fanned out request are errors
    • returns at least one response if at least one fanned out request succeeds

Alternatives

The foundation could run some Zebra or zcashd nodes on testnet, to increase the number of available testnet peers.

Put a watch channel in the peer set that is updated with the total number of connected peers (ready and unready), pass that channel to the crawler and syncer, and use it to limit their fanouts.

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage P-Low C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network A-network Area: Network protocol updates or fixes labels May 27, 2021
@teor2345 teor2345 changed the title Limit fanouts to the number of active peers Limit fanouts to the number of active peers, Credit: Equilibrium May 27, 2021
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label May 31, 2021
@teor2345 teor2345 added P-Medium and removed P-Low labels Nov 29, 2021
@teor2345 teor2345 added the S-needs-design Status: Needs a design decision label Dec 8, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Dec 8, 2021

Before starting this ticket, we need to decide on the best design for this ticket, because the suggested design is a lot of work.

@teor2345 teor2345 added the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Jan 6, 2022
@mpguerra mpguerra added C-cleanup Category: This is a cleanup and removed Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped labels Jan 25, 2022
@mpguerra mpguerra mentioned this issue Jan 27, 2022
40 tasks
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 2, 2022

Zebra has a large number of peers connected on mainnet, so it is unlikely we will send any of the 3 fanout requests to the same peer. We also made changes that make sure a broadcast only goes to half the peers, which means there are always peers available.

(This issue really only happens on testnet.)

@teor2345 teor2345 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network S-blocked Status: Blocked on other tasks S-needs-design Status: Needs a design decision
Projects
None yet
Development

No branches or pull requests

4 participants