Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve load-balancer #3439

Closed
jvff opened this issue Jan 30, 2022 · 4 comments
Closed

Improve load-balancer #3439

jvff opened this issue Jan 30, 2022 · 4 comments
Labels
A-network Area: Network protocol updates or fixes C-cleanup Category: This is a cleanup C-enhancement Category: This is an improvement I-slow Problems with performance or responsiveness

Comments

@jvff
Copy link
Contributor

jvff commented Jan 30, 2022

Motivation

When selecting a peer to route a request to, Zebra uses a load balancer that calculates an approximated peer load based on previous request latencies. However, that considers the time for the whole response to be received. This means that the value is skewed towards smaller responses, and the load balancer may choose a slow peer instead of a fast peer based on the size of the recent requests that those peers handled.

It would be better to figure out a better way to calculate load, maybe by calculating the peer's connection speed instead.

Hedge layer

The block synchronizer also uses a Hedge layer to spawn requests that are taking too long to complete to other peers. While this helps with performance, it's currently sub-optimal in how it chooses which requests to hedge. It may prioritize hedging large blocks unnecessarily, and it always spawns a hedge request based on the heuristic of response time.

One idea to improve this is to keep track of the block heights that have been requested, and when the number of active downloads falls a bit due to the limit of the chain height hedge the requests for blocks with the lowest height. The number of active downloads will fall if the too many blocks are in the queue for verification, and that means that a request for a previous block is taking too long. So that new heuristic can be leveraged for hedging.

This can be further improved by tracking the block download percent (#3440) so that the hedge layer can calculate which block is going to take the longest to finish and factor that in the decision.

Open Security Problems

This design is more complicated than the current load-balancer. Complex designs are easier for peers to exploit, harder to test, and harder to reason about.

It might be safer to keep the current hedge, retry, and load-balancing, because they have stronger fairness guarantees.

Open Design Questions

How do we get detailed load information into the PeerSet for routing purposes?
(The Hedge and Retry are currently in the syncer.)

Related Work

This may improve #3438, because the extend tip requests may be timing out due to sub-optimal choice of peers to fan out those requests.

@jvff jvff added C-enhancement Category: This is an improvement C-cleanup Category: This is a cleanup S-needs-triage Status: A bug report needs triage I-slow Problems with performance or responsiveness P-Optional ✨ A-network Area: Network protocol updates or fixes labels Jan 30, 2022
@ftm1000
Copy link

ftm1000 commented Feb 21, 2022

@teor2345
Copy link
Contributor

I'd like to resolve the security risks in this design before we schedule this ticket.

@gustavovalverde
Copy link
Member

I'll just add this related info so I don't forget to talk about this subject later https://bparli.medium.com/adventures-in-rust-and-load-balancers-73a0bc61a192

@ftm1000 ftm1000 removed the S-needs-triage Status: A bug report needs triage label Mar 10, 2022
@jvff
Copy link
Contributor Author

jvff commented Mar 11, 2022

We can revisit this in the future if needed.

@jvff jvff closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-cleanup Category: This is a cleanup C-enhancement Category: This is an improvement I-slow Problems with performance or responsiveness
Projects
None yet
Development

No branches or pull requests

4 participants