Improve load-balancer #3439
Labels
A-network
Area: Network protocol updates or fixes
C-cleanup
Category: This is a cleanup
C-enhancement
Category: This is an improvement
I-slow
Problems with performance or responsiveness
Motivation
When selecting a peer to route a request to, Zebra uses a load balancer that calculates an approximated peer load based on previous request latencies. However, that considers the time for the whole response to be received. This means that the value is skewed towards smaller responses, and the load balancer may choose a slow peer instead of a fast peer based on the size of the recent requests that those peers handled.
It would be better to figure out a better way to calculate load, maybe by calculating the peer's connection speed instead.
Hedge layer
The block synchronizer also uses a
Hedge
layer to spawn requests that are taking too long to complete to other peers. While this helps with performance, it's currently sub-optimal in how it chooses which requests to hedge. It may prioritize hedging large blocks unnecessarily, and it always spawns a hedge request based on the heuristic of response time.One idea to improve this is to keep track of the block heights that have been requested, and when the number of active downloads falls a bit due to the limit of the chain height hedge the requests for blocks with the lowest height. The number of active downloads will fall if the too many blocks are in the queue for verification, and that means that a request for a previous block is taking too long. So that new heuristic can be leveraged for hedging.
This can be further improved by tracking the block download percent (#3440) so that the hedge layer can calculate which block is going to take the longest to finish and factor that in the decision.
Open Security Problems
This design is more complicated than the current load-balancer. Complex designs are easier for peers to exploit, harder to test, and harder to reason about.
It might be safer to keep the current hedge, retry, and load-balancing, because they have stronger fairness guarantees.
Open Design Questions
How do we get detailed load information into the
PeerSet
for routing purposes?(The
Hedge
andRetry
are currently in the syncer.)Related Work
This may improve #3438, because the extend tip requests may be timing out due to sub-optimal choice of peers to fan out those requests.
The text was updated successfully, but these errors were encountered: