Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load balancing: Check each host at most once in LeastRequests LB #11006

Closed
wants to merge 12 commits into from

Conversation

euroelessar
Copy link
Contributor

Commit Message: Improve LeastRequests LB by picking an unique host endpoint at each iteration of unweighted pick algorithm.
Additional Description: This change adds one more map to LeastRequestsLoadBalancer object with a copy of hosts vector. This vector is partially shuffled at every pick attempt.
Risk Level: medium (change in an actively used LB policy)
Testing: updated unit tests
Docs Changes: n/a
Release Notes: updated
Fixes #11004

@euroelessar
Copy link
Contributor Author

Benchmark before:

BM_LeastRequestLoadBalancerChooseHost/100/1/1000000          186 ms        186 ms          4 mean_hits=10k relative_stddev_hits=0.0109756 stddev_hits=109.756
BM_LeastRequestLoadBalancerChooseHost/100/2/1000000          216 ms        216 ms          3 mean_hits=10k relative_stddev_hits=9.08017m stddev_hits=90.8017
BM_LeastRequestLoadBalancerChooseHost/100/3/1000000          243 ms        243 ms          3 mean_hits=10k relative_stddev_hits=9.44174m stddev_hits=94.4174
BM_LeastRequestLoadBalancerChooseHost/100/10/1000000         442 ms        442 ms          2 mean_hits=10k relative_stddev_hits=9.81851m stddev_hits=98.1851
BM_LeastRequestLoadBalancerChooseHost/100/50/1000000        1547 ms       1547 ms          1 mean_hits=10k relative_stddev_hits=9.03159m stddev_hits=90.3159
BM_LeastRequestLoadBalancerChooseHost/100/100/1000000       2928 ms       2928 ms          1 mean_hits=10k relative_stddev_hits=9.8252m stddev_hits=98.252

After:

BM_LeastRequestLoadBalancerChooseHost/100/1/1000000          199 ms        199 ms          4 mean_hits=10k relative_stddev_hits=0.0106399 stddev_hits=106.399
BM_LeastRequestLoadBalancerChooseHost/100/2/1000000          227 ms        227 ms          3 mean_hits=10k relative_stddev_hits=9.26935m stddev_hits=92.6935
BM_LeastRequestLoadBalancerChooseHost/100/3/1000000          255 ms        255 ms          3 mean_hits=10k relative_stddev_hits=9.63982m stddev_hits=96.3982
BM_LeastRequestLoadBalancerChooseHost/100/10/1000000         449 ms        449 ms          2 mean_hits=10k relative_stddev_hits=9.85124m stddev_hits=98.5124
BM_LeastRequestLoadBalancerChooseHost/100/50/1000000        1571 ms       1571 ms          1 mean_hits=10k relative_stddev_hits=9.47864m stddev_hits=94.7864
BM_LeastRequestLoadBalancerChooseHost/100/100/1000000       2933 ms       2933 ms          1 mean_hits=10k relative_stddev_hits=0.0106751 stddev_hits=106.751

Ruslan Nigmatullin added 2 commits April 29, 2020 19:14
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! The algorithm seems like an improvement, just some open questions around resource cost trade offs for the implementation.

@@ -470,6 +476,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase {
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use,
const HostsSource& source) override;
const uint32_t choice_count_;

// List of hosts per HostsSource for fair random sampling.
std::unordered_map<HostsSource, HostVector, HostsSourceHash> unweighted_hosts_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to avoid having to allocate a second set of host lists - this change means a memory increase of O(hosts * worker threads) for any cluster using LR, which I imagine would be noticeable for deployments with a large amounts of endpoints.

Feels like we either need to make this not require this much additional memory (can we shuffle the host lists maintained on the HostSet instead? Not clear if we want to make that mutable) or make it possible to fall back to an algorithm that doesn't require tracking this additional state.

Another option would be to allocate a HostVector (or a std::vector<size_t> if you're just trying to shuffle the indices around) during load balancing - this would trade additional allocations + vector construction in the LB path for not having to maintain all these lists everywhere.

@jmarantz @mattklein123 for thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea: what if we use the old implementation/algorithm for large numbers of Hosts, and this for small numbers of hosts? (with a configurable threshold perhaps). IIUC, these changes address issues mainly for smaller numbers of hosts. And with smaller numbers, the extra memory is negligible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HostVector is a typedef for std::vector<HostSharedPtr>, so ultimately its memory footprint is equivalent to std::vector<size_t>, which is 8 extra bytes per host per worker on amd64 architecture.
Allocating this vector on every pick attempt (especially assuming large number of endpoints) likely is not performance-viable from cpu usage point of view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my suggesting to use a size_t vector was based on me thinking copying std::shared_ptrs would be more expensive than generating a sequence of number, not as a way to reduce the memory footprint.

Gating the behavior on the number of hosts might give us the best of both worlds, though that of course begs the question of what the threshold should be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that both copying this on every worker and every update is not optimal.

sizeof(shared_ptr) is typically bigger than 8. I think it's usually 16 (pointer + pointer to control block) but I'm not sure if the current stdlib implementation. Either way not huge.

Per @snowp I think it would be possible to allocate a vector<uint32_t>, and initialize it with indexes 1..n, and then do the same swapping stuff. This might be slightly more compact but require some indirection back into the primary vector.

Per @ggreenway I agree it would be nice to potentially just turn on this optimization when hosts are below some size N?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a simulator written anywhere in the codebase? There is DISABLED_SimulationTest and looks like it hard-codes random load balancer.
Should I just use it with different policies or are there better options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had a benchmark that was working, but maybe the DISABLED one is broken. @tonya11en do you remember the history here? @antoniovicente has also been fixing some of the benchmarks and can help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall the simulations always being disabled, but one of them either wouldn't compile or suffered from a runtime error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of attempts is usually small. An alternate approach could be to track indexes picked so far in an array or set and try different random numbers until N different backend indexes are returned.

Another thing to consider is that hitting duplicates may be beneficial. Take the degenerate case of having only 2 backends and 2 pick attempts. This will result in the backend with the least connections being picked 3/4 of the time. If we have a situation where the backend with least requests is in that state because it is actively load shedding or behaving differently than the backends with higher connection counts, picking the backend with the least requests more often results in more requests failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the first part of this comment: it would seem simpler to just keep a local set of considered indexes and pick a new random number if you wind up comparing the same ones. I think that would be easier to reason about from a memory perspective. You would have to handle the degenerate case of NumHosts <= LeastRequestCardinality, maybe by just setting a local LeastRequestCardinality to NumHosts-1 or similar.

@antoniovicente second paragraph describes a scenario where IMO a user should avoid using LeastRequests in any case. I guess his argument is that such a choice would be less catastrophic in the current implementation than it would be in the new scenario.

Also FWIW, would recommend using absl::flat_hash_set rather than std::unordered_set for perf and size.

@mattklein123 mattklein123 self-assigned this Apr 30, 2020
for (uint32_t choice_idx = 0; choice_idx < choice_count_; ++choice_idx) {
const int rand_idx = random_.random() % hosts_to_use.size();
HostSharedPtr sampled_host = hosts_to_use[rand_idx];
uint64_t candidate_active_rq = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we keep this implementation can you add more comments about what is going on in this function? It would help readability.

for (uint32_t choice_idx = 0; choice_idx < choice_count; ++choice_idx) {
const int rand_idx = random_.random() % size;
HostSharedPtr sampled_host = hosts_to_use[rand_idx];
std::swap(hosts_to_use[rand_idx], hosts_to_use[--size]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above this line can you ASSERT(size >= 1)?

@@ -470,6 +476,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase {
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use,
const HostsSource& source) override;
const uint32_t choice_count_;

// List of hosts per HostsSource for fair random sampling.
std::unordered_map<HostsSource, HostVector, HostsSourceHash> unweighted_hosts_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that both copying this on every worker and every update is not optimal.

sizeof(shared_ptr) is typically bigger than 8. I think it's usually 16 (pointer + pointer to control block) but I'm not sure if the current stdlib implementation. Either way not huge.

Per @snowp I think it would be possible to allocate a vector<uint32_t>, and initialize it with indexes 1..n, and then do the same swapping stuff. This might be slightly more compact but require some indirection back into the primary vector.

Per @ggreenway I agree it would be nice to potentially just turn on this optimization when hosts are below some size N?

Ruslan Nigmatullin added 2 commits May 4, 2020 16:38
cr
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
@tonya11en
Copy link
Member

tonya11en commented May 12, 2020

edit: I found the original issue, sorry I missed it. The point still stands though- I'm not entirely sure what this is helping with given the probabilities below.

@euroelessar I think I'm missing something and need a bit of context. Can you clarify the problem this patch is solving? The reason I'm squinting here is that the Azar et. al. paper that introduces P2C was explicit about the choices being independent (in addition to uniformly at random).

The benefit I see here in your patch is that it slightly skews the distribution towards hosts with less outstanding requests. Looking at an extreme example of 3 hosts (a, b, c) with requests outstanding looking like (a < b < c), the probabilities for picking each one looks like:

Independent selections

Outstanding requests Selection probability
a ~50%
b ~33%
c ~16%

Unique choices

Outstanding requests Selection probability
a ~66%
b ~33%
c 0%

This doesn't seem to really affect selection probabilities for larger (>5) numbers of hosts in any appreciable way. I ran a quick simulation with 10000 selections below.

Forcing unique selections:

0: 1.000000
1: 0.000000
=====
0: 0.674100
1: 0.325900
2: 0.000000
=====
0: 0.504700
1: 0.331700
2: 0.163600
3: 0.000000
=====
0: 0.399500
1: 0.301200
2: 0.203000
3: 0.096300
4: 0.000000
=====
0: 0.328900
1: 0.266500
2: 0.204400
3: 0.136300
4: 0.063900
5: 0.000000

Independent selections (the original way):

0: 0.751100
1: 0.248900
=====
0: 0.557400
1: 0.336600
2: 0.106000
=====
0: 0.434700
1: 0.307100
2: 0.190800
3: 0.067400
=====
0: 0.367100
1: 0.278900
2: 0.198400
3: 0.115800
4: 0.039800
=====
0: 0.308700
1: 0.245100
2: 0.192900
3: 0.138700
4: 0.086400
5: 0.028200

If you do decide to go through with this patch and toggle the unique selections for lower numbers, the simulation results above might help.

@antoniovicente
Copy link
Contributor

/assign @antoniovicente

@SaveTheRbtz
Copy link
Contributor

@tonya11en nice simulation! Couple of notes below.

  1. Seems like your model is a bit too static: in real life once you assign a request to the backend it will have one more active request and less likely to be picked by P2C on the next iteration.

  2. The variable we want to optimize is not the backend load itself but some metric around request processing time. It is totally fine to have an imbalanced load on backends that have different performance characteristics.

The more advanced model would probably have:

  • backends with multimodal performance distributions (e.g. backends with different hardware.) Note that these should have bounded concurrency to reflect their finite performance.
  • requests arriving at independent random intervals (in other words requests do not synchronize.)
  • requests have log-normally distributed execution time (e.g. backend supports multiple types of APIs, most are very quick, some are quite slow though.) Note that execution time also depends on the chosen backend's performance.

@jmarantz
Copy link
Contributor

jmarantz commented May 15, 2020

This is very cool. Is the simulation in https://blog.envoyproxy.io/examining-load-balancing-algorithms-with-envoy-1be643ea121c in OSS? This question is for @tonya11en of course, not the author of this PR :)

@tonya11en
Copy link
Member

@jmarantz for the blog post, I made a few changes to bufferbloater for the heterogeneous upstream tests and just reused some code from my old thesis for the selection histograms.

My previous comment was just something I cooked up in a Python shell.

@tonya11en
Copy link
Member

@SaveTheRbtz I should have clarified that the point in the simulations was to come up with numbers that represent the probability that a request will be sent to a particular node- latency isn't considered of mentioned.You're right in your comments though and echoing a lot of the points made in the blog post @jmarantz is referencing.

@euroelessar
Copy link
Contributor Author

euroelessar commented May 17, 2020

I have written a small script for simulating different load balancing techniques:
https://github.com/euroelessar/load_balancing/blob/master/load_balancer_simulation.py

It doesn't take "overloaded backends" into account, but does consider that different backends may have different performance. Simulation measure time needed to complete serving all requests, assuming that total concurrency is bounded.

Example of a scenario when ensuring unique choices works noticeably better and is on par with a full scan:

$ python3 load_balancer_simulation.py --backends="1,2" --concurrency=2
Name                              Duration    Max Inflight Requests
------------------------------  ----------  -----------------------
least_requests_full_scan              6668                        1
least_requests_unique_2pc             6668                        1
least_requests_unique_4pc             6668                        1
least_requests_independent_2pc        7194                        2
least_requests_independent_4pc        7194                        2
least_requests_1_random               7496                        2
round_robin                           7500                        2
random                                7507                        2
$ python3 load_balancer_simulation.py --backends="1,2" --concurrency=16
Name                              Duration    Max Inflight Requests
------------------------------  ----------  -----------------------
least_requests_full_scan               834                        8
least_requests_unique_2pc              834                        8
least_requests_unique_4pc              834                        8
least_requests_independent_2pc         860                       13
least_requests_independent_4pc         860                       13
least_requests_1_random                938                       16
random                                 939                       16
round_robin                            939                       12

@stale
Copy link

stale bot commented May 24, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label May 24, 2020
@@ -470,6 +480,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase {
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use,
const HostsSource& source) override;
const uint32_t choice_count_;

// List of host indexes per HostsSource for fair random sampling.
std::unordered_map<HostsSource, std::vector<uint32_t>, HostsSourceHash> unweighted_host_indexes_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend enhancing test/common/upstream/load_balancer_benchmark.cc so that memory and memory_per_host are tracked by a LeastLoaded Build benchmark similar to BM_RoundRobinLoadBalancerBuild

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label May 26, 2020
@stale
Copy link

stale bot commented Jun 2, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 2, 2020
@jmarantz
Copy link
Contributor

jmarantz commented Jun 6, 2020

What's the status of this? Are we going to push this forward?

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 6, 2020
@tonya11en
Copy link
Member

Is fully draining C in this case desirable? If the reason why backend A has less connections is because it is actively load shedding while backends B and C are operating normally, A will end up with an even larger share of the incoming requests when using the modified version of the least-requests load balancing algorithm.

The behavior of the old algorithm which results in C getting a few new requests while A getting 3x as many as C and B getting 2x as many as C should help the 3 backends to end up with even load.

For the 3 backend case, I wonder if there's a way to reduce the difference in rate of new requests going to A vs C. Something like having least-requests apply only 50% of the time, which would shift the request ratio when a<b<c to:
A: 42% B: 33% C: 25%
instead of:
A: 50% B:33% C:16%
or
A: 67% B:33% C:0%

It's worth noting that those probabilities would be for only a single host selection. Over many host selections, the actual selection probabilities would look much different since the overwhelmed hosts will tend to have more outstanding requests. I don't think we should worry too much about the request ratio for a single host selection (I might be misunderstanding your comment, though).

@antoniovicente
Copy link
Contributor

Is fully draining C in this case desirable? If the reason why backend A has less connections is because it is actively load shedding while backends B and C are operating normally, A will end up with an even larger share of the incoming requests when using the modified version of the least-requests load balancing algorithm.
The behavior of the old algorithm which results in C getting a few new requests while A getting 3x as many as C and B getting 2x as many as C should help the 3 backends to end up with even load.
For the 3 backend case, I wonder if there's a way to reduce the difference in rate of new requests going to A vs C. Something like having least-requests apply only 50% of the time, which would shift the request ratio when a<b<c to:
A: 42% B: 33% C: 25%
instead of:
A: 50% B:33% C:16%
or
A: 67% B:33% C:0%

It's worth noting that those probabilities would be for only a single host selection. Over many host selections, the actual selection probabilities would look much different since the overwhelmed hosts will tend to have more outstanding requests. I don't think we should worry too much about the request ratio for a single host selection (I might be misunderstanding your comment, though).

I think the premise behind this change is that the selection probabilities matter. I'm pointing out that other users may have different preferences regarding least-requests backend selection behavior.

In any case, it is likely that the biggest issue here is that we are not measuring memory usage of least-request load balancers in benchmarks, and concerns about increasing memory usage in cases where many hosts are associated with the cluster. When the number of hosts is significantly larger than the number of pick attempts, use of the auxiliary data structures is not necessary since the output of the new and old algorithms end up being indistinguishable.

@stale
Copy link

stale bot commented Jul 18, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2020
Ruslan Nigmatullin added 5 commits July 22, 2020 21:50
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2020
@mattklein123
Copy link
Member

@euroelessar where are we at with this change? Is it ready for re-review? If so @snowp or @antoniovicente can one of you potentially drive review of this? I want to make sure we have a clear owner to drive this forward. Thank you!

@euroelessar
Copy link
Contributor Author

@mattklein123 Please mark is as "waiting", I'm implementing optimization for low-count choice count to avoid memory overhead & adding memory benchmarks.
I'll let you know once it's ready for review.

@euroelessar
Copy link
Contributor Author

euroelessar commented Aug 7, 2020

Unfortunately supporting pick count of 3-4 with constant performance leads to complicated implementation, so I'm not sure about the tradeoff anymore.

What do you think about me implementing factory for load balancing policy & providing this implementation as an alternative explicit option (assuming there is an interest from upstream to support it in the future)

LoadBalancingPolicy load_balancing_policy = 41;

@mattklein123
Copy link
Member

@euroelessar see #5598. Would definitely appreciate this being worked on.

@stale
Copy link

stale bot commented Aug 16, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 16, 2020
@stale
Copy link

stale bot commented Aug 23, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Aug 23, 2020
barroca added a commit to barroca/envoy that referenced this pull request Dec 2, 2023
Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose
the first host of the list when the number of requests are the same during a full scan. Start the selection from a random
index instead of 0.

Signed-off-by: Leonardo da Mata <[email protected]>
kyessenov added a commit that referenced this pull request Dec 9, 2023
Commit Message: Reverts #29873 and #30794

Multiple concerns about the effect of a full scan on LEAST_REQUEST have been raised.
See previous discussions in #11004 and #11006.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
jkirschner-hashicorp pushed a commit to jkirschner-hashicorp/envoy that referenced this pull request Dec 21, 2023
Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose
the first host of the list when the number of requests are the same during a full scan. Start the selection from a random
index instead of 0.

Signed-off-by: Leonardo da Mata <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Least requests LB is not fair at edge cases
9 participants