-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load balancing: Check each host at most once in LeastRequests LB #11006
Conversation
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Benchmark before:
After:
|
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! The algorithm seems like an improvement, just some open questions around resource cost trade offs for the implementation.
@@ -470,6 +476,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase { | |||
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use, | |||
const HostsSource& source) override; | |||
const uint32_t choice_count_; | |||
|
|||
// List of hosts per HostsSource for fair random sampling. | |||
std::unordered_map<HostsSource, HostVector, HostsSourceHash> unweighted_hosts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a way to avoid having to allocate a second set of host lists - this change means a memory increase of O(hosts * worker threads) for any cluster using LR, which I imagine would be noticeable for deployments with a large amounts of endpoints.
Feels like we either need to make this not require this much additional memory (can we shuffle the host lists maintained on the HostSet
instead? Not clear if we want to make that mutable) or make it possible to fall back to an algorithm that doesn't require tracking this additional state.
Another option would be to allocate a HostVector
(or a std::vector<size_t>
if you're just trying to shuffle the indices around) during load balancing - this would trade additional allocations + vector construction in the LB path for not having to maintain all these lists everywhere.
@jmarantz @mattklein123 for thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea: what if we use the old implementation/algorithm for large numbers of Hosts, and this for small numbers of hosts? (with a configurable threshold perhaps). IIUC, these changes address issues mainly for smaller numbers of hosts. And with smaller numbers, the extra memory is negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HostVector
is a typedef for std::vector<HostSharedPtr>
, so ultimately its memory footprint is equivalent to std::vector<size_t>
, which is 8 extra bytes per host per worker on amd64 architecture.
Allocating this vector on every pick attempt (especially assuming large number of endpoints) likely is not performance-viable from cpu usage point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my suggesting to use a size_t
vector was based on me thinking copying std::shared_ptr
s would be more expensive than generating a sequence of number, not as a way to reduce the memory footprint.
Gating the behavior on the number of hosts might give us the best of both worlds, though that of course begs the question of what the threshold should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that both copying this on every worker and every update is not optimal.
sizeof(shared_ptr) is typically bigger than 8. I think it's usually 16 (pointer + pointer to control block) but I'm not sure if the current stdlib implementation. Either way not huge.
Per @snowp I think it would be possible to allocate a vector<uint32_t>, and initialize it with indexes 1..n, and then do the same swapping stuff. This might be slightly more compact but require some indirection back into the primary vector.
Per @ggreenway I agree it would be nice to potentially just turn on this optimization when hosts are below some size N?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a simulator written anywhere in the codebase? There is DISABLED_SimulationTest
and looks like it hard-codes random load balancer.
Should I just use it with different policies or are there better options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had a benchmark that was working, but maybe the DISABLED one is broken. @tonya11en do you remember the history here? @antoniovicente has also been fixing some of the benchmarks and can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall the simulations always being disabled, but one of them either wouldn't compile or suffered from a runtime error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of attempts is usually small. An alternate approach could be to track indexes picked so far in an array or set and try different random numbers until N different backend indexes are returned.
Another thing to consider is that hitting duplicates may be beneficial. Take the degenerate case of having only 2 backends and 2 pick attempts. This will result in the backend with the least connections being picked 3/4 of the time. If we have a situation where the backend with least requests is in that state because it is actively load shedding or behaving differently than the backends with higher connection counts, picking the backend with the least requests more often results in more requests failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the first part of this comment: it would seem simpler to just keep a local set of considered indexes and pick a new random number if you wind up comparing the same ones. I think that would be easier to reason about from a memory perspective. You would have to handle the degenerate case of NumHosts <= LeastRequestCardinality, maybe by just setting a local LeastRequestCardinality to NumHosts-1 or similar.
@antoniovicente second paragraph describes a scenario where IMO a user should avoid using LeastRequests in any case. I guess his argument is that such a choice would be less catastrophic in the current implementation than it would be in the new scenario.
Also FWIW, would recommend using absl::flat_hash_set rather than std::unordered_set for perf and size.
for (uint32_t choice_idx = 0; choice_idx < choice_count_; ++choice_idx) { | ||
const int rand_idx = random_.random() % hosts_to_use.size(); | ||
HostSharedPtr sampled_host = hosts_to_use[rand_idx]; | ||
uint64_t candidate_active_rq = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we keep this implementation can you add more comments about what is going on in this function? It would help readability.
for (uint32_t choice_idx = 0; choice_idx < choice_count; ++choice_idx) { | ||
const int rand_idx = random_.random() % size; | ||
HostSharedPtr sampled_host = hosts_to_use[rand_idx]; | ||
std::swap(hosts_to_use[rand_idx], hosts_to_use[--size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above this line can you ASSERT(size >= 1)?
@@ -470,6 +476,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase { | |||
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use, | |||
const HostsSource& source) override; | |||
const uint32_t choice_count_; | |||
|
|||
// List of hosts per HostsSource for fair random sampling. | |||
std::unordered_map<HostsSource, HostVector, HostsSourceHash> unweighted_hosts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that both copying this on every worker and every update is not optimal.
sizeof(shared_ptr) is typically bigger than 8. I think it's usually 16 (pointer + pointer to control block) but I'm not sure if the current stdlib implementation. Either way not huge.
Per @snowp I think it would be possible to allocate a vector<uint32_t>, and initialize it with indexes 1..n, and then do the same swapping stuff. This might be slightly more compact but require some indirection back into the primary vector.
Per @ggreenway I agree it would be nice to potentially just turn on this optimization when hosts are below some size N?
…requests Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
edit: I found the original issue, sorry I missed it. The point still stands though- I'm not entirely sure what this is helping with given the probabilities below. @euroelessar I think I'm missing something and need a bit of context. Can you clarify the problem this patch is solving? The reason I'm squinting here is that the Azar et. al. paper that introduces P2C was explicit about the choices being independent (in addition to uniformly at random). The benefit I see here in your patch is that it slightly skews the distribution towards hosts with less outstanding requests. Looking at an extreme example of 3 hosts (a, b, c) with requests outstanding looking like (a < b < c), the probabilities for picking each one looks like: Independent selections
Unique choices
This doesn't seem to really affect selection probabilities for larger (>5) numbers of hosts in any appreciable way. I ran a quick simulation with 10000 selections below. Forcing unique selections:
Independent selections (the original way):
If you do decide to go through with this patch and toggle the unique selections for lower numbers, the simulation results above might help. |
/assign @antoniovicente |
@tonya11en nice simulation! Couple of notes below.
The more advanced model would probably have:
|
This is very cool. Is the simulation in https://blog.envoyproxy.io/examining-load-balancing-algorithms-with-envoy-1be643ea121c in OSS? This question is for @tonya11en of course, not the author of this PR :) |
@jmarantz for the blog post, I made a few changes to bufferbloater for the heterogeneous upstream tests and just reused some code from my old thesis for the selection histograms. My previous comment was just something I cooked up in a Python shell. |
@SaveTheRbtz I should have clarified that the point in the simulations was to come up with numbers that represent the probability that a request will be sent to a particular node- latency isn't considered of mentioned.You're right in your comments though and echoing a lot of the points made in the blog post @jmarantz is referencing. |
I have written a small script for simulating different load balancing techniques: It doesn't take "overloaded backends" into account, but does consider that different backends may have different performance. Simulation measure time needed to complete serving all requests, assuming that total concurrency is bounded. Example of a scenario when ensuring unique choices works noticeably better and is on par with a full scan:
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
@@ -470,6 +480,9 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase { | |||
HostConstSharedPtr unweightedHostPick(const HostVector& hosts_to_use, | |||
const HostsSource& source) override; | |||
const uint32_t choice_count_; | |||
|
|||
// List of host indexes per HostsSource for fair random sampling. | |||
std::unordered_map<HostsSource, std::vector<uint32_t>, HostsSourceHash> unweighted_host_indexes_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend enhancing test/common/upstream/load_balancer_benchmark.cc so that memory and memory_per_host are tracked by a LeastLoaded Build benchmark similar to BM_RoundRobinLoadBalancerBuild
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
What's the status of this? Are we going to push this forward? |
It's worth noting that those probabilities would be for only a single host selection. Over many host selections, the actual selection probabilities would look much different since the overwhelmed hosts will tend to have more outstanding requests. I don't think we should worry too much about the request ratio for a single host selection (I might be misunderstanding your comment, though). |
I think the premise behind this change is that the selection probabilities matter. I'm pointing out that other users may have different preferences regarding least-requests backend selection behavior. In any case, it is likely that the biggest issue here is that we are not measuring memory usage of least-request load balancers in benchmarks, and concerns about increasing memory usage in cases where many hosts are associated with the cluster. When the number of hosts is significantly larger than the number of pick attempts, use of the auxiliary data structures is not necessary since the output of the new and old algorithms end up being indistinguishable. |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
…requests Signed-off-by: Ruslan Nigmatullin <[email protected]>
…requests Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
Signed-off-by: Ruslan Nigmatullin <[email protected]>
@euroelessar where are we at with this change? Is it ready for re-review? If so @snowp or @antoniovicente can one of you potentially drive review of this? I want to make sure we have a clear owner to drive this forward. Thank you! |
@mattklein123 Please mark is as "waiting", I'm implementing optimization for low-count choice count to avoid memory overhead & adding memory benchmarks. |
Unfortunately supporting pick count of 3-4 with constant performance leads to complicated implementation, so I'm not sure about the tradeoff anymore. What do you think about me implementing factory for load balancing policy & providing this implementation as an alternative explicit option (assuming there is an interest from upstream to support it in the future)
|
@euroelessar see #5598. Would definitely appreciate this being worked on. |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0. Signed-off-by: Leonardo da Mata <[email protected]>
Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0. Signed-off-by: Leonardo da Mata <[email protected]>
Commit Message: Improve LeastRequests LB by picking an unique host endpoint at each iteration of unweighted pick algorithm.
Additional Description: This change adds one more map to LeastRequestsLoadBalancer object with a copy of hosts vector. This vector is partially shuffled at every pick attempt.
Risk Level: medium (change in an actively used LB policy)
Testing: updated unit tests
Docs Changes: n/a
Release Notes: updated
Fixes #11004