-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start Full Scan from a random index for Least Request LB. #31146
Start Full Scan from a random index for Least Request LB. #31146
Conversation
Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0. Signed-off-by: Leonardo da Mata <[email protected]>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @barroca ! Can you elaborate on how starting from a random index fixes the problem in #11004 and what #11006 tried to address? What effect is this change trying to have on the selection probabilities?
We've had many discussions on this (see #11006 (comment)) that lead to us leaving the P2C algorithm alone. Attempting to force a full scan, while guaranteeing selection of the host with the lowest active requests, opens the system up to herding behavior. While there are some applications where this would be useful, one of the key insights of the Azar et. al. paper is that there is little additional benefit to performing more than two choices. We would also creep closer to herding behavior with each additional selection we perform, so Envoy’s default configuration of the least request load balancer performs two choices.
#11004 is a feature, not a bug with the LB algorithm.
Can you please update the description with what exactly the problem is with the selection probabilities and what effect this patch has on those selection probabilities? When it comes to modifying the behavior of the load balancing algorithms, there are many landmines we can step on, so the changes need more rigorous analysis.
it seems this PR is waiting for @barroca response. thanks! /wait |
Thanks everyone for the reviews and discussion so far. The development started as an idea for solving this issue #11004 where there is a large probability of choosing the same host on a p2c algorithm when the number of hosts is small. Adding a full scan would prevent a random choice of the same host. The ideas behind the changes were:
The first point makes sense because the expected behaviour would be choosing the one with least requests I need more time to read the paper, but I'm sure it has more information that I haven't considered. Perhaps we have an opportunity here to have only points 2 and 3, allowing a explicit full scan only starting from a random index that can be useful for small number of hosts ? |
If the desired behavior is to unconditionally select the host with the least requests, it's fine to add a config parameter to use a full scan. This should not be the default (even for small host sets) or change the current behavior of any LBs. |
@tonya11en In current implementation, I think we have a config option to this new feature. |
@barroca I have to say sorry first. Now the enable full scan is reverted completely (except the API). Could you merge the All your work is super appreciated, thanks 🙏 |
Just to chime in with another use case for
I'm trying to enable a scenario where each backend only accepts 1 long-lived (websocket) connection at a time, so we want Envoy to route connections to backends with 0 connections. The This is all to say: I look forward to @barroca resubmitting the PR (combining the original and subsequent patches), and am very appreciative of all the discussion and review from maintainers! |
@wbpcode, @tonya11en : By when would you need a resubmitted, combined PR for there to be a reasonable chance for this to land in Envoy 1.29.0, assuming review goes the way you expect? (I also understand if we've already passed that window.) Thanks! |
Just to document the intended contents of the combined PR:
My understanding is that 2 (#30794) was only a problem because the first host was always selected if cluster stats are disabled. However, with the introduction of 3 (#31146), the host selected will be random if cluster stats are disabled. It seems like the advantage of including just 1 and 3 without 2 is that you'll consider each host only once even if choice count > num hosts. I defer to the maintainers on this point though (whether to include 1+3 or 1+2+3). |
@jkirschner-hashicorp I think now you can re-submit the combined PR. And @tonya11en have implement a simulator in the #30818 to validate the problem and solution.
Don't worry the time window. We can backport this PR even if it passed the window. |
I have some bandwidth today to put up a PR that makes the selection method configurable (P2C vs. FULL_SCAN vs. ...). Let me know if you're already working on it and I'll just stand by to review when it's ready. |
I haven't started anything else yet. |
@barroca : In case you weren't in a position to move this forward at this time, I took a first pass yesterday at combining PRs 1 and 3 (omitting 2, because it seems unnecessary with the inclusion of 3): jkirschner-hashicorp#1. I also made some small changes to the docs/comments. Let me know how you'd like to proceed, happy to have you carry it forward as the original contributor! |
Happy for take over and merge the changes with the combined PRs :) It is OSS after all and I've me my contributions already. I can focus on something else once I have time. |
@jkirschner-hashicorp if you've started this I'll leave it up to you, then. The only thing I want to make sure we do is to configure the full scan with an enum representing the selection method (P2C vs. FULL_SCAN) instead of a boolean as found in the original PRs. |
@tonya11en : I started, though initially with the expectation that I was just repackaging the existing PRs/commits. I've never worked with Envoy's source code before and am not familiar with some of the constraints (e.g., whether the Either way, what are your thoughts on automatically using "full scan" if the number of choices configured is greater than the number of hosts? I was thinking of preserving that behavior, since the original motivation for stripping it out was that the first host was always selected if cluster stats were disabled (creating hotspots). That's no longer the case, now that the starting index of the full scan is random. It's more efficient if the choice count configured is >= the number of hosts. And, if an external control plane is integrating with a version of That said, I realize you might have downsides in mind that override the above. |
I'll make a pass at converting the Bool to an enum. I now have a local build environment and got the least request load balancer tests passing. |
Submitted a successor PR that uses an Enum rather than a Bool to specify the selection method (power of N choices or full scan): #31507 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be closed in favor of #31507, which is based off of this and related patches.
// Choose a random index to start from preventing always picking the first host in the list. | ||
const int rand_idx = random_.random() % hosts_to_use.size(); | ||
for (unsigned long i = 0; i < hosts_to_use.size(); i++) { | ||
const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:
[9, 9, 1, 1, 1]
Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
closing in favour of #31507 |
Fixed a bug (#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0.