Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start Full Scan from a random index for Least Request LB. #31146

Conversation

barroca
Copy link
Contributor

@barroca barroca commented Dec 2, 2023

Fixed a bug (#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0.

Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose
the first host of the list when the number of requests are the same during a full scan. Start the selection from a random
index instead of 0.

Signed-off-by: Leonardo da Mata <[email protected]>
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31146 was opened by barroca.

see: more, trace.

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @barroca ! Can you elaborate on how starting from a random index fixes the problem in #11004 and what #11006 tried to address? What effect is this change trying to have on the selection probabilities?

We've had many discussions on this (see #11006 (comment)) that lead to us leaving the P2C algorithm alone. Attempting to force a full scan, while guaranteeing selection of the host with the lowest active requests, opens the system up to herding behavior. While there are some applications where this would be useful, one of the key insights of the Azar et. al. paper is that there is little additional benefit to performing more than two choices. We would also creep closer to herding behavior with each additional selection we perform, so Envoy’s default configuration of the least request load balancer performs two choices.

#11004 is a feature, not a bug with the LB algorithm.

Can you please update the description with what exactly the problem is with the selection probabilities and what effect this patch has on those selection probabilities? When it comes to modifying the behavior of the load balancing algorithms, there are many landmines we can step on, so the changes need more rigorous analysis.

@soulxu
Copy link
Member

soulxu commented Dec 8, 2023

it seems this PR is waiting for @barroca response. thanks!

/wait

@barroca
Copy link
Contributor Author

barroca commented Dec 8, 2023

Thanks everyone for the reviews and discussion so far. The development started as an idea for solving this issue #11004 where there is a large probability of choosing the same host on a p2c algorithm when the number of hosts is small. Adding a full scan would prevent a random choice of the same host.

The ideas behind the changes were:

  • Do a full scan when the number of choices is larger than equal the number of hosts in the list.
  • Allow a configuration to always use full scan of hosts for a Least Request LB.
  • and the latest patch was to start the full scan from a random index to avoid selecting the same host (which would be the first) when the number of requests per host is the same.

The first point makes sense because the expected behaviour would be choosing the one with least requests
The second point gives the choice of always using full scan which can make sense for small number of hosts.
The third point reduces a possible unfairness of the algorithm.

I need more time to read the paper, but I'm sure it has more information that I haven't considered. Perhaps we have an opportunity here to have only points 2 and 3, allowing a explicit full scan only starting from a random index that can be useful for small number of hosts ?

@tonya11en
Copy link
Member

If the desired behavior is to unconditionally select the host with the least requests, it's fine to add a config parameter to use a full scan. This should not be the default (even for small host sets) or change the current behavior of any LBs.

@wbpcode
Copy link
Member

wbpcode commented Dec 9, 2023

@tonya11en In current implementation, I think we have a config option to this new feature. enable_full_scan. The full scaning is not default behaviour.

@wbpcode
Copy link
Member

wbpcode commented Dec 9, 2023

@barroca I have to say sorry first. Now the enable full scan is reverted completely (except the API).

Could you merge the enable_full_scan implementation and new random index starting to this single PR as a complete new feature then the @tonya11en and me will re-review it.

All your work is super appreciated, thanks 🙏

@jkirschner-hashicorp
Copy link
Contributor

Just to chime in with another use case for enable_full_scan that is mentioned in a thread linked by @tonya11en :

the fact that your backends are basically only capable of effectively handling a single request at full capacity and are CPU bound would lend itself to being more problematic. It seems totally reasonable to do full-scan as an alternative LB for least-loaded.

I'm trying to enable a scenario where each backend only accepts 1 long-lived (websocket) connection at a time, so we want Envoy to route connections to backends with 0 connections. The enable_full_scan mode submitted by @barroca could be used to enable that scenario (randomly selecting a backend with 0 connections since they would all have the "least requests").

This is all to say: I look forward to @barroca resubmitting the PR (combining the original and subsequent patches), and am very appreciative of all the discussion and review from maintainers!

@jkirschner-hashicorp
Copy link
Contributor

@wbpcode, @tonya11en : By when would you need a resubmitted, combined PR for there to be a reasonable chance for this to land in Envoy 1.29.0, assuming review goes the way you expect? (I also understand if we've already passed that window.)

Thanks!

@jkirschner-hashicorp
Copy link
Contributor

Just to document the intended contents of the combined PR:

  1. Fix least request lb not fair #29873: Original PR that introduced the “enable_full_scan” option.
  2. fix: only enable full scan when enable_full_scan is set explicitly forleast request lb #30794: Don’t automatically use full scan mode even if ChoiceCount > number of hosts.
  3. Start Full Scan from a random index for Least Request LB. #31146: Starts the scan at a random index in the host array to prevent hotspotting (in Zoom’s case: on the first host with 0 active connections).

My understanding is that 2 (#30794) was only a problem because the first host was always selected if cluster stats are disabled. However, with the introduction of 3 (#31146), the host selected will be random if cluster stats are disabled.

It seems like the advantage of including just 1 and 3 without 2 is that you'll consider each host only once even if choice count > num hosts.

I defer to the maintainers on this point though (whether to include 1+3 or 1+2+3).

@wbpcode
Copy link
Member

wbpcode commented Dec 21, 2023

@jkirschner-hashicorp I think now you can re-submit the combined PR. And @tonya11en have implement a simulator in the #30818 to validate the problem and solution.

By when would you need a resubmitted, combined PR for there to be a reasonable chance for this to land in Envoy 1.29.0, assuming review goes the way you expect? (I also understand if we've already passed that window.)

Don't worry the time window. We can backport this PR even if it passed the window.

@tonya11en
Copy link
Member

I have some bandwidth today to put up a PR that makes the selection method configurable (P2C vs. FULL_SCAN vs. ...). Let me know if you're already working on it and I'll just stand by to review when it's ready.

@barroca
Copy link
Contributor Author

barroca commented Dec 21, 2023

I haven't started anything else yet.

@jkirschner-hashicorp
Copy link
Contributor

@barroca : In case you weren't in a position to move this forward at this time, I took a first pass yesterday at combining PRs 1 and 3 (omitting 2, because it seems unnecessary with the inclusion of 3): jkirschner-hashicorp#1. I also made some small changes to the docs/comments.

Let me know how you'd like to proceed, happy to have you carry it forward as the original contributor!

@barroca
Copy link
Contributor Author

barroca commented Dec 21, 2023

Happy for take over and merge the changes with the combined PRs :) It is OSS after all and I've me my contributions already. I can focus on something else once I have time.

@tonya11en
Copy link
Member

@jkirschner-hashicorp if you've started this I'll leave it up to you, then. The only thing I want to make sure we do is to configure the full scan with an enum representing the selection method (P2C vs. FULL_SCAN) instead of a boolean as found in the original PRs.

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Dec 21, 2023

@tonya11en : I started, though initially with the expectation that I was just repackaging the existing PRs/commits. I've never worked with Envoy's source code before and am not familiar with some of the constraints (e.g., whether the enable_full_scan protobuf field needs to be kept, even though it was never in a released version, if we're switching to an Enum instead of a Bool). It may be more efficient for you to pick up from what's here rather than guide me in PR comments.


Either way, what are your thoughts on automatically using "full scan" if the number of choices configured is greater than the number of hosts? I was thinking of preserving that behavior, since the original motivation for stripping it out was that the first host was always selected if cluster stats were disabled (creating hotspots). That's no longer the case, now that the starting index of the full scan is random.

It's more efficient if the choice count configured is >= the number of hosts. And, if an external control plane is integrating with a version of go-control-plane that doesn't know about this new field yet, it could still take advantage of full scan mode by setting the choice count arbitrarily high.

That said, I realize you might have downsides in mind that override the above.

@jkirschner-hashicorp
Copy link
Contributor

I'll make a pass at converting the Bool to an enum. I now have a local build environment and got the least request load balancer tests passing.

@jkirschner-hashicorp
Copy link
Contributor

Submitted a successor PR that uses an Enum rather than a Bool to specify the selection method (power of N choices or full scan): #31507

Copy link
Member

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be closed in favor of #31507, which is based off of this and related patches.

Comment on lines +1304 to +1307
// Choose a random index to start from preventing always picking the first host in the list.
const int rand_idx = random_.random() % hosts_to_use.size();
for (unsigned long i = 0; i < hosts_to_use.size(); i++) {
const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:

[9, 9, 1, 1, 1]

Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.

Copy link

github-actions bot commented Feb 1, 2024

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added stale stalebot believes this issue/PR has not been touched recently and removed stale stalebot believes this issue/PR has not been touched recently labels Feb 1, 2024
@barroca
Copy link
Contributor Author

barroca commented Feb 7, 2024

closing in favour of #31507

@barroca barroca closed this Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants