Start Full Scan from a random index for Least Request LB. #31146

barroca · 2023-12-02T16:01:37Z

Fixed a bug (#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0.

Fixed a bug (envoyproxy#11006) that caused the Least Request load balancer policy to choose the first host of the list when the number of requests are the same during a full scan. Start the selection from a random index instead of 0. Signed-off-by: Leonardo da Mata <[email protected]>

repokitteh-read-only · 2023-12-02T16:01:47Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31146 was opened by barroca.

see: more, trace.

tonya11en

Hi @barroca ! Can you elaborate on how starting from a random index fixes the problem in #11004 and what #11006 tried to address? What effect is this change trying to have on the selection probabilities?

We've had many discussions on this (see #11006 (comment)) that lead to us leaving the P2C algorithm alone. Attempting to force a full scan, while guaranteeing selection of the host with the lowest active requests, opens the system up to herding behavior. While there are some applications where this would be useful, one of the key insights of the Azar et. al. paper is that there is little additional benefit to performing more than two choices. We would also creep closer to herding behavior with each additional selection we perform, so Envoy’s default configuration of the least request load balancer performs two choices.

#11004 is a feature, not a bug with the LB algorithm.

Can you please update the description with what exactly the problem is with the selection probabilities and what effect this patch has on those selection probabilities? When it comes to modifying the behavior of the load balancing algorithms, there are many landmines we can step on, so the changes need more rigorous analysis.

soulxu · 2023-12-08T05:06:00Z

it seems this PR is waiting for @barroca response. thanks!

/wait

barroca · 2023-12-08T13:58:50Z

Thanks everyone for the reviews and discussion so far. The development started as an idea for solving this issue #11004 where there is a large probability of choosing the same host on a p2c algorithm when the number of hosts is small. Adding a full scan would prevent a random choice of the same host.

The ideas behind the changes were:

Do a full scan when the number of choices is larger than equal the number of hosts in the list.
Allow a configuration to always use full scan of hosts for a Least Request LB.
and the latest patch was to start the full scan from a random index to avoid selecting the same host (which would be the first) when the number of requests per host is the same.

The first point makes sense because the expected behaviour would be choosing the one with least requests
The second point gives the choice of always using full scan which can make sense for small number of hosts.
The third point reduces a possible unfairness of the algorithm.

I need more time to read the paper, but I'm sure it has more information that I haven't considered. Perhaps we have an opportunity here to have only points 2 and 3, allowing a explicit full scan only starting from a random index that can be useful for small number of hosts ?

tonya11en · 2023-12-08T18:38:01Z

If the desired behavior is to unconditionally select the host with the least requests, it's fine to add a config parameter to use a full scan. This should not be the default (even for small host sets) or change the current behavior of any LBs.

wbpcode · 2023-12-09T01:24:13Z

@tonya11en In current implementation, I think we have a config option to this new feature. enable_full_scan. The full scaning is not default behaviour.

wbpcode · 2023-12-09T01:39:59Z

@barroca I have to say sorry first. Now the enable full scan is reverted completely (except the API).

Could you merge the enable_full_scan implementation and new random index starting to this single PR as a complete new feature then the @tonya11en and me will re-review it.

All your work is super appreciated, thanks 🙏

jkirschner-hashicorp · 2023-12-13T22:44:45Z

Just to chime in with another use case for enable_full_scan that is mentioned in a thread linked by @tonya11en :

the fact that your backends are basically only capable of effectively handling a single request at full capacity and are CPU bound would lend itself to being more problematic. It seems totally reasonable to do full-scan as an alternative LB for least-loaded.

I'm trying to enable a scenario where each backend only accepts 1 long-lived (websocket) connection at a time, so we want Envoy to route connections to backends with 0 connections. The enable_full_scan mode submitted by @barroca could be used to enable that scenario (randomly selecting a backend with 0 connections since they would all have the "least requests").

This is all to say: I look forward to @barroca resubmitting the PR (combining the original and subsequent patches), and am very appreciative of all the discussion and review from maintainers!

jkirschner-hashicorp · 2023-12-19T17:12:06Z

@wbpcode, @tonya11en : By when would you need a resubmitted, combined PR for there to be a reasonable chance for this to land in Envoy 1.29.0, assuming review goes the way you expect? (I also understand if we've already passed that window.)

Thanks!

jkirschner-hashicorp · 2023-12-20T23:14:17Z

Just to document the intended contents of the combined PR:

Fix least request lb not fair #29873: Original PR that introduced the “enable_full_scan” option.
fix: only enable full scan when enable_full_scan is set explicitly forleast request lb #30794: Don’t automatically use full scan mode even if ChoiceCount > number of hosts.
Start Full Scan from a random index for Least Request LB. #31146: Starts the scan at a random index in the host array to prevent hotspotting (in Zoom’s case: on the first host with 0 active connections).

My understanding is that 2 (#30794) was only a problem because the first host was always selected if cluster stats are disabled. However, with the introduction of 3 (#31146), the host selected will be random if cluster stats are disabled.

It seems like the advantage of including just 1 and 3 without 2 is that you'll consider each host only once even if choice count > num hosts.

I defer to the maintainers on this point though (whether to include 1+3 or 1+2+3).

wbpcode · 2023-12-21T02:00:37Z

@jkirschner-hashicorp I think now you can re-submit the combined PR. And @tonya11en have implement a simulator in the #30818 to validate the problem and solution.

By when would you need a resubmitted, combined PR for there to be a reasonable chance for this to land in Envoy 1.29.0, assuming review goes the way you expect? (I also understand if we've already passed that window.)

Don't worry the time window. We can backport this PR even if it passed the window.

tonya11en · 2023-12-21T19:17:20Z

I have some bandwidth today to put up a PR that makes the selection method configurable (P2C vs. FULL_SCAN vs. ...). Let me know if you're already working on it and I'll just stand by to review when it's ready.

barroca · 2023-12-21T19:56:29Z

I haven't started anything else yet.

jkirschner-hashicorp · 2023-12-21T20:40:24Z

@barroca : In case you weren't in a position to move this forward at this time, I took a first pass yesterday at combining PRs 1 and 3 (omitting 2, because it seems unnecessary with the inclusion of 3): jkirschner-hashicorp#1. I also made some small changes to the docs/comments.

Let me know how you'd like to proceed, happy to have you carry it forward as the original contributor!

barroca · 2023-12-21T20:45:45Z

Happy for take over and merge the changes with the combined PRs :) It is OSS after all and I've me my contributions already. I can focus on something else once I have time.

tonya11en · 2023-12-21T21:07:34Z

@jkirschner-hashicorp if you've started this I'll leave it up to you, then. The only thing I want to make sure we do is to configure the full scan with an enum representing the selection method (P2C vs. FULL_SCAN) instead of a boolean as found in the original PRs.

jkirschner-hashicorp · 2023-12-21T23:23:14Z

@tonya11en : I started, though initially with the expectation that I was just repackaging the existing PRs/commits. I've never worked with Envoy's source code before and am not familiar with some of the constraints (e.g., whether the enable_full_scan protobuf field needs to be kept, even though it was never in a released version, if we're switching to an Enum instead of a Bool). It may be more efficient for you to pick up from what's here rather than guide me in PR comments.

Either way, what are your thoughts on automatically using "full scan" if the number of choices configured is greater than the number of hosts? I was thinking of preserving that behavior, since the original motivation for stripping it out was that the first host was always selected if cluster stats were disabled (creating hotspots). That's no longer the case, now that the starting index of the full scan is random.

It's more efficient if the choice count configured is >= the number of hosts. And, if an external control plane is integrating with a version of go-control-plane that doesn't know about this new field yet, it could still take advantage of full scan mode by setting the choice count arbitrarily high.

That said, I realize you might have downsides in mind that override the above.

jkirschner-hashicorp · 2023-12-22T18:42:28Z

I'll make a pass at converting the Bool to an enum. I now have a local build environment and got the least request load balancer tests passing.

jkirschner-hashicorp · 2023-12-23T22:04:35Z

Submitted a successor PR that uses an Enum rather than a Bool to specify the selection method (power of N choices or full scan): #31507

tonya11en

This should probably be closed in favor of #31507, which is based off of this and related patches.

tonya11en · 2024-01-02T19:37:48Z

source/common/upstream/load_balancer_impl.cc

+    // Choose a random index to start from preventing always picking the first host in the list.
+    const int rand_idx = random_.random() % hosts_to_use.size();
+    for (unsigned long i = 0; i < hosts_to_use.size(); i++) {
+      const HostSharedPtr& sampled_host = hosts_to_use[(rand_idx + i) % hosts_to_use.size()];


This is still going to be problematic as far as selection probabilities go. Consider some host vector with the following weights:

[9, 9, 1, 1, 1]

Choosing a random index to start the scan from would still choose the first host 80% of the time. The host at index 2 will only be picked 20% of the time, which seems unintuitive.

github-actions · 2024-02-01T20:01:06Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

barroca · 2024-02-07T16:12:47Z

closing in favour of #31507

repokitteh-read-only bot added the api label Dec 2, 2023

repokitteh-read-only bot assigned markdroth Dec 2, 2023

This was referenced Dec 2, 2023

Revert enable_full_scan #30812

Merged

Fix least request lb not fair #29873

Merged

tomwans approved these changes Dec 2, 2023

View reviewed changes

tonya11en suggested changes Dec 4, 2023

View reviewed changes

tonya11en mentioned this pull request Dec 4, 2023

Add tonya11en to CODEOWNERS for LR, random, and common LB #31171

Merged

repokitteh-read-only bot added the waiting label Dec 8, 2023

jkirschner-hashicorp mentioned this pull request Dec 23, 2023

Add FULL_SCAN selection mode to least request LB #31507

Merged

tonya11en suggested changes Jan 2, 2024

View reviewed changes

github-actions bot added stale stalebot believes this issue/PR has not been touched recently and removed stale stalebot believes this issue/PR has not been touched recently labels Feb 1, 2024

barroca closed this Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start Full Scan from a random index for Least Request LB. #31146

Start Full Scan from a random index for Least Request LB. #31146

barroca commented Dec 2, 2023

repokitteh-read-only bot commented Dec 2, 2023

tonya11en left a comment •

edited

Loading

soulxu commented Dec 8, 2023

barroca commented Dec 8, 2023

tonya11en commented Dec 8, 2023

wbpcode commented Dec 9, 2023 •

edited

Loading

wbpcode commented Dec 9, 2023 •

edited

Loading

jkirschner-hashicorp commented Dec 13, 2023

jkirschner-hashicorp commented Dec 19, 2023

jkirschner-hashicorp commented Dec 20, 2023

wbpcode commented Dec 21, 2023

tonya11en commented Dec 21, 2023

barroca commented Dec 21, 2023

jkirschner-hashicorp commented Dec 21, 2023

barroca commented Dec 21, 2023

tonya11en commented Dec 21, 2023

jkirschner-hashicorp commented Dec 21, 2023 •

edited

Loading

jkirschner-hashicorp commented Dec 22, 2023

jkirschner-hashicorp commented Dec 23, 2023

tonya11en left a comment

tonya11en Jan 2, 2024

github-actions bot commented Feb 1, 2024

barroca commented Feb 7, 2024

Start Full Scan from a random index for Least Request LB. #31146

Start Full Scan from a random index for Least Request LB. #31146

Conversation

barroca commented Dec 2, 2023

repokitteh-read-only bot commented Dec 2, 2023

tonya11en left a comment • edited Loading

Choose a reason for hiding this comment

soulxu commented Dec 8, 2023

barroca commented Dec 8, 2023

tonya11en commented Dec 8, 2023

wbpcode commented Dec 9, 2023 • edited Loading

wbpcode commented Dec 9, 2023 • edited Loading

jkirschner-hashicorp commented Dec 13, 2023

jkirschner-hashicorp commented Dec 19, 2023

jkirschner-hashicorp commented Dec 20, 2023

wbpcode commented Dec 21, 2023

tonya11en commented Dec 21, 2023

barroca commented Dec 21, 2023

jkirschner-hashicorp commented Dec 21, 2023

barroca commented Dec 21, 2023

tonya11en commented Dec 21, 2023

jkirschner-hashicorp commented Dec 21, 2023 • edited Loading

jkirschner-hashicorp commented Dec 22, 2023

jkirschner-hashicorp commented Dec 23, 2023

tonya11en left a comment

Choose a reason for hiding this comment

tonya11en Jan 2, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 1, 2024

barroca commented Feb 7, 2024

tonya11en left a comment •

edited

Loading

wbpcode commented Dec 9, 2023 •

edited

Loading

wbpcode commented Dec 9, 2023 •

edited

Loading

jkirschner-hashicorp commented Dec 21, 2023 •

edited

Loading