Endpoints weight are not respected from the start #31378

StupidScience · 2023-12-14T12:09:02Z

Title: Endpoints weight are not respected from the start

Description:

When there are multiple endpoints with different weights (100 and 10 in our case), endpoints with lower weight are not being called till some point. I've done some local testing and with these weights at least 50 of first requests are going to endpoints endpoint with higher weight. In our environment sometime it requires much more requests to reach lower weighted endpoints that results those endpoints might get no traffic at all for hours when added for some low traffic services.

Repro steps:
I tested locally with this envoy config:

node:
  cluster: cluster-1
  id: envoy-instance-1
admin:
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9901
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
static_resources:
  clusters:
  - name: cluster_0
    connect_timeout: 5s
    type: STATIC
    load_assignment:
      cluster_name: cluster_0
      endpoints:
      - lb_endpoints:
        - load_balancing_weight: 100
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 5050
      - lb_endpoints:
        - load_balancing_weight: 10
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 5051
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: listener_http
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: route_0
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: cluster_0
  - name: primary
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 5050
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: primary
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: primary
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                direct_response:
                  status: 200
                  body:
                    inline_string: primary
  - name: canary
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 5051
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: canary
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: canary
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                direct_response:
                  status: 200
                  body:
                    inline_string: canary

And was just curling in the loop like this:

for i in {1..10}; do echo "Iteration ${i}:" && for i in {1..50}; do curl -s localhost:10000 && echo ""; done | sort | uniq -c; done

It almost always return 50 primary for first iteration. Example of output:

Iteration 1:
  50 primary
Iteration 2:
   6 canary
  44 primary
Iteration 3:
   4 canary
  46 primary
Iteration 4:
   5 canary
  45 primary
Iteration 5:
   5 canary
  45 primary
Iteration 6:
   3 canary
  47 primary
Iteration 7:
   7 canary
  43 primary
Iteration 8:
   3 canary
  47 primary
Iteration 9:
   4 canary
  46 primary
Iteration 10:
   5 canary
  45 primary

Let me know if you actually need other debug information. I just do not think it is relevant here.
Tested on envoy version 1.24.2 and 1.28.0. Was able to reproduce on both of them.

The text was updated successfully, but these errors were encountered:

wbpcode · 2023-12-27T04:10:16Z

Hmmm, from my personal perspective, this distribution looks like reasonable. Because from you configuration, only 1/11 requests should be routed to canary.

wbpcode · 2023-12-27T04:10:52Z

cc @tonya11en for he is more familiar than me in this area.

StupidScience · 2023-12-27T08:43:16Z

@wbpcode in general distribution is okay, problem with first N requests that always hit higher weight endpoints. With more endpoints there will be more requests hitting higher weight endpoints only.

tonya11en · 2023-12-28T21:14:47Z

This is a known problem with the EDF scheduler we use in the Least Request LB. EDF is essentially a min-heap of hosts, where each host weight is (1/weight). This means when you make your first pick, the higher weight hosts are chosen. Selection probabilities will not be accurate until an entire "cycle" is completed. We call this the "first-pick problem".

If you'd like to understand what's going on a bit better, take a look at the issue opened to implement an alternative scheduler and the various comments linked in the description. The reason we haven't switched out the EDF scheduler is that it is resilient to host weights changing. Alternative scheduling disciplines avoid this first-pick problem by rebuilding some data structure after each change in host weights, which for the least request LB is after every host selection. The alternative schedulers would be fine for the weighted round-robin LB, though.

I think @adisuissa was looking at picking this work up a few weeks ago, but I'm not sure what the status is now.

StupidScience · 2024-01-02T12:27:38Z

@tonya11en thanks for the insights, really helpful. While there is no solution for this particular problem are you aware of any possible workarounds?

tonya11en · 2024-01-02T18:43:18Z

The only workaround I can think of is to avoid using the EDF scheduler. I don't know the specifics of your situation, but the only sane option that comes to mind involves removing the endpoint weights.

Use the LEAST_REQUEST load balancer with no endpoint weights set. This would rely on the load balancer scaling the endpoint weight based on the number of outstanding requests. The EDF scheduler isn't used in this case.

StupidScience · 2024-01-09T19:34:24Z

Thanks. I’m closing this one since more specific issue exists

StupidScience added bug triage Issue requires triage labels Dec 14, 2023

wbpcode added area/load balancing and removed triage Issue requires triage labels Dec 27, 2023

StupidScience closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoints weight are not respected from the start #31378

Endpoints weight are not respected from the start #31378

StupidScience commented Dec 14, 2023 •

edited

Loading

wbpcode commented Dec 27, 2023

wbpcode commented Dec 27, 2023

StupidScience commented Dec 27, 2023

tonya11en commented Dec 28, 2023

StupidScience commented Jan 2, 2024

tonya11en commented Jan 2, 2024

StupidScience commented Jan 9, 2024

Endpoints weight are not respected from the start #31378

Endpoints weight are not respected from the start #31378

Comments

StupidScience commented Dec 14, 2023 • edited Loading

wbpcode commented Dec 27, 2023

wbpcode commented Dec 27, 2023

StupidScience commented Dec 27, 2023

tonya11en commented Dec 28, 2023

StupidScience commented Jan 2, 2024

tonya11en commented Jan 2, 2024

StupidScience commented Jan 9, 2024

StupidScience commented Dec 14, 2023 •

edited

Loading