Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoints weight are not respected from the start #31378

Closed
StupidScience opened this issue Dec 14, 2023 · 7 comments
Closed

Endpoints weight are not respected from the start #31378

StupidScience opened this issue Dec 14, 2023 · 7 comments

Comments

@StupidScience
Copy link
Contributor

StupidScience commented Dec 14, 2023

Title: Endpoints weight are not respected from the start

Description:

When there are multiple endpoints with different weights (100 and 10 in our case), endpoints with lower weight are not being called till some point. I've done some local testing and with these weights at least 50 of first requests are going to endpoints endpoint with higher weight. In our environment sometime it requires much more requests to reach lower weighted endpoints that results those endpoints might get no traffic at all for hours when added for some low traffic services.

Repro steps:
I tested locally with this envoy config:

node:
  cluster: cluster-1
  id: envoy-instance-1
admin:
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9901
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
static_resources:
  clusters:
  - name: cluster_0
    connect_timeout: 5s
    type: STATIC
    load_assignment:
      cluster_name: cluster_0
      endpoints:
      - lb_endpoints:
        - load_balancing_weight: 100
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 5050
      - lb_endpoints:
        - load_balancing_weight: 10
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 5051
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: listener_http
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: route_0
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: cluster_0
  - name: primary
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 5050
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: primary
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: primary
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                direct_response:
                  status: 200
                  body:
                    inline_string: primary
  - name: canary
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 5051
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: canary
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            virtual_hosts:
            - name: canary
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                direct_response:
                  status: 200
                  body:
                    inline_string: canary

And was just curling in the loop like this:

for i in {1..10}; do echo "Iteration ${i}:" && for i in {1..50}; do curl -s localhost:10000 && echo ""; done | sort | uniq -c; done

It almost always return 50 primary for first iteration. Example of output:

Iteration 1:
  50 primary
Iteration 2:
   6 canary
  44 primary
Iteration 3:
   4 canary
  46 primary
Iteration 4:
   5 canary
  45 primary
Iteration 5:
   5 canary
  45 primary
Iteration 6:
   3 canary
  47 primary
Iteration 7:
   7 canary
  43 primary
Iteration 8:
   3 canary
  47 primary
Iteration 9:
   4 canary
  46 primary
Iteration 10:
   5 canary
  45 primary

Let me know if you actually need other debug information. I just do not think it is relevant here.
Tested on envoy version 1.24.2 and 1.28.0. Was able to reproduce on both of them.

@StupidScience StupidScience added bug triage Issue requires triage labels Dec 14, 2023
@wbpcode wbpcode added area/load balancing and removed triage Issue requires triage labels Dec 27, 2023
@wbpcode
Copy link
Member

wbpcode commented Dec 27, 2023

Hmmm, from my personal perspective, this distribution looks like reasonable. Because from you configuration, only 1/11 requests should be routed to canary.

@wbpcode
Copy link
Member

wbpcode commented Dec 27, 2023

cc @tonya11en for he is more familiar than me in this area.

@StupidScience
Copy link
Contributor Author

@wbpcode in general distribution is okay, problem with first N requests that always hit higher weight endpoints. With more endpoints there will be more requests hitting higher weight endpoints only.

@tonya11en
Copy link
Member

This is a known problem with the EDF scheduler we use in the Least Request LB. EDF is essentially a min-heap of hosts, where each host weight is (1/weight). This means when you make your first pick, the higher weight hosts are chosen. Selection probabilities will not be accurate until an entire "cycle" is completed. We call this the "first-pick problem".

If you'd like to understand what's going on a bit better, take a look at the issue opened to implement an alternative scheduler and the various comments linked in the description. The reason we haven't switched out the EDF scheduler is that it is resilient to host weights changing. Alternative scheduling disciplines avoid this first-pick problem by rebuilding some data structure after each change in host weights, which for the least request LB is after every host selection. The alternative schedulers would be fine for the weighted round-robin LB, though.

I think @adisuissa was looking at picking this work up a few weeks ago, but I'm not sure what the status is now.

@StupidScience
Copy link
Contributor Author

@tonya11en thanks for the insights, really helpful. While there is no solution for this particular problem are you aware of any possible workarounds?

@tonya11en
Copy link
Member

The only workaround I can think of is to avoid using the EDF scheduler. I don't know the specifics of your situation, but the only sane option that comes to mind involves removing the endpoint weights.

Use the LEAST_REQUEST load balancer with no endpoint weights set. This would rely on the load balancer scaling the endpoint weight based on the number of outstanding requests. The EDF scheduler isn't used in this case.

@StupidScience
Copy link
Contributor Author

Thanks. I’m closing this one since more specific issue exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants