Load balance long-lived connections upon auto scale-out or scale-in of Envoy #15283

sha-rath · 2021-03-03T16:37:08Z

Consider a scenario where there is auto scale-out of Envoy. In such a situation, I would actually like to drain few connections from an old instance and transfer that load to the newly scaled-out instance. Currently, I cannot find a way to drain a certain amount of connections an Envoy instance handles.

Would it be possible to have some sort of configuration where we could set the percentage of connections to drain based on certain parameters like number of connections per envoy instance or the total number of connections ?

lambdai · 2021-03-03T19:38:49Z

+1 for admin interface. Listener or network filter probably don't know how many connections should goaway. However, I feel it reasonable to expose it as a one time action and let some external tools to trigger the rebalance

mattklein123 · 2021-03-03T22:24:46Z

See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners. I think it would be reasonable to allow this API to take parameters that effect the drain speed. I think we originally discussed it when it was added. cc @auni53

twghu · 2021-03-16T14:42:03Z

I've been reviewing

#77efb4799d906400b97b43d6c2ed0b5d8369b2fc
#8c7df0f08bd86bbeebaee3773b79cf6e3949fe3c
The implementation of a drain-speed implies the concept of rate Additionally the code in listener_manager.cc stopListeners() would on first inspection seem a logical place to implement any concept of speed. However this previous two PR's indicated that this is somewhat more complex.

@mattklein123 are there notes/proposal relating to " I think we originally discussed it when it was added".

auni53 · 2021-03-16T18:32:03Z

The suggestions in this thread sound reasonable to me. I'm not clear on where you're seeing drain-speed; drain-time refers to the duration of the drain period. See my original design doc which might be a useful illustration.

The original draining behaviour, i.e., drain-strategy=immediate, does an algorithm where the probability of a request being drained scales with the progress into the drain period. I left a comment to describe this. That would be the simplest place to add a new API mode or parameters.

lambdai · 2021-03-17T00:14:09Z

IIRC the current DrainManager controls drain hint, but end up with an action close all connection.

A partial drain requires both controllable deterministic hint, e.g. not ideal if it answers "should drain" and answer "should not drain" on the same connection.
Also the action should be close the preselected connections

twghu · 2021-03-22T10:43:27Z

@auni53 If we are going to drain a percentage of connections, ideally we should be able to know, at the time of the drain call how many connections there are. There are two problems that i see with this:

the DrainManagerImpl::drainClose() is called frequently in header processing so latency is important and
in determining if a connections should be drained off relies on the knowledge of the total number of active connections.

There are numConnections() for ListenerManager to wit:

uint64_t ListenerManagerImpl::numConnections() const {
  uint64_t num_connections = 0;
  for (const auto& worker : workers_) {
    num_connections += worker->numConnections();
  }

  return num_connections;
}

This cycles of workers, which when draining is active could lead to unacceptable overhead.

I wonder, is there a preferred method of obtaining (with least overhead) the number of active downstream connections that is not going to impact performance.

mattklein123 · 2021-03-24T22:28:16Z

I wonder, is there a preferred method of obtaining (with least overhead) the number of active downstream connections that is not going to impact performance.

We don't have this today, but we could add it. A couple of quick ideas: 1) Give access to connections on current worker only (simple, might be good enough). 2) Do some periodic calculation that puts total connections into am atomic variable that can be read, potentially specific to the listener, server, etc. We already basically do this here:

envoy/source/server/server.cc

Lines 228 to 229 in 38f6738

    
           server_stats_->total_connections_.set(listener_manager_->numConnections() + 
        
                                                 parent_stats.parent_connections_);

So could add that information on the stats flush interval or some other interval. I doubt your use case requires it to be exact?

twghu · 2021-03-25T04:20:48Z

Hey @mattklein123 I like the "So could add that information on the stats flush interval or some other interval. I doubt your use case requires it to be exact?". Thanks.

hobbytp · 2021-05-19T05:59:17Z

@twghu @lambdai @mattklein123 will the solution discussed here impact both TCP and HTTP connection or only HTTP connection?

lambdai · 2021-05-19T21:20:16Z

@twghu @lambdai @mattklein123 will the solution discussed here impact both TCP and HTTP connection or only HTTP connection?

the connection shutdown should impact both TCP and http
upon the connection shutdown, only http connection supports "gracefully shutdown".

sha-rath · 2021-09-30T14:24:30Z

The new feature (max_requests_per_connection for downstream connections) to be introduced in v1.20 will hopefully solve my issue of rebalancing long-lived connections. When this parameter is set, once the limit on max number of requests per connection is reached, the connection will be closed and the external load balancer will be forced to open a new connection.

Any other alternatives to be discussed here? If not, I can close this issue.

hobbytp · 2022-06-08T14:50:58Z

sha-rath, I recommend keeping it, because the TCP part is still not fixed. And still some discussions don't have results until now.

Winbobob · 2024-07-18T16:26:16Z

To revive this conversation and add my two cents:

I think the drain_listeners admin interface will not meet the original requirements (drain a few connections) and rebalance the long-lived connections since this admin interface will drain all listeners and their connections eventually no matter it is graceful or immediate.
max_requests_per_connection is helpful but has limitations. What if the limit is not reached, but we still want to rebalance the connections (consider a situation when handling certain type of requests causes high CPU or memory usage, and we want to rebalance the connections/requests). If we set the number to low, it may be inefficient.
The closest config I found is max_connection_duration + drain_timeout. However, it is not flexible enough. Assume one client establishes many connections almost at the same time, once the connection duration limit reaches and there are no active streams, all connections will be force drained in a short period, which may cause high latency.

Proposal

#35391

Please let me know your thoughts! Thanks!

sha-rath added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Mar 3, 2021

alyssawilk removed the triage Issue requires triage label Mar 3, 2021

mattklein123 added area/admin help wanted Needs help! and removed enhancement Feature requests. Not bugs or questions. labels Mar 3, 2021

krapie mentioned this issue Mar 30, 2023

Long-lived connection is not closed with close_connections_on_host_set_change = true option #26459

Closed

Winbobob mentioned this issue Jul 23, 2024

Extend the functionality of max_connection_duration + drain_timeout #35391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balance long-lived connections upon auto scale-out or scale-in of Envoy #15283

Load balance long-lived connections upon auto scale-out or scale-in of Envoy #15283

sha-rath commented Mar 3, 2021

lambdai commented Mar 3, 2021

mattklein123 commented Mar 3, 2021

twghu commented Mar 16, 2021

auni53 commented Mar 16, 2021 •

edited

Loading

lambdai commented Mar 17, 2021

twghu commented Mar 22, 2021

mattklein123 commented Mar 24, 2021

twghu commented Mar 25, 2021

hobbytp commented May 19, 2021

lambdai commented May 19, 2021

sha-rath commented Sep 30, 2021

hobbytp commented Jun 8, 2022

Winbobob commented Jul 18, 2024 •

edited

Loading

Load balance long-lived connections upon auto scale-out or scale-in of Envoy #15283

Load balance long-lived connections upon auto scale-out or scale-in of Envoy #15283

Comments

sha-rath commented Mar 3, 2021

lambdai commented Mar 3, 2021

mattklein123 commented Mar 3, 2021

twghu commented Mar 16, 2021

auni53 commented Mar 16, 2021 • edited Loading

lambdai commented Mar 17, 2021

twghu commented Mar 22, 2021

mattklein123 commented Mar 24, 2021

twghu commented Mar 25, 2021

hobbytp commented May 19, 2021

lambdai commented May 19, 2021

sha-rath commented Sep 30, 2021

hobbytp commented Jun 8, 2022

Winbobob commented Jul 18, 2024 • edited Loading

Proposal

auni53 commented Mar 16, 2021 •

edited

Loading

Winbobob commented Jul 18, 2024 •

edited

Loading