Skip to content

Commit

Permalink
ipvs: avoid drop first packet by reusing conntrack
Browse files Browse the repository at this point in the history
Since 'commit f719e37 ("ipvs: drop first packet to
redirect conntrack")', when a new TCP connection meet
the conditions that need reschedule, the first syn packet
is dropped, this cause one second latency for the new
connection, more discussion about this problem can easy
search from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
kubernetes/kubernetes#70747

3)Apache Bench can fill up ipvs service proxy in seconds torvalds#544
cloudnativelabs/kube-router#544

4)Additional 1s latency in `host -> service IP -> pod`
kubernetes/kubernetes#90854

5)kube-proxy ipvs conn_reuse_mode setting causes errors
with high load from single client
kubernetes/kubernetes#81775

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

     if ((cp->flags & IP_VS_CONN_F_NFCT) &&
         !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
             /* Do not access conntracks during subsys cleanup
              * because nf_conntrack_find_get can not be used after
              * conntrack cleanup for the net.
              */
             smp_rmb();
             if (ipvs->enable)
                     ip_vs_conn_drop_conntrack(cp);
     }
...
}
```
As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT)
is true, the function ip_vs_conn_drop_conntrack will be called.

So we optimize this by following steps (Administrators
can choose the following optimization by setting
net.ipv4.vs.conn_reuse_old_conntrack=1):
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because
   no packets will use the old session)
2) call ip_vs_conn_expire_now to release the old session,
   then the related conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet, it
   just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related
   to the old conntrack(which is reopened by conntrack as a new
   one), the next whole things is just as normal as that the old
   session isn't used to exist.

The above processing has no problems except for passive FTP,
for passive FTP situation, ipvs can judging from
condition (atomic_read(&cp->n_control)) and condition (cp->control).
So, for other conditions(means not FTP), ipvs should give users
the right to choose,they can choose a high performance one processing
logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary
because most business scenarios (such as kubernetes) are very sensitive
to TCP short connection latency.

This patch has been verified on our thousands of kubernets
node servers on Tencent Inc.

Signed-off-by: YangYuxi <[email protected]>
  • Loading branch information
yyx authored and intel-lab-lkp committed Jun 16, 2020
1 parent c92cbae commit a970445
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 2 deletions.
23 changes: 23 additions & 0 deletions Documentation/networking/ipvs-sysctl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,29 @@ conn_reuse_mode - INTEGER
balancer in Direct Routing mode. This bit helps on adding new
real servers to a very busy cluster.

conn_reuse_old_conntrack - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)

If set, when a new TCP syn packet hit an old ipvs connection
table and need reschedule to a new dest: if
1) the packet use conntrack
2) the old ipvs connection table is not a master control
connection (E.g the command connection of passived FTP)
3) the old ipvs connection table been not controlled by any
connections (E.g the data connection of passived FTP)
ipvs Will not release the old conntrack, just let the conntrack
reopen the old session as it is a new one. This is an optimization
option selectable by the system administrator.

If not set, when a new TCP syn packet hit an old ipvs connection
table and need reschedule to a new dest: if
1) the packet use conntrack
ipvs just drop this syn packet, expire the old connection by timer.
This will cause the client tcp syn to retransmit.

Only has effect when conn_reuse_mode not 0.

conntrack - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
Expand Down
11 changes: 11 additions & 0 deletions include/net/ip_vs.h
Original file line number Diff line number Diff line change
Expand Up @@ -928,6 +928,7 @@ struct netns_ipvs {
int sysctl_pmtu_disc;
int sysctl_backup_only;
int sysctl_conn_reuse_mode;
int sysctl_conn_reuse_old_conntrack;
int sysctl_schedule_icmp;
int sysctl_ignore_tunneled;

Expand Down Expand Up @@ -1049,6 +1050,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
return ipvs->sysctl_conn_reuse_mode;
}

static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
{
return ipvs->sysctl_conn_reuse_old_conntrack;
}

static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
{
return ipvs->sysctl_schedule_icmp;
Expand Down Expand Up @@ -1136,6 +1142,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
return 1;
}

static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
{
return 1;
}

static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
{
return 0;
Expand Down
11 changes: 9 additions & 2 deletions net/netfilter/ipvs/ip_vs_core.c
Original file line number Diff line number Diff line change
Expand Up @@ -2066,7 +2066,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int

conn_reuse_mode = sysctl_conn_reuse_mode(ipvs);
if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) {
bool uses_ct = false, resched = false;
bool uses_ct = false, resched = false, drop = false;

if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest &&
unlikely(!atomic_read(&cp->dest->weight))) {
Expand All @@ -2086,10 +2086,17 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
}

if (resched) {
if (uses_ct) {
if (unlikely(!atomic_read(&cp->n_control) && !cp->control) &&
likely(sysctl_conn_reuse_old_conntrack(ipvs)))
cp->flags &= ~IP_VS_CONN_F_NFCT;
else
drop = true;
}
if (!atomic_read(&cp->n_control))
ip_vs_conn_expire_now(cp);
__ip_vs_conn_put(cp);
if (uses_ct)
if (drop)
return NF_DROP;
cp = NULL;
}
Expand Down
2 changes: 2 additions & 0 deletions net/netfilter/ipvs/ip_vs_ctl.c
Original file line number Diff line number Diff line change
Expand Up @@ -4049,7 +4049,9 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_pmtu_disc;
tbl[idx++].data = &ipvs->sysctl_backup_only;
ipvs->sysctl_conn_reuse_mode = 1;
ipvs->sysctl_conn_reuse_old_conntrack = 1;
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_conn_reuse_old_conntrack;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;

Expand Down

0 comments on commit a970445

Please sign in to comment.