Apache Bench can fill up ipvs service proxy in seconds #544

neeseius · 2018-09-28T01:16:41Z

I am not sure if I have something configured wrong but here is my Centos7 physical node and kube-router agent setup:

[ipvsadm package]
$ rpm -q ipvsadm
ipvsadm-1.27-7.el7.x86_64

[kube router process and options]
$ ps -ocommand= -C kube-router
/usr/local/bin/kube-router --run-router=true --run-firewall=true --run-service-proxy=true --kubeconfig=/etc/kubernetes/kube-router.kubeconfig --hostname-override=node6 --enable-overlay=true

[service]
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-svc NodePort 172.30.176.114 80:30530/TCP 7h

[ipvs]
$ ipvsadm -ln | head -n 1
IP Virtual Server version 1.2.1 (size=4096)

[ipvs service]
$ ipvsadm -ln | grep -A1 30530
TCP 10.200.1.146:30530 rr
-> 172.32.9.68:80 Masq 1 0 0

If I use apache bench with tcp keepalive all is swell and absurdly fast, posting over 10,000 requests per second and ipvsadm will show stats like below during such a test:
$ ipvsadm -ln | grep -A1 30530
TCP 10.200.1.146:30530 rr
-> 172.32.9.68:80 Masq 1 0 757

However if I run the same test without keep-alive then "InActConn" jumps up to 14000 within a few seconds and up until that point things are very fast, but after that point the virtual server just completely hangs up and stops responding to requests until "InActConn" drops back below 14000. This happens if I run apache bench on the node itself and hit the clusterIp, or if I run it from a random server and hit the nodeport.

---ipvs
$ ipvsadm -ln | grep -A1 30530
TCP 10.200.1.146:30530 rr
-> 172.32.9.68:80 Masq 1 0 14115

--- apache bench output
ab -c 100 -n 20000 http://node6:30530
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking node6 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests

Server Software: Apache/2.4.34
Server Hostname: node6
Server Port: 30530

Document Path: /
Document Length: 2512 bytes

Concurrency Level: 100
Time taken for tests: 63.914 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 55860000 bytes
HTML transferred: 50240000 bytes
Requests per second: 312.92 [#/sec] (mean)
Time per request: 319.569 [ms] (mean)
Time per request: 3.196 [ms] (mean, across all concurrent requests)
Transfer rate: 853.50 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 311 456.6 10 1005
Processing: 1 7 4.5 8 36
Waiting: 0 7 4.5 8 36
Total: 2 318 453.1 19 1010

Percentage of the requests served within a certain time (ms)
50% 19
66% 21
75% 1003
80% 1004
90% 1004
95% 1005
98% 1005
99% 1006
100% 1010 (longest request)

uablrek · 2018-09-28T14:11:54Z

Check with netstat -putan if you have zillions of sockets in TIME_WAIT wheń it stalls.

neeseius · 2018-09-28T17:43:57Z

I actually don't see any TIME_WAITS on the physical host, I do see a ton in the containers. I made 6 replicas this time, all still on the same host node6. Below is another picture of where the numbers are at when the requests stop being answered again.

It seems the limit is still 14000, just divided among the containers now.
[ipvsadm]
$ ipvsadm -ln | grep -A6 30530
TCP 10.200.1.146:30530 rr
-> 172.32.9.69:80 Masq 1 0 2336
-> 172.32.9.71:80 Masq 1 0 2336
-> 172.32.9.72:80 Masq 1 0 2336
-> 172.32.9.73:80 Masq 1 0 2336
-> 172.32.9.74:80 Masq 1 0 2336
-> 172.32.9.75:80 Masq 1 0 2336

Below is the number of TIME-WAITS retrieve from each container.
2453
2467
2471
2475
2480
2482

neeseius · 2018-10-14T16:05:55Z

Just wanted to point out I am not using DSR, which causes me to wonder why there is an accumulation of TIME_WAIT connections, since in my case shouldn't the LVS be able to see all packets sent both ways?

uablrek · 2018-10-15T05:09:16Z

TIME_WAIT is in the tcp standard. The state should linger for 2 minutes (depending on how the connection was shutdown) and ipvs also keeps the state to be able to forward stray packets.
But you can make Linux reuse sockets in time_wait with sysctl's. I don't remember which so you have too search.
But it can be other things. Your symptom, fast connects then a almost dead stop, is some resource that becomes exhausted on the way. It can be ports, but it can be entries in ipvs or (more likely) in "conntrack". I know kube-proxy increases the conntrack tables, perhaps kube-router doesn't, I don't know.
This is a hard problem since you must investigate the whole path.

neeseius · 2018-10-17T15:47:14Z

I've done some research on what is going on and it turns out there is a legitimate problem with IPVS.
moby/moby#31746
moby/moby#35082

IPVS is not reusing ports like it is supposed to and thus the ephemeral ports are exhausted depending on the ephemeral port range (net.ipv4.ip_local_port_range). Setting net.ipv4.vs.conntrack=0 in sysctl somehow solves the re-use problem, but it breaks nodeport (and probably other stuff) so I don't believe that is the solution.

I don't know if it's just CentOS 7 that is affected or this is a broader problem but I imagine many other engineering teams using IPVS as a service proxy are going to eventually encounter this limitation.

xnaveira · 2018-10-19T15:06:09Z

We have been investigating the problem and have come to the following conclusions:

After a connection to a service is finished by the server it ends up in TIME_WAIT state. This state is kept during 2 minutes and then the connection is removed from the conntrack table. If during those 2 minutes the client tries to reuse the same port, upon SYN reception the connection is removed from the conntrack table, the SYN is not forwarded to the backend server nor an ACK is sent back to the client which after one second forces a retransmission from the client. This is perceived from the client as it took 1 second for the server to respond.
Disabling conntrack for ipvs (https://www.kernel.org/doc/Documentation/networking/ipvs-sysctl.txt) solves the problem since there is no entries to remove but in our setup created another problem. If the server hit by the client query wasn't running the pod locally, ipvs forwarded the packet to a pod (finding the address in its ipvs rules) but somehow disabling conntrack also disabled the masquerading on that forwarded packet so it reaches the pod with the client address as a source, the pod then tries to answer the query sending the packet directly to the client. Since the client opened a connection to the service ip as opposed to the pod ip, it sends a reset back to the pod and the connection is never established.
In our setup both pods ips and service ips are /32 addresses which are accessible from the clients. What we did is run the services with kube-router.io/service.local=true. This announces the service ip only from the hosts which are running one or several pods belonging to that service. This way ipvs never needs to send packets outside the box so no conntrack or masquerade is needed. No conntrack means no 1 second delay when reusing a port too quickly. Since we are using ECMP in our BGP setup the load is equally shared by all the hosts announcing the ip and then again internally by ipvs round robin.

neeseius · 2018-10-19T16:43:21Z

Thank you for looking into this.

We aren't utilizing BGP or ECMP yet so a load balancer will add all nodes regardless.
However, it sounds like we can set net.ipv4.vs.conntrack=0 and as long as we don't use NodePort we should be good?

For example disabling conntrack won't affect a POD hitting a service IP to reach another POD on a different node? And will session affinity like clientip still work?

EDIT:
I see based on the link provided that this will break network policy (iptables).
Hmm.. doesn't seem like I can use LVS as a service proxy

EDIT2:
Based on the link you sent me I found something called conn_reuse_mode

setting:
net.ipv4.vs.conntrack=1 (back to default to iptables will work)
net.ipv4.vs.conn_reuse_mode=0

appears to solve everything, even node port!
I am not sure if this breaks anything else but so far it seems okay to me.

neeseius · 2018-10-19T21:22:54Z

Special sauce for me seems to be:

net.ipv4.vs.conntrack=1
net.ipv4.vs.conn_reuse_mode=0
net.ipv4.vs.expire_nodest_conn=1

xnaveira · 2018-10-20T06:17:37Z

Our tests showed that disabling reuse with 'net.ipv4.vs.conn_reuse_mode=0' will interfere with scaling. When adding more pods in a high traffic scenario the traffic will stick to the old and overloaded pods and when scaling down, the traffic will be send to non existent pods.

uablrek · 2018-10-20T09:45:06Z

Please read this excellent comment on a refered issue; moby/moby#35082 (comment)

Be aware that the case where a stream of connects from a single source may not be the common case in real life. It is more likely that you have few connections but from very many sources.

You may try to tune your system to handle a case that only exist in your lab. While doing so you tweak parameters that are standard and are there for a reason. The result may be that your app becomes more unstable in real life where the networks is less reliable but performs excellent in your lab which is probably a LAN.

m1093782566 · 2018-11-11T03:35:57Z

@xnaveira

Our tests showed that disabling reuse with 'net.ipv4.vs.conn_reuse_mode=0' will interfere with scaling. When adding more pods in a high traffic scenario the traffic will stick to the old and overloaded pods and when scaling down, the traffic will be send to non existent pods.

Have you ever set net.ipv4.vs.expire_nodest_conn=1?

linecolumn · 2018-11-19T22:20:29Z

One of the suggestions was to set --notrack on host:

# iptables -t raw -A PREROUTING -p tcp -d VIP --dport VPORT -j CT --notrack

This makes issues with non local pod communication, AFAIK.

Also, for reference "one second delay communication" article which explains and provides some of solutions, https://marc.info/?l=linux-virtual-server&m=151743061027765&w=2

m1093782566 · 2018-11-20T01:46:39Z

I have the same confusing feeling that why IPVS drops SYN packet that hits IPVS connection in TIME_WAIT state if such connection uses Netfilter connection tracking (conntrack=1)?

roffe · 2018-11-21T10:03:14Z

@neeseius we have set conn_reuse_mode to 0 in the lastest build, could you test if you are experiencing the same problem with cloudnativelabs/kube-router-git@sha256:93c843ce19a7d98e8d07849143cc612359cd97db10aba8dca46e98fa114cca79

xnaveira · 2018-11-21T15:05:46Z

@roffe I tried your image in our setup and it seems to solve the problem! When running with latest i do the following test:
curl -s http://$SERVICE --local-port 2348 -w "%{time_total}\n"
This command outputs the total time for the request and forces the local port to be the same across several tries. If run several times in a short interval it gives a time in the order of 10s of milliseconds for the first time but 1s on the following because of the "ipvs dropping syn issue"
Doing the same with that image gives 10s of milliseconds consistently, no more ipvs delays.
We had tried disabling net.ipv4.vs.conn_reuse_mode on the hosts but then the problem was that traffic from the same port was redirected to the same pod even if this pod was removed during 2 minutes after deleting the pod. Have you done something else besides disabling net.ipv4.vs.conn_reuse_mode?

roffe · 2018-11-21T15:14:23Z

no, that was the only change

neeseius · 2018-11-21T17:29:26Z

This does appear to the solve the problem in my testing, even when scaling up and down.

I know we toyed with these parameters before, but it interfered with scaling.
net.ipv4.vs.conn_reuse_mode=0
net.ipv4.vs.expire_nodest_conn=1

However I noticed this is new:
net.ipv4.vs.expire_quiescent_template = 1

Is that was made the difference?

xnaveira · 2018-11-21T18:02:26Z

Could you link to the commit @roffe , I am also curious and in the same situation as @neeseius it seems.

roffe · 2018-11-21T18:18:18Z

#577
#579

roffe · 2018-11-21T18:26:35Z

This does appear to the solve the problem in my testing, even when scaling up and down.

I know we toyed with these parameters before, but it interfered with scaling.
net.ipv4.vs.conn_reuse_mode=0
net.ipv4.vs.expire_nodest_conn=1

However I noticed this is new:
net.ipv4.vs.expire_quiescent_template = 1

https://github.com/cloudnativelabs/kube-router/blame/master/pkg/controllers/proxy/network_services_controller.go#L285-L295

Is that was made the difference?

roffe · 2018-11-22T20:42:01Z

v0.2.3 released with IPVS throughput fixes

igoratencompass · 2018-11-22T23:39:38Z

Special sauce for me seems to be:

net.ipv4.vs.conntrack=1
net.ipv4.vs.conn_reuse_mode=0
net.ipv4.vs.expire_nodest_conn=1

Don't understand how can the last two be used in the same time when the kernel docs about conn_reuse_mode clearly says:

       0: disable any special handling on port reuse. The new
	connection will be delivered to the same real server that was
	servicing the previous connection. **This will effectively
	disable expire_nodest_conn**

so by setting net.ipv4.vs.conn_reuse_mode=0 you disable net.ipv4.vs.expire_nodest_conn.

igoratencompass · 2018-11-22T23:41:50Z

We had tried disabling net.ipv4.vs.conn_reuse_mode on the hosts but then the problem was that traffic from the same port was redirected to the same pod even if this pod was removed during 2 minutes after deleting the pod. Have you done something else besides disabling net.ipv4.vs.conn_reuse_mode?

And this is the main problem I see with this since setting it to zero basically disables net.ipv4.vs.expire_nodest_conn. Or is it just me?

roffe · 2018-11-22T23:57:13Z

must be a typo in the docs, kernel does not check if conn_reuse_mode is 0 when expiring nodest conn it seems: https://github.com/torvalds/linux/blob/master/net/netfilter/ipvs/ip_vs_core.c#L1982

Otherwise, will meet issue cloudnativelabs/kube-router#544

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: John Vincent <[email protected]> Signed-off-by: John Vincent <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ] YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 kubernetes/kubernetes#70747 - Apache Bench can fill up ipvs service proxy in seconds #544 cloudnativelabs/kube-router#544 - Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 Fixes: f719e37 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <[email protected]> Signed-off-by: YangYuxi <[email protected]> Signed-off-by: Julian Anastasov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

msoderberg mentioned this issue Oct 23, 2018

Allow changing default behavior of the service.local configuration #562

Closed

annProg mentioned this issue Nov 10, 2018

IPVS low throughput kubernetes/kubernetes#70747

Closed

roffe closed this as completed Nov 22, 2018

panpan0000 mentioned this issue Oct 15, 2019

Disable net/ipv4/vs/conn_reuse_mode to improve IPVS performance aledbf/kube-keepalived-vip#109

Merged

panpan0000 added a commit to panpan0000/kube-keepalived-vip that referenced this issue Oct 16, 2019

Disable net/ipv4/vs/conn_reuse_mode

8f03768

Otherwise, will meet issue cloudnativelabs/kube-router#544

yyx mentioned this issue Jun 11, 2020

kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client kubernetes/kubernetes#81775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Bench can fill up ipvs service proxy in seconds #544

Apache Bench can fill up ipvs service proxy in seconds #544

neeseius commented Sep 28, 2018 •

edited

Loading

uablrek commented Sep 28, 2018

neeseius commented Sep 28, 2018

neeseius commented Oct 14, 2018

uablrek commented Oct 15, 2018

neeseius commented Oct 17, 2018

xnaveira commented Oct 19, 2018

neeseius commented Oct 19, 2018 •

edited

Loading

neeseius commented Oct 19, 2018

xnaveira commented Oct 20, 2018

uablrek commented Oct 20, 2018

m1093782566 commented Nov 11, 2018

linecolumn commented Nov 19, 2018

m1093782566 commented Nov 20, 2018

roffe commented Nov 21, 2018 •

edited

Loading

xnaveira commented Nov 21, 2018

roffe commented Nov 21, 2018

neeseius commented Nov 21, 2018

xnaveira commented Nov 21, 2018

roffe commented Nov 21, 2018

roffe commented Nov 21, 2018

roffe commented Nov 22, 2018

igoratencompass commented Nov 22, 2018

igoratencompass commented Nov 22, 2018 •

edited

Loading

roffe commented Nov 22, 2018

Apache Bench can fill up ipvs service proxy in seconds #544

Apache Bench can fill up ipvs service proxy in seconds #544

Comments

neeseius commented Sep 28, 2018 • edited Loading

uablrek commented Sep 28, 2018

neeseius commented Sep 28, 2018

neeseius commented Oct 14, 2018

uablrek commented Oct 15, 2018

neeseius commented Oct 17, 2018

xnaveira commented Oct 19, 2018

neeseius commented Oct 19, 2018 • edited Loading

neeseius commented Oct 19, 2018

xnaveira commented Oct 20, 2018

uablrek commented Oct 20, 2018

m1093782566 commented Nov 11, 2018

linecolumn commented Nov 19, 2018

m1093782566 commented Nov 20, 2018

roffe commented Nov 21, 2018 • edited Loading

xnaveira commented Nov 21, 2018

roffe commented Nov 21, 2018

neeseius commented Nov 21, 2018

xnaveira commented Nov 21, 2018

roffe commented Nov 21, 2018

roffe commented Nov 21, 2018

roffe commented Nov 22, 2018

igoratencompass commented Nov 22, 2018

igoratencompass commented Nov 22, 2018 • edited Loading

roffe commented Nov 22, 2018

neeseius commented Sep 28, 2018 •

edited

Loading

neeseius commented Oct 19, 2018 •

edited

Loading

roffe commented Nov 21, 2018 •

edited

Loading

igoratencompass commented Nov 22, 2018 •

edited

Loading