Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune haproxy configuration #740

Closed
wants to merge 1 commit into from

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Jul 27, 2019

Current haproxy configuration declared the session closed after 50 sec if no packet was received from the client or from the server.
This caused issues, per example, when trying to tail logs with

...
127.0.0.1 - - [26/Jul/2019:21:20:28 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost:8000/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36" "-"
F0726 22:21:48.325924    1750 helpers.go:114] error: unexpected EOF

For long lived connections the user has to define the appropriate values to his environment, we can safely define 1h in kind, is unlikely that we can DOS the server with the number of open connections.

  timeout client 1h
  timeout server 1h

However, in other environments, and specifically in kind, there are more things that can force a disconnection. Keep in mind that in kind, kubectl connects to the exposed port on localhost, so our TCP connections can be closed by the conntrack timeouts of the host. We can set the option tcpka, enabling socket-level TCP keep-alives, that makes the system regularly send packets
to the other end of the connection, leaving it active. The configurations parameters of the TCP keepalives depends on the operations systems, and these keepalives doesn´t prevent the haproxy session to be expired.

Ref: kubernetes/kubeadm#1685

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 27, 2019
@k8s-ci-robot k8s-ci-robot requested review from amwat and munnerz July 27, 2019 21:38
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 27, 2019
@@ -33,18 +33,21 @@ type ConfigData struct {
// DefaultConfigTemplate is the loadbalancer config template
const DefaultConfigTemplate = `# generated by kind
global
stats socket /var/run/haproxy.sock mode 600 level admin
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is useful for debugging, installint socat or netcat in the haproxy container, you can check the sessions status per example

/ # echo "show sess" | socat  /var/run/haproxy.sock stdio
0x5655011ce8a0: proto=tcpv4 src=172.17.0.1:35988 fe=control-plane be=kube-apiservers srv=ha-control-plane ts=00 age=14m54s calls=3 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=45m11s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=45m11s,wx=,ax=] s0=[8,200008h,fd=32,ex=] s1=[8,200018h,fd=33,ex=] exp=45m5s
0x5655011cd880: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 age=0s calls=2 rate=2 cpu=0 lat=0 rq[f=c4c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008002h,i=0,an=00h,rx=,wx=,ax=] s0=[8,280008h,fd=34,ex=] s1=[8,204018h,fd=-1,ex=] exp=2m
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@aojea
Copy link
Contributor Author

aojea commented Jul 27, 2019

cc @ereslibre

Copy link

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is not addressing the issue but making the wait of the timeout longer (kubectl logs -f or kubectl exec -it would die if the inactivity happens during 1 hour instead of 50 seconds).

Looks like tcpka option itself is not helping.

log /dev/log local0
log /dev/log local1 notice
daemon

defaults
log global
mode tcp
option tcpka

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing the same behavior with clitcpka and srvtcpka (equivalent to tcpka), without tweaking timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tcpka avoids that the intermediate systems closes the connection (in our case iptables since we are doing docker port forwarding).
The kubectl commands will always die after the inactivity timeout we configure in haproxy if there is no packet between the server and the client.
To avoid this, kubectl should keep the connection open

Copy link
Contributor Author

@aojea aojea Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 seems that TCP keep alive doesn't kick in until the connection has been idle for 2 hours. by default

🤷‍♂

cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

@aojea
Copy link
Contributor Author

aojea commented Jul 27, 2019

/assign @BenTheElder

option dontlognull
# TODO: tune these
timeout connect 5000
timeout client 50000
Copy link
Member

@BenTheElder BenTheElder Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what unit were these in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 50s to an hour.

Seems arbitrarily large, how do other setups compare (eg kubespray)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is to be safe, maybe exaggerated, but don't think that anybody is going to exhaust the host with open tcp connections with kind :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair I supposed, I'm not sure this is the correct or right fix, but I'm OK to try it, mostly seems harmless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is hard to say if it is a bug or a limitation, the software does what's supposed to do, just doesn`t implement keepalives to avoid the intermediate devices closes the connection, all of the involved components can be blamed, is a common TCP problem with long lived connections 🤷‍♂️

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder I am going to finish this patch and open a PR. It covers kubectl exec -it, kubectl logs -f and kubectl get -w. The idea is to ping the other end on a regular basis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ereslibre! + 💯

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch is almost ready. I will review it tomorrow, checking that I am not breaking anything, rereading for minor changes and I think I should be good to open a PR and ping all stakeholders.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, when testing manually I go to the extreme and use this patch in kind: master...ereslibre:low-timeout, it's much more convenient to force the issue to ensure that pings are working as intended.

@BenTheElder
Copy link
Member

[I've just caught up on the linked issue]

@BenTheElder
Copy link
Member

Current haproxy configuration declared the session closed after 50 sec if no packet was received from the client or from the server.

when is this happening? what client? what use case?

@aojea
Copy link
Contributor Author

aojea commented Aug 6, 2019

when is this happening? what client? what use case?
Using those commands
#740 (review)

haproxy opens 2 sockets and forward traffic between them, if haproxy doesn't see any traffic it expires the session. The OS TCP keepalives are useless because they work at Kernel level (ending in each of the sockets open by haproxy) and haproxy doesn't see them, hence the connection timeout is not refreshed.

@BenTheElder
Copy link
Member

haproxy opens 2 sockets and forward traffic between them, if haproxy doesn't see any traffic it expires the session. The OS TCP keepalives are useless because they work at Kernel level (ending in each of the sockets open by haproxy) and haproxy doesn't see them, hence the connection timeout is not refreshed.

that doesn't really answer the questions:

  • which client?
    • answer: kubectl, any kubernetes client ...
  • what use case?
    • answer: kubectl logs, apparently when there's no output there's no packets

#740 (comment) sounds like the right solution to me, this is an upstream issue that should be fixed there.

we wouldn't want people using this fix on a real cluster, and there's nothing special about kind in this respect.

@aojea
Copy link
Contributor Author

aojea commented Aug 6, 2019

any other tools or applications that uses haproxy and not send traffic in 50 sec will have the same problem, however I agree with you it should be solved at the app level

@BenTheElder
Copy link
Member

imho that's the kind of bug you want to find testing your app :^)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please assign bentheelder
You can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BenTheElder BenTheElder added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 14, 2019
@aojea
Copy link
Contributor Author

aojea commented Sep 6, 2019

we can reopen in necessary, seems it is no needed

@aojea aojea closed this Sep 6, 2019
@aojea aojea deleted the haproxy_tuning branch August 9, 2020 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. priority/backlog Higher priority than priority/awaiting-more-evidence. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants