Connection Issues with kubectl exec|logs #4

marwinski · 2021-07-14T10:59:14Z

What happened:

We sporadically see two issues that might or might not be related:

kubectl exec|logs sessions are randomly reset. Sometimes connections stay open all day long and sometimes connections are dropped every couple of minutes.
Sometimes the kubectl exec command fails with errors like the following although all the pods are running and the VPN is established:

Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable

In addition we see another issue which makes the above issues more crititical:

We also noticed that openvpn connection establishment between seed- and shoot clusters takes roughly 5 seconds compared to 1 second in a local setup with an identical configuration. We suspect that the vpn2 pod does not have sufficient CPU especially during connection establishment.

What you expected to happen:

None of the above. VPN tunnel shall be established in one second.

How to reproduce it (as minimally and precisely as possible):

This happens on all clusters and this is not a new issue (except maybe 2). It appears to happen much more often in larger environments such as our canary environment.

Anything else we need to know:

We sometimes see that the openvpn tunnel is reset and restarted. From past experience we suspect that this is caused by the intermittent cloud provider load balancers. There might be little that can be done about this, however our experiments indicate that this is not a significant problem:

Upon termination of the vpn tunnel the tunnel is re-established with an identical configuration.
Existing connections (kubectl logs, kubectl exec, kubectl port-forward) remain open and just appear to hang for 5 seconds (see above for the 5 second problem, this could be reduced to 1 second)
New kubectl exec attempts at that time fail while the vpn is down (possibly because of a 1 second connect timeout)
New kubectl logs attempts hang until the connection has been re-established
We did not test webhooks but assume they are also not affected as they have a connect timeout of 10 seconds by default. Once the TCP connection has been established the timeout will be much longer.

From our investigation we strongly suspect that the reset or termination of the vpn tunnel is not a real issue (even if it does happen every couple of minutes). Existing connections hang and new ones can be established if the connect timeout is not limited to one second or so (otherwise a retry will do the trick). This appears to apply to kubectl exec but not kubectl logs.

In this context, connections will only stay alive when openvpn restarts or recovers in the same pod. Due to NAT, connections will be terminated if for example the vpn-shoot pod restarts (as the stateful conntrack table in not kept).

It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.

Issue (2) appears to be related to the envoy configuration. It can be reproduced as follows:

Create a shoot cluster with one node
Log on to the node and reboot it, e.g. shutdown -r now
See the node restart, once all pods are running again you will still see this error message for a couple of minutes for kubectl exec:

Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable

Exec into the envoy sidecar container in the vpn-seed pod. Verify that you can indeed connect to the kubelet, e.g. do a nc -vz 10.250.0.14:10250

As for issue (1) we have seen this already with the "old" openvpn solution as well as the early ssh tunnel. We did believe the root cause was that the vpn tunnel was re-established. This investigation now has shown that this cannot be the main reason which is unknown.

Environment:

Any shoot cluster presumably on any infrastructure.

The text was updated successfully, but these errors were encountered:

marwinski · 2021-07-15T06:46:57Z

It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.

It appears this one is not an issue. We have seen that a kubectl exec connection is terminated when the new vpn-shoot pod is or has been started.

vlerenc · 2021-07-23T08:06:53Z

@mvladev Could you please update this ticket?

Also, @marwinski opened #4 which seems to describe the same problems as gardener/gardener#4381 and gardener/gardener#4382 opened by @ScheererJ . Which ones do we keep?

marwinski added the kind/bug Bug label Jul 14, 2021

This was referenced Jul 23, 2021

Establishing the ReversedVPN connection takes longer than expected gardener/gardener#4382

Closed

ReversedVPN connections being reset by istio ingress reconfigurations gardener/gardener#4381

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 20, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Issues with kubectl exec|logs #4

Connection Issues with kubectl exec|logs #4

marwinski commented Jul 14, 2021 •

edited

Loading

marwinski commented Jul 15, 2021

vlerenc commented Jul 23, 2021

Connection Issues with kubectl exec|logs #4

Connection Issues with kubectl exec|logs #4

Comments

marwinski commented Jul 14, 2021 • edited Loading

marwinski commented Jul 15, 2021

vlerenc commented Jul 23, 2021

marwinski commented Jul 14, 2021 •

edited

Loading