Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Issues with kubectl exec|logs #4

Open
marwinski opened this issue Jul 14, 2021 · 2 comments
Open

Connection Issues with kubectl exec|logs #4

marwinski opened this issue Jul 14, 2021 · 2 comments
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@marwinski
Copy link
Collaborator

marwinski commented Jul 14, 2021

What happened:

We sporadically see two issues that might or might not be related:

  1. kubectl exec|logs sessions are randomly reset. Sometimes connections stay open all day long and sometimes connections are dropped every couple of minutes.

  2. Sometimes the kubectl exec command fails with errors like the following although all the pods are running and the VPN is established:

Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable

In addition we see another issue which makes the above issues more crititical:

  1. We also noticed that openvpn connection establishment between seed- and shoot clusters takes roughly 5 seconds compared to 1 second in a local setup with an identical configuration. We suspect that the vpn2 pod does not have sufficient CPU especially during connection establishment.

What you expected to happen:

None of the above. VPN tunnel shall be established in one second.

How to reproduce it (as minimally and precisely as possible):

This happens on all clusters and this is not a new issue (except maybe 2). It appears to happen much more often in larger environments such as our canary environment.

Anything else we need to know:

We sometimes see that the openvpn tunnel is reset and restarted. From past experience we suspect that this is caused by the intermittent cloud provider load balancers. There might be little that can be done about this, however our experiments indicate that this is not a significant problem:

  • Upon termination of the vpn tunnel the tunnel is re-established with an identical configuration.
  • Existing connections (kubectl logs, kubectl exec, kubectl port-forward) remain open and just appear to hang for 5 seconds (see above for the 5 second problem, this could be reduced to 1 second)
  • New kubectl exec attempts at that time fail while the vpn is down (possibly because of a 1 second connect timeout)
  • New kubectl logs attempts hang until the connection has been re-established
  • We did not test webhooks but assume they are also not affected as they have a connect timeout of 10 seconds by default. Once the TCP connection has been established the timeout will be much longer.

From our investigation we strongly suspect that the reset or termination of the vpn tunnel is not a real issue (even if it does happen every couple of minutes). Existing connections hang and new ones can be established if the connect timeout is not limited to one second or so (otherwise a retry will do the trick). This appears to apply to kubectl exec but not kubectl logs.

In this context, connections will only stay alive when openvpn restarts or recovers in the same pod. Due to NAT, connections will be terminated if for example the vpn-shoot pod restarts (as the stateful conntrack table in not kept).

It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.

Issue (2) appears to be related to the envoy configuration. It can be reproduced as follows:

  • Create a shoot cluster with one node
  • Log on to the node and reboot it, e.g. shutdown -r now
  • See the node restart, once all pods are running again you will still see this error message for a couple of minutes for kubectl exec:
Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable
  • Exec into the envoy sidecar container in the vpn-seed pod. Verify that you can indeed connect to the kubelet, e.g. do a nc -vz 10.250.0.14:10250

As for issue (1) we have seen this already with the "old" openvpn solution as well as the early ssh tunnel. We did believe the root cause was that the vpn tunnel was re-established. This investigation now has shown that this cannot be the main reason which is unknown.

Environment:

Any shoot cluster presumably on any infrastructure.

@marwinski marwinski added the kind/bug Bug label Jul 14, 2021
@marwinski
Copy link
Collaborator Author

It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.

It appears this one is not an issue. We have seen that a kubectl exec connection is terminated when the new vpn-shoot pod is or has been started.

@vlerenc
Copy link
Member

vlerenc commented Jul 23, 2021

@mvladev Could you please update this ticket?

Also, @marwinski opened #4 which seems to describe the same problems as gardener/gardener#4381 and gardener/gardener#4382 opened by @ScheererJ . Which ones do we keep?

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 20, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

3 participants