You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We sporadically see two issues that might or might not be related:
kubectl exec|logs sessions are randomly reset. Sometimes connections stay open all day long and sometimes connections are dropped every couple of minutes.
Sometimes the kubectl exec command fails with errors like the following although all the pods are running and the VPN is established:
Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable
In addition we see another issue which makes the above issues more crititical:
We also noticed that openvpn connection establishment between seed- and shoot clusters takes roughly 5 seconds compared to 1 second in a local setup with an identical configuration. We suspect that the vpn2 pod does not have sufficient CPU especially during connection establishment.
What you expected to happen:
None of the above. VPN tunnel shall be established in one second.
How to reproduce it (as minimally and precisely as possible):
This happens on all clusters and this is not a new issue (except maybe 2). It appears to happen much more often in larger environments such as our canary environment.
Anything else we need to know:
We sometimes see that the openvpn tunnel is reset and restarted. From past experience we suspect that this is caused by the intermittent cloud provider load balancers. There might be little that can be done about this, however our experiments indicate that this is not a significant problem:
Upon termination of the vpn tunnel the tunnel is re-established with an identical configuration.
Existing connections (kubectl logs, kubectl exec, kubectl port-forward) remain open and just appear to hang for 5 seconds (see above for the 5 second problem, this could be reduced to 1 second)
New kubectl exec attempts at that time fail while the vpn is down (possibly because of a 1 second connect timeout)
New kubectl logs attempts hang until the connection has been re-established
We did not test webhooks but assume they are also not affected as they have a connect timeout of 10 seconds by default. Once the TCP connection has been established the timeout will be much longer.
From our investigation we strongly suspect that the reset or termination of the vpn tunnel is not a real issue (even if it does happen every couple of minutes). Existing connections hang and new ones can be established if the connect timeout is not limited to one second or so (otherwise a retry will do the trick). This appears to apply to kubectl exec but not kubectl logs.
In this context, connections will only stay alive when openvpn restarts or recovers in the same pod. Due to NAT, connections will be terminated if for example the vpn-shoot pod restarts (as the stateful conntrack table in not kept).
It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.
Issue (2) appears to be related to the envoy configuration. It can be reproduced as follows:
Create a shoot cluster with one node
Log on to the node and reboot it, e.g. shutdown -r now
See the node restart, once all pods are running again you will still see this error message for a couple of minutes for kubectl exec:
Error from server: error dialing backend: proxy error from vpn-seed-server:9443 while dialing 10.250.0.14:10250, code 503: 503 Service Unavailable
Exec into the envoy sidecar container in the vpn-seed pod. Verify that you can indeed connect to the kubelet, e.g. do a nc -vz 10.250.0.14:10250
As for issue (1) we have seen this already with the "old" openvpn solution as well as the early ssh tunnel. We did believe the root cause was that the vpn tunnel was re-established. This investigation now has shown that this cannot be the main reason which is unknown.
Environment:
Any shoot cluster presumably on any infrastructure.
The text was updated successfully, but these errors were encountered:
It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.
It appears this one is not an issue. We have seen that a kubectl exec connection is terminated when the new vpn-shoot pod is or has been started.
What happened:
We sporadically see two issues that might or might not be related:
kubectl exec|logs
sessions are randomly reset. Sometimes connections stay open all day long and sometimes connections are dropped every couple of minutes.Sometimes the
kubectl exec
command fails with errors like the following although all the pods are running and the VPN is established:In addition we see another issue which makes the above issues more crititical:
What you expected to happen:
None of the above. VPN tunnel shall be established in one second.
How to reproduce it (as minimally and precisely as possible):
This happens on all clusters and this is not a new issue (except maybe 2). It appears to happen much more often in larger environments such as our canary environment.
Anything else we need to know:
We sometimes see that the openvpn tunnel is reset and restarted. From past experience we suspect that this is caused by the intermittent cloud provider load balancers. There might be little that can be done about this, however our experiments indicate that this is not a significant problem:
kubectl exec
attempts at that time fail while the vpn is down (possibly because of a 1 second connect timeout)kubectl logs
attempts hang until the connection has been re-establishedFrom our investigation we strongly suspect that the reset or termination of the vpn tunnel is not a real issue (even if it does happen every couple of minutes). Existing connections hang and new ones can be established if the connect timeout is not limited to one second or so (otherwise a retry will do the trick). This appears to apply to
kubectl exec
but notkubectl logs
.In this context, connections will only stay alive when openvpn restarts or recovers in the same pod. Due to NAT, connections will be terminated if for example the vpn-shoot pod restarts (as the stateful conntrack table in not kept).
It might be useful in this case to investigate what happens to existing connection if this happens. Those should be actively terminated as the TCP timeout is quite long and this would cause applications and/or infrastructure to hang.
Issue (2) appears to be related to the envoy configuration. It can be reproduced as follows:
shutdown -r now
kubectl exec
:vpn-seed
pod. Verify that you can indeed connect to the kubelet, e.g. do anc -vz 10.250.0.14:10250
As for issue (1) we have seen this already with the "old" openvpn solution as well as the early ssh tunnel. We did believe the root cause was that the vpn tunnel was re-established. This investigation now has shown that this cannot be the main reason which is unknown.
Environment:
Any shoot cluster presumably on any infrastructure.
The text was updated successfully, but these errors were encountered: