-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server tries to route traffic through "hard" disconnected agents #152
Comments
Couple of quick clarifying questions. What does you setup look like? Single APIServer, Konnectivity Server and three Nodes each running an agent? There should be a line like "AgentID set to d700cd66-6ae4-416f-89be-4523adda93c3." in one of the agent logs. Can we get its log around W1015 11:51:13.078907? |
So the problem certainly looks like a failure to clean up a backend tunnel to an agent which has gone away. There were some known issues which sound similar in v0.0.11, fixed by @Jefftree but I believe those fixes are in v0.0.12. I've tried generating this issue at head using two agents and a server but any attempt to kill/restart an agent is seen by the server. May try to get another box involved so I can see if having a box go aware repro's the issue. Quick question, do you see any references to d700cd66-6ae4-416f-89be-4523adda93c3 in the server log, other than the attempts to pick it as an backend? Eg Do we know when it connected, do we see any indications that the server might have seen a broken connection but failed to do the cleanup? Do we know if (and when last) it was successfully used as a backend? |
@cheftako, I'm sorry, but I don't have any logs collection there and I lost the log after pod rotation. But! I replicated the issue.
First, we see the new backend connection - that's OK. |
Awesome work! That seems to confirm what I was suspecting. The problem would seem to occur when the connection is "silently" broken. On a single machine running both agents and a server, even if I "kill -9" the operating system will detect the connection having been broken and inform the server. I think by disabling the node your simulating a network failure. At that point no actual end connection message has ever been received. So the connection appears to still be alive, we're just not getting a response to anything we send. Anything we do send will timeout on the transport. We clearly want to at a minimum mark the backend as degraded/broken. I would like to think about whether we should immediately do a full cleanup on it. |
Thanks for the detailed explanation! At the moment, only the agent is responsible for checking and disconnecting unhealthy connections so a "silently" broken connection coupled with the agent unable to reach the server seems to contribute to the connection not being properly cleaned up on the server. @caesarxuchao did originally bring up that we may need to perform health checks on the server side too, but we haven't encountered situations where it is required before. This issue seems to provide a compelling reason for further investigation on that front. I agree that on the server side it would be difficult determine whether the network disconnect is transient, so we may not have enough information to determine whether to perform a full cleanup. Marking the connection and asking the server not to send traffic over a potentially broken connection does seem like a good first step though. |
I tested, how konnectivity behave after connection loss - an agent reconnects and everything works fine:
@cheftako, @Jefftree, can't we simply send "ping"/NULL packets at some interval? |
I will defer to @cheftako as for performance implications but TCP keepalive seems like it should be able to solve the problem. As for the code change, it should be a very small change to pass in additional KeepaliveParams when starting the proxy server. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/cmd/server/main.go#L545). Something like replacing that line with
might already be enough. The difficult part is probably fine tuning the parameter values (the default keepalive time of 2 hours seems a bit long for the server -> agent connections) and verifying whether the addition actually solves this bug. It would be awesome if you'd like to contribute to the codebase, we're all here and in the slack channel if you have any questions :). |
Thank you for the description and your support! It's very kind :) I will wait for @cheftako opinion and I will dig it through the weekend. That's a great opportunity for me! |
So a packet every two hours is unlikely to cause any performance implications, however having a dead connection for upto 2 hours, while better than the current situation (?forever?) will have performance implications. I see no reason not to merge such a fix right now, as its strictly better than where we are. However I still think we should be looking to mark the connection as unhealthy. That is more work. Especially if we want to be able to detect that it has become healthy again. |
What do you think about setting the
And I believe that it solved the issue:
And reconnecting:
I'm not sure, if adding |
Great findings, @Avatat! I have some questions on implementing the health check. Is the grpc.KeepaliveParams the same as the TCP keepalive, or is it at the GRPC level? It would be problematic if it's the TCP keepalive, because TCP keepalive won't kick in until the TCP max retransimission has been reached (see [1]). Is there a way to check the GRPC connection state on the server-side, like what we did on the client-side? |
It looks that GRPC keepalive is something different than "classic" TCP keepalive:
I don't know the answer to the second question, but I will dig it :) |
Nice, the GRPC keepalive looks promising. The godoc of
Thanks. Let's find out the canonical way of performing health check on the server-side in GRPC. |
I believe we don't have access to the underlying connection on the server side (grpc/grpc-go#2456) so health checks would have to be performed differently than on the client side. Interestingly enough, that issue in the grpc repo also points to using KeepaliveParams for closing problematic connections.
That's an interesting case we will need to verify
+1. The array might need to be unfolded with |
On the client-side, we use grpc.ClientConn.GetState which works for client only, and I didn't find anything similar but for a server. |
Maybe it's too early, but I've created a PR: #154 |
So I've been able to reproduce the problem using "sudo ifconfig down". This further confirms the belief in the problem. It also means I can implement a mechanism to mark a backend as unhealthy. |
I just looked into the grpc code, "activity" is defined by read activity on the connection, so the client needs to send data to the server to be quantified as "activity". server -> client transmit attempts should not be considered read activity so I think this keepalive option does actually solve our issue. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@Jefftree: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hello,
I have Kubernetes 1.18 cluster where Konnectivity v0.0.12 is running.
Most of the time, everything works perfectly, but after ~20 days of konnectivity-server uptime, some DIAL_REQ are getting nowhere.
For example,
kubectl logs -n kube-system konnectivity-agent-h6bt9 -f
only worked after the third execution.Server log:
Agent log:
Sometimes, when I want to view logs from the second agent, I'm getting these warnings:
Server log:
Agent logs nothing.
I believe, that konnectivity-server (+ kube-apiserver) pod restart would help (because it helped in the past), but I don't want to do it, because maybe you will need some more tests.
The text was updated successfully, but these errors were encountered: