-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cleanup of timeout on dial request. #253
Conversation
We can timeout on a dial request. That is timeout with having received a dial response. No dial response means we have no connection id. Thats means we were sending a close request with a zero connID. That does nothing. Now sending a new proto, DIAL_CLS with the random id. This allows the konn server to clean up the pending dial.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In the normal case you would upgrade KAS and ANP Server together. In this case there is no issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. From the client it looks good. On the server side, there might be a small race: when the CLOSE_DIAL request comes, the connection has already established, and moved out of the pendingDial table. It won't close the connection. Is it acceptable?
The dial timeout is usually ~10 seconds. I think the more likely case is that we "lose" the dial response, to generate the case your talking about. I believe it should be fairly rare but I' not happy with it. We are planning on adding metrics for # of pending dials and connections. That might review is we realistically are hitting either of these cases. Frankly though I'm not sure how to handle this edge case. The client can't usefully send a CLOSE_REQ as it doesn't know the connection id. If we've hit the dial timeout then a late arriving dial response won't have a listener to handle it. So not sure there is much we can do. |
Maybe keep the |
I seem to remember that we decided it wasn't safe to use the random id as the connection ID. I believe it had to do with having multiple agents. At this point I don't think its safe to make that level of change without a lot of thought. No particular objection to keeping the random in the established connection, though not sure how it helps. I will say the the identified race condition exists today and while the PR does not eliminate the window, it does significantly reduce the size of the window. |
LGTM for the code changes, I agree that sending connID=0 is a problem, but I'm having a bit of trouble understanding the desired behavior with the new packet. When the proxy server sends the DIAL_REQ to the agent, the agent blocks until the dial is resolved. A response should be guaranteed (regardless of success) and the pending dial is removed in both cases. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/server.go#L686-L691) I guess the only case where the PendingDial isn't cleaned up is if the network connection is broken during this process. Is that what you're targeting or is there another specific case that will trigger this? |
Great question. While investigating the https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=action%5C%20gather%5C%20failed%5C%20for%5C%20SystemPodMetrics%5C%20measurement%3A%5C%20restart%5C%20counts%5C%20violation%3A%5C%20RestartCount%5C(konnectivity%5C-agent&job=gce-scal tests, we see a few instances of connID==0 in the logs. The only way I can see what we would get the client (KAS) sending a close request with a conn id of 0, is if never got a dial response. Hard to pin point the cause from logs. However I would assume this means either the end point being talked to (?Kubelet/pod/....) failed to respond to the TCP connection request (dead processs, ...) or the dial response message was lost during transmission. In any of these cases it is pointless to send a close request with a conn id of 0 because it cannot be acted upon. In the failed connection request case (or the lost response message if it was lost before getting to the ANP server) there will be an orphaned entry in the pending dial list. So at a minimum it would be good to remove that entry. |
Thanks for the explanation, I agree that connID==0 is not ideal and we should send a more useful error. It's bizarre that a Close was called without the DIAL_RSP going through since we do block for the dial to finish (or timeout) before returning the connection (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/konnectivity-client/pkg/client/client.go#L216-L230), and omitting the DIAL_RSP should in theory prevent the net.Conn from being returned. It's hard to pinpoint how this is caused so for now +1 on this mitigation. /lgtm |
We can timeout on a dial request.
That is timeout with having received a dial response.
No dial response means we have no connection id.
Thats means we were sending a close request with a zero connID.
That does nothing.
Now sending a new proto, DIAL_CLS with the random id.
This allows the konn server to clean up the pending dial.