Fix cleanup of timeout on dial request. #253

cheftako · 2021-08-03T05:57:49Z

We can timeout on a dial request.
That is timeout with having received a dial response.
No dial response means we have no connection id.
Thats means we were sending a close request with a zero connID.
That does nothing.
Now sending a new proto, DIAL_CLS with the random id.
This allows the konn server to clean up the pending dial.

We can timeout on a dial request. That is timeout with having received a dial response. No dial response means we have no connection id. Thats means we were sending a close request with a zero connID. That does nothing. Now sending a new proto, DIAL_CLS with the random id. This allows the konn server to clean up the pending dial.

k8s-ci-robot · 2021-08-03T05:57:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cheftako · 2021-08-03T06:00:04Z

In the normal case you would upgrade KAS and ANP Server together. In this case there is no issue.
If you upgrade the ANP-Server prior to the KAS, no issue as the new message is only sent by the KAS.
If you upgrade the KAS prior to the ANP-Server it will start sending DIAL_CLS messages.
The ANP-Server does not need to respond and will ignore the message. As such it should not cause any new issues.

cheftako · 2021-08-03T06:00:34Z

/assign @Jefftree @anfernee

anfernee

LGTM. From the client it looks good. On the server side, there might be a small race: when the CLOSE_DIAL request comes, the connection has already established, and moved out of the pendingDial table. It won't close the connection. Is it acceptable?

cheftako · 2021-08-03T16:16:37Z

LGTM. From the client it looks good. On the server side, there might be a small race: when the CLOSE_DIAL request comes, the connection has already established, and moved out of the pendingDial table. It won't close the connection. Is it acceptable?

The dial timeout is usually ~10 seconds. I think the more likely case is that we "lose" the dial response, to generate the case your talking about. I believe it should be fairly rare but I' not happy with it. We are planning on adding metrics for # of pending dials and connections. That might review is we realistically are hitting either of these cases. Frankly though I'm not sure how to handle this edge case. The client can't usefully send a CLOSE_REQ as it doesn't know the connection id. If we've hit the dial timeout then a late arriving dial response won't have a listener to handle it. So not sure there is much we can do.

anfernee · 2021-08-03T16:35:51Z

Maybe keep the random seed in the established connection as well, or simply use it as connectID so that it spans the whole lifecycle of a pending connections and live connections. What do you guys think?

cheftako · 2021-08-03T21:46:31Z

Maybe keep the random seed in the established connection as well, or simply use it as connectID so that it spans the whole lifecycle of a pending connections and live connections. What do you guys think?

I seem to remember that we decided it wasn't safe to use the random id as the connection ID. I believe it had to do with having multiple agents. At this point I don't think its safe to make that level of change without a lot of thought. No particular objection to keeping the random in the established connection, though not sure how it helps. I will say the the identified race condition exists today and while the PR does not eliminate the window, it does significantly reduce the size of the window.

Jefftree · 2021-08-04T15:44:00Z

LGTM for the code changes, I agree that sending connID=0 is a problem, but I'm having a bit of trouble understanding the desired behavior with the new packet.

When the proxy server sends the DIAL_REQ to the agent, the agent blocks until the dial is resolved. A response should be guaranteed (regardless of success) and the pending dial is removed in both cases. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/server.go#L686-L691) I guess the only case where the PendingDial isn't cleaned up is if the network connection is broken during this process. Is that what you're targeting or is there another specific case that will trigger this?

cheftako · 2021-08-04T16:00:19Z

LGTM for the code changes, I agree that sending connID=0 is a problem, but I'm having a bit of trouble understanding the desired behavior with the new packet.

When the proxy server sends the DIAL_REQ to the agent, the agent blocks until the dial is resolved. A response should be guaranteed (regardless of success) and the pending dial is removed in both cases. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/server.go#L686-L691) I guess the only case where the PendingDial isn't cleaned up is if the network connection is broken during this process. Is that what you're targeting or is there another specific case that will trigger this?

Great question. While investigating the https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=action%5C%20gather%5C%20failed%5C%20for%5C%20SystemPodMetrics%5C%20measurement%3A%5C%20restart%5C%20counts%5C%20violation%3A%5C%20RestartCount%5C(konnectivity%5C-agent&job=gce-scal tests, we see a few instances of connID==0 in the logs. The only way I can see what we would get the client (KAS) sending a close request with a conn id of 0, is if never got a dial response. Hard to pin point the cause from logs. However I would assume this means either the end point being talked to (?Kubelet/pod/....) failed to respond to the TCP connection request (dead processs, ...) or the dial response message was lost during transmission. In any of these cases it is pointless to send a close request with a conn id of 0 because it cannot be acted upon. In the failed connection request case (or the lost response message if it was lost before getting to the ANP server) there will be an orphaned entry in the pending dial list. So at a minimum it would be good to remove that entry.

Jefftree · 2021-08-04T21:28:19Z

Thanks for the explanation, I agree that connID==0 is not ideal and we should send a more useful error. It's bizarre that a Close was called without the DIAL_RSP going through since we do block for the dial to finish (or timeout) before returning the connection (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/konnectivity-client/pkg/client/client.go#L216-L230), and omitting the DIAL_RSP should in theory prevent the net.Conn from being returned. It's hard to pinpoint how this is caused so for now +1 on this mitigation.

/lgtm

cheftako requested review from anfernee and Jefftree August 3, 2021 05:57

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 3, 2021

k8s-ci-robot requested a review from dberkov August 3, 2021 05:57

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2021

k8s-ci-robot assigned anfernee and Jefftree Aug 3, 2021

anfernee reviewed Aug 3, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2021

k8s-ci-robot merged commit 36d83bc into kubernetes-sigs:master Aug 4, 2021

cheftako mentioned this pull request Aug 10, 2021

Konnectivity server leaks memory and free sockets #255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cleanup of timeout on dial request. #253

Fix cleanup of timeout on dial request. #253

cheftako commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

cheftako commented Aug 3, 2021

cheftako commented Aug 3, 2021

anfernee left a comment

cheftako commented Aug 3, 2021

anfernee commented Aug 3, 2021

cheftako commented Aug 3, 2021

Jefftree commented Aug 4, 2021

cheftako commented Aug 4, 2021

Jefftree commented Aug 4, 2021

Fix cleanup of timeout on dial request. #253

Fix cleanup of timeout on dial request. #253

Conversation

cheftako commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

cheftako commented Aug 3, 2021

cheftako commented Aug 3, 2021

anfernee left a comment

Choose a reason for hiding this comment

cheftako commented Aug 3, 2021

anfernee commented Aug 3, 2021

cheftako commented Aug 3, 2021

Jefftree commented Aug 4, 2021

cheftako commented Aug 4, 2021

Jefftree commented Aug 4, 2021