Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cleanup of timeout on dial request. #253

Merged
merged 1 commit into from
Aug 4, 2021

Conversation

cheftako
Copy link
Contributor

@cheftako cheftako commented Aug 3, 2021

We can timeout on a dial request.
That is timeout with having received a dial response.
No dial response means we have no connection id.
Thats means we were sending a close request with a zero connID.
That does nothing.
Now sending a new proto, DIAL_CLS with the random id.
This allows the konn server to clean up the pending dial.

We can timeout on a dial request.
That is timeout with having received a dial response.
No dial response means we have no connection id.
Thats means we were sending a close request with a zero connID.
That does nothing.
Now sending a new proto, DIAL_CLS with the random id.
This allows the konn server to clean up the pending dial.
@cheftako cheftako requested review from anfernee and Jefftree August 3, 2021 05:57
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from dberkov August 3, 2021 05:57
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2021
@cheftako
Copy link
Contributor Author

cheftako commented Aug 3, 2021

In the normal case you would upgrade KAS and ANP Server together. In this case there is no issue.
If you upgrade the ANP-Server prior to the KAS, no issue as the new message is only sent by the KAS.
If you upgrade the KAS prior to the ANP-Server it will start sending DIAL_CLS messages.
The ANP-Server does not need to respond and will ignore the message. As such it should not cause any new issues.

@cheftako
Copy link
Contributor Author

cheftako commented Aug 3, 2021

/assign @Jefftree @anfernee

Copy link
Member

@anfernee anfernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. From the client it looks good. On the server side, there might be a small race: when the CLOSE_DIAL request comes, the connection has already established, and moved out of the pendingDial table. It won't close the connection. Is it acceptable?

@cheftako
Copy link
Contributor Author

cheftako commented Aug 3, 2021

LGTM. From the client it looks good. On the server side, there might be a small race: when the CLOSE_DIAL request comes, the connection has already established, and moved out of the pendingDial table. It won't close the connection. Is it acceptable?

The dial timeout is usually ~10 seconds. I think the more likely case is that we "lose" the dial response, to generate the case your talking about. I believe it should be fairly rare but I' not happy with it. We are planning on adding metrics for # of pending dials and connections. That might review is we realistically are hitting either of these cases. Frankly though I'm not sure how to handle this edge case. The client can't usefully send a CLOSE_REQ as it doesn't know the connection id. If we've hit the dial timeout then a late arriving dial response won't have a listener to handle it. So not sure there is much we can do.

@anfernee
Copy link
Member

anfernee commented Aug 3, 2021

Maybe keep the random seed in the established connection as well, or simply use it as connectID so that it spans the whole lifecycle of a pending connections and live connections. What do you guys think?

@cheftako
Copy link
Contributor Author

cheftako commented Aug 3, 2021

Maybe keep the random seed in the established connection as well, or simply use it as connectID so that it spans the whole lifecycle of a pending connections and live connections. What do you guys think?

I seem to remember that we decided it wasn't safe to use the random id as the connection ID. I believe it had to do with having multiple agents. At this point I don't think its safe to make that level of change without a lot of thought. No particular objection to keeping the random in the established connection, though not sure how it helps. I will say the the identified race condition exists today and while the PR does not eliminate the window, it does significantly reduce the size of the window.

@Jefftree
Copy link
Member

Jefftree commented Aug 4, 2021

LGTM for the code changes, I agree that sending connID=0 is a problem, but I'm having a bit of trouble understanding the desired behavior with the new packet.

When the proxy server sends the DIAL_REQ to the agent, the agent blocks until the dial is resolved. A response should be guaranteed (regardless of success) and the pending dial is removed in both cases. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/server.go#L686-L691) I guess the only case where the PendingDial isn't cleaned up is if the network connection is broken during this process. Is that what you're targeting or is there another specific case that will trigger this?

@cheftako
Copy link
Contributor Author

cheftako commented Aug 4, 2021

LGTM for the code changes, I agree that sending connID=0 is a problem, but I'm having a bit of trouble understanding the desired behavior with the new packet.

When the proxy server sends the DIAL_REQ to the agent, the agent blocks until the dial is resolved. A response should be guaranteed (regardless of success) and the pending dial is removed in both cases. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/server.go#L686-L691) I guess the only case where the PendingDial isn't cleaned up is if the network connection is broken during this process. Is that what you're targeting or is there another specific case that will trigger this?

Great question. While investigating the https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=action%5C%20gather%5C%20failed%5C%20for%5C%20SystemPodMetrics%5C%20measurement%3A%5C%20restart%5C%20counts%5C%20violation%3A%5C%20RestartCount%5C(konnectivity%5C-agent&job=gce-scal tests, we see a few instances of connID==0 in the logs. The only way I can see what we would get the client (KAS) sending a close request with a conn id of 0, is if never got a dial response. Hard to pin point the cause from logs. However I would assume this means either the end point being talked to (?Kubelet/pod/....) failed to respond to the TCP connection request (dead processs, ...) or the dial response message was lost during transmission. In any of these cases it is pointless to send a close request with a conn id of 0 because it cannot be acted upon. In the failed connection request case (or the lost response message if it was lost before getting to the ANP server) there will be an orphaned entry in the pending dial list. So at a minimum it would be good to remove that entry.

@Jefftree
Copy link
Member

Jefftree commented Aug 4, 2021

Thanks for the explanation, I agree that connID==0 is not ideal and we should send a more useful error. It's bizarre that a Close was called without the DIAL_RSP going through since we do block for the dial to finish (or timeout) before returning the connection (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/konnectivity-client/pkg/client/client.go#L216-L230), and omitting the DIAL_RSP should in theory prevent the net.Conn from being returned. It's hard to pinpoint how this is caused so for now +1 on this mitigation.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2021
@k8s-ci-robot k8s-ci-robot merged commit 36d83bc into kubernetes-sigs:master Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants