Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a custom kubernetes client, to handle network failures #9009

Closed
wants to merge 2 commits into from

Conversation

justinsb
Copy link
Member

By setting a low HTTP timeout, and reseting connections on error, we
should be able to work around the http2 connection issues:
kubernetes/client-go#374

I observed this bug specifically when rolling an HA control plane
without a load balancer.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/rolling-update approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 28, 2020
@justinsb justinsb force-pushed the rolling_update_timeout branch from ed38e12 to 744b555 Compare April 28, 2020 01:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2020
Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind the exit strategy. When client-go is fixed we'll want to unravel this interface.

An alternative approach would be to have client implement (only) kubernetes.Interface, wrapping some or all of the methods with the error handling. This would allow error handling for the calls that the drainer uses, for instance. Admittedly kubernetes.Interface is a voluminous interface to write a delegating implementation for, but if writing a reflection-based code generator is too much then one could embed the inner kubernetes.Interface and only override the key methods.


c, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("error creating kubernetes client: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just return err? Wrapping it doesn't add any value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair - done. It's mostly habit from when calling AWS SDK when some of the functions return very terse messages, so I tend to wrap-by-default. Hopefully this all becomes easier when we start using %w - which we can start doing with go 1.13 (https://blog.golang.org/go1.13-errors) .. I'm just a little wary about backports in general.

patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing the patch creation logic down into here strikes me as being a mixing of concerns. This layer should limit itself to timeouts and error recovery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem is that we need to call handleError. I do agree that there's more logic in here than there should be, so I moved it to be a package-level function (which takes a k8sclient.Interface).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the caller should create patchBytes and this layer should simply concern itself with calling Patch() and handleError(), just like ListNodes() and DeleteNode() do.

type Interface interface {
// RawClient returns the current kubernetes.Interface; it should not be cached
// Using wrapper methods is preferrable because we can do richer retry logic and error handling.
RawClient() kubernetes.Interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be simpler to just embed kubernetes.Interface in both Interface and client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My strategy is to gradually remove RawClient() - anything that uses RawClient() won't get our enhanced error handling. I added a comment that this should be considered Deprecated.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020
@justinsb
Copy link
Member Author

justinsb commented Jun 7, 2020

You raise an excellent point on client-go wrapping and the exit strategy: I think our exit strategy is actually using less and less of client-go. It's hard to keep track, but I think the plan for client-go is to favor the dynamic interface.

(I also marked RawClient deprecated to make it doubly clear that ideally we get to a point where we don't use it)

@justinsb justinsb force-pushed the rolling_update_timeout branch from 744b555 to ac05cad Compare June 7, 2020 16:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2020
@justinsb justinsb force-pushed the rolling_update_timeout branch 2 times, most recently from eb8348f to 06af22f Compare June 9, 2020 17:09
By setting a low HTTP timeout, and reseting connections on error, we
should be able to work around the http2 connection issues:
kubernetes/client-go#374

I observed this bug specifically when rolling an HA control plane
without a load balancer.
@justinsb justinsb force-pushed the rolling_update_timeout branch from 06af22f to bc685c0 Compare June 9, 2020 17:11
pkg/k8sclient/client.go Outdated Show resolved Hide resolved
Co-authored-by: Peter Rifel <[email protected]>
Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layering approach has me quite concerned: I fear it's going to turn into a significant maintenance burden. I understand the need for expediency, but I don't think we should plan to have such a thing long term.

@@ -511,7 +487,7 @@ func (c *RollingUpdateCluster) drainNode(u *cloudinstances.CloudInstanceGroupMem
}

helper := &drain.Helper{
Client: c.K8sClient,
Client: c.K8sClient.RawClient(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're always going to have to call into imported code such as drain.Helper and those things are going to need a kubernetes.Interface. So those won't get the enhanced error handling, we'll have to fork and maintain the imported code ourselves, or we'll have to wrap kubernetes.Interface.

// trigger a timeout error and then to recover we need to
// reset any existing connections
if config.Timeout < maxTimeout {
config.Timeout = maxTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to modify the caller's rest.Config? Perhaps we should make a copy?

patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the caller should create patchBytes and this layer should simply concern itself with calling Patch() and handleError(), just like ListNodes() and DeleteNode() do.

}

_, err = k8sClient.RawClient().CoreV1().Nodes().Patch(ctx, node.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{})
if c, ok := k8sClient.(*client); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a code smell. Assuming mocking of Interface is useful, this should be a method of Interface so it can also be mocked.

"k8s.io/kops/pkg/k8sclient"
)

type Fake struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this type exist? Why not use a real client wrapping a fake.Clientset?

@hakman
Copy link
Member

hakman commented Aug 17, 2020

Stepping back a little, my understanding is that Go 1.15 was the blocker for fixing this on kubernetes/client-go side.
Is there any way to make this happen now or maybe at least the Go patch opens some other possibilities?
kubernetes/client-go#374 (comment)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2020
@k8s-ci-robot
Copy link
Contributor

@justinsb: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

klog.Warningf("client timed out, but http transport was not of expected type, was %T", restClient.Client.Transport)
return
}
httpTransport.CloseIdleConnections()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb How would this make sure that the connection over which the request just timed out is idle and hence would be closed? Is there a possibility that the connection is picked up by another request and is in use while this is getting executed? // @johngmyers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the connection got picked up by another request, that's fine. When the connection does go idle the last request to complete will close it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johngmyers Sure, this may work in this particular case. Would the same approach work in general for some other high throughput clients suffering from connection staleness? Sorry to hijack this thread, but my issue is similar but related to a different system, and was wondering if the same approach can be applied there.

@hakman
Copy link
Member

hakman commented Dec 14, 2020

@justinsb I think this should be fixed now by this kubernetes/kubernetes#96778 via #10385. I believe it can be closed.

@k8s-ci-robot
Copy link
Contributor

@justinsb: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kops-verify-cloudformation 88b332d link /test pull-kops-verify-cloudformation
pull-kops-verify-hashes 88b332d link /test pull-kops-verify-hashes
pull-kops-e2e-k8s-containerd 88b332d link /test pull-kops-e2e-k8s-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants