Use a custom kubernetes client, to handle network failures #9009

justinsb · 2020-04-28T01:52:34Z

By setting a low HTTP timeout, and reseting connections on error, we
should be able to work around the http2 connection issues:
kubernetes/client-go#374

I observed this bug specifically when rolling an HA control plane
without a load balancer.

k8s-ci-robot · 2020-04-28T01:53:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [justinsb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johngmyers

Keep in mind the exit strategy. When client-go is fixed we'll want to unravel this interface.

An alternative approach would be to have client implement (only) kubernetes.Interface, wrapping some or all of the methods with the error handling. This would allow error handling for the calls that the drainer uses, for instance. Admittedly kubernetes.Interface is a voluminous interface to write a delegating implementation for, but if writing a reflection-based code generator is too much then one could embed the inner kubernetes.Interface and only override the key methods.

johngmyers · 2020-04-28T04:22:12Z

pkg/k8sclient/client.go

+
+	c, err := kubernetes.NewForConfig(config)
+	if err != nil {
+		return nil, fmt.Errorf("error creating kubernetes client: %v", err)


Perhaps just return err? Wrapping it doesn't add any value.

Fair - done. It's mostly habit from when calling AWS SDK when some of the functions return very terse messages, so I tend to wrap-by-default. Hopefully this all becomes easier when we start using %w - which we can start doing with go 1.13 (https://blog.golang.org/go1.13-errors) .. I'm just a little wary about backports in general.

johngmyers · 2020-04-28T04:29:47Z

pkg/k8sclient/client.go

+	patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node)
+	if err != nil {
+		return err
+	}


Pushing the patch creation logic down into here strikes me as being a mixing of concerns. This layer should limit itself to timeouts and error recovery.

So the problem is that we need to call handleError. I do agree that there's more logic in here than there should be, so I moved it to be a package-level function (which takes a k8sclient.Interface).

I believe the caller should create patchBytes and this layer should simply concern itself with calling Patch() and handleError(), just like ListNodes() and DeleteNode() do.

johngmyers · 2020-04-28T04:57:28Z

pkg/k8sclient/client.go

+type Interface interface {
+	// RawClient returns the current kubernetes.Interface; it should not be cached
+	// Using wrapper methods is preferrable because we can do richer retry logic and error handling.
+	RawClient() kubernetes.Interface


It would be simpler to just embed kubernetes.Interface in both Interface and client.

My strategy is to gradually remove RawClient() - anything that uses RawClient() won't get our enhanced error handling. I added a comment that this should be considered Deprecated.

justinsb · 2020-06-07T16:23:43Z

You raise an excellent point on client-go wrapping and the exit strategy: I think our exit strategy is actually using less and less of client-go. It's hard to keep track, but I think the plan for client-go is to favor the dynamic interface.

(I also marked RawClient deprecated to make it doubly clear that ideally we get to a point where we don't use it)

By setting a low HTTP timeout, and reseting connections on error, we should be able to work around the http2 connection issues: kubernetes/client-go#374 I observed this bug specifically when rolling an HA control plane without a load balancer.

pkg/k8sclient/client.go

Co-authored-by: Peter Rifel <[email protected]>

johngmyers

This layering approach has me quite concerned: I fear it's going to turn into a significant maintenance burden. I understand the need for expediency, but I don't think we should plan to have such a thing long term.

johngmyers · 2020-06-11T04:44:50Z

pkg/instancegroups/instancegroups.go

@@ -511,7 +487,7 @@ func (c *RollingUpdateCluster) drainNode(u *cloudinstances.CloudInstanceGroupMem
 	}

 	helper := &drain.Helper{
-		Client:              c.K8sClient,
+		Client:              c.K8sClient.RawClient(),


We're always going to have to call into imported code such as drain.Helper and those things are going to need a kubernetes.Interface. So those won't get the enhanced error handling, we'll have to fork and maintain the imported code ourselves, or we'll have to wrap kubernetes.Interface.

johngmyers · 2020-06-11T04:49:09Z

pkg/k8sclient/client.go

+	// trigger a timeout error and then to recover we need to
+	// reset any existing connections
+	if config.Timeout < maxTimeout {
+		config.Timeout = maxTimeout


Do we really want to modify the caller's rest.Config? Perhaps we should make a copy?

johngmyers · 2020-06-11T04:56:35Z

pkg/k8sclient/client.go

+	patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node)
+	if err != nil {
+		return err
+	}


I believe the caller should create patchBytes and this layer should simply concern itself with calling Patch() and handleError(), just like ListNodes() and DeleteNode() do.

johngmyers · 2020-06-11T04:57:56Z

pkg/k8sclient/client.go

+	}
+
+	_, err = k8sClient.RawClient().CoreV1().Nodes().Patch(ctx, node.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{})
+	if c, ok := k8sClient.(*client); ok {


This is definitely a code smell. Assuming mocking of Interface is useful, this should be a method of Interface so it can also be mocked.

johngmyers · 2020-06-11T05:06:08Z

pkg/k8sclient/fakek8sclient/fake.go

+	"k8s.io/kops/pkg/k8sclient"
+)
+
+type Fake struct {


Why does this type exist? Why not use a real client wrapping a fake.Clientset?

hakman · 2020-08-17T07:07:44Z

Stepping back a little, my understanding is that Go 1.15 was the blocker for fixing this on kubernetes/client-go side.
Is there any way to make this happen now or maybe at least the Go patch opens some other possibilities?
kubernetes/client-go#374 (comment)

k8s-ci-robot · 2020-08-17T07:07:51Z

@justinsb: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Abhishekvrshny · 2020-08-18T10:28:35Z

pkg/k8sclient/client.go

+			klog.Warningf("client timed out, but http transport was not of expected type, was %T", restClient.Client.Transport)
+			return
+		}
+		httpTransport.CloseIdleConnections()


@justinsb How would this make sure that the connection over which the request just timed out is idle and hence would be closed? Is there a possibility that the connection is picked up by another request and is in use while this is getting executed? // @johngmyers

If the connection got picked up by another request, that's fine. When the connection does go idle the last request to complete will close it.

@johngmyers Sure, this may work in this particular case. Would the same approach work in general for some other high throughput clients suffering from connection staleness? Sorry to hijack this thread, but my issue is similar but related to a different system, and was wondering if the same approach can be applied there.

hakman · 2020-12-14T11:54:13Z

@justinsb I think this should be fixed now by this kubernetes/kubernetes#96778 via #10385. I believe it can be closed.

k8s-ci-robot · 2021-01-11T11:53:37Z

@justinsb: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kops-verify-cloudformation	`88b332d`	link	`/test pull-kops-verify-cloudformation`
pull-kops-verify-hashes	`88b332d`	link	`/test pull-kops-verify-hashes`
pull-kops-e2e-k8s-containerd	`88b332d`	link	`/test pull-kops-e2e-k8s-containerd`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fejta-bot · 2021-04-11T13:21:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-11T13:22:55Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-06-10T13:52:41Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-06-10T13:52:47Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020

k8s-ci-robot requested review from joshbranham and robinpercy April 28, 2020 01:53

k8s-ci-robot added area/rolling-update approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 28, 2020

justinsb force-pushed the rolling_update_timeout branch from ed38e12 to 744b555 Compare April 28, 2020 01:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2020

johngmyers reviewed Apr 28, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020

justinsb force-pushed the rolling_update_timeout branch from 744b555 to ac05cad Compare June 7, 2020 16:24

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2020

justinsb force-pushed the rolling_update_timeout branch 2 times, most recently from eb8348f to 06af22f Compare June 9, 2020 17:09

justinsb force-pushed the rolling_update_timeout branch from 06af22f to bc685c0 Compare June 9, 2020 17:11

rifelpet reviewed Jun 9, 2020

View reviewed changes

pkg/k8sclient/client.go Outdated Show resolved Hide resolved

Fix spelling mistake

88b332d

Co-authored-by: Peter Rifel <[email protected]>

johngmyers requested changes Jun 11, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2020

Abhishekvrshny reviewed Aug 18, 2020

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2021

k8s-ci-robot closed this Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a custom kubernetes client, to handle network failures #9009

Use a custom kubernetes client, to handle network failures #9009

justinsb commented Apr 28, 2020

k8s-ci-robot commented Apr 28, 2020

johngmyers left a comment

johngmyers Apr 28, 2020

justinsb Jun 7, 2020

johngmyers Apr 28, 2020

justinsb Jun 7, 2020

johngmyers Jun 11, 2020

johngmyers Apr 28, 2020

justinsb Jun 7, 2020

justinsb commented Jun 7, 2020

johngmyers left a comment

johngmyers Jun 11, 2020

johngmyers Jun 11, 2020

johngmyers Jun 11, 2020

johngmyers Jun 11, 2020

johngmyers Jun 11, 2020

hakman commented Aug 17, 2020

k8s-ci-robot commented Aug 17, 2020

Abhishekvrshny Aug 18, 2020

johngmyers Aug 18, 2020

Abhishekvrshny Aug 18, 2020

hakman commented Dec 14, 2020

k8s-ci-robot commented Jan 11, 2021

fejta-bot commented Apr 11, 2021

fejta-bot commented May 11, 2021

fejta-bot commented Jun 10, 2021

k8s-ci-robot commented Jun 10, 2021

Use a custom kubernetes client, to handle network failures #9009

Use a custom kubernetes client, to handle network failures #9009

Conversation

justinsb commented Apr 28, 2020

k8s-ci-robot commented Apr 28, 2020

johngmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Jun 7, 2020

johngmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakman commented Aug 17, 2020

k8s-ci-robot commented Aug 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakman commented Dec 14, 2020

k8s-ci-robot commented Jan 11, 2021

fejta-bot commented Apr 11, 2021

fejta-bot commented May 11, 2021

fejta-bot commented Jun 10, 2021

k8s-ci-robot commented Jun 10, 2021