-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a custom kubernetes client, to handle network failures #9009
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: justinsb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ed38e12
to
744b555
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind the exit strategy. When client-go is fixed we'll want to unravel this interface.
An alternative approach would be to have client
implement (only) kubernetes.Interface
, wrapping some or all of the methods with the error handling. This would allow error handling for the calls that the drainer uses, for instance. Admittedly kubernetes.Interface
is a voluminous interface to write a delegating implementation for, but if writing a reflection-based code generator is too much then one could embed the inner kubernetes.Interface
and only override the key methods.
pkg/k8sclient/client.go
Outdated
|
||
c, err := kubernetes.NewForConfig(config) | ||
if err != nil { | ||
return nil, fmt.Errorf("error creating kubernetes client: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just return err
? Wrapping it doesn't add any value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair - done. It's mostly habit from when calling AWS SDK when some of the functions return very terse messages, so I tend to wrap-by-default. Hopefully this all becomes easier when we start using %w - which we can start doing with go 1.13 (https://blog.golang.org/go1.13-errors) .. I'm just a little wary about backports in general.
patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing the patch creation logic down into here strikes me as being a mixing of concerns. This layer should limit itself to timeouts and error recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that we need to call handleError. I do agree that there's more logic in here than there should be, so I moved it to be a package-level function (which takes a k8sclient.Interface).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the caller should create patchBytes
and this layer should simply concern itself with calling Patch()
and handleError()
, just like ListNodes()
and DeleteNode()
do.
type Interface interface { | ||
// RawClient returns the current kubernetes.Interface; it should not be cached | ||
// Using wrapper methods is preferrable because we can do richer retry logic and error handling. | ||
RawClient() kubernetes.Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be simpler to just embed kubernetes.Interface
in both Interface
and client
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My strategy is to gradually remove RawClient() - anything that uses RawClient() won't get our enhanced error handling. I added a comment that this should be considered Deprecated.
You raise an excellent point on client-go wrapping and the exit strategy: I think our exit strategy is actually using less and less of client-go. It's hard to keep track, but I think the plan for client-go is to favor the dynamic interface. (I also marked RawClient deprecated to make it doubly clear that ideally we get to a point where we don't use it) |
744b555
to
ac05cad
Compare
eb8348f
to
06af22f
Compare
By setting a low HTTP timeout, and reseting connections on error, we should be able to work around the http2 connection issues: kubernetes/client-go#374 I observed this bug specifically when rolling an HA control plane without a load balancer.
06af22f
to
bc685c0
Compare
Co-authored-by: Peter Rifel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This layering approach has me quite concerned: I fear it's going to turn into a significant maintenance burden. I understand the need for expediency, but I don't think we should plan to have such a thing long term.
@@ -511,7 +487,7 @@ func (c *RollingUpdateCluster) drainNode(u *cloudinstances.CloudInstanceGroupMem | |||
} | |||
|
|||
helper := &drain.Helper{ | |||
Client: c.K8sClient, | |||
Client: c.K8sClient.RawClient(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're always going to have to call into imported code such as drain.Helper
and those things are going to need a kubernetes.Interface
. So those won't get the enhanced error handling, we'll have to fork and maintain the imported code ourselves, or we'll have to wrap kubernetes.Interface
.
// trigger a timeout error and then to recover we need to | ||
// reset any existing connections | ||
if config.Timeout < maxTimeout { | ||
config.Timeout = maxTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to modify the caller's rest.Config
? Perhaps we should make a copy?
patchBytes, err := strategicpatch.CreateTwoWayMergePatch(oldData, newData, node) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the caller should create patchBytes
and this layer should simply concern itself with calling Patch()
and handleError()
, just like ListNodes()
and DeleteNode()
do.
} | ||
|
||
_, err = k8sClient.RawClient().CoreV1().Nodes().Patch(ctx, node.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}) | ||
if c, ok := k8sClient.(*client); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a code smell. Assuming mocking of Interface
is useful, this should be a method of Interface
so it can also be mocked.
"k8s.io/kops/pkg/k8sclient" | ||
) | ||
|
||
type Fake struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this type exist? Why not use a real client
wrapping a fake.Clientset
?
Stepping back a little, my understanding is that Go 1.15 was the blocker for fixing this on kubernetes/client-go side. |
@justinsb: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
klog.Warningf("client timed out, but http transport was not of expected type, was %T", restClient.Client.Transport) | ||
return | ||
} | ||
httpTransport.CloseIdleConnections() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinsb How would this make sure that the connection over which the request just timed out is idle and hence would be closed? Is there a possibility that the connection is picked up by another request and is in use while this is getting executed? // @johngmyers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the connection got picked up by another request, that's fine. When the connection does go idle the last request to complete will close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johngmyers Sure, this may work in this particular case. Would the same approach work in general for some other high throughput clients suffering from connection staleness? Sorry to hijack this thread, but my issue is similar but related to a different system, and was wondering if the same approach can be applied there.
@justinsb I think this should be fixed now by this kubernetes/kubernetes#96778 via #10385. I believe it can be closed. |
@justinsb: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
By setting a low HTTP timeout, and reseting connections on error, we
should be able to work around the http2 connection issues:
kubernetes/client-go#374
I observed this bug specifically when rolling an HA control plane
without a load balancer.