-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add Health Check logic to MachineHealthCheck Reconciler #2250
Conversation
/cc @ncdc (for continuity) |
fe2d440
to
da82841
Compare
2ff0a55
to
e8dbb6a
Compare
e8dbb6a
to
223f30c
Compare
Rebased to resolve conflicts |
/milestone v0.3.0-rc.1 |
/milestone v0.3.0 bumping this from today's release |
/milestone v0.3.0-rc.2 |
Reviewing this now |
987bf92
to
09f4ca9
Compare
09f4ca9
to
2ccd5c2
Compare
ecf8450
to
f4409bf
Compare
@ncdc did you want to take another look? |
@vincepri yes I have a review in progress |
if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 { | ||
allErrs = append( | ||
allErrs, | ||
field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close 😄 - remove "greater"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh 🤦♂ 😂
controller controller.Controller | ||
recorder record.EventRecorder | ||
scheme *runtime.Scheme | ||
clusterNodeInformers *sync.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this needs to be a pointer
mhcList := &clusterv1.MachineHealthCheckList{} | ||
if err := r.Client.List( | ||
context.Background(), | ||
mhcList, | ||
&client.ListOptions{Namespace: machine.Namespace}, | ||
client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName}, | ||
); err != nil { | ||
r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace) | ||
return nil | ||
} | ||
|
||
var requests []reconcile.Request | ||
for k := range mhcList.Items { | ||
mhc := &mhcList.Items[k] | ||
if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) { | ||
key := util.ObjectKey(mhc) | ||
requests = append(requests, reconcile.Request{NamespacedName: key}) | ||
} | ||
} | ||
return requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you want to extract this to a function for reuse here & in machineToMachineHealthCheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this method to call machineToMachineHealthCheck
once it has the machine, I think that makes sense to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc Thanks for the review again, I've addressed all of your feedback and left a few comments where appropriate
if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 { | ||
allErrs = append( | ||
allErrs, | ||
field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh 🤦♂ 😂
mhcList := &clusterv1.MachineHealthCheckList{} | ||
if err := r.Client.List( | ||
context.Background(), | ||
mhcList, | ||
&client.ListOptions{Namespace: machine.Namespace}, | ||
client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName}, | ||
); err != nil { | ||
r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace) | ||
return nil | ||
} | ||
|
||
var requests []reconcile.Request | ||
for k := range mhcList.Items { | ||
mhc := &mhcList.Items[k] | ||
if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) { | ||
key := util.ObjectKey(mhc) | ||
requests = append(requests, reconcile.Request{NamespacedName: key}) | ||
} | ||
} | ||
return requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this method to call machineToMachineHealthCheck
once it has the machine, I think that makes sense to do
} | ||
|
||
// durations should all be less than 1 Hour | ||
minDuration := time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good idea, I then started the range from durations[1:]
, not sure if that's good or not from a readability perspective
// Ensure that concurrent reconciles don't clash when setting up watches | ||
|
||
key := util.ObjectKey(cluster) | ||
if _, ok := r.loadClusterNodeInformer(key); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone back to map and RWMutex, the logic is now:
- Check if the informer exists in the map under
RLock
- If it doesn't, attempt to acquire write
Lock
- Once
Lock
is acquired, double check no one else updated the map in the mean time - Still under lock, set up informer and add to map
I broke this into a couple of smaller methods so the locks could be scoped nicely, let me know what you think
// a node with only a name represents a | ||
// not found node in the target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subtle, + when combined with L100. Do you think it would be clearer if we added a nodeMissing
field to healthCheckTarget
?
recorder record.EventRecorder | ||
scheme *runtime.Scheme | ||
clusterNodeInformers map[client.ObjectKey]cache.Informer | ||
clusterNodeInformersLock *sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterNodeInformersLock *sync.RWMutex | |
clusterNodeInformersLock sync.RWMutex |
@JoelSpeed: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
r.controller = controller | ||
r.recorder = mgr.GetEventRecorderFor("machinehealthcheck-controller") | ||
r.scheme = mgr.GetScheme() | ||
r.clusterNodeInformers = make(map[client.ObjectKey]cache.Informer) | ||
r.clusterNodeInformersLock = sync.RWMutex{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (non-blocking): the zero value of a RWMutex is ready for use, so this is not strictly necessary
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, ncdc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
var requests []reconcile.Request | ||
for k := range mhcList.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we're not paginating the list could not have all the results we need, although this seems unlikely for now, I'd add at least a TODO, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming from the cache. I'm fairly certain we only need to paginate if we're doing live reads. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% sure, what if there are more items? we should be able to test this one easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up from https://kubernetes.slack.com/archives/C0EG7JC6T/p1582925668071800: the apiserver does not force paging on clients, clients must request paging by setting options.Limit > 0. And the controller-runtime informer-based caching client we get from the Manager
does a full List()
without pagination.
What this PR does / why we need it:
This PR adds logic to fetch targets from MachineHealthChecks and perform the health check on them to determine whether or not they should be remediated, but it does not yet remediate them since it is not yet implemented. I will follow up with this logic in a separate PR once I have had time to work on it.
I ran this on a cluster and verified that it was indeed reacting to events from Nodes/Machines correctly and that, if a node was unhealthy, the MachineHealthChecker followed the correct paths and logged as I was expecting.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Machine health check targeting logic from #1990