-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Log when and why a machine is marked for remediation #3385
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Tangential question about whether we should add helper methods to the Condition type to help with logging conditions, definitely not a blocker.
condition := conditions.Get(t.Machine, clusterv1.MachineHealthCheckSuccededCondition) | ||
logger.Info("Target has failed health check, marking for remediation", "target", t.string(), "reason", condition.Reason, "message", condition.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR, but I'm wondering if it would be good to add helper methods to the Condition type to "stringify" a condition for purposes such as this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I looked for this and was surprised we didn't already have one. String()
on the condition and maybe a conditions.String(c)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is along the lines of what I was thinking, but not sure if we can completely generalize it or should have a few different helpers for different purposes (failure vs success, brief vs verbose, etc). It might have the potential to be a bit of a rabbit hole, so likely best handled as a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benmoss. Is there an example msg when this happened?
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ncdc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -250,7 +250,8 @@ func (r *MachineHealthCheckReconciler) reconcile(ctx context.Context, logger log | |||
// mark for remediation | |||
errList := []error{} | |||
for _, t := range unhealthy { | |||
logger.V(3).Info("Target meets unhealthy criteria, triggers remediation", "target", t.string()) | |||
condition := conditions.Get(t.Machine, clusterv1.MachineHealthCheckSuccededCondition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to be non-nil? I'd assume so, just double checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in unhealthy
should be getting its condition set in needsRemediation()
/milestone v0.3.8 |
What this PR does / why we need it:
Right now MHC logs almost nothing on the
-v=0
log level. This adds a new message when a machine fails a health check and we are marking it for remediation. We store information about why the health check fails on the condition, so we can use that to surface the failure now.Also removes the "patching machine" log line from healthy machines, this is almost always useless in my experience.