Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Resiliency metrics #224

Closed

Conversation

Reham77
Copy link

@Reham77 Reham77 commented Jul 29, 2024

Description:
The HyperPod team will tag each node on the Kubernetes level with a label which describes it’s health status, this PR goal is to add a feature to extract these labels value and emit a metrics to CW.

Testing: Deployed a custom agent on a testing cluster

image

Documentation: N/A

@Reham77 Reham77 requested review from movence and straussb July 29, 2024 14:13
@Reham77 Reham77 requested a review from mxiamxia as a code owner July 29, 2024 14:13
Allocatable v1.ResourceList
ProviderId string
InstanceType string
HyperPodLabels map[Label]k8sutil.HyperPodConditionType
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something generic like Labels?


const (
SageMakerNodeHealthStatus Label = "sagemaker.amazonaws.com/node-health-status"
SageMakerNodeHealthStatusSC Label = "n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a comment saying this is a single char to reduce memory consumption? This name doesnt matter, it could be a single byte as well.

}
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unschedulable.String()])] = isUnschedulable
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unknown.String()])] = isLabelUnknown(labels, k8sclient.SageMakerNodeHealthStatusSC)
attributes[ci.InstanceID] = strings.TrimLeft(nodeName, "hyperod-")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here. hyperpod-

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SA1024: cutset contains duplicate characters (staticcheck)
A cutset is treated as a set of characters to remove from a
string

https://pkg.go.dev/honnef.co/go/tools/staticcheck

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think it removes all chars which might mess with the instance id itself, will replace it with TrimPrefix

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, we shouldn't trim this, lets find out the rule that breaks without a valid instance-id and fix that instead.

}
}
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unschedulable.String()])] = isUnschedulable
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unknown.String()])] = isLabelUnknown(labels, k8sclient.SageMakerNodeHealthStatusSC)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a continuous metric? I think this is fine to be sparse.

@Reham77 Reham77 force-pushed the sagemaker-leader branch from a3ebe70 to 0411607 Compare July 30, 2024 11:49
SchedulableMetric = "schedulable"
SchedulablePreferredMetric = "schedulable_preferred"
UnschedulableMetric = "unschedulable"
Unknown = "unknown"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align this?

@Reham77 Reham77 force-pushed the sagemaker-leader branch 2 times, most recently from 5c1c7a7 to f2eb07a Compare July 31, 2024 17:06
@@ -464,10 +464,8 @@ func TestK8sAPIServer_GetMetrics(t *testing.T) {
assert.Equal(t, "i-abcdef123456789", getStringAttrVal(metric, ci.InstanceID))
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_unschedulable_pending_reboot", int64(0))
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_schedulable", int64(1))
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_schedulable_preferred", int64(0))
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_unschedulable", int64(0))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be removed also?

Copy link

@spanaik spanaik Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unschedulable will be a label value going forward and no longer a synthetic metric, thus i added it in line 556

@@ -131,8 +131,8 @@ func parseDeploymentFromReplicaSet(name string) string {
return name[:lastDash]
}

func isHyperPodNode(nodeName string) bool {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change to use instance type instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have been easier to do the IntegTest with a mocked HyperPod cluster with this change, but i think we can revert this we will use an actual HyperPod Cluster for the Integ Tests.

@Reham77 Reham77 force-pushed the sagemaker-leader branch 2 times, most recently from 0a8e0ff to 0cf92f6 Compare August 14, 2024 15:41
remove unknown and schedulable_preferred metrics

Changing HyperPodNode check

Merge conflicts
@Reham77 Reham77 closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants