-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Resiliency metrics #224
Conversation
Allocatable v1.ResourceList | ||
ProviderId string | ||
InstanceType string | ||
HyperPodLabels map[Label]k8sutil.HyperPodConditionType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something generic like Labels?
|
||
const ( | ||
SageMakerNodeHealthStatus Label = "sagemaker.amazonaws.com/node-health-status" | ||
SageMakerNodeHealthStatusSC Label = "n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a comment saying this is a single char to reduce memory consumption? This name doesnt matter, it could be a single byte as well.
} | ||
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unschedulable.String()])] = isUnschedulable | ||
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unknown.String()])] = isLabelUnknown(labels, k8sclient.SageMakerNodeHealthStatusSC) | ||
attributes[ci.InstanceID] = strings.TrimLeft(nodeName, "hyperod-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here. hyperpod-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SA1024: cutset contains duplicate characters (staticcheck)
A cutset is treated as a set of characters to remove from a
string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think it removes all chars which might mess with the instance id itself, will replace it with TrimPrefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again, we shouldn't trim this, lets find out the rule that breaks without a valid instance-id and fix that instead.
} | ||
} | ||
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unschedulable.String()])] = isUnschedulable | ||
fields[ci.MetricName(ci.TypeHyperPodNode, ci.ConditionToMetricName[k8sutil.Unknown.String()])] = isLabelUnknown(labels, k8sclient.SageMakerNodeHealthStatusSC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a continuous metric? I think this is fine to be sparse.
a3ebe70
to
0411607
Compare
SchedulableMetric = "schedulable" | ||
SchedulablePreferredMetric = "schedulable_preferred" | ||
UnschedulableMetric = "unschedulable" | ||
Unknown = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align this?
5c1c7a7
to
f2eb07a
Compare
@@ -464,10 +464,8 @@ func TestK8sAPIServer_GetMetrics(t *testing.T) { | |||
assert.Equal(t, "i-abcdef123456789", getStringAttrVal(metric, ci.InstanceID)) | |||
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_unschedulable_pending_reboot", int64(0)) | |||
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_schedulable", int64(1)) | |||
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_schedulable_preferred", int64(0)) | |||
assertMetricValueEqual(t, metric, "hyper_pod_node_health_status_unschedulable", int64(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be removed also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unschedulable
will be a label value going forward and no longer a synthetic metric, thus i added it in line 556
@@ -131,8 +131,8 @@ func parseDeploymentFromReplicaSet(name string) string { | |||
return name[:lastDash] | |||
} | |||
|
|||
func isHyperPodNode(nodeName string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change to use instance type instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have been easier to do the IntegTest with a mocked HyperPod cluster with this change, but i think we can revert this we will use an actual HyperPod Cluster for the Integ Tests.
0a8e0ff
to
0cf92f6
Compare
remove unknown and schedulable_preferred metrics Changing HyperPodNode check Merge conflicts
0cf92f6
to
2711bbc
Compare
Description:
The HyperPod team will tag each node on the Kubernetes level with a label which describes it’s health status, this PR goal is to add a feature to extract these labels value and emit a metrics to CW.
Testing: Deployed a custom agent on a testing cluster
Documentation: N/A