Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Resiliency Metrics #226

Merged
merged 1 commit into from
Aug 19, 2024

Conversation

Reham77
Copy link

@Reham77 Reham77 commented Aug 14, 2024

Description:
The HyperPod team will tag each node on the Kubernetes level with a label which describes it’s health status, this PR goal is to add a feature to extract these labels value and emit a metrics to CW.

Testing: Deployed a custom agent on a testing cluster

image

Documentation: N/A

@Reham77 Reham77 requested review from movence and straussb August 14, 2024 19:18
@Reham77 Reham77 requested a review from mxiamxia as a code owner August 14, 2024 19:18
@Reham77 Reham77 force-pushed the leader-node-dev branch 2 times, most recently from 500f9a4 to 16f89bc Compare August 15, 2024 12:59
internal/aws/containerinsight/const.go Outdated Show resolved Hide resolved
internal/aws/containerinsight/utils.go Show resolved Hide resolved
internal/aws/k8s/k8sclient/clientset.go Show resolved Hide resolved
internal/aws/k8s/k8sclient/node.go Show resolved Hide resolved
k8sClient: k8sclient.Get(logger),
logger: logger,
k8sClient: k8sclient.Get(logger,
k8sclient.CaptureOnlyNodeLabelsInfo(true),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will be always true and not contorlled by a flag or something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add unit test for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be empty if the node is not hyperpod (instance type doesn't start with ml)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have direct access to captureOnlyNodeLabelInfo from leaderElection. also i don't see the CaptureNodeLevelInfo getting tested. any suggestions?

Copy link

@spanaik spanaik Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will be always true and not contorlled by a flag or something?

Yes this will run only on the leader agent to collect the labels for all nodes for now, we might end up collecting other labels for the node local agent as well, so we will need it for that as well.

internal/aws/k8s/k8sutil/util.go Show resolved Hide resolved
internal/aws/containerinsight/const.go Show resolved Hide resolved
if sageMakerHealthStatus, ok := node.Labels[SageMakerNodeHealthStatus.String()]; ok {
info.Labels = make(map[Label]int8)
if condition, ok := k8sutil.ParseString(sageMakerHealthStatus); ok {
info.Labels[SageMakerNodeHealthStatus] = condition

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this map will have a max of one label for the health status and not have any other labels.
Does this even have to be a map in that case? Cant this just be a string called sageMakerNodeHealthStatus as part of NodeInfo and itll be empty for non hyperpod nodes. Callers can check for empty string or ret pointer and check for nil.

You can probably remove the captureOnlyNodeLabelInfo as well at that point?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it something generic like NodeToLabelsMap gives readers an expectation that all the labels set on the node are accessible via this map.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map are part of an allowlist that uses an ENUM instead of strings to store the label keys/values, any additional labels to be stored will have to be made into an ENUM, thus we will continue to use this pattern for now.

internal/aws/containerinsight/const.go Outdated Show resolved Hide resolved
internal/aws/containerinsight/utils.go Show resolved Hide resolved
movence
movence previously approved these changes Aug 19, 2024
sky333999
sky333999 previously approved these changes Aug 19, 2024
@sky333999 sky333999 merged commit d503b27 into amazon-contributing:aws-cwa-dev Aug 19, 2024
146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants