Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds TESTING node label during invasive checks #42

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

jimcadden
Copy link
Collaborator

Fixes #41

Copy link
Collaborator

@cmisale cmisale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a feature request

{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": ""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmisale should a failed dcgm job results in gpuhealth=ERR ? Here I've cleared the label if it errrors (not sure if that's the right thing to do in any case)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a little more complicated..
No, a fail in the creation of a Job for the dcgm should not result into an error in the GPUs.
In this code, I am not sure how to interpret this result. Because again, a periodic check that reports all good, is going to overwrite that label with PASS. And there it's clearing out that label for no real reason... meaning the reason is not related to the GPU failings or any other test.

If we want to update that value with PASS or ERR based on the dcgm r3 result, it has to happen somewhere else.

Copy link
Collaborator Author

@jimcadden jimcadden Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to update that value with PASS or ERR based on the dcgm r3 result, it has to happen somewhere else.

Yes, it should. My assumption in this PR was that TESTING label would only persist until the dcgm result finished, when it was overwritten with PASS or ERR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does get overwritten here:

"autopilot.ibm.com/gpuhealth": general_health}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So that should close the loop! We only need to make sure that the testing value goes away if something happens and can't remove it. This is quite critical because it would prevent jobs from running even tho the node might be perfectly fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CreateJob(...) fails then the TESTING label is cleared. If gpu-dcgm/entrypoint.py starts it should eventually overwrite TESTING with a PASS or EVICT label.

I suppose gpu-dcgm/entrypoint.py could crash before updating the label, and then we would be in trouble. However, we don't want to remove TESTING before dcgm finishes... Hmm. any ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a solution for this..
I think the place where it can go wrong, is in the patch api call, we should be managing the dcgm fails/errors in the try_dcgm function

@jimcadden jimcadden force-pushed the temp_lbl branch 3 times, most recently from 16a7287 to cc0a5c0 Compare August 30, 2024 15:07
func InvasiveCheckHandler() http.Handler {
fn := func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Launching invasive health checks. Results added to 'gpuhealth' node label"))
InvasiveCheckTimer()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change this function name.. it seems like we're starting a timer, while instead we're just calling the test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good observation

Signed-off-by: Jim Cadden <jcadden@ibm.com>
Copy link
Collaborator

@cmisale cmisale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimcadden jimcadden merged commit 24c0d06 into IBM:main Sep 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New node label for reserving nodes
2 participants