-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds TESTING node label during invasive checks #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a feature request
{ | ||
"metadata": { | ||
"labels": { | ||
"autopilot.ibm.com/gpuhealth": "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmisale should a failed dcgm
job results in gpuhealth=ERR
? Here I've cleared the label if it errrors (not sure if that's the right thing to do in any case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a little more complicated..
No, a fail in the creation of a Job for the dcgm should not result into an error in the GPUs.
In this code, I am not sure how to interpret this result. Because again, a periodic check that reports all good, is going to overwrite that label with PASS
. And there it's clearing out that label for no real reason... meaning the reason is not related to the GPU failings or any other test.
If we want to update that value with PASS
or ERR
based on the dcgm r3 result, it has to happen somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to update that value with PASS or ERR based on the dcgm r3 result, it has to happen somewhere else.
Yes, it should. My assumption in this PR was that TESTING label would only persist until the dcgm result finished, when it was overwritten with PASS or ERR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does get overwritten here:
"autopilot.ibm.com/gpuhealth": general_health} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. So that should close the loop! We only need to make sure that the testing
value goes away if something happens and can't remove it. This is quite critical because it would prevent jobs from running even tho the node might be perfectly fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CreateJob(...)
fails then the TESTING
label is cleared. If gpu-dcgm/entrypoint.py
starts it should eventually overwrite TESTING
with a PASS
or EVICT
label.
I suppose gpu-dcgm/entrypoint.py
could crash before updating the label, and then we would be in trouble. However, we don't want to remove TESTING before dcgm finishes... Hmm. any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a solution for this..
I think the place where it can go wrong, is in the patch api call, we should be managing the dcgm fails/errors in the try_dcgm
function
16a7287
to
cc0a5c0
Compare
func InvasiveCheckHandler() http.Handler { | ||
fn := func(w http.ResponseWriter, r *http.Request) { | ||
w.Write([]byte("Launching invasive health checks. Results added to 'gpuhealth' node label")) | ||
InvasiveCheckTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change this function name.. it seems like we're starting a timer, while instead we're just calling the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good observation
Signed-off-by: Jim Cadden <jcadden@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #41