Adds TESTING node label during invasive checks #42

jimcadden · 2024-08-30T13:16:23Z

Fixes #41

autopilot-daemon/pkg/handlers/healthchecks.go

cmisale

Left a feature request

jimcadden · 2024-08-30T13:52:32Z

autopilot-daemon/pkg/handlers/healthchecks.go

+		{
+			"metadata": {
+				"labels": {
+					"autopilot.ibm.com/gpuhealth": ""


@cmisale should a failed dcgm job results in gpuhealth=ERR ? Here I've cleared the label if it errrors (not sure if that's the right thing to do in any case)

That is a little more complicated..
No, a fail in the creation of a Job for the dcgm should not result into an error in the GPUs.
In this code, I am not sure how to interpret this result. Because again, a periodic check that reports all good, is going to overwrite that label with PASS. And there it's clearing out that label for no real reason... meaning the reason is not related to the GPU failings or any other test.

If we want to update that value with PASS or ERR based on the dcgm r3 result, it has to happen somewhere else.

If we want to update that value with PASS or ERR based on the dcgm r3 result, it has to happen somewhere else.

Yes, it should. My assumption in this PR was that TESTING label would only persist until the dcgm result finished, when it was overwritten with PASS or ERR.

It does get overwritten here:

autopilot/autopilot-daemon/gpu-dcgm/entrypoint.py

Line 230 in b25afca

"autopilot.ibm.com/gpuhealth": general_health}

Good point. So that should close the loop! We only need to make sure that the testing value goes away if something happens and can't remove it. This is quite critical because it would prevent jobs from running even tho the node might be perfectly fine

If CreateJob(...) fails then the TESTING label is cleared. If gpu-dcgm/entrypoint.py starts it should eventually overwrite TESTING with a PASS or EVICT label.

I suppose gpu-dcgm/entrypoint.py could crash before updating the label, and then we would be in trouble. However, we don't want to remove TESTING before dcgm finishes... Hmm. any ideas?

I don't have a solution for this..
I think the place where it can go wrong, is in the patch api call, we should be managing the dcgm fails/errors in the try_dcgm function

cmisale · 2024-09-03T13:10:24Z

autopilot-daemon/pkg/handlers/handler.go

+func InvasiveCheckHandler() http.Handler {
+	fn := func(w http.ResponseWriter, r *http.Request) {
+		w.Write([]byte("Launching invasive health checks. Results added to 'gpuhealth' node label"))
+		InvasiveCheckTimer()


Maybe we should change this function name.. it seems like we're starting a timer, while instead we're just calling the test

Yes, good observation

Signed-off-by: Jim Cadden <jcadden@ibm.com>

cmisale

LGTM

jimcadden requested a review from cmisale August 30, 2024 13:16

jimcadden force-pushed the temp_lbl branch from a9ca075 to 68bc14a Compare August 30, 2024 13:17

cmisale reviewed Aug 30, 2024

View reviewed changes

autopilot-daemon/pkg/handlers/healthchecks.go Show resolved Hide resolved

cmisale requested changes Aug 30, 2024

View reviewed changes

jimcadden commented Aug 30, 2024

View reviewed changes

jimcadden force-pushed the temp_lbl branch 3 times, most recently from 16a7287 to cc0a5c0 Compare August 30, 2024 15:07

cmisale reviewed Sep 3, 2024

View reviewed changes

Adds TESTING node label during invasive checks

Loading
Loading status checks…

419a113

Signed-off-by: Jim Cadden <jcadden@ibm.com>

jimcadden force-pushed the temp_lbl branch from cc0a5c0 to 419a113 Compare September 9, 2024 19:57

jimcadden requested review from cmisale and Vezio September 10, 2024 18:09

cmisale approved these changes Sep 11, 2024

View reviewed changes

jimcadden merged commit 24c0d06 into IBM:main Sep 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds TESTING node label during invasive checks #42

Adds TESTING node label during invasive checks #42

jimcadden commented Aug 30, 2024

cmisale left a comment

jimcadden Aug 30, 2024

cmisale Aug 30, 2024

jimcadden Aug 30, 2024 •

edited

Loading

jimcadden Aug 30, 2024

cmisale Aug 30, 2024

jimcadden Aug 30, 2024

cmisale Aug 30, 2024

cmisale Sep 3, 2024

jimcadden Sep 3, 2024

cmisale left a comment

Adds TESTING node label during invasive checks #42

Adds TESTING node label during invasive checks #42

Conversation

jimcadden commented Aug 30, 2024

cmisale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimcadden Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmisale left a comment

Choose a reason for hiding this comment

jimcadden Aug 30, 2024 •

edited

Loading