Add a set-status RPC for marking vertices up or down #1110

jameshcorbett · 2023-12-02T06:01:10Z

The motivation for this PR comes from this flux-coral2 issue. Rabbits can (and very often do) go down, and fluxion needs to avoid scheduling rabbit jobs on dead rabbits. Rabbit status is monitored by HPE software and reported in kubernetes objects, which flux-coral2 software checks. The flux-coral2 software needs a way to report to Fluxion that a rabbit has died (or resurrected).

This PR provides a way to do that. Flux-coral2 can now send a RPC like sched-fluxion-resource.set_status /path/to/rabbit down to mark a rabbit as down, and similarly to mark it up.

However, I did notice that marking all the nodes or racks in the resource graph didn't prevent jobs from running, hence the test failures. @milroy do you know whether this is expected? Is marking a resource as 'down' purely informational--does it have no effect on actual resource choices?

As an aside, one tricky aspect is that with this PR, flux-coral2 will need to know the resource graph layout. It will know the hostname of the rabbit that has gone down, but it won't necessarily know that the rabbit 'rzvernal201' is '/rzvernal/rack3/rzvernal201` in the resource graph. I think I can work around this though--the resource graph is already written out to a file (because it needs to contain information about rabbit locality) so the flux-coral2 software can read in that graph and figure things out.

jameshcorbett · 2023-12-05T15:37:56Z

@grondo can you describe what the core-sched interactions are when a node goes down?

grondo · 2023-12-05T16:14:13Z

When a node goes down the flux-core resource module sends a new resource.acquire response as detailed in RFC 28 with the down rank added to the down idset.

jameshcorbett · 2023-12-05T17:44:51Z

Thanks @grondo , that actually gave me the information I needed to fix the tests! I think this PR is a pretty harmless addition now.

zekemorton

I looked through all of the commits, and the code looks clean and all of the changes make sense! I couldn't find anything to complain about.

I think it's still a good idea for @milroy to give it a look over

milroy · 2023-12-12T09:04:01Z

I'll add my review today.

milroy

This PR is almost ready to merge. I'm requesting a quick change to remove duplicate code and it will be ready to merge.

resource/modules/resource_match.cpp

milroy · 2023-12-12T17:47:30Z

resource/modules/resource_match.cpp

+        if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0)
+            flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
+        return;


You could save the repetition of the flux_respond_error and flux_log_error by setting errmsg and using goto error like this block:

flux-sched/resource/modules/resource_match.cpp

Line 1863 in 0b70113

error:

Suggested change

if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0)

flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);

return;

errmsg = "malformed RPC";

goto error;

That would apply to the following blocks, too.

I originally wrote this function that way, but the compiler complained that I was jumping over the initialization of some variables. Do you have any recommendations to avoid that? Move all declarations and initializations to the top of the function? That also seemed clunky...

I've run into that a lot, too. You'll need to declare and initialize the variable before the if blocks. Sometimes that isn't feasible; if that's true in this case then we can merge this PR.

Done! And your spacing comments addressed.

milroy · 2023-12-12T21:40:12Z

LGTM! Approved.

Once you rebase against the current master you can set MWP.

Problem: there is no way for a service external to Fluxion to mark a vertex as down. Rabbit nodes are monitored by an external service, and Fluxion needs to be informed of their status. Add a 'sched-fluxion-resource.set_status' RPC for marking a vertex as up/down.

Problem: there is no convenient testing interface to the set-status resource service. Add a command to the flux-ion-resource script for sending a set-status RPC.

Problem: there are some __future__ imports left over from the days of Python 2. Remove them, and fix a typo.

Problem: there are no tests to ensure that the "sched-fluxion-resource.set_status" service works properly. Add tests.

jameshcorbett · 2023-12-12T21:42:07Z

Done, setting MWP. Thanks @milroy !

codecov · 2023-12-12T21:51:05Z

Codecov Report

Merging #1110 (1658f88) into master (0b70113) will decrease coverage by 0.1%.
The diff coverage is 70.9%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1110     +/-   ##
========================================
- Coverage    71.8%   71.8%   -0.1%     
========================================
  Files          88      88             
  Lines       11531   11562     +31     
========================================
+ Hits         8287    8309     +22     
- Misses       3244    3253      +9

Files	Coverage Δ
resource/modules/resource_match.cpp	`67.8% <70.9%> (+<0.1%)`	⬆️

jameshcorbett requested a review from milroy December 2, 2023 06:01

jameshcorbett force-pushed the set-status-rpc branch 2 times, most recently from 87c94c6 to 0a283bd Compare December 2, 2023 06:18

jameshcorbett force-pushed the set-status-rpc branch from 0a283bd to d309907 Compare December 5, 2023 17:43

jameshcorbett requested a review from zekemorton December 5, 2023 17:44

jameshcorbett mentioned this pull request Dec 6, 2023

Draining rabbits flux-framework/flux-coral2#117

Merged

zekemorton approved these changes Dec 6, 2023

View reviewed changes

milroy requested changes Dec 12, 2023

View reviewed changes

jameshcorbett force-pushed the set-status-rpc branch 2 times, most recently from c3e5893 to eb0bc8a Compare December 12, 2023 21:36

milroy approved these changes Dec 12, 2023

View reviewed changes

jameshcorbett added 4 commits December 12, 2023 13:41

test: add set-status command to flux ion-resource

579c193

Problem: there is no convenient testing interface to the set-status resource service. Add a command to the flux-ion-resource script for sending a set-status RPC.

test: remove redundant python __future__ imports

46d6d62

Problem: there are some __future__ imports left over from the days of Python 2. Remove them, and fix a typo.

test: add tests for resource.set-status rpc

1658f88

Problem: there are no tests to ensure that the "sched-fluxion-resource.set_status" service works properly. Add tests.

jameshcorbett force-pushed the set-status-rpc branch from eb0bc8a to 1658f88 Compare December 12, 2023 21:41

jameshcorbett added the merge-when-passing mergify.io - merge PR automatically once CI passes label Dec 12, 2023

mergify bot merged commit 0278c61 into flux-framework:master Dec 12, 2023
22 of 23 checks passed

jameshcorbett deleted the set-status-rpc branch December 13, 2023 01:26

jameshcorbett mentioned this pull request Dec 15, 2023

Expand functionality of sched-fluxion-resource.set_status service. #1119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a set-status RPC for marking vertices up or down #1110

Add a set-status RPC for marking vertices up or down #1110

jameshcorbett commented Dec 2, 2023

jameshcorbett commented Dec 5, 2023

grondo commented Dec 5, 2023 •

edited

Loading

jameshcorbett commented Dec 5, 2023

zekemorton left a comment

milroy commented Dec 12, 2023

milroy left a comment

milroy Dec 12, 2023

jameshcorbett Dec 12, 2023 •

edited

Loading

milroy Dec 12, 2023 •

edited

Loading

jameshcorbett Dec 12, 2023

milroy commented Dec 12, 2023

jameshcorbett commented Dec 12, 2023

codecov bot commented Dec 12, 2023

Add a set-status RPC for marking vertices up or down #1110

Add a set-status RPC for marking vertices up or down #1110

Conversation

jameshcorbett commented Dec 2, 2023

jameshcorbett commented Dec 5, 2023

grondo commented Dec 5, 2023 • edited Loading

jameshcorbett commented Dec 5, 2023

zekemorton left a comment

Choose a reason for hiding this comment

milroy commented Dec 12, 2023

milroy left a comment

Choose a reason for hiding this comment

milroy Dec 12, 2023

Choose a reason for hiding this comment

jameshcorbett Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

milroy Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

jameshcorbett Dec 12, 2023

Choose a reason for hiding this comment

milroy commented Dec 12, 2023

jameshcorbett commented Dec 12, 2023

codecov bot commented Dec 12, 2023

Codecov Report

grondo commented Dec 5, 2023 •

edited

Loading

jameshcorbett Dec 12, 2023 •

edited

Loading

milroy Dec 12, 2023 •

edited

Loading