-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a set-status RPC for marking vertices up or down #1110
Conversation
87c94c6
to
0a283bd
Compare
@grondo can you describe what the core-sched interactions are when a node goes down? |
When a node goes down the flux-core |
0a283bd
to
d309907
Compare
Thanks @grondo , that actually gave me the information I needed to fix the tests! I think this PR is a pretty harmless addition now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through all of the commits, and the code looks clean and all of the changes make sense! I couldn't find anything to complain about.
I think it's still a good idea for @milroy to give it a look over
I'll add my review today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is almost ready to merge. I'm requesting a quick change to remove duplicate code and it will be ready to merge.
resource/modules/resource_match.cpp
Outdated
if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0) | ||
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could save the repetition of the flux_respond_error
and flux_log_error
by setting errmsg
and using goto error
like this block:
flux-sched/resource/modules/resource_match.cpp
Line 1863 in 0b70113
error: |
if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0) | |
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__); | |
return; | |
errmsg = "malformed RPC"; | |
goto error; |
That would apply to the following blocks, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wrote this function that way, but the compiler complained that I was jumping over the initialization of some variables. Do you have any recommendations to avoid that? Move all declarations and initializations to the top of the function? That also seemed clunky...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run into that a lot, too. You'll need to declare and initialize the variable before the if
blocks. Sometimes that isn't feasible; if that's true in this case then we can merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! And your spacing comments addressed.
c3e5893
to
eb0bc8a
Compare
LGTM! Approved. Once you rebase against the current |
Problem: there is no way for a service external to Fluxion to mark a vertex as down. Rabbit nodes are monitored by an external service, and Fluxion needs to be informed of their status. Add a 'sched-fluxion-resource.set_status' RPC for marking a vertex as up/down.
Problem: there is no convenient testing interface to the set-status resource service. Add a command to the flux-ion-resource script for sending a set-status RPC.
Problem: there are some __future__ imports left over from the days of Python 2. Remove them, and fix a typo.
Problem: there are no tests to ensure that the "sched-fluxion-resource.set_status" service works properly. Add tests.
eb0bc8a
to
1658f88
Compare
Done, setting MWP. Thanks @milroy ! |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #1110 +/- ##
========================================
- Coverage 71.8% 71.8% -0.1%
========================================
Files 88 88
Lines 11531 11562 +31
========================================
+ Hits 8287 8309 +22
- Misses 3244 3253 +9
|
The motivation for this PR comes from this flux-coral2 issue. Rabbits can (and very often do) go down, and fluxion needs to avoid scheduling rabbit jobs on dead rabbits. Rabbit status is monitored by HPE software and reported in kubernetes objects, which flux-coral2 software checks. The flux-coral2 software needs a way to report to Fluxion that a rabbit has died (or resurrected).
This PR provides a way to do that. Flux-coral2 can now send a RPC like
sched-fluxion-resource.set_status /path/to/rabbit down
to mark a rabbit as down, and similarly to mark it up.However, I did notice that marking all the nodes or racks in the resource graph didn't prevent jobs from running, hence the test failures. @milroy do you know whether this is expected? Is marking a resource as 'down' purely informational--does it have no effect on actual resource choices?
As an aside, one tricky aspect is that with this PR, flux-coral2 will need to know the resource graph layout. It will know the hostname of the rabbit that has gone down, but it won't necessarily know that the rabbit 'rzvernal201' is '/rzvernal/rack3/rzvernal201` in the resource graph. I think I can work around this though--the resource graph is already written out to a file (because it needs to contain information about rabbit locality) so the flux-coral2 software can read in that graph and figure things out.