Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a set-status RPC for marking vertices up or down #1110

Merged
merged 4 commits into from
Dec 12, 2023

Conversation

jameshcorbett
Copy link
Member

The motivation for this PR comes from this flux-coral2 issue. Rabbits can (and very often do) go down, and fluxion needs to avoid scheduling rabbit jobs on dead rabbits. Rabbit status is monitored by HPE software and reported in kubernetes objects, which flux-coral2 software checks. The flux-coral2 software needs a way to report to Fluxion that a rabbit has died (or resurrected).

This PR provides a way to do that. Flux-coral2 can now send a RPC like sched-fluxion-resource.set_status /path/to/rabbit down to mark a rabbit as down, and similarly to mark it up.

However, I did notice that marking all the nodes or racks in the resource graph didn't prevent jobs from running, hence the test failures. @milroy do you know whether this is expected? Is marking a resource as 'down' purely informational--does it have no effect on actual resource choices?

As an aside, one tricky aspect is that with this PR, flux-coral2 will need to know the resource graph layout. It will know the hostname of the rabbit that has gone down, but it won't necessarily know that the rabbit 'rzvernal201' is '/rzvernal/rack3/rzvernal201` in the resource graph. I think I can work around this though--the resource graph is already written out to a file (because it needs to contain information about rabbit locality) so the flux-coral2 software can read in that graph and figure things out.

@jameshcorbett jameshcorbett requested a review from milroy December 2, 2023 06:01
@jameshcorbett jameshcorbett force-pushed the set-status-rpc branch 2 times, most recently from 87c94c6 to 0a283bd Compare December 2, 2023 06:18
@jameshcorbett
Copy link
Member Author

@grondo can you describe what the core-sched interactions are when a node goes down?

@grondo
Copy link
Contributor

grondo commented Dec 5, 2023

When a node goes down the flux-core resource module sends a new resource.acquire response as detailed in RFC 28 with the down rank added to the down idset.

@jameshcorbett
Copy link
Member Author

Thanks @grondo , that actually gave me the information I needed to fix the tests! I think this PR is a pretty harmless addition now.

Copy link
Collaborator

@zekemorton zekemorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through all of the commits, and the code looks clean and all of the changes make sense! I couldn't find anything to complain about.

I think it's still a good idea for @milroy to give it a look over

@milroy
Copy link
Member

milroy commented Dec 12, 2023

I'll add my review today.

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is almost ready to merge. I'm requesting a quick change to remove duplicate code and it will be ready to merge.

resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
Comment on lines 2655 to 2657
if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0)
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could save the repetition of the flux_respond_error and flux_log_error by setting errmsg and using goto error like this block:

Suggested change
if (flux_respond_error (h, msg, EINVAL, "malformed RPC") < 0)
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
return;
errmsg = "malformed RPC";
goto error;

That would apply to the following blocks, too.

Copy link
Member Author

@jameshcorbett jameshcorbett Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally wrote this function that way, but the compiler complained that I was jumping over the initialization of some variables. Do you have any recommendations to avoid that? Move all declarations and initializations to the top of the function? That also seemed clunky...

Copy link
Member

@milroy milroy Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run into that a lot, too. You'll need to declare and initialize the variable before the if blocks. Sometimes that isn't feasible; if that's true in this case then we can merge this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! And your spacing comments addressed.

@jameshcorbett jameshcorbett force-pushed the set-status-rpc branch 2 times, most recently from c3e5893 to eb0bc8a Compare December 12, 2023 21:36
@milroy
Copy link
Member

milroy commented Dec 12, 2023

LGTM! Approved.

Once you rebase against the current master you can set MWP.

Problem: there is no way for a service external to Fluxion to mark
a vertex as down.

Rabbit nodes are monitored by an external service, and Fluxion needs
to be informed of their status.

Add a 'sched-fluxion-resource.set_status' RPC for marking a vertex
as up/down.
Problem: there is no convenient testing interface to the set-status
resource service.

Add a command to the flux-ion-resource script for sending a
set-status RPC.
Problem: there are some __future__ imports left over from the days
of Python 2.

Remove them, and fix a typo.
Problem: there are no tests to ensure that the
"sched-fluxion-resource.set_status" service works properly.

Add tests.
@jameshcorbett
Copy link
Member Author

Done, setting MWP. Thanks @milroy !

@jameshcorbett jameshcorbett added the merge-when-passing mergify.io - merge PR automatically once CI passes label Dec 12, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Merging #1110 (1658f88) into master (0b70113) will decrease coverage by 0.1%.
The diff coverage is 70.9%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1110     +/-   ##
========================================
- Coverage    71.8%   71.8%   -0.1%     
========================================
  Files          88      88             
  Lines       11531   11562     +31     
========================================
+ Hits         8287    8309     +22     
- Misses       3244    3253      +9     
Files Coverage Δ
resource/modules/resource_match.cpp 67.8% <70.9%> (+<0.1%) ⬆️

@mergify mergify bot merged commit 0278c61 into flux-framework:master Dec 12, 2023
22 of 23 checks passed
@jameshcorbett jameshcorbett deleted the set-status-rpc branch December 13, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants