Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc27: document job feasibility service #425

Merged
merged 3 commits into from
Aug 20, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Aug 19, 2024

This is a first cut at documenting the feasibilty service proposed in #384.

Fixes #384

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some quick comments

spec_27.rst Outdated
@@ -99,6 +99,8 @@ Design Criteria

- Allow the expiration time of a resource allocation to be adjusted.

- Allow out of band detection of unsatisfiable job requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more to the point to say "Detect unsatisfiable job requests at submission time"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. How about "Allow the detection of unsatisfiable job requests at submission time" since the feasibility service and corresponding validation plugin are optional? If you'd prefer the shorter version and don't think it implies the detection of unsatisfiable jobs is required by the RFC, then I'm fine with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that works for me.

spec_27.rst Outdated

A scheduler or other entity MAY register a generic ``feasibility``
service name through which the feasibility or satisfiability of jobs may
be determined out of band, for example during job submission.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like my previous comment, maybe "may be determined during job submission"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Note that the reason I left this statement more open is that there is nothing preventing use of the feasibility service and validator plugin outside of job submission (e.g. via flux job-validator --jobspec-only), so I was being a bit wary of specifying only at job submission in the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever you think is best. My first thought was that it's clearer if we get directly to the primary intended purpose but you make a good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took your suggestions verbatim. I think shorter is better at this point, and if gets confusing in the future (which it probably won't) then it is easy to update the RFC.

spec_27.rst Outdated
Comment on lines 670 to 674
If the job could ever be satisfied, then success SHALL be indicated via
a response with the following REQUIRED key:

errnum
(integer) An integer error number zero (0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh weird, I'm guessing the existing protocol requires errnum in the success response payload? I wonder why we did that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was copying the requirements for the validator workers. Shall we change it for the feasibilty.check service? (Also if you have another suggestion for the name of the service, feel free to share it. I was just throwing something up to get this moving)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, looking at the feasibility validator plugin, I don't think the errnum is actually required. It just happens to be in the sched-simple implementation for some reason.

https://github.com/flux-framework/flux-core/blob/dc2ef05918c8df0a9324d7fa36d5d6c85da98088/src/modules/sched-simple/sched.c#L703-L705

I'll remove it here since it doesn't seem necessary, and open a PR to remove it in sched-simple's implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup sounds good.

@grondo
Copy link
Contributor Author

grondo commented Aug 19, 2024

Ok, I've pushed changes that I believe address @garlick's comments (also had to add a URL to linkcheck ignore). This could use another pass.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I did have one thought but I'm not sure how to resolve it. You mentioned that feasibility might run on more than one rank for scalability but since the feasibility client isn't discussed it's left open ended how that would get used. Would it make sense to just tack on an explanatory sentence that mentions that feasibility is normally called by the ingest validator which runs on all ranks and makes the RPC with FLUX_NODEID_ANY so it would work whether loaded on all ranks or just rank 0? Or maybe as a boxed note?

Or maybe not at all! It just seemed a little open ended to me. I leave it to you to decide.

@grondo
Copy link
Contributor Author

grondo commented Aug 19, 2024

I was wondering about that too. I wasn't sure if it was appropriate to go into that much detail in this RFC. I like your suggestion.

@grondo
Copy link
Contributor Author

grondo commented Aug 19, 2024

so it would work whether loaded on all ranks or just rank 0?

Side note: in the case of the Fluxion feasibility service, there will be a non-trivial memory hit for running this service module, so I think the intent is not to run it on every rank, but possibly rank 0 and the login nodes? I wonder if the sentence "all ranks or just rank 0" should be modified to state something along the lines of

...the service SHALL be loaded on one or more ranks that are upstream of any rank where job submission is expected.

Or something?

@garlick
Copy link
Member

garlick commented Aug 19, 2024

Works for me.

@grondo
Copy link
Contributor Author

grondo commented Aug 19, 2024

Uhh, I took a stab at it, but not too happy with the result. Feel free to suggest editing.

@garlick
Copy link
Member

garlick commented Aug 19, 2024

, therefore the feasibility service SHALL be loaded on one or more ranks upstream of any rank where job submission is expected.

maybe instead

. The request is routed to a local feasibility service if available, or an upstream one per RFC 3.

grondo added 3 commits August 19, 2024 22:35
Problem: The feasibility service, through which the satisifiability
of submitted jobs may be determined out of band, is not documented.

Add a Feasibility section to RFC 27 describing an optional feasibility
service.
Problem: A linked mpich commit gets "406 Client not acceptable for URL"
during linkcheck.

Add this URL to the ignore list.
Problem: On Ubuntu 22.04 "unsatisfiable" is not in the spelling
dictionary.

Add it to spell.en.pws.
@grondo
Copy link
Contributor Author

grondo commented Aug 19, 2024

Thanks, I took your suggestion verbatim (sorry, having trouble with my prose today) and force-pushed the result.

@grondo
Copy link
Contributor Author

grondo commented Aug 20, 2024

I guess I'll set MWP here since this was approved. We can always update with future PRs.

@mergify mergify bot merged commit 7eb2664 into flux-framework:master Aug 20, 2024
7 checks passed
@grondo grondo deleted the issue#384 branch August 20, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC 27: define sched.feasibility RPC
2 participants