rfc15: describe IMP signal handling + minor updates #429

garlick · 2024-10-17T23:14:04Z

Problem: RFC 15 only describes flux imp kill but we need the IMP to forward signals.

Take a stab at describing this behavior, admittedly without a lot of detail as yet - perhaps we can fix that by coming to some agreement on those details. But this is a start anyway.

Problem: the RFC states that the IMP takes its input on stdin to avoid placing sensitive data on the command line, but stdin is no longer used for this. Now the IMP obtains its input by calling a helper program provided by the instance instead of stdin. The helper is run from the unprivileged part of the IMP. For now, just drop the incorrect detail which wasn't necessary in that part of the text anyway. See also: flux-framework/flux-security#163

Problem: it seems like the intent was to render R_local as R with a subscript, but this was not quite achieved. Use :math: which makes this simple.

Problem: the spec says that the IMP exits after starting the job shell but this is no longer the case. The IMP must linger in order to finalize the PAM session. Reword the post-verification execution section to reflect this behavior. See also: flux-framework/flux-security#150

grondo

Thanks!! Just added one comment and a typo I happened to notice.

grondo · 2024-10-17T23:17:14Z

spec_15.rst

@@ -372,13 +372,25 @@ A multi-user instance of Flux not only requires the ability to execute
 work as a guest user, but it must also have privilege to monitor and
 kill these processes as part of normal resource manager operation.

-Signaling and terminating jobs in a multi-user instance


commit message typo: forard

grondo · 2024-10-17T23:29:30Z

spec_15.rst

+The mechanism by which processes are identified to receive SIGKILL is
+outside the scope of this document.


So is the consensus that we no longer need flux-imp kill? That would be kind of nice because it is a bit of code to maintain and has to read cgroup files to determine if the calling user has permission to signal a process.

It feels like we will need to add details here because there are two modes (at least) in which the IMP operates:

sdexec: The IMP should be the first processes in a cgroup created by the flux user systemd instance for the specific job. In this case the IMP can use cgroup.procs to gather the list of PIDs to forward signals.

without sdexec: The IMP is in a cgroup shared with the parent broker (system instance broker) and all other concurrently running jobs. It should not use cgroup.procs to gather the list of PIDs to signal, since that would include everything Flux is running on the node including the broker.

If we can codify in the RFC how to differentiate these two environments, then the implementation of the IMP signal forwarding would be much simpler (and documented). For instance, to follow current practice we could state something like "The IMP shall get the basename of the current cgroup directory at startup. If the directory begins with imp-shell, then the IMP SHALL forward SIGKILL to all PIDs listed in cgroup.procs. Otherwise, the IMP SHALL forward SIGKILL only to its direct child and optionally MAY include descendants."

Ignore me if you already had this in mind (and is what you meant when you said details could be added later)

BTW, your succinct description of how this should work is excellent. Thanks!

I was hoping you would swoop in with those details and you did! Let me put that in and we'll see how it looks.

Yes, I concur that flux imp kill isn't needed. At least I can't think why we need it. Dropping it will also simplify bulk-exec since the regular subprocess kill should work.

Oh wait a sec, when flux imp run is used, we might need flux imp kill unless we change that to work the same way.

Oh, good catch. 😞

Problem: the IMP kill subcommand is briefly mentioned as the way to signal guest processes, but this is inadequate in practice. Now that the IMP lingers, just have it forward signals to the job shell. In addition, describe a surrogate signal that tells the IMP to do its best to clean up the entire job container. See also: flux-framework/flux-core#6011

Problem: the method for delivering SIGKILL to all members of the job's container is not described. Describe the mechanism.

garlick · 2024-10-17T23:58:02Z

Pushed those changes, thanks!

grondo

This LGTM. Once this is merged we can open an issue to implement the RFC in flux-security.

garlick · 2024-10-18T00:16:13Z

I guess a question is should we address flux imp run? We could have it wait for its children and forward signals too. Or if we aren't going to do that we probably should add back flux imp kill to the RFC.

Actually let's deal with that in a follow on PR and get this merged since it's a somewhat independent issue. I'll set MWP here.

garlick added 3 commits October 17, 2024 15:27

rfc15: use :math: to render J, R, R_local

35e4f9b

Problem: it seems like the intent was to render R_local as R with a subscript, but this was not quite achieved. Use :math: which makes this simple.

grondo reviewed Oct 17, 2024

View reviewed changes

garlick added 2 commits October 17, 2024 16:46

rfc15: describe method for killing container

53757f4

Problem: the method for delivering SIGKILL to all members of the job's container is not described. Describe the mechanism.

garlick force-pushed the imp_changes branch from ca5d3a2 to 53757f4 Compare October 17, 2024 23:57

grondo approved these changes Oct 18, 2024

View reviewed changes

garlick added the merge-when-passing label Oct 18, 2024

mergify bot merged commit 205c74c into flux-framework:master Oct 18, 2024
7 checks passed

garlick deleted the imp_changes branch October 18, 2024 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc15: describe IMP signal handling + minor updates #429

rfc15: describe IMP signal handling + minor updates #429

garlick commented Oct 17, 2024

grondo left a comment

grondo Oct 17, 2024

grondo Oct 17, 2024

grondo Oct 17, 2024

garlick Oct 17, 2024

grondo Oct 17, 2024

garlick Oct 17, 2024

grondo Oct 17, 2024

garlick commented Oct 17, 2024

grondo left a comment

garlick commented Oct 18, 2024

		The mechanism by which processes are identified to receive SIGKILL is
		outside the scope of this document.

rfc15: describe IMP signal handling + minor updates #429

rfc15: describe IMP signal handling + minor updates #429

Conversation

garlick commented Oct 17, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Oct 17, 2024

Choose a reason for hiding this comment

grondo Oct 17, 2024

Choose a reason for hiding this comment

grondo Oct 17, 2024

Choose a reason for hiding this comment

garlick Oct 17, 2024

Choose a reason for hiding this comment

grondo Oct 17, 2024

Choose a reason for hiding this comment

garlick Oct 17, 2024

Choose a reason for hiding this comment

grondo Oct 17, 2024

Choose a reason for hiding this comment

garlick commented Oct 17, 2024

grondo left a comment

Choose a reason for hiding this comment

garlick commented Oct 18, 2024