Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configure system instance to require every job be a new instance #4214

Closed
grondo opened this issue Mar 13, 2022 · 5 comments
Closed

configure system instance to require every job be a new instance #4214

grondo opened this issue Mar 13, 2022 · 5 comments

Comments

@grondo
Copy link
Contributor

grondo commented Mar 13, 2022

(I thought we had captured this requirement somewhere, but I'm unable to find it at the moment)

We've talked about somehow enforcing that all jobs submitted to a system instance are new instances of Flux, i.e. they are submitted with jobspecs equivalent to flux mini batch or flux mini alloc. Some of the reasons for this requirement include:

  • Launching a new instance for every job directly launched from the system instance offloads much of the KVS and message traffic (e.g. all output is typically directed to a file or interactive pty, though this can always be overridden on the command line or jobspec, PMI for MPI jobs within the new instance are handled by subinstance broker wireup)
  • I think at one point we were thinking that sub-instances would be easier to recover from a system instance restart than individual MPI or other jobs, but I'm not sure that is the case anymore.
  • I actually thought this matches how we configure our Slurm clusters, but that doesn't seem to be the case (you can run an interactive srun in the batch partition just fine if you are patient enough)

Given that the second two bullets are not really true anymore, I wonder if this is still a requirement. It does seem like the first bullet has enough benefits that we might want to at least consider implementing a solution.

On the other hand, I'm not sure there is a practical way to determine that a submitted jobspec definitely results in a new instance of Flux. So this needs a little more thought.

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

It does seem like the first bullet has enough benefits that we might want to at least consider implementing a solution.

Yes, if I remember right, reducing the load at the system instance was the main motivation.

On the other hand, I'm not sure there is a practical way to determine that a submitted jobspect definitely results in a new instance of Flux. So this needs a little more thought.

Can we add a key to attributes section of the jobspec if the flux mini batch or alloc is used and have a job validator plugin for our system instance to validate it only this key exists? I don't even know if it is possible to configure a different validator for the system instance (e.g., TOML. configurable). So take it just as a food for thought.

@grondo
Copy link
Contributor Author

grondo commented Mar 17, 2022

Can we add a key to attributes section of the jobspec if the flux mini batch or alloc is used and have a job validator plugin for our system instance to validate it only this key exists?

We could, though nothing prevents a user from setting this key themselves, e.g. flux mini submit --cc=1-1024 --setattr=system.mini-batch=true ... so it seems a little kludgy...

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

Or maybe the validator can look at the task section and check if that invokes flux broker or has attributes.batch

@grondo
Copy link
Contributor Author

grondo commented Mar 17, 2022

Ok, so a system instance validator plugin could employ the following couple of checks: If any check is true then the job is validated for submission to the instance:

  • if attributes.system.batch is set with a valid script
  • if the basename of tasks[0].command[0] is flux, and tasks[0].command[1] is broker or start

I think you are right this should cover 99% of cases. Possibly we should distribute the validator plugin outside of flux-core in case there are some corner cases the plugin could be updated on-the-fly.

Here's a proof of concept plugin:

"""Only allow jobs that launch a new instance of Flux
"""

import errno
from os.path import basename
from flux.job.validator import ValidatorPlugin


class Validator(ValidatorPlugin):
    def validate(self, args):
        if "batch" in args.jobspec["attributes"]["system"]:
            return
        command = args.jobspec["tasks"][0]["command"]
        arg0 = basename(command[0])
        if arg0 == "flux" and command[1] in ["broker", "start"]:
            return
        return (
            errno.EINVAL,
            "Direct job submission is disabled for this instance."
            + " Please use the batch or alloc subcommands of flux-mini(1)",
        )
$ flux mini run hostname
flux-mini: ERROR: [Errno 22] Direct job submission is disabled for this instance. Please use the batch or alloc subcommands of flux-mini(1)
$ flux mini alloc -n1 hostname
fluke108
[detached: session exiting]

@dongahn
Copy link
Member

dongahn commented Mar 17, 2022

So fast!

grondo added a commit to grondo/flux-core that referenced this issue Mar 26, 2022
Problem: It might be useful for a system instance to reject jobs which
are not themselves new instances of Flux, i.e. either batch jobs or
jobs that run `flux start` or `flux broker`. However, Flux does not
provide a way to do this.

Add a very simple `require-instance` job validator which attempts to
reject all jobs that do not create a new instance of Flux. It does
this by looking for either the "batch" system attribute in jobspec
(which will cause the job shell to start a new instance), or a command
that starts with "flux start" or "flux broker".

If necessary, this plugin could be copied and modified in environments
that want to be slightly more permissive, or allow other command
arguments that might result in a new instance. However, at least Flux
will provide an example from which to base new validators.

Fixes flux-framework#4214
grondo added a commit to grondo/flux-core that referenced this issue Mar 26, 2022
Problem: It might be useful for a system instance to reject jobs which
are not themselves new instances of Flux, i.e. either batch jobs or
jobs that run `flux start` or `flux broker`. However, Flux does not
provide a way to do this.

Add a very simple `require-instance` job validator which attempts to
reject all jobs that do not create a new instance of Flux. It does
this by looking for either the "batch" system attribute in jobspec
(which will cause the job shell to start a new instance), or a command
that starts with "flux start" or "flux broker".

If necessary, this plugin could be copied and modified in environments
that want to be slightly more permissive, or allow other command
arguments that might result in a new instance. However, at least Flux
will provide an example from which to base new validators.

Fixes flux-framework#4214
@mergify mergify bot closed this as completed in cf12bf7 Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants