Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create program execution service (wreck replacement) #333

Closed
garlick opened this issue Aug 18, 2015 · 13 comments
Closed

create program execution service (wreck replacement) #333

garlick opened this issue Aug 18, 2015 · 13 comments

Comments

@garlick
Copy link
Member

garlick commented Aug 18, 2015

Implement a new service that executes distributed "programs" in terms of cmb.exec. It depends on the KVS and creates bare bones KVS entries for programs (historical and current).

Tools such as 'ps' and 'top' would show currently executing programs.

Possibly flux-exec could be modified to (optionally) talk to this service and launch stuff in parallel?

The "program context" would be an opaque area in the program KVS space that would be left up to the "shell" (see #334 ) or the application to fill in.

This service can be (mostly) resource agnostic though it would need to know which broker ranks are to execute tasks. For the most part the mapping of resources to tasks could be a function of the "shell".

This service could assign a monotonic Flux program id (fpid?) to programs as they are executed. Since this number is only unique within the local Flux instance, job submissions should be assigned a uuid so they can be uniquely referenced in system wide logs or a global provenance database. Then each time the submission is executed or re-executed, it is associated with the fpid.

Programs would be submitted using the program specification introduced in RFC 8.

@lipari
Copy link
Contributor

lipari commented Aug 19, 2015

This new tool seems similar to flux-exec. For posterity, would you mind contrasting this new tool against flux-exec? Would flux-exec eventually go away? What would this tool provide that we don't already (or will never) have with flux-exec?

@grondo
Copy link
Contributor

grondo commented Aug 19, 2015

flux-exec uses the point-to-point cmb.exec protocol. This Issue describes a distributed service built on top of the low level cmb.exec service. flux-exec doesn't have any real support for launching parallel work, tracking distributed processes in groups, signalling these processes as groups, etc.

@garlick
Copy link
Member Author

garlick commented Aug 19, 2015

I guess you could say this is analogous to the wrexec service, except implemented atop cmb.exec.

The three related issues that were entered here were the result of brainstorming a wreck redesign yesterday. The ideas (and descriptions) are a little rough still.

@grondo
Copy link
Contributor

grondo commented Aug 19, 2015

This service could assign a monotonic Flux program id (fpid?) to programs as they are executed. Since this number is only unique within the local Flux instance, job _submissions_should be assigned a uuid so they can be uniquely referenced in system wide logs or a global provenance database.

I feel like we're getting close to a nice abstraction for job/program delineation here. Should we (perhaps in another issue or RFC) define something like a job submission/program specification creates program description which is saved with a uuid in the provenance database. Once a program has started it is associated with an fpid which is a container for a running instance of a program, and has associated with it all typical data for running programs (task placement, io, current state, exit statuses if exited, etc.). fpid data would point to program specification that started it, and program specification could have a list of all fpids that it started...

In this model the program specification is like an executable and the fpid is the task_struct data (extended because data for fpids is saved for provenance). At times new programs could be relaunched from the same program specification? (maybe this is getting a bit too pie in the sky,
you'd have to run with the same resource request, working directory, etc.)

@dongahn
Copy link
Member

dongahn commented Aug 20, 2015

@grondo @garlick are you guys going to consider elasticity support as part of this? With Suraj joining the team and his project, I have to guess we can have really good synergy and be able to knock the mechanism/requirement down.

@garlick
Copy link
Member Author

garlick commented Aug 20, 2015

I think @grondo and I were focused mainly on incremental improvement of the existing static execution here, but if we can support elasticity at some level without sacrificing our near term milestones then maybe! It would be good at least for me to develop more of an understanding of what that should look like so our design an accommodate it.

At our last meeting @surajpkn briefly touched on emerging "standard" runtime interfaces for this, e.g. via PMI-like capability. It would be good to get an issue open referencing any descriptions of these interfaces so we can be thinking about how to provide them to programs?

@dongahn
Copy link
Member

dongahn commented Aug 20, 2015

Great. I will turn to @surajpkn to providing some information. His dynamic scheduling runtime interface to Torque is also something to look at.

@surajpkn, @SteVwonder and I had a nice discussion today, and we now have a reasonably good idea about what aspects of dynamic scheduling under hierarachical scheduling we will work on. I've asked him to give us a short presentation about this next Monday so that we can give him support when he needs them.

My guess is his work can be mostly done through adding support to flux-sched under emulation, but later we will want to hook this to the actual flux-core runtime in accordance to RFC 8. So It's better to start to discuss this early.

@surajpkn
Copy link

The type of adaptive application and the scheduling method we want to use will have some impact on the job submission API. For example: if a job is predictably evolving - meaning that it knows when it needs extra nodes and when it will release some (like a workflow application) - then we might want to include that information in the job submission. Similarly, malleable jobs may require a (min, max) or moldable with ((nodecount1,walltime1),(nodecount2,walltime2)...). It would be nice if there is some "room" with some forethought in the job submission API for easy addition of such information later, so that it doesn't delay the current milestones. Because adding elasticity-related parameters on job submission API that was totally designed for static purpose only could turn out to be very ugly (and TORQUE is a perfect example. I can show that on Monday.)

The standard API is trying to address the runtime API for expand/shrink request and orders and will kick off only in November at the BoF. So actually some initial thoughts on FLUX-local API for job submission and elasticity can influence the standard API. We need more clarity on our requirements (from motivating applications) and how we will do adaptive scheduling (application level). In the meantime, I will try to get hold of the PMIX so that we can have a look.

@dongahn
Copy link
Member

dongahn commented Aug 20, 2015

FYI -- @lipari recently had to deal with PMIX as part of his CORAL involvement as well.

@trws
Copy link
Member

trws commented Aug 23, 2015

Notes for things I'm finding would be really useful features in our distributed execution system:

  • pre/post-run commands: This is partly so users can be explicit about their per-task setup and teardown code, context discussion with Dave Richards, semi-automated restart on node failure becomes much easier with this
  • "wrapper" commands or hooks for containment/binding setup
  • hooks for state detection, support dead-daemon restart, failover, etc, not directly but as a hook to override the action on task/program exit
  • IO indirection support: this isn't a solid thing right now, but I have a sneaking suspicion we'll want to be able to make IO handling independent of the KVS at some point, so maybe implement to an abstract "open/append" interface that uses the KVS on the back for the initial implementation?

@grondo
Copy link
Contributor

grondo commented Aug 24, 2015

pre/post-run commands: This is partly so users can be explicit about their per-task setup and teardown code, context discussion with Dave Richards, semi-automated restart on node failure becomes much easier with this

Are you talking about the equivalent of job prolog/epilog? I'm actually not following about how this could impact per-task setup, so need more detail sorry. For "teardown code" shouldn't that be implemented more flexibly as dependent program(s)?

"wrapper" commands or hooks for containment/binding setup

I'd like to avoid wrappers as much as possible. Otherwise you have wrappers calling wrappers and the issues become intractable. As we've said, we plan on doing something like spank plugins.

IO indirection support: this isn't a solid thing right now, but I have a sneaking suspicion we'll want to be able to make IO handling independent of the KVS at some point.

cmb.exec doesn't use kvs. I think we planned distributed program framework as described in this issue to use reduction for IO by default, possibly with stdio optionally collected in kvs.

@grondo
Copy link
Contributor

grondo commented Aug 24, 2015

Actually hooks already exist in the wrexec prototype if you feel the need to do binding now (can't do real containment until we have support for privileged operations). Actually environment propagation uses these hooks, though I admit they aren't very well designed nor implemented.

Right now the hooks take the form of a set of lua plugins that can be dropped into wreck/lua.d/ directory. The hooks are

  • rexecd_init -- called as wrexecd starts up
  • rexecd_task_init -- called per task before exec
  • rexecd_task_exit -- called as each task exits

See lua.d/01-env.lua for an example.

@grondo
Copy link
Contributor

grondo commented Feb 25, 2020

Closing outdated "wreck replacement" issues.

@grondo grondo closed this as completed Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants