Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

programmable sync points between jsc and rexec #249

Closed
dongahn opened this issue Jul 6, 2015 · 4 comments
Closed

programmable sync points between jsc and rexec #249

dongahn opened this issue Jul 6, 2015 · 4 comments

Comments

@dongahn
Copy link
Member

dongahn commented Jul 6, 2015

Moving from #206 to capture the use cases and discussions on fine-grained programmable sync points support between jsc and rexec:

  1. For notify-control pattern, we talked about the idea of fine-grained programmable sync points support within rexec. That is, users of jsc can add some behavior (e.g., stop) on a programmable set of events for a particular job. We at least agreed that using these sync points for all of the jobs will incur unneccessarily high overheads and we want to do this in a fine-grained manner if we do this.

If this is combined with the new eventing scheme proposed in #206, the user will probably have to send a "state control mask" (the set of states by which the user want rexec to be synchronized) as part of a new job creation request. I was further talking to @SteVwonder, and it seems it is kind of premature to decide how sync points ideas can be materialized. When we have more concrete use cases with runtime tools and dynamic scheduling implementation, we may know more about what's needed.

@garlick
Copy link
Member

garlick commented Jul 7, 2015

How is this like/unlike Slurm SPANK plugins?

@dongahn
Copy link
Member Author

dongahn commented Jul 11, 2015

@garlick Thanks, I will take a look at it.

@dongahn
Copy link
Member Author

dongahn commented Jul 11, 2015

Just to further the discussion a bit:

I took a look at the wreck services to understand where and how those states are emitted.

  • next-id is incremented by the job module.
  • starting is generated by wrexecd right before the wrexecd session fork/exec the target program processes
  • after the fork/exec, either sync or running is generated: sync is debugger support -- spawning the target program processes and leave them in a stop state (using ptrace); running is course to simply let loose these processes.

The state set will be richer when we get to the real rexec service, and this seems a good time to think about what/how synchronization support we need to provide to the users of these states (e.g, jsc) a bit more concretely. BTW, this was discussed a bit in #206: In general, I like the idea of making things "asynchronous" to the services and "synchronous" to the job as much as possible.

Some use cases:

  • pre-staging tool wanting to stage its program before the target program is spawned (e.g., I/O forwarding layer? -- CEA's IO-forwarding requirement might give us some concrete idea about what is the sync point needs of their tool.
  • debugging tools: to control the program execution from the beginning, the sync point can be a bit later than these pre-staging tools. It can wait until the target program fork/exec'ed/stopped and the current sync support should be sufficient. However, if users want to apply those tools on some other events like aborting, we need sync support on that event. Some mechanisms to keep the aborting processes from exiting so that the debugger can be attached. Or Some mechanisms to hook an programmable core file generation. This would be similar to CRAY ATP support. Further, we will need to determine the necessary sync operations for grow/shrink when we start supporting dynamic scheduling. The debugger will need to be sync'ed with a new set of processes grow fork/exec's and be notified of a set of processes that are killed by shrink.
  • performance tools: their sync need will be similar to debugging tools. But they might not have ptrace like capability to sync with the job... (a bit vague at this point though.)
  • dynamic scheduling: The requirement can only be exactly known when we lay the foundation later this summer.

Using the current wrexec implementation as a simple model for rexec. The above sync points are translated to:

  • pre-staging tool: sync right after wrexecd emit starting. You want to make sure wrexecd won't go ahead and fork/exec the program. If this is the good sync point or not really depends on the concrete use cases
  • debugging tools: sync is already provided between starting to running with ptrace so the initial case is covered. abort state isn't available yet, so we will need to defer that discussion. How wrexecd will be sync'ed with the target program would be interesting in and of it self. (If MPI, we can have some hook in MPI_Abort to create a sync point with wrexecd.) sync needs for grow and shrink are also something that we need to discuss when dynamic scheduling comes.

grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 9, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
@grondo
Copy link
Contributor

grondo commented Feb 13, 2019

closed by #1988

@grondo grondo closed this as completed Feb 13, 2019
chu11 pushed a commit to chu11/flux-core that referenced this issue Feb 13, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants