-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulator: design new architecture for new job exec system #1566
Comments
In a recent meeting, having a good scheduling simulator was identified as being extremely valuable for a large HPC site to be able to test the effectiveness of scheduling and I believe your emulator has a good potential to serve that role in the future. If we only move the emulator into flux-core leaving the rest of flux-sched, I worry the emulator won't serve that role as well. At the same time, at this point I'm a bit reluctant to create more framework projects because we have already suffered from having flux-sched and -core in separate repos. (Unless there really is a compelling reason). I would like to understand the use cases in core a bit better. Maybe some components of emulator can go into there to support core functionalities while other bits kept in sched so that we can grow yours to serve that purpose. As we discussed, @SteVwonder, we also need to refactor emulator codes. I think it makes sense keeping |
@SteVwonder, I think the idea we were proposing was to utilize the engineering and development of the new exec system to "build in" emulation support from the beginning. This work could be valuable from the outset as it would let us test the job ingest, management, and scheduling interfaces without actually executing jobs. The idea would be to build an "execution emulator" module that would simulate the execution of jobs instead of actually running them, which would then allow emulating greater resource sets than a test instance actually has available, yadda, yadda, yadda -- I'm sure you get the idea. Since we are completely redeveloping the execution system including kvs layout for jobs, etc, there may be major effort on the simulator front anyway. We should definitely chat about what we need to do here for the best design of the simulator. I'm sure I don't understand all the intricacies yet. |
Yes, having ways to test our systems for large scale while using only a small resource set is an excellent idea. |
@dongahn, I guess I was thinking of only moving the The motivation here being that if we move over just the |
@grondo, I think the part I get confused on is the simulator (as a whole) already supports everything you mentioned. So if I'm interpreting what you are saying correctly, the value proposition here is not adding new features but instead baking the features into the real exec system (rather than relying on the |
For scalable toolstesting (e.g. STAT), building some testing logic into the real system helped a lot. With such support, we could launch N-core number of back-end deamons per node instead of 1 per node, which then allowed us to pin down some nasty scalability bugs and bottlenecks using a small fraction of resources. |
@SteVwonder, yes it could just be a misunderstanding of what the existing simulator does and how it does it. We kind of assumed there would be enough breakage of the existing simulator that it might be easier to just develop a new execution simulator module alongside the new execution system, but if the existing code can be reused in part or whole then that makes a lot of sense too. |
No. I think you are on the right track. Replacing Sorry for the misunderstanding/confusion there. I think at some point we should have a chat in front of a whiteboard to diagram this out a bit, especially if the simulator is going to be split across flux-core & flux-sched. (I may be rushing to a conclusion there though, maybe there is a way to keep everything together). |
Also the current emulator is designed for testing the effectiveness of scheduling policies etc. For scalability/performance testing, I think the design point would be different. For that one, I believe it makes sense to have it in flux-core. |
Great idea! Good first target for discussion in our new collaboration area. |
So @grondo and @garlick brought in the rolling whiteboard to 451, and we had a good brainstorming session. A summary of a few of the ideas that came out of the discussion:
|
Just wanted to get this thought down before continuing this discussion face to face. Apologies if I've gone off the rails and need another course correction! For the new exec system, we proposed a "state log" in RFC 14. Each job would have a log in the KVS consisting of timestamps and job states. I was going to propose an API function to go with this that would internally use KVS watch (a streamlined version) to allow one to wait until a job enters a state that matches a regex. So for example, one could call something like flux_future_t *f;
f = flux_job_wait (f, jobid, "(started|finished)"); and the future would be fulfilled once jobid enters either the started or finished states. The fulfilled future would contain the full state log with timestamps. We had discussed implementing the simulator as a python script running as the initial program of an instance. If we have batches of jobs that have to be submitted at specific wallclock intervals, it should be possible for that script to call If in addition to that we provided
then it seems like the front end script would have all it needs to drive the simulated instance? The script would be driven by a queue of time steps. Some time steps would be created intiially based on the wallclock intervals at which jobs are to be submitted. As jobs are submitted and the scheduler runs them, front end script takes in start time and simulated run time for each job and calculates new time steps to add to its queue. Possibly time steps that are very near each other are combined. Upon scheduler signaling that it is blocked, simulator steps time to the next discrete value, which causes jobs to complete, may trigger new jobs being submitted, etc.. Repeat until workload complete. Ingest the normal job log (TBD) to create simulator results. |
Yeah. I think that sounds about right. The only thing that I could use some more clarification on is the exec system hook and faking execution. If the front end script is the one taking control of the heartbeat and stepping instance time, it needs to communicate with the exec system to know when the next job will complete. Side-note: at first I thought the scheduler event signalling its idleness/blocking would just be noise in the non-sim case, but now I'm wondering how it could be leveraged in a production use-case. Maybe high-rate submission tools like capacitor could use it to dynamically throttle their submissions (e.g., dump a batch of jobs on the system and then wait for the scheduler to signal its idle before dumping more jobs). |
Following up on offline conversation: We have a possible future need to control simulated job behavior based on evolving simulation state, e.g. for simulating congestion or random failures. In the above scheme, a job runs until "instance time" >= start time + pre-set runtime. An alternative might be to have the exec system leave the job running until the front end script tells it to change state. That would allow the front end script to control how and when the simulated job terminates, or perhaps provide other directives to the simulated job. Another good point mentioned was that the single-job Finally, another point is that "scheduler idle" is insufficient information to advance to the next time step. The simulator also needs to know that the scheduler has run its scheduling loop after all the simulator-generated state changes. For example, idle is only meaningful if we know the scheduler has considered all of the jobs that were just submitted, or that just terminated. I think we said something about having a job state that indicates that the scheduler is aware of the job (pending?) |
I'm not following. What is the exact state of the scheduler you need to know? BTW, in addition to the emulation machinery, don't you need a way to
what is the plan for these in the FIFO scheduler? flux-sched has 1. But if 2 is added into the new execution system, more testing can be done, I think. |
The state we need to know is that the scheduler cannot schedule new jobs, and there are no events that would change that "in the pipeline". It's a bit of a synchronization problem involving the overall state of the system, not just the scheduler.
2 is the simulator support we're proposing to build into the new exec system. 1 is a general problem for Flux right? We have R_lite but we need something more expressive that will allow resources to be passed to sub-instances. Presumably that covers this use case too? We were hoping to make the scheduler unaware that it is in a simulation. (@grondo says "like us", ha ha). |
Ah. Make sense. It's a bit of synchronization problem, but if the simulator has the overall control of job and resource events emission, this would be more tractable.
Maybe we should break this problem down to sub problems. B. Populating the resource model with C. How does the exec system deal with the case where the scheduler has much larger resources. If we can make it so that we require to launch larger number of brokers per node to match the scheduler's resources, this still should be useful to test scalability.
This will be the best! Now I do not know whether I was then a man dreaming I was a butterfly, or whether I am now a butterfly, dreaming I am a man. |
Our thought was that the exec system should first be able to launch fake jobs with arbitrary resource assignment without needing any extra broker ranks (e.g. it would not launch any process or job shells but instead simply wait until simulator says the job should terminate and then update state as though procs and shells had terminated). We talked a little bit this morning about using the simulator to exercise some of the scalability of the system, but decided that might be unncessarily conflating the two goals. It may be easier to test scalability without the additional synchronization required by the simulator. |
OK. Then, the level of emulation is pretty similar to the emulator we built in flux-sched. The difference would be to directly build something in the exec system and better separation of concerns so that the emulator can be layered with multiple components.
OK. This works for me. But ultimately it would be really nice to bake something in so that we can run many brokers per node and test scalability that way. |
Exactly!
We can do that now - were you thinking about baking in something about how the resources are carved up? Because right now although we can run multiple brokers per node, they each think they have all the node's resources, and so resources are oversubscribed... |
I was thinking to do this with fake hwloc xml files, each fed to a broker, the capability that exists already. But then when the exec system actually executes the target programs, this can quickly lead to "resource limitations" and other issues because we will heavily overcommit brokers into each node. So a thought would be make it such that exec engine doesn't actually execute the program (no fork/exec) while still being able to satisfy the requirements of other systems. (e.g., fake data in the KVS job schema). In the case of STAT, we baked in fake stack traces into our tool daemons. This can be emulation level 2. |
Some other requirements that were referenced in now closed flux-sched issues:
|
I spoke with @chu11 before the break, and he had some great ideas inspired by his work on the infiniband simulator. In particular, it would great if users could provide a list of "breakpoints" - times at which the simulator should stop and run the command specified by the user. This could be used to provide second-class simulator support for job cancellations and nodes going up/down/drained. It could also be used by flux devs and sys admins to script the deterministic reproduction of hard-to-hit bugs (e.g., to reproduce the bug, you must mark a node as drained on a saturated system and then reload the scheduler). Finally, it could be used to script the testing of edge cases in scheduling policies (e.g., saturate the system, hit the user breakpoint, run their script which calculates the largest job that will still get backfilled, resume simulation, verify that the job was in-fact backfilled). Bonus points if the simulator can stop at a specified "breakpoint" and drop the user into an interactive shell, so they can poke around at the system state. |
Closing this one as it seems the simulator is going to be on hold for a while. |
When @garlick, @grondo, and I spoke on Monday, the simulator came up. It was mention that having access to the simulator from flux-core would be nice for testing the FIFO scheduler as well as other components in core. As part of that, what do others think the right move is? Moving the main simulator module into flux-core? I guess the alternative is making a separate project under flux-framework, but that seems like overkill to me.
This would also be a good chance to iron out some of the simulator interfaces (and finally decide on if we are calling it an emulator or a simulator).
The text was updated successfully, but these errors were encountered: