Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulator: design new architecture for new job exec system #1566

Closed
SteVwonder opened this issue Jul 5, 2018 · 24 comments
Closed

simulator: design new architecture for new job exec system #1566

SteVwonder opened this issue Jul 5, 2018 · 24 comments
Assignees
Labels
design don't expect this to ever be closed...

Comments

@SteVwonder
Copy link
Member

SteVwonder commented Jul 5, 2018

When @garlick, @grondo, and I spoke on Monday, the simulator came up. It was mention that having access to the simulator from flux-core would be nice for testing the FIFO scheduler as well as other components in core. As part of that, what do others think the right move is? Moving the main simulator module into flux-core? I guess the alternative is making a separate project under flux-framework, but that seems like overkill to me.

This would also be a good chance to iron out some of the simulator interfaces (and finally decide on if we are calling it an emulator or a simulator).

@dongahn
Copy link
Member

dongahn commented Jul 5, 2018

In a recent meeting, having a good scheduling simulator was identified as being extremely valuable for a large HPC site to be able to test the effectiveness of scheduling and I believe your emulator has a good potential to serve that role in the future.

If we only move the emulator into flux-core leaving the rest of flux-sched, I worry the emulator won't serve that role as well. At the same time, at this point I'm a bit reluctant to create more framework projects because we have already suffered from having flux-sched and -core in separate repos. (Unless there really is a compelling reason).

I would like to understand the use cases in core a bit better. Maybe some components of emulator can go into there to support core functionalities while other bits kept in sched so that we can grow yours to serve that purpose.

As we discussed, @SteVwonder, we also need to refactor emulator codes. I think it makes sense keeping sched.c lean and mean, not including emulator bits (or include only a minimal set). Even if this means, the emulator will do include "sched.c"

@grondo
Copy link
Contributor

grondo commented Jul 5, 2018

@SteVwonder, I think the idea we were proposing was to utilize the engineering and development of the new exec system to "build in" emulation support from the beginning. This work could be valuable from the outset as it would let us test the job ingest, management, and scheduling interfaces without actually executing jobs.

The idea would be to build an "execution emulator" module that would simulate the execution of jobs instead of actually running them, which would then allow emulating greater resource sets than a test instance actually has available, yadda, yadda, yadda -- I'm sure you get the idea.

Since we are completely redeveloping the execution system including kvs layout for jobs, etc, there may be major effort on the simulator front anyway.

We should definitely chat about what we need to do here for the best design of the simulator. I'm sure I don't understand all the intricacies yet.

@dongahn
Copy link
Member

dongahn commented Jul 5, 2018

Yes, having ways to test our systems for large scale while using only a small resource set is an excellent idea.

@SteVwonder
Copy link
Member Author

SteVwonder commented Jul 5, 2018

@dongahn, I guess I was thinking of only moving the simulator module (which includes the major interfaces necessary for running a simulation) but leaving all of the other functionality in flux-sched (e.g., the submit module) that make a full end-to-end simulation of a job trace possible. I was also thinking that the simulator module would be installed with make install of flux-core, enabling components of flux-sched to compile against the simulator (e.g., sched.c).

The motivation here being that if we move over just the simulator module (not the whole thing), and the new exec system supports simulation, then we can add some simulation to flux-core's make check.

@SteVwonder
Copy link
Member Author

@grondo, I think the part I get confused on is the simulator (as a whole) already supports everything you mentioned. So if I'm interpreting what you are saying correctly, the value proposition here is not adding new features but instead baking the features into the real exec system (rather than relying on the simexec module)?

@dongahn
Copy link
Member

dongahn commented Jul 5, 2018

For scalable toolstesting (e.g. STAT), building some testing logic into the real system helped a lot. With such support, we could launch N-core number of back-end deamons per node instead of 1 per node, which then allowed us to pin down some nasty scalability bugs and bottlenecks using a small fraction of resources.

@grondo
Copy link
Contributor

grondo commented Jul 5, 2018

@SteVwonder, yes it could just be a misunderstanding of what the existing simulator does and how it does it. We kind of assumed there would be enough breakage of the existing simulator that it might be easier to just develop a new execution simulator module alongside the new execution system, but if the existing code can be reused in part or whole then that makes a lot of sense too.

@SteVwonder
Copy link
Member Author

No. I think you are on the right track. Replacing simexec with the new exec system seems like the way to go. Much better to test actual interfaces than to test a clone of them.

Sorry for the misunderstanding/confusion there. I think at some point we should have a chat in front of a whiteboard to diagram this out a bit, especially if the simulator is going to be split across flux-core & flux-sched. (I may be rushing to a conclusion there though, maybe there is a way to keep everything together).

@dongahn
Copy link
Member

dongahn commented Jul 6, 2018

Also the current emulator is designed for testing the effectiveness of scheduling policies etc. For scalability/performance testing, I think the design point would be different. For that one, I believe it makes sense to have it in flux-core.

@grondo
Copy link
Contributor

grondo commented Jul 6, 2018

I think at some point we should have a chat in front of a whiteboard to diagram this out a bit,

Great idea! Good first target for discussion in our new collaboration area.

@SteVwonder
Copy link
Member Author

So @grondo and @garlick brought in the rolling whiteboard to 451, and we had a good brainstorming session. A summary of a few of the ideas that came out of the discussion:

  • The sim module itself can be removed. In it's place, each module participating in the simulation can enter a barrier. Once all modules have hit the barrier, they can exchange events/times and whoever has the next occurring event can proceed.
  • The 'sched' module can be "sim-agnostic" as long as one other module can determine when the sched module has gone idle (and block on its behalf). For example, the submit module could submit N jobs and then wait until all N have reached the pending state before entering the barrier. This would ensure that the sched has completely ingested all of the jobs (although we need to double check that it would ensure sched had completed its schedule_jobs loop). There is the potential that we need to add some sort of sched.idle notification/event to make this work. The main benefit here is the removal of all simulator logic from the scheduler (enabling any flux scheduler to be tested with the simulator)
  • The submit module does not need to be a module. It could be an initial program instead. Python was suggested here as the language of choice to enable easier parsing of the job traces.
  • The functionality of exec could be merged into submit (or the submit initial program equivalent), but this would require thought as to how to support extension by researchers that want to do more advanced simulations (e.g. contention modeling, node failures). One suggestion was to provide configurable hooks/callbacks or use class inheritance/method overriding (if in an OOP language).

@garlick
Copy link
Member

garlick commented Jul 9, 2018

Just wanted to get this thought down before continuing this discussion face to face. Apologies if I've gone off the rails and need another course correction!

For the new exec system, we proposed a "state log" in RFC 14. Each job would have a log in the KVS consisting of timestamps and job states. I was going to propose an API function to go with this that would internally use KVS watch (a streamlined version) to allow one to wait until a job enters a state that matches a regex. So for example, one could call something like

flux_future_t *f;
f = flux_job_wait (f, jobid, "(started|finished)");

and the future would be fulfilled once jobid enters either the started or finished states. The fulfilled future would contain the full state log with timestamps.

We had discussed implementing the simulator as a python script running as the initial program of an instance. If we have batches of jobs that have to be submitted at specific wallclock intervals, it should be possible for that script to call flux_wait_job() on all the jobs it submits and then be informed of the job start time and end time.

If in addition to that we provided

  • A heartbeat-driven "instance" time
  • A way for front end script to take control of heartbeat and use it to step instance time in discrete jumps
  • Exec system hook to allow execution to be faked. Job submitted with runtime=5m blocks until instance time >= start time + 5m
  • Way for scheduler to signal when it is blocked (cannot schedule more resources)

then it seems like the front end script would have all it needs to drive the simulated instance?

The script would be driven by a queue of time steps. Some time steps would be created intiially based on the wallclock intervals at which jobs are to be submitted. As jobs are submitted and the scheduler runs them, front end script takes in start time and simulated run time for each job and calculates new time steps to add to its queue. Possibly time steps that are very near each other are combined. Upon scheduler signaling that it is blocked, simulator steps time to the next discrete value, which causes jobs to complete, may trigger new jobs being submitted, etc.. Repeat until workload complete. Ingest the normal job log (TBD) to create simulator results.

@SteVwonder
Copy link
Member Author

Yeah. I think that sounds about right. The only thing that I could use some more clarification on is the exec system hook and faking execution. If the front end script is the one taking control of the heartbeat and stepping instance time, it needs to communicate with the exec system to know when the next job will complete.

Side-note: at first I thought the scheduler event signalling its idleness/blocking would just be noise in the non-sim case, but now I'm wondering how it could be leveraged in a production use-case. Maybe high-rate submission tools like capacitor could use it to dynamically throttle their submissions (e.g., dump a batch of jobs on the system and then wait for the scheduler to signal its idle before dumping more jobs).

@garlick
Copy link
Member

garlick commented Jul 9, 2018

Following up on offline conversation:

We have a possible future need to control simulated job behavior based on evolving simulation state, e.g. for simulating congestion or random failures. In the above scheme, a job runs until "instance time" >= start time + pre-set runtime. An alternative might be to have the exec system leave the job running until the front end script tells it to change state. That would allow the front end script to control how and when the simulated job terminates, or perhaps provide other directives to the simulated job.

Another good point mentioned was that the single-job flux_job_wait() won't scale for millions of jobs so we'll probably need an alternate bulk job monitoring scheme.

Finally, another point is that "scheduler idle" is insufficient information to advance to the next time step. The simulator also needs to know that the scheduler has run its scheduling loop after all the simulator-generated state changes. For example, idle is only meaningful if we know the scheduler has considered all of the jobs that were just submitted, or that just terminated. I think we said something about having a job state that indicates that the scheduler is aware of the job (pending?)

@dongahn
Copy link
Member

dongahn commented Jul 9, 2018

Finally, another point is that "scheduler idle" is insufficient information to advance to the next time step. The simulator also needs to know that the scheduler has run its scheduling loop after all the simulator-generated state changes. For example, idle is only meaningful if we know the scheduler has considered all of the jobs that were just submitted, or that just terminated. I think we said something about having a job state that indicates that the scheduler is aware of the job (pending?)

I'm not following. What is the exact state of the scheduler you need to know?

BTW, in addition to the emulation machinery, don't you need a way to

  1. feed the scheduler with the large scale resources info to emulate large scale systems?
  2. fake execute tasks without having to actually execute them?

what is the plan for these in the FIFO scheduler?

flux-sched has 1. But if 2 is added into the new execution system, more testing can be done, I think.

@garlick
Copy link
Member

garlick commented Jul 9, 2018

The state we need to know is that the scheduler cannot schedule new jobs, and there are no events that would change that "in the pipeline". It's a bit of a synchronization problem involving the overall state of the system, not just the scheduler.

BTW, in addition to the emulation machinery, don't you need a way to

  1. eed the scheduler with the large scale resources info to emulate large scale systems?
  2. fake execute tasks without having to actually execute them?

2 is the simulator support we're proposing to build into the new exec system.

1 is a general problem for Flux right? We have R_lite but we need something more expressive that will allow resources to be passed to sub-instances. Presumably that covers this use case too?

We were hoping to make the scheduler unaware that it is in a simulation. (@grondo says "like us", ha ha).

@dongahn
Copy link
Member

dongahn commented Jul 9, 2018

The state we need to know is that the scheduler cannot schedule new jobs, and there are no events that would change that "in the pipeline". It's a bit of a synchronization problem involving the overall state of the system, not just the scheduler.

Ah. Make sense. It's a bit of synchronization problem, but if the simulator has the overall control of job and resource events emission, this would be more tractable.

1 is a general problem for Flux right? We have R_lite but we need something more expressive that will allow resources to be passed to sub-instances. Presumably that covers this use case too?

Maybe we should break this problem down to sub problems.
A. The scheduler needs to read in a config or hwloc xml files and populate its resource model. (We have this in flux-sched)

B. Populating the resource model with R from the parent instance would just be to provide one more reader. (exact R howto is still in the co-design area.)

C. How does the exec system deal with the case where the scheduler has much larger resources. If we can make it so that we require to launch larger number of brokers per node to match the scheduler's resources, this still should be useful to test scalability.

We were hoping to make the scheduler unaware that it is in a simulation. (@grondo says "like us", ha ha).

This will be the best!

Now I do not know whether I was then a man dreaming I was a butterfly, or whether I am now a butterfly, dreaming I am a man.

@garlick
Copy link
Member

garlick commented Jul 9, 2018

How does the exec system deal with the case where the scheduler has much larger resources. If we can make it so that we require to launch larger number of brokers per node to match the scheduler's resources, this still should be useful to test scalability.

Our thought was that the exec system should first be able to launch fake jobs with arbitrary resource assignment without needing any extra broker ranks (e.g. it would not launch any process or job shells but instead simply wait until simulator says the job should terminate and then update state as though procs and shells had terminated).

We talked a little bit this morning about using the simulator to exercise some of the scalability of the system, but decided that might be unncessarily conflating the two goals. It may be easier to test scalability without the additional synchronization required by the simulator.

@dongahn
Copy link
Member

dongahn commented Jul 9, 2018

Our thought was that the exec system should first be able to launch fake jobs with arbitrary resource assignment without needing any extra broker ranks (e.g. it would not launch any process or job shells but instead simply wait until simulator says the job should terminate and then update state as though procs and shells had terminated).

OK. Then, the level of emulation is pretty similar to the emulator we built in flux-sched. The difference would be to directly build something in the exec system and better separation of concerns so that the emulator can be layered with multiple components.

We talked a little bit this morning about using the simulator to exercise some of the scalability of the system, but decided that might be unncessarily conflating the two goals. It may be easier to test scalability without the additional synchronization required by the simulator.

OK. This works for me. But ultimately it would be really nice to bake something in so that we can run many brokers per node and test scalability that way.

@garlick
Copy link
Member

garlick commented Jul 9, 2018

The difference would be to directly build something in the exec system and better separation of concerns so that the emulator can be layered with multiple components.

Exactly!

But ultimately it would be really nice to bake something in so that we can run many brokers per node and test scalability that way.

We can do that now - were you thinking about baking in something about how the resources are carved up? Because right now although we can run multiple brokers per node, they each think they have all the node's resources, and so resources are oversubscribed...

@dongahn
Copy link
Member

dongahn commented Jul 9, 2018

We can do that now - were you thinking about baking in something about how the resources are carved up? Because right now although we can run multiple brokers per node, they each think they have all the node's resources, and so resources are oversubscribed...

I was thinking to do this with fake hwloc xml files, each fed to a broker, the capability that exists already.

But then when the exec system actually executes the target programs, this can quickly lead to "resource limitations" and other issues because we will heavily overcommit brokers into each node.

So a thought would be make it such that exec engine doesn't actually execute the program (no fork/exec) while still being able to satisfy the requirements of other systems. (e.g., fake data in the KVS job schema).

In the case of STAT, we baked in fake stack traces into our tool daemons.

This can be emulation level 2.

@SteVwonder SteVwonder self-assigned this Feb 14, 2019
@SteVwonder
Copy link
Member Author

Some other requirements that were referenced in now closed flux-sched issues:

  • We should support timed cancelling of jobs
  • We should support "native SLURM" traces (csv dumps from sacct)
  • We should support node-level and core-level scheduling
    • node-level is necessary to simulate SLURM on LC systems
    • core-level is necessary to simulate multi-tenancy and Pilot2-style workflows

@garlick garlick changed the title move main simulator module into flux-core? simulator: design new architecture for new job exec system Feb 25, 2019
@SteVwonder SteVwonder added the design don't expect this to ever be closed... label Feb 25, 2019
@SteVwonder
Copy link
Member Author

I spoke with @chu11 before the break, and he had some great ideas inspired by his work on the infiniband simulator. In particular, it would great if users could provide a list of "breakpoints" - times at which the simulator should stop and run the command specified by the user. This could be used to provide second-class simulator support for job cancellations and nodes going up/down/drained. It could also be used by flux devs and sys admins to script the deterministic reproduction of hard-to-hit bugs (e.g., to reproduce the bug, you must mark a node as drained on a saturated system and then reload the scheduler). Finally, it could be used to script the testing of edge cases in scheduling policies (e.g., saturate the system, hit the user breakpoint, run their script which calculates the largest job that will still get backfilled, resume simulation, verify that the job was in-fact backfilled). Bonus points if the simulator can stop at a specified "breakpoint" and drop the user into an interactive shell, so they can poke around at the system state.

@garlick
Copy link
Member

garlick commented Aug 17, 2022

Closing this one as it seems the simulator is going to be on hold for a while.

@garlick garlick closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design don't expect this to ever be closed...
Projects
None yet
Development

No branches or pull requests

4 participants