Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Flux simulator #2561

Closed
wants to merge 13 commits into from
Closed

Conversation

SteVwonder
Copy link
Member

@SteVwonder SteVwonder commented Nov 27, 2019

Initial support for the new simulator design within flux-core. It is a CLI tool that takes output files from sacct and re-executes the job trace through Flux using a simulated set of resources. Most of the logic is contained within flux-simulator.py, but there is some added logic to the job-manager and scheduler for determing "quiescence" (i.e., in the absence of new events/requests, the system will make no further changes - such as allocating or freeing jobs).

I can peel the python bindings changes out into a separate PR if that is desirable (a few of the commits can be removed too once we close #2549).

Related: #1566

@garlick
Copy link
Member

garlick commented Nov 27, 2019

Yay! Take a victory lap!

A few initial thoughts/questions:

Let's peel off the python refcounting fix(es) to a standalone PR and get that in ASAP, even if it's not the final fix, so we don't need to carry it here and elsewhere.

Could we improve on job.convert_id() by creating a JobID class with a factory interface as proposed for JobSpec (or maybe that would be overkill when the native format is just an int?). Maybe the public C api should provide an interface for the conversions so python doesn't need to use libutil/fluid.h directly.

Rather than calling out the job record as "sacct format", should we define a flux format that is either compatible or that has a straightforward conversion path? The job-info module could perhaps produce traces of Flux workloads in this format.

Is the strategy to have simulator.py unload the exec module and register a handler to replace it? Maybe we could find a better way to do that and then avoid the need to do module management from python. Maybe using the testexec interface or a simulator specific struct exec_implementation? I think the C interface for module management is pretty rough and I hate to give it more traction here. Another angle would be to clean that up in C and provide a python API as a separate PR.

The big ticket item of course: should we revisit, now that we've had some experience building on our original job manager design, whether there are alternatives to the "quiescent" interface? It would be nice if any new synchronization mechanisms we introduce have some general utility beyond this use case. It would also be nice to have less intrusion into the scheduler. This is hard as I recall, so I'm not sure we'll get anywhere, but I'd feel better if we spent a bit more time thinking about it before committing to this approach.

IMHO, peeling off some of the bits mentioned above into standalone PR's would help move this forward.

Anyway, nice job getting this all wrapped up :-)

SteVwonder and others added 13 commits December 19, 2019 12:35
converts flux ids from/to hex, dec, and kvs
add optional callbacks to notify schedutil users when there are no
longer any outstanding futures/messages in the schedutil context (i.e.,
idle) and when the schedutil context goes from idle to busy (i.e., now
has an outstanding future/message)

useful for simulations where the scheduler needs to accurately respond
to a `quiescent` request from the job-manager
The simulator can now send a `job-manager.quiescent` request, which will
only be responded to when the entire system has quiesced (i.e., in the
absence of new events/requests, the system will make no further changes
- such as allocating or freeing jobs).  For the simple scheduler, this
simply means that the schedutil library is idle.

The job-manager then sends its own `quiescent` request to the scheduler
along with every alloc request. It will only respond to the simulator's
request after its own request to the scheduler is responded to.  In the
future, this protocol will be expanded to include the exec and depend
modules.
after receiving an alloc response from the scheduler, the job-manager
emits an event, which triggers a `start` request to be sent to the exec
system.  The re-entrance into the reactor loop between the reception of
the alloc response and sending the start request means that the
job-manager has a chance to "pre-maturely" process the quiescent
response from the scheduler.  This ultimately leads to the simulator
receiving an erroneous 'quiescent' response from the job-manager.  A
similar problem exists for outstanding start requests.

To solve these problems, ensure that every alloc response has a
corresponding start response before sending a quiescent request.  Track
the number of outstanding requests in the simulator context of the job
manager, which is also the piece responsible for responding to the
quiescent request.
@SteVwonder
Copy link
Member Author

Let's peel off the python refcounting fix(es) to a standalone PR and get that in ASAP, even if it's not the final fix, so we don't need to carry it here and elsewhere.

👍 Done.

Maybe the public C api should provide an interface for the conversions so python doesn't need to use libutil/fluid.h directly.

Yeah, that makes sense. I was waffling between the two solution and went the Python route b/c it was expedient at the time, but exporting it from C seem cleaner.

Rather than calling out the job record as "sacct format", should we define a flux format that is either compatible or that has a straightforward conversion path? The job-info module could perhaps produce traces of Flux workloads in this format.

Using a format other than "sacct" seems like a good idea to me. One option is using the "Parallel Workloads Archive's" "Standard Workload Format" (SWF). That is the closest thing to a common standard in the literature, although it is a bit outdated at this point. Another option would be what you suggest, to put together our own format, maybe one that natively supports Jobspec. That way it is easy to run simulations involving resources beyond nodes and cores. I think I'm leaning towards the latter since we plan on doing BB simulations in the short to medium term as part of an L2 milestone. We could include in our conversion script both the SWF as well as sacct.

Is the strategy to have simulator.py unload the exec module and register a handler to replace it? Maybe we could find a better way to do that and then avoid the need to do module management from python. Maybe using the testexec interface or a simulator specific struct exec_implementation? I think the C interface for module management is pretty rough and I hate to give it more traction here. Another angle would be to clean that up in C and provide a python API as a separate PR.

Yeah, that is the current strategy. I'm definitely open to changing it. One easy tweak to the current strategy could be to remove module loading/unloading from the python and just have simulator-specific RC scripts that don't load the exec system.

One of the benefit IMO of doing it from python is that all of the simulator-specific information and logic (including the simulated clock) is localized to a single file. IIUC, a simulator-specific struct exec_implementation would require some form of side-channel communication between the simulator and exec system to communicate:

  • The actual runtime of the job (assuming it is less than the requested walltime)
  • The current simulated time so that the exec system knows when to emit a "job exited" event/msg

Maybe we can discuss in more detail at coffee time.

The big ticket item of course: should we revisit, now that we've had some experience building on our original job manager design, whether there are alternatives to the "quiescent" interface? It would be nice if any new synchronization mechanisms we introduce have some general utility beyond this use case. It would also be nice to have less intrusion into the scheduler. This is hard as I recall, so I'm not sure we'll get anywhere, but I'd feel better if we spent a bit more time thinking about it before committing to this approach.

Yeah, I agree that this solution isn't the most appealing from a conceptual level. As we discussed face-to-face, let's move forward with the quiescent interface for now, and we can revisit later on once we have some more discussions and better ideas. For now, I think the big benefits of the quiescent interface are that it:

  • Is highly localized and is minimally invasive to the broader codebase. Almost all of the code in tucked away in the simulator component of the job-manager and the flux-simulator.py script. The only other piece impacted currently is the scheduler, which now has an idle_cb, a busy_cb and a quiescent_cb, totalling ~65 lines. I expect a similar number of lines will need to be added to the depend module. To be honest, I'm not sure what the impact will be to the flux-sched scheduler. IIUC, it isn't doing any idle loops, it is still entirely event-driven. So it should be a similarly small number of lines of code.
  • Has almost zero runtime implications when running normally (i.e., not a simulation). There are three function calls added in job-manager that increment/decrement a variable, run a simple check for a NULL value and then immediately return when the quiescent interface is not being used. The idle and busy callbacks in the scheduler just flip a boolean value, do a similar NULL value check, and then immediately return when not in a simulation. I plan on doing a simple job throughput test with submitbench to validate this claim.

IMHO, peeling off some of the bits mentioned above into standalone PR's would help move this forward.

👍 I'll start work on that now.

@lgtm-com
Copy link

lgtm-com bot commented Jan 14, 2020

This pull request introduces 5 alerts when merging ce68813 into ce510d3 - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unnecessary pass
  • 1 for Module is imported more than once
  • 1 for Unused argument in a formatting call

@SteVwonder
Copy link
Member Author

Per a face-to-face discussion with @garlick:

  • Ok to push forward with the current module load/unload strategy. The exec plugin infrastructure is changing soon, so it may not be the best time to build off of that.
  • Whatever job trace format we go with, we should create a tool that can build that format from a live instance of Flux.

Comment on lines +133 to +136
def insert_apriori_events(self, simulation):
# TODO: add priority to `add_event` so that all submits for a given time
# can happen consecutively, followed by the waits for the jobids
simulation.add_event(self.submit_time, lambda: simulation.submit_job(self))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Astute observation from @mrwyattii: this logic should be contained in the simulator along with the other job event additions.

I originally planned to have the Job add all of it's own events so that the Simulation could remain agnostic of the job's lifecycle (submit -> run -> complete). That would make adding new job states like depend, grow, and shrink only require modifying the Job class. It would also allow adding new entities like a Resource (e.g., node, filesystem) more modular; they would each handle their own event adding and the Simulation could remain ignorant of their lifecycles. But that is probably left for another day and different PR.

@SteVwonder
Copy link
Member Author

SteVwonder commented Apr 18, 2020

Note from @mrwyattii's current research investigation. The current cancel method just raises the cancel exception. The simulator acting as the exec system does not actually process the cancel exception properly, so the job never makes it to the inactive state. On the plus side, the post-simulation auditing of job states worked properly!

@SteVwonder
Copy link
Member Author

SteVwonder commented Aug 29, 2020

EDIT: I just force pushed the commit (ce68813) that I had previously overwritten with an older commit.

@adfaure
Copy link

adfaure commented Jun 15, 2021

Hello, will this PR be accepted ?
If I want to use Flux simulation, do I need to use this branch or the simulation is also possible in the master branch ?

Thank you.

@grondo
Copy link
Contributor

grondo commented Jun 15, 2021

Hello, will this PR be accepted ?

@adfaure, this PR is quite outdated, so it won't be accepted in its current form, though I think the plan to eventually update and merge this work.

If I want to use Flux simulation, do I need to use this branch or the simulation is also possible in the master branch ?

It depends on what you mean by simulation. What are you looking to do? For example, the mainline version of flux-core can simulate job execution when the attributes.system.exec.test.run_duration attribute is set.

@adfaure
Copy link

adfaure commented Jun 15, 2021

I am interested to understand the simulation capabilities of flux to have a global picture of what it offers, especially about scheduling simulation.
Especially:

  • What is the simulation model for the jobs;
  • What is the simulation model of the platform;
  • How can I write a new scheduling algorithm.

Next year, I managed to make the simulator of this PR working, I will try to do the same with the current master branch.

Thank you for your quick answer.

@grondo
Copy link
Contributor

grondo commented Jun 15, 2021

@adfaure, I will let @SteVwonder answer some of your specific questions.

How can I write a new scheduling algorithm.

The scheduler in Flux is an independent module. To develop a new scheduling algorithm you can either write a new scheduler module (using perhaps the extremely simple included scheduler as a starting point), or by developing new planner or matching plugins for the Fluxion graph based scheduler.

We should perhaps move the last few comments here to our Discussions forum.

Edit: Done. See #3718

@grondo
Copy link
Contributor

grondo commented Jul 7, 2022

Placing the simulator in deep freeze. To be resurrected at some future date when civilization has evolved to a higher level of consciousness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang when submitting many jobs via Python
4 participants