[WIP] Flux simulator #2561

SteVwonder · 2019-11-27T02:39:59Z

Initial support for the new simulator design within flux-core. It is a CLI tool that takes output files from sacct and re-executes the job trace through Flux using a simulated set of resources. Most of the logic is contained within flux-simulator.py, but there is some added logic to the job-manager and scheduler for determing "quiescence" (i.e., in the absence of new events/requests, the system will make no further changes - such as allocating or freeing jobs).

I can peel the python bindings changes out into a separate PR if that is desirable (a few of the commits can be removed too once we close #2549).

Related: #1566

garlick · 2019-11-27T16:03:18Z

Yay! Take a victory lap!

A few initial thoughts/questions:

Let's peel off the python refcounting fix(es) to a standalone PR and get that in ASAP, even if it's not the final fix, so we don't need to carry it here and elsewhere.

Could we improve on job.convert_id() by creating a JobID class with a factory interface as proposed for JobSpec (or maybe that would be overkill when the native format is just an int?). Maybe the public C api should provide an interface for the conversions so python doesn't need to use libutil/fluid.h directly.

Rather than calling out the job record as "sacct format", should we define a flux format that is either compatible or that has a straightforward conversion path? The job-info module could perhaps produce traces of Flux workloads in this format.

Is the strategy to have simulator.py unload the exec module and register a handler to replace it? Maybe we could find a better way to do that and then avoid the need to do module management from python. Maybe using the testexec interface or a simulator specific struct exec_implementation? I think the C interface for module management is pretty rough and I hate to give it more traction here. Another angle would be to clean that up in C and provide a python API as a separate PR.

The big ticket item of course: should we revisit, now that we've had some experience building on our original job manager design, whether there are alternatives to the "quiescent" interface? It would be nice if any new synchronization mechanisms we introduce have some general utility beyond this use case. It would also be nice to have less intrusion into the scheduler. This is hard as I recall, so I'm not sure we'll get anywhere, but I'd feel better if we spent a bit more time thinking about it before committing to this approach.

IMHO, peeling off some of the bits mentioned above into standalone PR's would help move this forward.

Anyway, nice job getting this all wrapped up :-)

converts flux ids from/to hex, dec, and kvs

add optional callbacks to notify schedutil users when there are no longer any outstanding futures/messages in the schedutil context (i.e., idle) and when the schedutil context goes from idle to busy (i.e., now has an outstanding future/message) useful for simulations where the scheduler needs to accurately respond to a `quiescent` request from the job-manager

The simulator can now send a `job-manager.quiescent` request, which will only be responded to when the entire system has quiesced (i.e., in the absence of new events/requests, the system will make no further changes - such as allocating or freeing jobs). For the simple scheduler, this simply means that the schedutil library is idle. The job-manager then sends its own `quiescent` request to the scheduler along with every alloc request. It will only respond to the simulator's request after its own request to the scheduler is responded to. In the future, this protocol will be expanded to include the exec and depend modules.

after receiving an alloc response from the scheduler, the job-manager emits an event, which triggers a `start` request to be sent to the exec system. The re-entrance into the reactor loop between the reception of the alloc response and sending the start request means that the job-manager has a chance to "pre-maturely" process the quiescent response from the scheduler. This ultimately leads to the simulator receiving an erroneous 'quiescent' response from the job-manager. A similar problem exists for outstanding start requests. To solve these problems, ensure that every alloc response has a corresponding start response before sending a quiescent request. Track the number of outstanding requests in the simulator context of the job manager, which is also the piece responsible for responding to the quiescent request.

SteVwonder · 2020-01-14T21:22:45Z

Let's peel off the python refcounting fix(es) to a standalone PR and get that in ASAP, even if it's not the final fix, so we don't need to carry it here and elsewhere.

👍 Done.

Maybe the public C api should provide an interface for the conversions so python doesn't need to use libutil/fluid.h directly.

Yeah, that makes sense. I was waffling between the two solution and went the Python route b/c it was expedient at the time, but exporting it from C seem cleaner.

Rather than calling out the job record as "sacct format", should we define a flux format that is either compatible or that has a straightforward conversion path? The job-info module could perhaps produce traces of Flux workloads in this format.

Using a format other than "sacct" seems like a good idea to me. One option is using the "Parallel Workloads Archive's" "Standard Workload Format" (SWF). That is the closest thing to a common standard in the literature, although it is a bit outdated at this point. Another option would be what you suggest, to put together our own format, maybe one that natively supports Jobspec. That way it is easy to run simulations involving resources beyond nodes and cores. I think I'm leaning towards the latter since we plan on doing BB simulations in the short to medium term as part of an L2 milestone. We could include in our conversion script both the SWF as well as sacct.

Is the strategy to have simulator.py unload the exec module and register a handler to replace it? Maybe we could find a better way to do that and then avoid the need to do module management from python. Maybe using the testexec interface or a simulator specific struct exec_implementation? I think the C interface for module management is pretty rough and I hate to give it more traction here. Another angle would be to clean that up in C and provide a python API as a separate PR.

Yeah, that is the current strategy. I'm definitely open to changing it. One easy tweak to the current strategy could be to remove module loading/unloading from the python and just have simulator-specific RC scripts that don't load the exec system.

One of the benefit IMO of doing it from python is that all of the simulator-specific information and logic (including the simulated clock) is localized to a single file. IIUC, a simulator-specific struct exec_implementation would require some form of side-channel communication between the simulator and exec system to communicate:

The actual runtime of the job (assuming it is less than the requested walltime)
The current simulated time so that the exec system knows when to emit a "job exited" event/msg

Maybe we can discuss in more detail at coffee time.

The big ticket item of course: should we revisit, now that we've had some experience building on our original job manager design, whether there are alternatives to the "quiescent" interface? It would be nice if any new synchronization mechanisms we introduce have some general utility beyond this use case. It would also be nice to have less intrusion into the scheduler. This is hard as I recall, so I'm not sure we'll get anywhere, but I'd feel better if we spent a bit more time thinking about it before committing to this approach.

Yeah, I agree that this solution isn't the most appealing from a conceptual level. As we discussed face-to-face, let's move forward with the quiescent interface for now, and we can revisit later on once we have some more discussions and better ideas. For now, I think the big benefits of the quiescent interface are that it:

Is highly localized and is minimally invasive to the broader codebase. Almost all of the code in tucked away in the simulator component of the job-manager and the flux-simulator.py script. The only other piece impacted currently is the scheduler, which now has an idle_cb, a busy_cb and a quiescent_cb, totalling ~65 lines. I expect a similar number of lines will need to be added to the depend module. To be honest, I'm not sure what the impact will be to the flux-sched scheduler. IIUC, it isn't doing any idle loops, it is still entirely event-driven. So it should be a similarly small number of lines of code.
Has almost zero runtime implications when running normally (i.e., not a simulation). There are three function calls added in job-manager that increment/decrement a variable, run a simple check for a NULL value and then immediately return when the quiescent interface is not being used. The idle and busy callbacks in the scheduler just flip a boolean value, do a similar NULL value check, and then immediately return when not in a simulation. I plan on doing a simple job throughput test with submitbench to validate this claim.

IMHO, peeling off some of the bits mentioned above into standalone PR's would help move this forward.

👍 I'll start work on that now.

lgtm-com · 2020-01-14T21:45:51Z

This pull request introduces 5 alerts when merging ce68813 into ce510d3 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unnecessary pass
1 for Module is imported more than once
1 for Unused argument in a formatting call

SteVwonder · 2020-01-15T00:02:05Z

Per a face-to-face discussion with @garlick:

Ok to push forward with the current module load/unload strategy. The exec plugin infrastructure is changing soon, so it may not be the best time to build off of that.
Whatever job trace format we go with, we should create a tool that can build that format from a live instance of Flux.

SteVwonder · 2020-03-30T22:25:34Z

src/cmd/flux-simulator.py

+    def insert_apriori_events(self, simulation):
+        # TODO: add priority to `add_event` so that all submits for a given time
+        # can happen consecutively, followed by the waits for the jobids
+        simulation.add_event(self.submit_time, lambda: simulation.submit_job(self))


Astute observation from @mrwyattii: this logic should be contained in the simulator along with the other job event additions.

I originally planned to have the Job add all of it's own events so that the Simulation could remain agnostic of the job's lifecycle (submit -> run -> complete). That would make adding new job states like depend, grow, and shrink only require modifying the Job class. It would also allow adding new entities like a Resource (e.g., node, filesystem) more modular; they would each handle their own event adding and the Simulation could remain ignorant of their lifecycles. But that is probably left for another day and different PR.

SteVwonder · 2020-04-18T04:05:07Z

Note from @mrwyattii's current research investigation. The current cancel method just raises the cancel exception. The simulator acting as the exec system does not actually process the cancel exception properly, so the job never makes it to the inactive state. On the plus side, the post-simulation auditing of job states worked properly!

SteVwonder · 2020-08-29T02:50:19Z

EDIT: I just force pushed the commit (ce68813) that I had previously overwritten with an older commit.

adfaure · 2021-06-15T13:22:20Z

Hello, will this PR be accepted ?
If I want to use Flux simulation, do I need to use this branch or the simulation is also possible in the master branch ?

Thank you.

grondo · 2021-06-15T14:01:45Z

Hello, will this PR be accepted ?

@adfaure, this PR is quite outdated, so it won't be accepted in its current form, though I think the plan to eventually update and merge this work.

If I want to use Flux simulation, do I need to use this branch or the simulation is also possible in the master branch ?

It depends on what you mean by simulation. What are you looking to do? For example, the mainline version of flux-core can simulate job execution when the attributes.system.exec.test.run_duration attribute is set.

adfaure · 2021-06-15T14:37:56Z

I am interested to understand the simulation capabilities of flux to have a global picture of what it offers, especially about scheduling simulation.
Especially:

What is the simulation model for the jobs;
What is the simulation model of the platform;
How can I write a new scheduling algorithm.

Next year, I managed to make the simulator of this PR working, I will try to do the same with the current master branch.

Thank you for your quick answer.

grondo · 2021-06-15T14:45:48Z

@adfaure, I will let @SteVwonder answer some of your specific questions.

How can I write a new scheduling algorithm.

The scheduler in Flux is an independent module. To develop a new scheduling algorithm you can either write a new scheduler module (using perhaps the extremely simple included scheduler as a starting point), or by developing new planner or matching plugins for the Fluxion graph based scheduler.

We should perhaps move the last few comments here to our Discussions forum.

Edit: Done. See #3718

grondo · 2022-07-07T23:34:44Z

Placing the simulator in deep freeze. To be resurrected at some future date when civilization has evolved to a higher level of consciousness.

SteVwonder and others added 13 commits December 19, 2019 12:35

python: add copy method to message class

f363d2d

python/util: refactor encode_* funcs to use six.ensure_binary

95f4cdc

python: add modfind util function

f21d37c

python: add kvs.get_key_raw for getting job eventlogs

57af899

python: include fluid header in bindings

9f34d14

python: add convert_id to job module

787e147

converts flux ids from/to hex, dec, and kvs

job-manager: add debug logging to hello exchange

2c39cde

schedutil: ensure work is completed before idle_cb is run

67d0de0

flux-simulator: intial commit

43f5ba3

t: add simple test for flux-simulator

ce68813

SteVwonder force-pushed the flux-simulator branch from ac28a43 to ce68813 Compare December 20, 2019 00:21

SteVwonder commented Mar 30, 2020

View reviewed changes

SteVwonder force-pushed the flux-simulator branch from ce68813 to b2afbad Compare August 29, 2020 02:47

SteVwonder force-pushed the flux-simulator branch from b2afbad to ce68813 Compare March 25, 2021 00:57

grondo closed this Jul 7, 2022

washwor1 mentioned this pull request Dec 2, 2024

Flux Emulator Revival Questions #6466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Flux simulator #2561

[WIP] Flux simulator #2561

SteVwonder commented Nov 27, 2019 •

edited

Loading

garlick commented Nov 27, 2019

SteVwonder commented Jan 14, 2020

lgtm-com bot commented Jan 14, 2020

SteVwonder commented Jan 15, 2020

SteVwonder Mar 30, 2020

SteVwonder commented Apr 18, 2020 •

edited

Loading

SteVwonder commented Aug 29, 2020 •

edited

Loading

adfaure commented Jun 15, 2021

grondo commented Jun 15, 2021

adfaure commented Jun 15, 2021

grondo commented Jun 15, 2021 •

edited

Loading

grondo commented Jul 7, 2022

[WIP] Flux simulator #2561

[WIP] Flux simulator #2561

Conversation

SteVwonder commented Nov 27, 2019 • edited Loading

garlick commented Nov 27, 2019

SteVwonder commented Jan 14, 2020

lgtm-com bot commented Jan 14, 2020

SteVwonder commented Jan 15, 2020

SteVwonder Mar 30, 2020

Choose a reason for hiding this comment

SteVwonder commented Apr 18, 2020 • edited Loading

SteVwonder commented Aug 29, 2020 • edited Loading

adfaure commented Jun 15, 2021

grondo commented Jun 15, 2021

adfaure commented Jun 15, 2021

grondo commented Jun 15, 2021 • edited Loading

grondo commented Jul 7, 2022

SteVwonder commented Nov 27, 2019 •

edited

Loading

SteVwonder commented Apr 18, 2020 •

edited

Loading

SteVwonder commented Aug 29, 2020 •

edited

Loading

grondo commented Jun 15, 2021 •

edited

Loading