Quasi-static scheduling #1080

lsk567 · 2022-04-05T06:50:27Z

lsk567
Apr 5, 2022
Maintainer

This proposal aims to explore a quasi-static scheduler for LF.
A quasi-static scheduler performs scheduling both at compile time and at runtime.

Potential advantages (to be verified)

Parallelism: the scheduler can potentially generate very efficient parallel schedules at compile time.
Less synchronization overhead: a worker thread could potentially stick to its static schedule at a logical instant without talking to other threads.
Less data movement: the static schedule can try to promote spatial and temporal locality.
Analyzability: the order in which reactions are processed is determined at compile-time, making the real-time behavior more analyzable and verifiable.

Approach

The approach is broken down into two parts: a compile-time algorithm and a runtime algorithm.

Running example: SleepingBarber.lf
For simplicity, let's assume that there is only 1 customer instead of 20.

Compile-time algorithm

Step 0: Construct a precedence graph, which the compile already does.
The precedence graph tells us the order in which reactions are processed when they are enabled at a particular logical instant.

(Drawn with https://asciiflow.com/)
 +--------------------+
 |B  = Barber         |
 |WR = WaitingRoom    |
 |C  = Customer       |
 |CF = CustomerFactory|
 +--------------------+

 +------------------------------------------------+
 |                                                |
 |        +----------------------------------+    |
 |        |                                  |    |
 |        v                                  |    |
CF1 ---> WR1 ---> C1 ---> CF3 <------------ CF2 <-+
        | |       |       |
        | |       |       +-----------------------+
        | |       v                               |
        | +-----> C2 ------+                      |
        | |                |                      |
        | |                v                      |
        | +-----> B2 ---> C3 -------------+       |
        |        ^ ^                      |       |
        |        | |                      |       |
        |        | +-------------+        |       |
        |        |               |        v       v
        |        |               B1 ---> C4 ---> CF4
        |        |               |
        |        |               |
        |        |               +-----> WR2
        |        |                        | ^
        |        +------------------------+ |
        |                                   |
        +-----------------------------------+

Step 1: Construct a counterfactual causality (CC) graph from the LF program, with nodes being reactions and edges being counterfactual causal relations. We also label the edges with logical delays.
For example, CF1 has an arrow back to itself because of the logical action next it can schedule.
The presence of the logical action next establishes a counterfactual causal relation because, without a previous invocation of CF1 scheduling next, the next invocation of CF1 would not occur.
Besides actions, the same counterfactual reasoning can be applied to reactions related by connections and timers as well.

                    +----------------------------------+
                    |                                  |
                    v                                  |
+-------> CF1 ---> WR1 ---> C1 ---> CF3 --[1 mstep]-> CF2
|           |       |
|           |       |
+-[1 mstep]-+       +-----> C2
                    |
                    |
                    +-----> B2 ---> C3
                           ^ |
                           | |
+--------------------+     | +-[1 mstep]-> B1 ---> C4 ---> CF4
|B  = Barber         |     |               |
|WR = WaitingRoom    |     |               |
|C  = Customer       |     |               +-----> WR2
|CF = CustomerFactory|     |                        |
+--------------------+     +------------------------+

Step 2: From the CC graph (not the precedence graph), generate a mapping from schedulable events to subgraphs of the CC graph, with each subgraph being a group of reactions that could get triggered at the same instant w.r.t. the order of counterfactual causality (denoted by the arrows).

Barber.done (triggering B1 when present):
 B2 ---> C3
^
|
|               B1 ---> C4 ---> CF4
|               |
|               |
|               +-----> WR2
|                        |
+------------------------+

CustomerFactory.next (triggering CF1 when present):
CF1 ---> WR1 ---> C1 ---> CF3
          |
          |
          +-----> C2
          |
          |
          +-----> B2 ---> C3

CustomerFactory.again (triggering CF2 when present):
 +----------------------------------+
 |                                  |
 v                                  |
WR1 ---> C1 ---> CF3               CF2
 |
 |
 +-----> C2
 |
 |
 +-----> B2 ---> C3

Step 3a: Augment the above subgraphs with arrows from the precedence graph (from Step 0).

Barber.done (triggering B1 when present):
 B2 ---> C3 -------------+
^ ^                      |
| |                      v
| +------------ B1 ---> C4 ---> CF4
|               |
|               |
|               +-----> WR2
|                        |
+------------------------+

CustomerFactory.next (triggering CF1 when present):
 +-------------------------+
 |                         |
 |                         v
CF1 ---> WR1 ---> C1 ---> CF3
          |       |
          |       v
          +-----> C2 -----+
          |               |
          |               v
          +-----> B2 ---> C3

CustomerFactory.again (triggering CF2 when present):
 +----------------------------------+
 |                                  |
 v                                  |
WR1 ---> C1 ---> CF3 <------------ CF2
 |       |
 |       v
 +-----> C2 -----+
 |               |
 |               v
 +-----> B2 ---> C3

Step 3b: For two adjacent nodes, if there are longer paths between them, remove the adjacent edges.

Barber.done (triggering B1 when present):
 B2 ---> C3 -------------+
^                        |
|                        v
|               B1      C4 ---> CF4
|               |
|               |
|               +-----> WR2
|                        |
+------------------------+

CustomerFactory.next (triggering CF1 when present):
CF1 ---> WR1 ---> C1 ---> CF3
          |       |
          |       v
          |       C2 -----+
          |               |
          |               v
          +-----> B2 ---> C3

CustomerFactory.again (triggering CF2 when present):
 +----------------------------------+
 |                                  |
 v                                  |
WR1 ---> C1 ---> CF3               CF2
 |       |
 |       v
 |       C2 -----+
 |               |
 |               v
 +-----> B2 ---> C3

Step 4: For each mapping, generate an N-worker schedule table, where N is the number of worker threads used.
Here, let's assume that we have 2 worker threads.
Note that each schedule here is a schedule for a logical instant.

Barber.done (triggering B1 when present):
 B2 ---> C3 -------------+
^                        |
|                        v
|               B1      C4 ---> CF4
|               |
|               |
|               +-----> WR2
|                        |
+------------------------+
Worker 1 || B1 => WR2 => B2 => C3 => C4 => CF4
Worker 2 ||             (idle)

Startup OR CustomerFactory.next (triggering CF1 when present):
CF1 ---> WR1 ---> C1 ---> CF3
          |       |
          |       v
          |       C2 -----+
          |               |
          |               v
          +-----> B2 ---> C3
Worker 1 || CF1 => WR1 => C1 => C2 => (wait for B2) => C3
Worker 2 || (wait for WR1) => B2 => (wait for C1) => CF3

CustomerFactory.again (triggering CF2 when present):
 +----------------------------------+
 |                                  |
 v                                  |
WR1 ---> C1 ---> CF3               CF2
 |       |
 |       v
 |       C2 -----+
 |               |
 |               v
 +-----> B2 ---> C3
Worker 1 || CF2 => WR1 => C1 => C2 => (wait for B2) => C3
Worker 2 || (wait for WR1) => B2 => (wait for C1) => CF3

Note that the structure of the LF program naturally presents an opportunity for a certain degree of parallelism.
The degree to which we can parallelize is limited, however.
Notice that here, even if we have 10 worker threads, at most 2 of them will run in parallel.
It would be very useful to have a systematic way to calculate, for a specific LF program, the optimal number of workers that can maximize parallelism.

Now we are ready to proceed to the runtime algorithm.

Runtime algorithm

Step 1: Create a struct variable called current_schedule and a priority queue called the pending_events, with each element being a list of events to be present at some future time t.

current_schedule: (execution not started yet)

pending_events
--------------
(t = 0)[startup]

Step 2: Remove all events present at time t from pending_events, perform a lookup on the N-worker schedule table, and update schedules.
In this case, we pop the list of signals for t = 0 off pending_events, observe that it only contains startup, look for the static schedule corresponding to startup being present, and set current_schedule to the static schedule.

current_schedule: t = 0
Worker 1 || CF1 => WR1 => C1 => C2 => (wait for B2) => C3
Worker 2 || (wait for WR1) => B2 => (wait for C1) => CF3

pending_events
--------------
(empty)

Step 3: Execute a schedule for a particular logical instant. Skip a reaction invocation from the schedule if its upstream reactions do not produce outputs. Appending the generated events to corresponding linked lists in pending_events.
In this example, we proceed with executing the static schedule for t = 0.
Assume that CF1 has just finished and CustomerFactory.next is scheduled 1 second later. The state of the data structures at this moment looks like this:

current_schedule: t = 0
Worker 1 || CF1 (done) => WR1 (running) => C1 => C2 => (wait for B2) => C3
Worker 2 || (wait for WR1) => B2 => (wait for C1) => CF3

pending_events
--------------
(t = 0)[CustomerFactory.next]

Assume that C1 does not produce an output that triggers CF3. As soon as this is known, CF3 in the static schedule will be marked as "to be skipped."

current_schedule: t = 0
Worker 1 || CF1 (done) => WR1 (done) => C1 (done) => C2 (running) => (wait for B2) => C3
Worker 2 || (wait for WR1) (done) => B2 (running) => (wait for C1) => CF3 (*to be skipped*)

pending_events
--------------
(t = 1 sec)[CustomerFactory.next]

Assume that, at this point, B2 finishes and schedules Barber.done at t = 500 msec. The state of the data structure becomes:

current_schedule: t = 0
Worker 1 || CF1 (done) => WR1 (done) => C1 (done) => C2 (running) => (wait for B2) => C3
Worker 2 || (wait for WR1) (done) => B2 (done) => (wait for C1) (done) => CF3 (*to be skipped*)

pending_events
--------------
(t = 500 msec)[Barber.done]
|
v
(t = 1 sec)[CustomerFactory.next]

The multi-threaded runtime finishes the current schedule and proceeds to the next step.

Step 4: Repeat Step 2 until there are no more events to process.
Since we have two pending events from executing the previous schedule, we pop off all events at t = 500 msec, advance time to t = 500 msec, and repeat Step 2.

current_schedule: t = 500 msec
Worker 1 || B1 => WR2 => B2 => C3 => C4 => CF4
Worker 2 ||             (idle)

pending_events
--------------
(t = 1 sec)[CustomerFactory.next]

This runtime algorithm proceeds until all pending_events are handled. At this point, the shutdown event is scheduled, properly shutting off the execution.

congliuAES · 2022-04-06T14:50:38Z

congliuAES
Apr 6, 2022

I think this design could be very interesting and promising: using the compile-time analysis to take advantage of the program structure for maximizing parallelism and data locality, and then using runtime scheduling for efficient and fast resource allocation. As some of the previous works of relevance show, such an offline-online combined scheduling approach could achieve significant improvements (e.g., https://www.cs.ucr.edu/~dtrip003/publication/Website_WireFrame_Micro2017.pdf).

A few high-level comments:

It may be helpful to intuitively and analytically (best) reason about the advantages due to a PC+CC combined approach, compared to the typical PC-driven approach.
I think most emerging (autonomous) embedded systems are pretty much data-driven, thus data locality may critically impact the overall performance. This design could consider data locality as a first-class citizen in both compile- and run-time phases. Some of the intuitive ideas include: grouping reactions sharing the same data thus forcing them to run on the same core or core cluster sharing the same cache, late-scheduling at runtime which delays scheduling an event due to data locality (the offline built graphs would indicate such information), some of the gang-scheduling ideas considering synchronization barriers.
I think the lag-based schedulability analysis framework I previously developed (e.g., RTSS'17, RTSS'12), particularly when applying to the gang-typeof task model, could be leveraged for analyzing the reactor-based task model in LF. If any of you feels interested on this kind of formal schedulability analysis (ensuring either soft latency upper bound or hard deadline guarantee with clearly stated assumptions), I can work with you together on this.

0 replies

petervdonovan · 2022-04-06T19:10:13Z

petervdonovan
Apr 6, 2022
Maintainer

If I understood, Edward already mentioned that the data structure used in current_schedule could significantly affect performance, and I agree because explicitly marking reactions as "to be skipped" might, in the case of sparse activity, have performance consequences. A good rule might be to keep execution time within a constant factor of the number of reactions that are actually executed. A couple of other challenges that might be added:

One feature of the NP scheduler that makes it perform well when executing many short reactions is that when the graph of reactions to be executed at a given tag looks parallelizable at compile time, but is clearly sequential at runtime, synchronization (using semaphore/locks) can be totally optimized out until tag advancement. So if the compile time graph looks like this

R1───►R2───►R4───►R5
      │           ▲
      └─────R3────┘

but R4 does not turn out to be triggered by R2, so at runtime we effectively have to execute this,

R1───►R2          R5
      │           ▲
      └─────R3────┘

there is a risk that a more static scheduler would assign R3 to a different worker than R2, forcing an unnecessary switch between workers. Avoidance of such a switch (and the associated synchronization) was something I have had to mimic in order to make a different scheduler perform similarly to the NP scheduler.

Load balancing might be harder to do at compile time than at runtime. So if there are 2 workers, and all of the following reactions are able to be triggered at a given timestep,

        ┌──►R3────┐
        │         ▼
R1───►R2├──►R4───►R6
        │         ▲
        └──►R5────┘

        ┌──►R9────┐
        │         ▼
R7───►R8├──►R10──►R12
        │         ▲
        └──►R11───┘

A static scheduler might assign R1-R6 to worker 1, and assign R7-R12 to worker 2. But if R7-R12 all turn out to be very short, lightweight reactions, while R3-R5 take a long time to execute, then potential for parallelism on R3-R5 could be lost. In this regard, systematically assigning reactions to workers could result in worse performance than randomly assigning reactions (and then keeping those assignments consistent throughout program execution), because without randomness, you can no longer make probabilistic guarantees that the workload will approach an even distribution across workers for sufficiently large numbers of reactions. (An approximately random implementation with consistent assignments of reactions to workers and crude work stealing is in this branch; it is sufficient to eliminate apparent cache performance issues in SortedLinkedList, Dictionary, and Big but is still shabby and incomplete.)

1 reply

lsk567 Apr 7, 2022
Maintainer Author

Thanks for the great observations, Peter. I fully agree with them, and I think it is certainly important to position the design of the quasi-static scheduler in relation to the ones we currently have.

there is a risk that a more static scheduler would assign R3 to a different worker than R2, forcing an unnecessary switch between workers. Avoidance of such a switch (and the associated synchronization) was something I have had to mimic in order to make a different scheduler perform similarly to the NP scheduler.

This is a very important insight, which could suggest that the schedule for a logical instant should not be fully static. Perhaps a good design is to start with a static schedule, then make adjustments whenever a reaction's outputs are known.

For example, as soon as R2 finishes and the runtime knows that R4 is not triggered,

R1(W1)───►R2(W1)          R5(W1)
             │              ▲
             └───R3(W2)─────┘

the runtime should reassign R3 to W1.

R1(W1)───►R2(W1)          R5(W1)
             │              ▲
             └───R3(W1)─────┘

The dynamic adjustment hopefully will not be very expensive.

Similarly, for the load balancing case, it would be ideal to have worker 1 with the lighter tasks help worker 2 with heavier tasks after worker 1 finishes its tasks. It is unclear how to implement this. One thought is to compute a static schedule that is more fine-grained, which includes instructions on how to adjust workers based on each worker's progress. In that case, how to trade off the cost in memory with performance will need to be carefully studied...

lsk567 · 2022-04-18T23:51:46Z

lsk567
Apr 18, 2022
Maintainer Author

@zitaofang proposed a small instruction set for the workers, so that each static schedule can be represented in the form of a program. The instruction set contains 3 operations: execute, wait, stop. Here are what these operations do:

The execute [reaction], [loc] operation contains 3 sub-operations:
a. Check if the specified reaction is marked as queued, if so, execute this reaction; otherwise, do not execute the reaction and jump to loc.
b. If there is a semaphore assigned to the reaction, acquire it (i.e. marking the reaction as "done").
c. Check if the reaction produces any outputs; if so, mark downstream reactions as queued; if not, acquire semaphores assigned to downstream reactions and jump to loc.
The wait [reaction] operation blocks on the count of a reaction-specific semaphore until count = 0, which means that the statuses of all the prior reactions have been known.
The stop operation concludes the current time step for a worker thread.

Consider the original example.

Barber.done (triggering B1 when present):
 B2 ---> C3 -------------+
^                        |
|                        v
|               B1      C4 ---> CF4
|               |
|               |
|               +-----> WR2
|                        |
+------------------------+
Worker 1 || B1 => WR2 => B2 => C3 => C4 => CF4
Worker 2 ||             (idle)

The representation for this schedule would be

Worker 1:
1: execute B1, 7 (Execute B1. If no outputs produced, jump to line 7 and acquire downstream semaphores.)
2: execute WR2, 7
3: execute B2, 7
4: execute C3, 7
5: execute C4, 7
6: execute CF4, 7
7: stop

Worker 2:
(idle)

Startup OR CustomerFactory.next (triggering CF1 when present):
CF1 ---> WR1 ---> C1 ---> CF3
          |       |
          |       v
          |       C2 -----+
          |               |
          |               v
          +-----> B2 ---> C3
Worker 1 || CF1 => WR1 => C1 => C2 => (wait for B2) => C3
Worker 2 || (wait for WR1) => B2 => (wait for C1) => CF3

The representation for this schedule would be

During initialization, assign semaphores for WR1, B2, C1 with count = 1 respectively.

Worker 1:
1: execute CF1, 7 (Execute CF1. If no outputs produced, jump to line 7 and acquire downstream semaphores.)
2: execute WR1, 5 (Execute WR1. If no outputs produced, jump to line 5 and acquire downstream semaphores.)
3: execute C1, 5
4: execute C2, 5
5: wait B2        (Block on the count of B2 semaphore until count = 0)
6: execute C3, 7
7: stop

Worker 2:
1: wait WR1
2: execute B2, 5
3: wait C1
4: execute CF3, 5
5: stop

2 replies

Soroosh129 Apr 19, 2022

I think this is a great idea!

soyerefsane Apr 20, 2022
Collaborator

From the scheduler meeting:
Potential future optimization to avoid acquiring all of the semaphores in the chain, which is already an expensive operation:

Instead of acquiring all of the semaphores, only acquire the ones that are needed, i.e. if another worker called wait on this reaction
Initialize all of the semaphores to value = 0, and only try to acquire the semaphore for the executing instruction only, and the previous instruction can call sema_up on the next instruction that is waiting for this instruction.

lsk567 · 2022-06-10T20:46:08Z

lsk567
Jun 10, 2022
Maintainer Author

Uploading a preliminary project report here:
QS Scheduler Project Report

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quasi-static scheduling #1080

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Quasi-static scheduling #1080

lsk567 Apr 5, 2022 Maintainer

Potential advantages (to be verified)

Approach

Compile-time algorithm

Runtime algorithm

Replies: 4 comments · 3 replies

congliuAES Apr 6, 2022

petervdonovan Apr 6, 2022 Maintainer

lsk567 Apr 7, 2022 Maintainer Author

lsk567 Apr 18, 2022 Maintainer Author

Soroosh129 Apr 19, 2022

soyerefsane Apr 20, 2022 Collaborator

lsk567 Jun 10, 2022 Maintainer Author

lsk567
Apr 5, 2022
Maintainer

Replies: 4 comments 3 replies

congliuAES
Apr 6, 2022

petervdonovan
Apr 6, 2022
Maintainer

lsk567 Apr 7, 2022
Maintainer Author

lsk567
Apr 18, 2022
Maintainer Author

soyerefsane Apr 20, 2022
Collaborator

lsk567
Jun 10, 2022
Maintainer Author