Roadmap, Design and implementation choices #9

mratsim · 2022-01-16T23:46:28Z

Here is an overview of items and features considered for future improvement, with sketches of solutions:

User-facing features
Implementation features

User-facing features

Receiving tasks from any thread

The task queues are SPMC queues hence other threads can't enqueue new jobs.

See #6

Solutions:

A Global MPMC queue (which risks creating contention again)
Modify the Chase-Lev deque to allow push (front or back) or change the data structure
Have workers have another queue which is MPSC to receive tasks. (which risks latency are those tasks are unstealable)

Scheduling, Load balancing and task priority

Users might want to have access to task priorities.
In particular for Proof-of-Stake blockchain, validation and block proposal messages are higher priorities than anything else.

Chase-Lev deque, which are LIFO can be replaced by multiqueues which are FIFO relaxed priority queues :

MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues
Priority Queues are not good concurrent task schedulers
MultiQueues Can Be State-of-the-Art Priority Scheduler
Engineering MultiQueues: Fast Relaxed
Concurrent Priority Queues
Multiqueues for Julia: https://github.com/kpamnany/partr/blob/master/src/multiq.c and https://github.com/JuliaLang/julia/blob/v1.7.1/src/partr.c
Priority-Based Scheduling of Graph Algorithms
on a Shared-Memory Platform which gives an overview from a third party author which has no incentive to push an implementation forward.
- slides: https://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap209s5.pdf

Note: supporting priorities has heavy impact on the runtime:

LIFO scheduling has better throughput while FIFO has more fairness and reduces latency.
A computational job often spawns many tasks, LIFO maximizes that the data stays hot in cache while it is processing.
Also the queue owner competes only for the last task of the queue with the potential thieves.
With a queue, everyone competes at the tail of the queue.
A priority queue is probably incompatible with batch steal "stealHalf" which significantly reduce overheads.

Structured Parallelism

See spec nim-lang/RFCs#347 (comment)

Structured parallelism [OPTIONAL]

Structured parallelism ensures that all tasks and their descendants created within a scope are completed before exiting that scope.

template syncScope(ex: MyExecutor, scope: untyped): untyped
template syncScope(scope: untyped): untyped

References:

X10 Parallel Programming Language (2004), the Async-Finish construct

Habanero multithreading runtime for Java and Scala, the Async-Finish construct

Habanero C, the Async-Finish construct

More recently, concurrency frameworks also converged to similar "structured concurrency" (https://en.wikipedia.org/wiki/Structured_concurrency).

This would allow the following:

tp.syncScope:
  tp.spawn fn0()
  tp.spawn fn1()

let a = ...

and ensure that all tasks are finished before exiting the scope, without waiting for tasks out of this scope (with syncAll). Also any thread can call syncScope unlike syncAll. It would also avoid having to wrap proc coming from dependencies so that they return a dummy bool so that we can await them.

Implementation is a straightforward Atomic[int] and the same code as forceFuture, except that we wait until we have no descendant tasks.

Event system / Dataflow parallelism

This allows tasks to be scheduled only when an event is triggered (aka task graphs with precise in/out dependencies).

Implementation is standalone and does not touch the internals of taskpools.
https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/flow_events.nim

Design and competing designs have been heavily discussed here: mratsim/weave#92 (comment)

Dealing with blocked threads

Assume a 4 core CPU like a Raspberry Pi, an user might want to spawn blocking operations like reading from stdin on a threadpool because that's a common pattern. We can also consider a long-running computation blocking from the point of view of user interface.
It is possible to have all cores blocked and no progress could be made even though there is work to do.

Solutions:

Indicate that the user can create dedicated blocking threadpools and a main one for non-IO tasks.
Have a dedicated proc spawnBlocking for IO that can block and potentially a numBlockingThread parameter in Taskpool.new(...) for dedicated blocking thread. This requires users to know their workload
Have some detection logic similar to the Go scheduler to detect long-running or blocked threads and create a new thread

See Go presentation: https://assets.ctfassets.net/oxjq45e8ilak/48lwQdnyDJr2O64KUsUB5V/5d8343da0119045c4b26eb65a83e786f/100545_516729073_DMITRII_VIUKOV_Go_scheduler_Implementing_language_with_lightweight_concurrency.pdf

Implementation features

Reducing allocations

Currently, spawning will cause the following allocations:
- 1 in Task for storing the function call arguments. This is inherent to std/tasks.
- 1 for the TaskNode
- 1 for the Flowvar / inner Channel (needs an UncheckedArray)

Solution:

For the upstream problem, we can patch upstream to have a buffer. (std/tasks: "small functions" optimization nim-lang/RFCs#443)
The inner fields take 24 bytes, a 40 byte buffer would cover many use-cases without allocation,
for example, a function with 2 openarrays requires at minimum 32 bytes, which leaves 8 bytes for
a pointer to a result or for a callback.
The TaskNode can be made intrusive to the Flowvar

Overhead with the event notifier / putting thread to sleep waking them up

See #5

Memory reclamation

Currently when the task deques grow, the old ones aren't collected. We might want to consider a memory reclamation scheme or a "garbage collector" just for the runtime, for example Epoch-Based Reclamation.
Alternatively we decide to use a priority queue (a list-based data structure) there is no memory to reclaim anymore.

Allow awaiting threads to sleep

When awaiting, threads don't sleep as there is no way to wakeup a specific thread when a task it was waiting for is finished:

nim-taskpools/taskpools/taskpools.nim

Lines 337 to 345 in 12752b9

    
           # 2.2 No task to steal 
        
           if tp.eventNotifier.getParked() == tp.numThreads - 1: 
        
             # 2.2.1 all threads besides the current are parked 
        
             debugTermination: 
        
               log("Worker %2d: syncAll 2.2.1 - termination, all other threads sleeping\n", ctx.id) 
        
             foreignThreadsParked = true 
        
           else: 
        
             # 2.2.2 We don't park as there is no notif for task completion 
        
             cpuRelax()

Busy loops waste energy.

About not processing all tasks in our queue when awaiting.

Note: Astute readers will remark that when awaiting, a worker will skip its own unrelated tasks (aka non-children of the awaited task) and instead directly steal from other workers.

nim-taskpools/taskpools/taskpools.nim

Lines 274 to 277 in b31b891

    
           if taskNode.parent != ctx.currentTask: 
        
             debug: log("Worker %2d: sync 1 - skipping non-direct descendant task 0x%.08x (parent 0x%.08x, current 0x%.08x)\n", ctx.id, taskNode, taskNode.parent, ctx.currentTask) 
        
             ctx.schedule(taskNode) 
        
             break

Why?

If we're awaiting for a future and we have no work to do, that future is dependent on computation in other worker queues. (A dependency has to be spawned after our own task and so if it wasn't enqueued at the tail of our deque, it's in another worker queue)
This decreases latency for the awaited task (at the expense of throughput since a worker with work and bandwidth does not work), and once the awaited Flowvar is resolved the unblocked thread can enqueue work earlier, avoiding starvation for other thread.
Also the worker might otherwise process a task that for sure doesn't help make progress on the awaited task, and takes an indefinite amount of time to resolve.
The technique is called Leapfrogging see https://cseweb.ucsd.edu/~calder/papers/PPoPP-93.pdf

Note: Regarding " Scheduling, Load balancing and task priority", leapfrogging is "unfair" since user awaited-tasks have priority.

Backoff strategies

Alternatively, if parking threads is actually inadequate:

As sleep/wake is very expensive,
Also as condition variables or futexes don't allow precise control of which thread to wake when multiple are awaiting, we would need multiple CVs or futexes, 1 per thread. Assuming we want to have the number of threads vary when some become blocked (Dealing with blocked threads paragraph) we need a thread-safe way to keep track of all active thread CVs or futexes and insert and remove on thread creation or destruction.

We can consider backoff solutions see https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/event_notifiers_and_backoff.md, this is an active area of research in particular for WiFi and Bluetooth low-energy as when a station is backed off it can't receive messages, so there are advanced techniques so that a station is up when a message arrives to reduce latency and retries.

However, does Windows has an equivalent to usleep with microsecond resolution?

The text was updated successfully, but these errors were encountered:

mratsim mentioned this issue Jan 19, 2022

Replace event_notifier by eventcount #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap, Design and implementation choices #9

Roadmap, Design and implementation choices #9

mratsim commented Jan 16, 2022 •

edited

Loading

Roadmap, Design and implementation choices #9

Roadmap, Design and implementation choices #9

Comments

mratsim commented Jan 16, 2022 • edited Loading

User-facing features

Receiving tasks from any thread

Scheduling, Load balancing and task priority

Structured Parallelism

Event system / Dataflow parallelism

Dealing with blocked threads

Implementation features

Reducing allocations

Overhead with the event notifier / putting thread to sleep waking them up

Memory reclamation

Allow awaiting threads to sleep

About not processing all tasks in our queue when awaiting.

Backoff strategies

mratsim commented Jan 16, 2022 •

edited

Loading