Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap, Design and implementation choices #9

Open
mratsim opened this issue Jan 16, 2022 · 0 comments
Open

Roadmap, Design and implementation choices #9

mratsim opened this issue Jan 16, 2022 · 0 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Jan 16, 2022

Here is an overview of items and features considered for future improvement, with sketches of solutions:

User-facing features

Receiving tasks from any thread

The task queues are SPMC queues hence other threads can't enqueue new jobs.

See #6

Solutions:

  1. A Global MPMC queue (which risks creating contention again)
  2. Modify the Chase-Lev deque to allow push (front or back) or change the data structure
  3. Have workers have another queue which is MPSC to receive tasks. (which risks latency are those tasks are unstealable)

Scheduling, Load balancing and task priority

Users might want to have access to task priorities.
In particular for Proof-of-Stake blockchain, validation and block proposal messages are higher priorities than anything else.

Chase-Lev deque, which are LIFO can be replaced by multiqueues which are FIFO relaxed priority queues :

Note: supporting priorities has heavy impact on the runtime:

  • LIFO scheduling has better throughput while FIFO has more fairness and reduces latency.
    A computational job often spawns many tasks, LIFO maximizes that the data stays hot in cache while it is processing.
    Also the queue owner competes only for the last task of the queue with the potential thieves.
    With a queue, everyone competes at the tail of the queue.
  • A priority queue is probably incompatible with batch steal "stealHalf" which significantly reduce overheads.

see also: mratsim/weave#22

Structured Parallelism

See spec nim-lang/RFCs#347 (comment)

Structured parallelism [OPTIONAL]

Structured parallelism ensures that all tasks and their descendants created within a scope are completed before exiting that scope.

template syncScope(ex: MyExecutor, scope: untyped): untyped
template syncScope(scope: untyped): untyped

References:

  • X10 Parallel Programming Language (2004), the Async-Finish construct
  • Habanero multithreading runtime for Java and Scala, the Async-Finish construct
  • Habanero C, the Async-Finish construct

More recently, concurrency frameworks also converged to similar "structured concurrency" (https://en.wikipedia.org/wiki/Structured_concurrency).

This would allow the following:

tp.syncScope:
  tp.spawn fn0()
  tp.spawn fn1()

let a = ...

and ensure that all tasks are finished before exiting the scope, without waiting for tasks out of this scope (with syncAll). Also any thread can call syncScope unlike syncAll. It would also avoid having to wrap proc coming from dependencies so that they return a dummy bool so that we can await them.

Implementation is a straightforward Atomic[int] and the same code as forceFuture, except that we wait until we have no descendant tasks.

Event system / Dataflow parallelism

This allows tasks to be scheduled only when an event is triggered (aka task graphs with precise in/out dependencies).

Implementation is standalone and does not touch the internals of taskpools.
https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/flow_events.nim

Design and competing designs have been heavily discussed here: mratsim/weave#92 (comment)

Dealing with blocked threads

Assume a 4 core CPU like a Raspberry Pi, an user might want to spawn blocking operations like reading from stdin on a threadpool because that's a common pattern. We can also consider a long-running computation blocking from the point of view of user interface.
It is possible to have all cores blocked and no progress could be made even though there is work to do.

Solutions:

  1. Indicate that the user can create dedicated blocking threadpools and a main one for non-IO tasks.
  2. Have a dedicated proc spawnBlocking for IO that can block and potentially a numBlockingThread parameter in Taskpool.new(...) for dedicated blocking thread. This requires users to know their workload
  3. Have some detection logic similar to the Go scheduler to detect long-running or blocked threads and create a new thread

See Go presentation: https://assets.ctfassets.net/oxjq45e8ilak/48lwQdnyDJr2O64KUsUB5V/5d8343da0119045c4b26eb65a83e786f/100545_516729073_DMITRII_VIUKOV_Go_scheduler_Implementing_language_with_lightweight_concurrency.pdf

Implementation features

Reducing allocations

  • Currently, spawning will cause the following allocations:
    • 1 in Task for storing the function call arguments. This is inherent to std/tasks.
    • 1 for the TaskNode
    • 1 for the Flowvar / inner Channel (needs an UncheckedArray)

Solution:

  • For the upstream problem, we can patch upstream to have a buffer. (std/tasks: "small functions" optimization nim-lang/RFCs#443)
    The inner fields take 24 bytes, a 40 byte buffer would cover many use-cases without allocation,
    for example, a function with 2 openarrays requires at minimum 32 bytes, which leaves 8 bytes for
    a pointer to a result or for a callback.
  • The TaskNode can be made intrusive to the Flowvar

Overhead with the event notifier / putting thread to sleep waking them up

See #5

Memory reclamation

Currently when the task deques grow, the old ones aren't collected. We might want to consider a memory reclamation scheme or a "garbage collector" just for the runtime, for example Epoch-Based Reclamation.
Alternatively we decide to use a priority queue (a list-based data structure) there is no memory to reclaim anymore.

Allow awaiting threads to sleep

When awaiting, threads don't sleep as there is no way to wakeup a specific thread when a task it was waiting for is finished:

# 2.2 No task to steal
if tp.eventNotifier.getParked() == tp.numThreads - 1:
# 2.2.1 all threads besides the current are parked
debugTermination:
log("Worker %2d: syncAll 2.2.1 - termination, all other threads sleeping\n", ctx.id)
foreignThreadsParked = true
else:
# 2.2.2 We don't park as there is no notif for task completion
cpuRelax()

Busy loops waste energy.

About not processing all tasks in our queue when awaiting.

Note: Astute readers will remark that when awaiting, a worker will skip its own unrelated tasks (aka non-children of the awaited task) and instead directly steal from other workers.

if taskNode.parent != ctx.currentTask:
debug: log("Worker %2d: sync 1 - skipping non-direct descendant task 0x%.08x (parent 0x%.08x, current 0x%.08x)\n", ctx.id, taskNode, taskNode.parent, ctx.currentTask)
ctx.schedule(taskNode)
break

Why?

  1. If we're awaiting for a future and we have no work to do, that future is dependent on computation in other worker queues. (A dependency has to be spawned after our own task and so if it wasn't enqueued at the tail of our deque, it's in another worker queue)
  2. This decreases latency for the awaited task (at the expense of throughput since a worker with work and bandwidth does not work), and once the awaited Flowvar is resolved the unblocked thread can enqueue work earlier, avoiding starvation for other thread.
  3. Also the worker might otherwise process a task that for sure doesn't help make progress on the awaited task, and takes an indefinite amount of time to resolve.
    The technique is called Leapfrogging see https://cseweb.ucsd.edu/~calder/papers/PPoPP-93.pdf

Note: Regarding " Scheduling, Load balancing and task priority", leapfrogging is "unfair" since user awaited-tasks have priority.

Backoff strategies

Alternatively, if parking threads is actually inadequate:

  • As sleep/wake is very expensive,
  • Also as condition variables or futexes don't allow precise control of which thread to wake when multiple are awaiting, we would need multiple CVs or futexes, 1 per thread. Assuming we want to have the number of threads vary when some become blocked (Dealing with blocked threads paragraph) we need a thread-safe way to keep track of all active thread CVs or futexes and insert and remove on thread creation or destruction.

We can consider backoff solutions see https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/event_notifiers_and_backoff.md, this is an active area of research in particular for WiFi and Bluetooth low-energy as when a station is backed off it can't receive messages, so there are advanced techniques so that a station is up when a message arrives to reduce latency and retries.

However, does Windows has an equivalent to usleep with microsecond resolution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant