Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload long-running tasks to other OS threads #117

Open
Bulat-Ziganshin opened this issue Jun 15, 2022 · 8 comments
Open

Offload long-running tasks to other OS threads #117

Bulat-Ziganshin opened this issue Jun 15, 2022 · 8 comments

Comments

@Bulat-Ziganshin
Copy link
Contributor

Bulat-Ziganshin commented Jun 15, 2022

Here I save ideas from our discussion on subj. There are 3 things we should implement:

  • thread pool for CPU-heavy tasks
  • thread pool for I/O tasks (in particular, synchronous calls to OS)
  • how to integrate them into the Chronos async eventloop

Resources:

Communication

Since Chronos and task-executing threads are, indeed, run in different threads, we need communication channels that can work over thread bounds.

We have several candidates:

Depending on compilation options, we may need to pass between threads only non-GCed objects, solutions are:

  • Isolated[T]
  • serialization
  • gcUnref+gcRef in another thread
  • allocShared, i.e. manual memory management
  • read data into raw buffers and deserialize them in receiver thread

In order to identify tasks we may send with each task a random-generated id and return this id with the task answer.

Thread pool for CPU-heavy tasks

It can be implemented by nim-taskpools, Weave or internally:

proc cpuThread(pool: CPUThreadPool) = 
  while pool.notFinished:
    (id,task) = pool.taskQueue.recv()
    let res = task()
    pool.resultsQueue.send(id,res)

Thread pool for I/O tasks

It can be implemented internally. The main difference compared to CPU-heavy taskpool is that we can run a lot of threads and create threads on demand:

proc ioThread(pool: IOThreadPool) = 
  while pool.notFinished:
    (id,task) = pool.taskQueue.recv()
    if (pool.freeTasks-=1) == 0 && pool.runningTasks <= 100:
      pool.runningTasks += 1
      startThread(ioThread, pool)
    let res = task()
    pool.freeTasks += 1
    pool.resultsQueue.send(id,res)

Integrate them into the Chronos async eventloop

This code:

await ioOperation()
step = 1

is translated by async macro into closure iterator with code:

let fut = ioOperation()
yield fut
step = 1

that's compiled by Nim into sort of:

let fut = ioOperation()
fut.then:
  step = 1
return fut

So, when ioOperation will be finished, the future will complete and run its continuation that holds all the code after the await ioOperation() line.

WIP: code sample

var ctx

var table: Table[threadid, Future[T]]
proc threadPoll() =
  for fut in ctx.futures:
    let flow = spawn fut.task
    if flow.isReady:
      let val = flow.tryRecv()
      let future = val.fut
      future.complete(val.value)
  • asyncqueue from Chronos
  • channels from taskpool
@emizzle
Copy link
Contributor

emizzle commented Jun 16, 2022

We implemented some of this in https://github.com/status-im/nim-task-runner, including a thread pool. Check out the tests in https://github.com/status-im/nim-task-runner/tree/test/use-cases/test/use_cases.

@mratsim
Copy link

mratsim commented Jun 17, 2022

allocShared, i.e. manual memory management

Don't use allocShared, with --threads:on it triggers a lock that is quite expensive. Using import system/ansi_c + c_malloc is better. Also It avoids running into issues of reclaiming memory and it's easier to instrument in valgrind/sanitizers.

@arnetheduck
Copy link

arnetheduck commented Jun 17, 2022

These discussions tend to end up in fairly extensive and complex solutions that take a lot of time, are high on potential for issues and end up failing because of this.

There's a trivial solution that already works, which is basically to use a socket (or pipe) with chronos - this works without any changes to anything, without new libraries or compiler options and has numerous advantages: guarantees memory isolation between the threads, is "fast enough" for any reasonable long-term workloads etc - down the line, if ever a fancier solution is needed, it can easily be replaced as well - either by increasing isolation (separate processes instead of threads) or decreasing it (shared memory solutions).

@Bulat-Ziganshin
Copy link
Contributor Author

Bulat-Ziganshin commented Jun 17, 2022

thank everyone for help, it's invaluable!

@arnetheduck Is it correct example of the solution you are proposing: start.nim ?

I never programmed with pipes or sockets, but quick googling tells that both are 1:1 connections. And we have one Chronos thread and N threads doing tasks. We can return answers with N separate pipes, but emplacing tasks is necessarily SPMC operation.

We can use single thread dedicated for communication with Chronos, and then classic MPMC channels to communicate this thread with all worker threads. I done that (MPMC channels for tasking threadpool) in C++ and believe that it can be copied in Nim.

Another solution proposed by dryajov is to use classic MPMC channels directly in Chronos loop with code like that:

while true:
  task = channel.tryRecv()
  if task: 
    task()
    yield
  else:
    await sleepAsync(1.ms)

I.e. don't try to integrate AsyncChannel into Chronos event loop, but use polling with standard thread-based channel.

So, my question - which ways are wrong? It's not place for discussions, so I ask everyone which solution(s) he consider as reliable, and which ones he don't trust.

Also I will be glad if you can provide any estimates about the speed for any solution, in messages/second (each message is just a pointer).

And the solutions we found so far are

  1. AsyncChannel from nim-task-runner
  2. named pipe or socket + dispatcher thread with two MPMC thread-based channels
  3. polling with backpressure in Chronos loop + two MPMC thread-based channels

New solutions are also highly appreciated. We look for solutions that can handle up to 10K - 100K messages per second and minimize our coding efforts.

@zah
Copy link
Contributor

zah commented Jun 17, 2022

These discussions tend to end up in fairly extensive and complex solutions that take a lot of time, are high on potential for issues and end up failing because of this.

There's a trivial solution that already works, which is basically to use a socket (or pipe) with chronos - this works without any changes to anything, without new libraries or compiler options and has numerous advantages: guarantees memory isolation between the threads, is "fast enough" for any reasonable long-term workloads etc - down the line, if ever a fancier solution is needed, it can easily be replaced as well - either by increasing isolation (separate processes instead of threads) or decreasing it (shared memory solutions).

I find this fear of tackling the AsyncChannel problem puzzling. The expertise in this organisation is surely enough to develop a correct async channel implementation over time once we do this, it would benefit the entire Nim community (async channels are one of the last missing pieces in Chronos that will turn it into a truly general purpose library).

The "just use a socket" advice is somewhat problematic because the socket will be exposed to arbitrary software running on the same machine (potentially by less privileged users) and this has various security implications. Using unix domain sockets is better as these can be secured by file system rights, but then you still might have a problem around the fact that multiple clients can try to connect to the same socket (should the software run an accept loop or not).

An in-process AsyncChannel could be a much simpler facility that is cheap to create and doesn't bring any of these problems into the picture.

@arnetheduck
Copy link

I find this fear of tackling the AsyncChannel problem puzzling.

It's not so much fear as a rational approach to product development: in this particular case, it's a comment about the fact that it has derailed several projects already, which were at a stage in their development where it would have been more valuable to use a more simple existing approach to prove the concept of that project, and later, if it would have been deemed necessary, develop more complex solutions.

If we expand on the causes, Nim in general has a very poor track record of threading: it is off by default meaning that there is a high risk of running into core / GC issues that take up significant amount of time to resolve at the expense of not shipping useful features - AsyncChannel is a means to an end - it's not a useful feature on its own, ergo, it comes low in the priority list - ie it's not until you run into actual issues with existing, more simple, approaches that you'd reach for writing frameworks for thing - product first, framework later also has the advantage that what you're building in the framework actually has a realistic use case, which in general makes for better frameworks in shorter time.

Similarly, the primitives for threading in general in the language are not well developed and known to be buggy - pretty much all primitives in the std lib are broken one way or another and can't practically be used (ie flowvars, spawn etc) - this again speaks to the fact that a simple socket solution will be a time-saver, overall, until there's an actual need. Codex in general has a long way to go before there's an actual need (ie it has 0 users of now).

The 10k - 100k case is uninteresting if the project doesn't get to the point where it's able to handle 1 message, and we've been in this situation before - this issue description looks .. similar, and my advice really is to focus development on focus on other things and reuse simpler options, even if it means an extra copy here or there - copying data in memory is incredibly fast and it's extremely rare to have well-written programs where that is the bottleneck - those programs are usually doings something else wrong already, and a fancy threadpool / task scheduler might be able to shove the problem under the carpet for a while, but it won't address the more systemic issues.

And we have one Chronos thread and N threads doing tasks. We can return answers with N separate pipes, but emplacing tasks is necessarily SPMC operation.

You can have the task scheduler sit on the S side and hand out tasks to workers that are free - this of course is slightly unusual as far as traditional SPMC solutions go where workers pluck tasks off a queue, but it's adequate and would allow focusing on other things for the next 6 months :)

@mratsim
Copy link

mratsim commented Jun 20, 2022

I don't think there is a need for a generic solution, and certainly not before Nim arc/orc works:

  • Nim default GC is unsuitable for multithreading due to thread-local heaps. This is not a problem for compute tasks because those usually own their data. Async tasks however strongly rely on reference semantics.
  • A simple solution is to use createThread and have it run an event loop, waiting on an async lock, That lock protect something (a channel/memory location) to pass tasks with. And there is another to pass results.

This "producer-consumer" pattern is basically shared-memory services and the OS handles the scheduling.

See examples here:

@dryajov
Copy link
Contributor

dryajov commented Jun 21, 2022

I generally agree with the sentiment that this should be kept maximally simple. Both the pipe/socket approach as well as what @mratsim outlines above have been discussed to some extent. We should experiment with both, but I suspect that pipes/sockets will perform acceptably overall and might be less of a hassle to deal with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants