-
-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nng creates too many threads #1572
Comments
This is expected behavior. But I agree with you that nng should have a way to set the num of threads at runtime ideally. |
The iocp and expire threads are created one per core to provide good scalability. Having said that, I'm totally amenable to reducing this. Note that these threads should generally be sitting around idle unless you have extreme I/O pressure, in which case you actually probably want to have this many threads (although you wind up having a trade-off of application processing and NNG I/O processing.) |
@JaylinYu those tunables won't really help this -- because the threads involved aren't taskq threads. I'm also pretty opposed to using globals to tune that -- if we need tuning, then local variables (parameters on the taskq) should be set. We need to ensure that at least 2 threads are allocated for any taskq, as it's likely that deadlocks will occur if only one thread is allocated. I recommend 4 as a bare minimum actually. |
I think the IOCP and expire threads should have a similar default limiter as the tasks... I think we max this to 16. Arguably these could be even smaller -- especially the expire thread can safely be reduced all the way down to 1. (It used to be one, but we started creating a lot more of them to improve scalability by reducing contention on locks -- each thread has it's own queue, and having one per thread tends to ensure we don't contend on the expiration queue locks, which were one of the larger areas of lock contention. |
Full disclosure: Since filing this issue I've moved away from
I highly doubt this is true, over subscribing a machine like that is almost never useful and just introduces overhead. |
Idle threads not runnable threads don't hurt anything at all apart from the stack space they consume. Where things get really ugly is when you have a large number of runnable threads. Then stuff falls apart. There are unfortunately also some limitations in userland thread frameworks that wind up using way too many resources when allocating a thread (thread stacks) which is why you can't create 10,000 threads in user space harmlessly. (On modern kernels like illumos you can, but only in kernel space. 10,000 threads in that situations uses like 80MB of RAM for stack space, at 8K per thread.) |
Btw, the thread count we allocate is based on the CPU core count of the machine. |
Just a few more notes for anyone looking at this: The 64 expiration threads are almost always idle. They represent a timer thread, and the reason we have so many (one timer per CPU) is to reduce pressure on the central lock used for expirations (timers). We found that in some workloads the expiration queue was tragically contended, and this was a simple solution. We can probably look at alternative approaches to use fewer threads, since likely the expirations have some reasonable tolerances with respect to timing out. The IOCP thread list is this way on Windows, as it represents the actual threads doing I/O. Conceivably this could better be done with a Windows thread pool. Note that on other platforms we don't have nearly this many threads, but that comes at a scalability cost as well. The task threads are the ones doing real work, and likely running user callbacks. It might be worth looking for ways for user supplied thread(s) to be able to chew on this instead of fixed allocation threads. This could ultimately be used for some kind of cooperative multitasking, although running user callbacks is always a bit dodgy since we have no long those will take to complete. |
The high amount of thread is also a blocker for me. Basically i want to use the Request/Reply On my 6 core machine (Windows 10) this creates 64 threads… Patching the limits in source would also be fine for me. |
Is there a way to close redundant threads without modifying the source code? |
I've been evaluating nng for use in some tooling and this issue is a blocker for us as well. Much like @Jasper-Bekkers we often use machines with high core counts and the number of threads is a problem for debuggability and performance profiling since the threads cause lots of clutter as well as making the library needlessly resource intensive regardless of whether threads do actual work or not. In the case of IOCP especially there's typically no real need to have a huge number of threads as long as you only do I/O related work in the completion handlers. We are currently consuming nng via vcpkg and therefore using build-time defines to reconfigure nng is not really a great option. Ideally there would be an API to allow application-specific tuning. I'm currently not familiar enough with the project to propose what that would look like but might take a stab at it if there is no attractive alternative. |
I made a quick-and-dirty local patch to enable initialization-time tweaking of thread counts. It's still a bit rough around the edges but works well enough for me to evaluate. It adds the following API functions: void nng_set_ncpu_max(int); // limit the number of CPUs reported
void nng_set_pool_thread_limit_min(int); // configure minimum number of threads in thread pools
void nng_set_pool_thread_limit_max(int); // configure maximum number of threads in thread pools
void nng_set_resolve_thread_max(int); // configure number of resolve threads patch is here I wasn't sure what would be a good naming scheme for these as I didn't see any similar function to mimic, feedback is welcomed! (I already know |
So initially Im going to add compile time defaults that are sensible. I may add runtime for these as well. |
I'm using 1619 as the EXPIRE_THREADS tunable, and a PR is out... (I'm waiting to merge that as I've uncovered some other worse bugs that I want to ensure are fixed first.) I expect this will merge. I'll revisit the poller for Windows to do something similar for it as well. Longer term we need to look at ways to improve the parallelism in the pollers anyway -- in nanomq they are running into scalability issues beyond 64 cores with the single epoll() thread. |
did a fix ever merge? |
Hope these threads num settings functions apply as soon as possible, sincerely, respect |
Could you tell me how to find nng threads on Windows? I use Process Explorer but cannot find nng threads named as your table shows. THX. |
I've not merged anything... as the patch was not submitted in the form of a PR, I will need some time to look it over and get it into such form so that it can run in our CI. |
Btw, there are compile time fixes for these that * did* merge. Runtime is a different matter. Probably it will still need to be set up by the application before calling nng_init() or other functions, because the initial threads have to be configured at startup. |
I intend at some point to refactor the expiration subsystem entirely -- which will probably let me greatly reduce the needs for lots of threads, by reducing the lock pressure caused by expiration. I am probably not going to configuring the expiration threads a runtime tunable. |
In head, we got read of the nni_timer thread. It's only thread, but still an improvement, if only a small one. |
I've decided to apply a sane limit for Windows by default. 64 was simply too big. This limits to 8 by default. Note that POSIX pollers are already single threaded right now, so this isn't necessarily a big deal (although we need to make them parallel too at some point.) Its compile time tunable as well, but 8 is a lot nicer to work around than 64. It has to be at least 2 right now. (I probably could have allowed just one of these, but I'm paranoid about going too low.) |
I think I am going to provide an API for this, that applications can call at start up time. Stay tuned. |
This further limits some of the thread counts, but principally it offers a new runtime facility, nng_init_set_parameter(), which can be used to set certain runtime parameters on the number of threads, provided it is called before the rest of application start up. This facility is quite intentionally "undocumented", at least for now, as we want to limit our commitment to it. Still this should be helpful for applications that need to reduce the number of threads that are created.
This further limits some of the thread counts, but principally it offers a new runtime facility, nng_init_set_parameter(), which can be used to set certain runtime parameters on the number of threads, provided it is called before the rest of application start up. This facility is quite intentionally "undocumented", at least for now, as we want to limit our commitment to it. Still this should be helpful for applications that need to reduce the number of threads that are created.
This further limits some of the thread counts, but principally it offers a new runtime facility, nng_init_set_parameter(), which can be used to set certain runtime parameters on the number of threads, provided it is called before the rest of application start up. This facility is quite intentionally "undocumented", at least for now, as we want to limit our commitment to it. Still this should be helpful for applications that need to reduce the number of threads that are created.
This further limits some of the thread counts, but principally it offers a new runtime facility, nng_init_set_parameter(), which can be used to set certain runtime parameters on the number of threads, provided it is called before the rest of application start up. This facility is quite intentionally "undocumented", at least for now, as we want to limit our commitment to it. Still this should be helpful for applications that need to reduce the number of threads that are created.
Describe the bug
I'm running a 32-core/64-thread threadripper machine, and it looks like in total nng creates 151 threads.
While this could be the wanted case for some apps, for the application that I'm writing it's all around way too much, and clutters up the profiling tools quite a bit.
Expected behavior
Ideally, thread creation is left to the application instead of being done by the nng framework, that way I can control scheduling of work and fit it into my async execution runtime, instead of having nng take up resource (and arguably more important) screen real estate in the profiler.
In our application we're trying to severely limit our number of threads and schedule work on our own.
Actual Behavior
See above
To Reproduce
Simply run nng on a high core-count machine.
** Environment Details **
nng-rs
crate, I think it's a bit behind on latest nngAdditional context
I found an additional issue on limiting the amount of threads being spun up in #769 this seems to require compile time toggles? Ideally this is something that can be set at runtime (we have a few different use-cases all sharing the same codebase - one case where run the service inproc and want a low thread count, one where we run the service as a daemon and we don't really care about thread count).
The text was updated successfully, but these errors were encountered: