TBB parallelism #3595
Replies: 1 comment 4 replies
-
Hi! Even for single operations it is common to see diminishing returns beyond a certain thread count. A typical reason for this is that operations are often partially or fully bound by memory bandwidth, so it can well be that this limit is reached already at 10-20 threads on a system with 50 or more threads. Details will depend on several factors, such as the number of memory slots in use, whether it is a dual-socket system, and so on. I think Scipp should actually be "nicer" here and avoid using more threads than it needs, see #3565. Your application may benefit from parallelism on a different level. You might try solutions such as Dask to perform trivially parallel computation steps. Note that all Scipp functions that make calls to the lower-level C++ operations release the GIL. That is, using Dask in "threaded" mode with Scipp can yields speedups. Since you mention trying to parallelize the loop over dims in Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am running non-linear regression over a very large DataArray and am confused about the implementation of TBB in scipp. I have run identical code on my personal laptop (10 threads) and on a single cluster node (112 threads), and the runtime is very similar (small difference is likely due to cpu frequency, not thread count). In both cases, libtbbmalloc_proxy is activated, and all threads are running at full capacity. On the cluster node, I have even activated transparent HugePages, and I observe no noticeable difference. This leads to the question: what is actually being parallelized? I injected a counter into the _curve_fit method and have verified that the iteration across dims runs in serial (this is expected because the loop is not parallelized with njit or similar), so the TBB parallelism must be isolated to single scipp operations like slicing and broadcasting. If this is the case, what is the benefit of TBB parallelism? Furthermore, considering the lack of speedup going from 10 threads to 112 threads, is there a way to further parallelize and take advantage of a large number of threads for embarrassingly-parallel operations? I plan to test parallelizing the loop over dims in curve_fit using numba or tbb4py (parallel_for) and will report back.
Beta Was this translation helpful? Give feedback.
All reactions