TBB parallelism #3595

TheFermiSea · 2024-11-10T19:39:48Z

TheFermiSea
Nov 10, 2024

Hello,

I am running non-linear regression over a very large DataArray and am confused about the implementation of TBB in scipp. I have run identical code on my personal laptop (10 threads) and on a single cluster node (112 threads), and the runtime is very similar (small difference is likely due to cpu frequency, not thread count). In both cases, libtbbmalloc_proxy is activated, and all threads are running at full capacity. On the cluster node, I have even activated transparent HugePages, and I observe no noticeable difference. This leads to the question: what is actually being parallelized? I injected a counter into the _curve_fit method and have verified that the iteration across dims runs in serial (this is expected because the loop is not parallelized with njit or similar), so the TBB parallelism must be isolated to single scipp operations like slicing and broadcasting. If this is the case, what is the benefit of TBB parallelism? Furthermore, considering the lack of speedup going from 10 threads to 112 threads, is there a way to further parallelize and take advantage of a large number of threads for embarrassingly-parallel operations? I plan to test parallelizing the loop over dims in curve_fit using numba or tbb4py (parallel_for) and will report back.

SimonHeybrock · 2024-11-11T03:42:10Z

SimonHeybrock
Nov 11, 2024
Maintainer

Hi!
Indeed threading is limited to single operations. Implementing anything beyond that wold be much more complicated.

Even for single operations it is common to see diminishing returns beyond a certain thread count. A typical reason for this is that operations are often partially or fully bound by memory bandwidth, so it can well be that this limit is reached already at 10-20 threads on a system with 50 or more threads. Details will depend on several factors, such as the number of memory slots in use, whether it is a dual-socket system, and so on. I think Scipp should actually be "nicer" here and avoid using more threads than it needs, see #3565.

Your application may benefit from parallelism on a different level. You might try solutions such as Dask to perform trivially parallel computation steps. Note that all Scipp functions that make calls to the lower-level C++ operations release the GIL. That is, using Dask in "threaded" mode with Scipp can yields speedups.

Since you mention trying to parallelize the loop over dims in curve_fit, maybe one could simply use Python threads, e.g., via a thread-pool, there?

Cheers,
Simon

4 replies

TheFermiSea Nov 11, 2024
Author

Hi Simon,
Thank you for the detailed reply. I have been playing around with curve_fit to see if I can parallelize the inner for loop (starting at line 193), but it seems that the recursive function call prohibits a simple solution (like using a @dask.delayed decorator). I can probably just slice up my DataArray and call scipp.curve_fit on each slice using Dask, but I am concerned that TBB will eat up all of the threads and compete with Dask. Do you have any insights here? Will TBB play nicely with other multithreading libraries?

I suspect that if the thread hunger issue you mentioned above is resolved, it should be relatively simple to refactor curve_fit such that it is parallelized across some dimension (maybe as a "parallel_dims" argument?). This is a bit beyond my capabilities and/or capacity right now (writing dissertation), but I think this would be an enormous improvement to Scipp.

TheFermiSea Nov 12, 2024
Author

Hi Simon,
I am excited to inform you that you were right! Upon slicing and mapping to a ProcessPool (using pathos), I was able to run curve_fit enormously fast. This fit routine used to take about a week, it just ran in less than 20 seconds. The output is individual hdf5 files for each slice, but it is trivial to concatenate into a single DataGroup. Considering the flexibility of map() in pathos to accept multiple arguments, I suspect that it would be fairly straightforward to modify scipp.curve_fit._curve_fit to run in parallel. I don't quite understand the logic behind the recursive function call in _curve_fit, so I don't know how to actually implement a parallel map in the curve_fit module itself. Do you have any thoughts?

SimonHeybrock Nov 12, 2024
Maintainer

@jokasimr You implemented this particular part, do you have some insights on how to do this internally in the existing curve_fit?

jokasimr Nov 14, 2024
Maintainer

Yes I think it should be fairly straightforward to implement this using multiprocessing.Pool.
It's definitely worth trying out.

I'll create an issue in scipp.

Thank you for investigating this @TheFermiSea!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sci++

TBB parallelism #3595

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sci++

TBB parallelism #3595

TheFermiSea Nov 10, 2024

Replies: 1 comment · 4 replies

SimonHeybrock Nov 11, 2024 Maintainer

TheFermiSea Nov 11, 2024 Author

TheFermiSea Nov 12, 2024 Author

SimonHeybrock Nov 12, 2024 Maintainer

jokasimr Nov 14, 2024 Maintainer

TheFermiSea
Nov 10, 2024

Replies: 1 comment 4 replies

SimonHeybrock
Nov 11, 2024
Maintainer

TheFermiSea Nov 11, 2024
Author

TheFermiSea Nov 12, 2024
Author

SimonHeybrock Nov 12, 2024
Maintainer

jokasimr Nov 14, 2024
Maintainer