Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace pthreads with openmp #37

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

mode89
Copy link
Contributor

@mode89 mode89 commented Jul 30, 2017

No description provided.

@mode89 mode89 changed the title Replace pthread calls inside kernel with OpenMP directives. Add CMake option turning off multithreading. Remove pthread related code from kernel. replace pthreads with openmp Jul 30, 2017
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 1843574 on mode89:feat/openmp into ** on UCI-CARL:master**.

@hkashyap
Copy link
Contributor

Why do you want to replace pthreads with openmp? pthreads is lower level multi-threading API and can hard assign threads to cores and that is something we want.

Create two new branches:

  1. feat/benchmarkOpenmp from this branch and copy benchmark4 from feat/benchmark
  2. feat/benchmarkpthreads from this master and copy benchmark4 from feat/benchmark

Run benchmark4 on both the two branches using mulicore machines (1, 4, 8, 16, 32 cores) and share the comparison.
copy @tingshuc

@mode89
Copy link
Contributor Author

mode89 commented Jul 31, 2017

Hi, @hkashyap. I'll do benchmarking.

I did the replacement because:

  • openmp based implementation has a cleaner code;
  • cross-platform;
  • didn't find the way how to break into pthreads thread in vscode on linux.

@mode89
Copy link
Contributor Author

mode89 commented Jul 31, 2017

Hi, @hkashyap

I ran benchmark4 on a machine with 8 logical cores. I didn't want to wait too much time, that's why I ran simulations with 300 and 400 synapses only. Here is run_benchmark4 that I used.

Here is the summary:

OpenMP

Partitions Synapses Setup time Run time
1 300 74703 461210
1 400 84604 572805
2 300 72605 242761
2 400 82895 301566
4 300 70685 155766
4 400 81047 191754
8 300 62028 106303
8 400 77644 131458
16 300 59296 91215
16 400 75889 114351

Output and record.csv files.

pthreads

Partitions Synapses Setup time Run time
1 300 75792 259754
1 400 80685 323523
2 300 73531 166340
2 400 83039 206210
4 300 72645 121236
4 400 80364 152389
8 300 69844 132340
8 400 72541 170894
16 300 64541 158897
16 400 73202 196628

Output and record.csv files.

I've created branches feat/benchmark-openmp and feat/benchmark-pthreads.

@tingshuc
Copy link
Member

Hi Andrew @mode89,
First of all, thanks for helping out. The result is quite interesting. We'll double check the computing results and discuss if we just move to openmp. In the mean time, please let us know the best e-mail for contacting you. We are writing CARLsim4 paper and I think we should at least acknowledge your contribution.

@mode89
Copy link
Contributor Author

mode89 commented Jul 31, 2017

Hi Ting-Shuo @tingshuc

Yeah, the results are interesting. On Saturday I tried to run some simulations on a 4-core machine and as far as I remember OpenMP's implementation outperformed pthreads. Later I will run the benchmark on that machine and share the results with you.

I've implemented building with OpenMP in CMake only. If you plan on using Make for building, then you will probably have to enable support of OpenMP by passing some additional compiler/linker flags. If you use GCC and default GNU's OpenMP implementation than the compiler flag should be -fopenmp and linker's flag is -lgomp. On other toolchains the flags should be different.

Thank you for acknowlegment. I can be contacted via [email protected]

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 1843574 on mode89:feat/openmp into ** on UCI-CARL:master**.

@hkashyap
Copy link
Contributor

hkashyap commented Aug 1, 2017

Hi @mode89 thank you for the benchmark comparison. We will double check them using 500 and 600 synapses and with more cores (on a cluster) when we get some time. We will definitely do an analysis.

@hkashyap
Copy link
Contributor

hkashyap commented Aug 1, 2017

Talking about cluster, we run these multicore simulations on clusters with many nodes, not on one node with many cores. Using openmp will mean that we will be restricted to single node. If we really need to use many nodes on a HPC setting, don't we need something like MPI.

@tingshuc
Copy link
Member

tingshuc commented Aug 1, 2017

Actually we need OpenMP + MPI. If this is something @mode89 would help, we can jump into CARLsim5.

@mode89
Copy link
Contributor Author

mode89 commented Aug 1, 2017

Hey guys.

I've never got my hands on MPI, but it sounds interesting and I'm keen on helping you with this.

Hirak @hkashyap yes, you are right that OpenMP is bound to a single node, but I think it can work in conjunction with MPI, when MPI launches a single process per a node and each process parallels jobs through all of the node's CPUs using OpenMP.

@mode89
Copy link
Contributor Author

mode89 commented Aug 1, 2017

Here are the results of running benchmark4 on Intel Core i5-4210U with 4 logical cores.

Partitions   Synapses    Start time          Run time
                         OpenMP  Pthreads    OpenMP  Pthreads

1            100         23609   24318       155879  138548
             200         27572   28293       209926  197093
             300         31496   32510       269595  260106
             400         35558   36965       332930  326105
             500         40029   41647       394867  392823
             600         44297   45821       456115  456416

2            100         23483   23508       107391  123676
             200         26926   27563       142489  176900
             300         30862   31566       185877  231452
             400         34732   35655       226305  289984
             500         39243   39992       264346  346974
             600         43062   44348       307221  406263

4            100         22705   23011       87710   87466
             200         26339   27434       125000  126202
             300         30199   30719       160702  162898
             400         33768   34543       204995  201963
             500         37778   38578       243666  243010
             600         42162   42494       279010  284144

8            100         22257   22556       86465   91126
             200         25721   26193       120797  128545
             300         29192   29783       161095  172993
             400         32743   33737       194389  213762
             500         36507   37364       244109  248595
             600         40175   41572       268033  288087

16           100         21943   22150       84459   104359
             200         25443   25581       118988  140627
             300         29239   29094       155669  177737
             400         32016   32612       194259  224579
             500         35622   36251       234972  259011
             600         39170   39866       262310  298025

Output for OpenMP and pthreads
record.csv for OpenMP and pthreads

@hkashyap
Copy link
Contributor

hkashyap commented Aug 2, 2017

Hi @mode89 what do you mean by running 16 core simulations using 4 logical cores? You need 16 physical cores (on one node in case of OpenMP) to run the SNN simulations on 16 cores.

@mode89
Copy link
Contributor Author

mode89 commented Aug 2, 2017

Hi @hkashyap I meant 4 logical cores of CPU. And word "cores" in the table means the number which is passed to benchmark executable: it defines amount of used partitions. In run_benchmark4 script it's referenced as "number of cores". I've replaced word "Cores" with "Partitions" in the tables above.

@mode89
Copy link
Contributor Author

mode89 commented Aug 2, 2017

My personal opinion on this numbers is that the difference is minor. And the bigger amount of neurons we have, the smaller the difference is. Actually, pthread is quite low-level and I believe pthread-based CARLsim can be optimized to meet performance of OpenMP. I think, GNU's OpenMP even based on pthreads under the hood, because libgomp is linked against libpthread.

My main concerns were readability and maintainability. Pthread-based implementation requires more code and more variables to keep track of, and additional helper functions. Actually, from the statistics of this pull request we can see 100 lines against 800 lines. Plus, OpenMP is cross-platform and all popular compilers support it out-of-the-box.

@hkashyap
Copy link
Contributor

hkashyap commented Aug 4, 2017

@mode89 now I see what's going on here. My best guess is that your CPU has 4 core level parallelization. So you are not improving anything beyond 4 cores. OpenMP automatically assigns to these four cores for any number of runtimes >= 4. On the otherhand, since with pthreads we manually try to hard assign 8/16 threads over four cores, the performance goes down.

Anyway, as I already explained the need to assign to multiple physical cores on multiple cluster nodes, which is the main focus above all. Thank you for detailing.

@hkashyap
Copy link
Contributor

hkashyap commented Sep 26, 2017

Hi @mode89, I am finalizing this pull request by comparing against pthreads. Sorry for the delay, as I had a very busy summer. I have two questions:

  1. Did you run the simulations on Windows? If not, I will seek help from @tingshuc

  2. I re-ran benchmark4 on your openmp branch on a cluster node with 60 cores and received no improvement beyond 4 cores. What may be the reason for this? Performance did improve in case of pthreads, which is still running.

Openmp:

number of cores is 1 |   |   | |
2000 | 100 | 5 | 32346 | 205685
2000 | 200 | 5 | 41421 | 294648
2000 | 300 | 5 | 50009 | 386893
2000 | 400 | 5 | 60023 | 489918
2000 | 500 | 5 | 67978 | 592517
2000 | 600 | 5 | 77117 | 684760
number of cores is 2 |   | | |  
2000 | 100 | 5 | 31053 | 222891
2000 | 200 | 5 | 39513 | 330809
2000 | 300 | 5 | 47618 | 434174
2000 | 400 | 5 | 55777 | 541242
2000 | 500 | 5 | 64268 | 654405
2000 | 600 | 4 | 72789 | 763254
number of cores is 4 |   |  | |
2000 | 100 | 5 | 29730 | 227001
2000 | 200 | 5 | 37374 | 329766
2000 | 300 | 5 | 45064 | 437332
2000 | 400 | 5 | 53110 | 548943
2000 | 500 | 5 | 60992 | 654377
2000 | 600 | 5 | 69426 | 765715
number of cores is 8 |   |  | |
2000 | 100 | 5 | 29310 | 227971
2000 | 200 | 5 | 36186 | 331715
2000 | 300 | 4 | 43697 | 437471
2000 | 400 | 4 | 51079 | 551749
2000 | 500 | 4 | 58638 | 665500
2000 | 600 | 4 | 65833 | 782450
number of cores is 16 |   |  | |
2000 | 100 | 4 | 28665 | 228726
2000 | 200 | 5 | 35377 | 329576
2000 | 300 | 4 | 42417 | 444320
2000 | 400 | 5 | 49668 | 553423
2000 | 500 | 4 | 56639 | 664213
2000 | 600 | 6 | 63710 | 785869
number of cores is 32 |   |  | |
2000 | 100 | 6 | 28259 | 229111
2000 | 200 | 4 | 34707 | 331572
2000 | 300 | 5 | 41544 | 439790
2000 | 400 | 5 | 48300 | 553787
2000 | 500 | 5 | 55451 | 666177
2000 | 600 | 5 | 62638 | 780040

pthreads:

number of cores is 1 |   |   | |
2000 | 100 | 5 | 23949 | 243003
2000 | 200 | 5 | 30291 | 337753
2000 | 300 | 4 | 36714 | 528196
2000 | 400 | 5 | 59039 | 602546
2000 | 500 | 6 | 65720 | 718506
2000 | 600 | 5 | 75035 | 1014244
number of cores is 2 |   | | |  
2000 | 100 | 4 | 30486 | 433953
2000 | 200 | 5 | 38698 | 564945
2000 | 300 | 5 | 46680 | 719083
2000 | 400 | 11 | 54868 | 874514
2000 | 500 | 5 | 63284 | 1029743
2000 | 600 | 6 | 71487 | 1042554
number of cores is 4 |   |  | |
2000 | 100 | 5 | 29468 | 294977
2000 | 200 | 5 | 37193 | 386312
2000 | 300 | 6 | 44585 | 453801
2000 | 400 | 5 | 52434 | 504584
2000 | 500 | 5 | 60108 | 558175
2000 | 600 | 5 | 67862 | 642926
number of cores is 8 |   |  | |
2000 | 100 | 5 | 28790 | 239215
2000 | 200 | 4 | 35738 | 330791
2000 | 300 | 5 | 43251 | 391847
2000 | 400 | 5 | 50363 | 464413
2000 | 500 | 5 | 57080 | 476765
2000 | 600 | 5 | 64221 | 543255
number of cores is 16 |   |  | |
2000 | 100 | 5 | 27675 | 233491
2000 | 200 | 5 | 34664 | 291623
2000 | 300 | 5 | 41339 | 368996
2000 | 400 | 5 | 47832 | 420652
2000 | 500 | 6 | 55003 | 460946
2000 | 600 | 5 | 62438 | 495215

@mode89
Copy link
Contributor Author

mode89 commented Sep 26, 2017

Hi @hkashyap, no problem!

  1. I haven't ran the simulation on Windows. Visual Studio supports only OpenMP 2.0, which lacks #pragma omp task directive. I think it should be possible to replace it with #pragma omp section directive.
  2. As we can see this time OpenMP performance doesn't change with increasing number of partitions, as opposed to the previous posts when I ran it on 8-core CPU and 4-core CPU, where run time changed drastically between 1-partition and 2-partitions configurations. My guess, is that OpenMP might be disabled. Did you use Make or CMake to build the project? I haven't changed Make scripts. GCC Compiler requires to pass the option -fopenmp to generate OpenMP compatible code, and GCC linker requires to link against GOMP library with the option -lgomp. Other toolchains, e.g. Visual Studio, require other flags. I think you can debug it by printing output of function omp_get_thread_num or omp_get_num_threads somewhere inside #pragma omp blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants