replace pthreads with openmp #37

mode89 · 2017-07-30T07:48:14Z

No description provided.

Compile with OpenMP enabled.

coveralls · 2017-07-30T07:55:03Z

Changes Unknown when pulling 1843574 on mode89:feat/openmp into ** on UCI-CARL:master**.

hkashyap · 2017-07-30T17:12:52Z

Why do you want to replace pthreads with openmp? pthreads is lower level multi-threading API and can hard assign threads to cores and that is something we want.

Create two new branches:

feat/benchmarkOpenmp from this branch and copy benchmark4 from feat/benchmark
feat/benchmarkpthreads from this master and copy benchmark4 from feat/benchmark

Run benchmark4 on both the two branches using mulicore machines (1, 4, 8, 16, 32 cores) and share the comparison.
copy @tingshuc

mode89 · 2017-07-31T02:03:22Z

Hi, @hkashyap. I'll do benchmarking.

I did the replacement because:

openmp based implementation has a cleaner code;
cross-platform;
didn't find the way how to break into pthreads thread in vscode on linux.

mode89 · 2017-07-31T06:27:16Z

Hi, @hkashyap

I ran benchmark4 on a machine with 8 logical cores. I didn't want to wait too much time, that's why I ran simulations with 300 and 400 synapses only. Here is run_benchmark4 that I used.

Here is the summary:

OpenMP

Partitions	Synapses	Setup time	Run time
1	300	74703	461210
1	400	84604	572805
2	300	72605	242761
2	400	82895	301566
4	300	70685	155766
4	400	81047	191754
8	300	62028	106303
8	400	77644	131458
16	300	59296	91215
16	400	75889	114351

Output and record.csv files.

pthreads

Partitions	Synapses	Setup time	Run time
1	300	75792	259754
1	400	80685	323523
2	300	73531	166340
2	400	83039	206210
4	300	72645	121236
4	400	80364	152389
8	300	69844	132340
8	400	72541	170894
16	300	64541	158897
16	400	73202	196628

Output and record.csv files.

I've created branches feat/benchmark-openmp and feat/benchmark-pthreads.

tingshuc · 2017-07-31T06:57:06Z

Hi Andrew @mode89,
First of all, thanks for helping out. The result is quite interesting. We'll double check the computing results and discuss if we just move to openmp. In the mean time, please let us know the best e-mail for contacting you. We are writing CARLsim4 paper and I think we should at least acknowledge your contribution.

mode89 · 2017-07-31T07:29:47Z

Hi Ting-Shuo @tingshuc

Yeah, the results are interesting. On Saturday I tried to run some simulations on a 4-core machine and as far as I remember OpenMP's implementation outperformed pthreads. Later I will run the benchmark on that machine and share the results with you.

I've implemented building with OpenMP in CMake only. If you plan on using Make for building, then you will probably have to enable support of OpenMP by passing some additional compiler/linker flags. If you use GCC and default GNU's OpenMP implementation than the compiler flag should be -fopenmp and linker's flag is -lgomp. On other toolchains the flags should be different.

Thank you for acknowlegment. I can be contacted via [email protected]

coveralls · 2017-07-31T17:04:51Z

Changes Unknown when pulling 1843574 on mode89:feat/openmp into ** on UCI-CARL:master**.

hkashyap · 2017-08-01T02:54:33Z

Hi @mode89 thank you for the benchmark comparison. We will double check them using 500 and 600 synapses and with more cores (on a cluster) when we get some time. We will definitely do an analysis.

hkashyap · 2017-08-01T03:01:08Z

Talking about cluster, we run these multicore simulations on clusters with many nodes, not on one node with many cores. Using openmp will mean that we will be restricted to single node. If we really need to use many nodes on a HPC setting, don't we need something like MPI.

tingshuc · 2017-08-01T03:33:14Z

Actually we need OpenMP + MPI. If this is something @mode89 would help, we can jump into CARLsim5.

mode89 · 2017-08-01T06:04:43Z

Hey guys.

I've never got my hands on MPI, but it sounds interesting and I'm keen on helping you with this.

Hirak @hkashyap yes, you are right that OpenMP is bound to a single node, but I think it can work in conjunction with MPI, when MPI launches a single process per a node and each process parallels jobs through all of the node's CPUs using OpenMP.

mode89 · 2017-08-01T15:24:22Z

Here are the results of running benchmark4 on Intel Core i5-4210U with 4 logical cores.

Partitions   Synapses    Start time          Run time
                         OpenMP  Pthreads    OpenMP  Pthreads

1            100         23609   24318       155879  138548
             200         27572   28293       209926  197093
             300         31496   32510       269595  260106
             400         35558   36965       332930  326105
             500         40029   41647       394867  392823
             600         44297   45821       456115  456416

2            100         23483   23508       107391  123676
             200         26926   27563       142489  176900
             300         30862   31566       185877  231452
             400         34732   35655       226305  289984
             500         39243   39992       264346  346974
             600         43062   44348       307221  406263

4            100         22705   23011       87710   87466
             200         26339   27434       125000  126202
             300         30199   30719       160702  162898
             400         33768   34543       204995  201963
             500         37778   38578       243666  243010
             600         42162   42494       279010  284144

8            100         22257   22556       86465   91126
             200         25721   26193       120797  128545
             300         29192   29783       161095  172993
             400         32743   33737       194389  213762
             500         36507   37364       244109  248595
             600         40175   41572       268033  288087

16           100         21943   22150       84459   104359
             200         25443   25581       118988  140627
             300         29239   29094       155669  177737
             400         32016   32612       194259  224579
             500         35622   36251       234972  259011
             600         39170   39866       262310  298025

Output for OpenMP and pthreads
record.csv for OpenMP and pthreads

hkashyap · 2017-08-02T05:45:02Z

Hi @mode89 what do you mean by running 16 core simulations using 4 logical cores? You need 16 physical cores (on one node in case of OpenMP) to run the SNN simulations on 16 cores.

mode89 · 2017-08-02T08:24:19Z

Hi @hkashyap I meant 4 logical cores of CPU. And word "cores" in the table means the number which is passed to benchmark executable: it defines amount of used partitions. In run_benchmark4 script it's referenced as "number of cores". I've replaced word "Cores" with "Partitions" in the tables above.

mode89 · 2017-08-02T09:01:05Z

My personal opinion on this numbers is that the difference is minor. And the bigger amount of neurons we have, the smaller the difference is. Actually, pthread is quite low-level and I believe pthread-based CARLsim can be optimized to meet performance of OpenMP. I think, GNU's OpenMP even based on pthreads under the hood, because libgomp is linked against libpthread.

My main concerns were readability and maintainability. Pthread-based implementation requires more code and more variables to keep track of, and additional helper functions. Actually, from the statistics of this pull request we can see 100 lines against 800 lines. Plus, OpenMP is cross-platform and all popular compilers support it out-of-the-box.

hkashyap · 2017-08-04T01:58:35Z

@mode89 now I see what's going on here. My best guess is that your CPU has 4 core level parallelization. So you are not improving anything beyond 4 cores. OpenMP automatically assigns to these four cores for any number of runtimes >= 4. On the otherhand, since with pthreads we manually try to hard assign 8/16 threads over four cores, the performance goes down.

Anyway, as I already explained the need to assign to multiple physical cores on multiple cluster nodes, which is the main focus above all. Thank you for detailing.

hkashyap · 2017-09-26T05:11:16Z

Hi @mode89, I am finalizing this pull request by comparing against pthreads. Sorry for the delay, as I had a very busy summer. I have two questions:

Did you run the simulations on Windows? If not, I will seek help from @tingshuc
I re-ran benchmark4 on your openmp branch on a cluster node with 60 cores and received no improvement beyond 4 cores. What may be the reason for this? Performance did improve in case of pthreads, which is still running.

Openmp:

number of cores is 1 | | | |
2000 | 100 | 5 | 32346 | 205685
2000 | 200 | 5 | 41421 | 294648
2000 | 300 | 5 | 50009 | 386893
2000 | 400 | 5 | 60023 | 489918
2000 | 500 | 5 | 67978 | 592517
2000 | 600 | 5 | 77117 | 684760
number of cores is 2 | | | |
2000 | 100 | 5 | 31053 | 222891
2000 | 200 | 5 | 39513 | 330809
2000 | 300 | 5 | 47618 | 434174
2000 | 400 | 5 | 55777 | 541242
2000 | 500 | 5 | 64268 | 654405
2000 | 600 | 4 | 72789 | 763254
number of cores is 4 | | | |
2000 | 100 | 5 | 29730 | 227001
2000 | 200 | 5 | 37374 | 329766
2000 | 300 | 5 | 45064 | 437332
2000 | 400 | 5 | 53110 | 548943
2000 | 500 | 5 | 60992 | 654377
2000 | 600 | 5 | 69426 | 765715
number of cores is 8 | | | |
2000 | 100 | 5 | 29310 | 227971
2000 | 200 | 5 | 36186 | 331715
2000 | 300 | 4 | 43697 | 437471
2000 | 400 | 4 | 51079 | 551749
2000 | 500 | 4 | 58638 | 665500
2000 | 600 | 4 | 65833 | 782450
number of cores is 16 | | | |
2000 | 100 | 4 | 28665 | 228726
2000 | 200 | 5 | 35377 | 329576
2000 | 300 | 4 | 42417 | 444320
2000 | 400 | 5 | 49668 | 553423
2000 | 500 | 4 | 56639 | 664213
2000 | 600 | 6 | 63710 | 785869
number of cores is 32 | | | |
2000 | 100 | 6 | 28259 | 229111
2000 | 200 | 4 | 34707 | 331572
2000 | 300 | 5 | 41544 | 439790
2000 | 400 | 5 | 48300 | 553787
2000 | 500 | 5 | 55451 | 666177
2000 | 600 | 5 | 62638 | 780040

pthreads:

number of cores is 1 | | | |
2000 | 100 | 5 | 23949 | 243003
2000 | 200 | 5 | 30291 | 337753
2000 | 300 | 4 | 36714 | 528196
2000 | 400 | 5 | 59039 | 602546
2000 | 500 | 6 | 65720 | 718506
2000 | 600 | 5 | 75035 | 1014244
number of cores is 2 | | | |
2000 | 100 | 4 | 30486 | 433953
2000 | 200 | 5 | 38698 | 564945
2000 | 300 | 5 | 46680 | 719083
2000 | 400 | 11 | 54868 | 874514
2000 | 500 | 5 | 63284 | 1029743
2000 | 600 | 6 | 71487 | 1042554
number of cores is 4 | | | |
2000 | 100 | 5 | 29468 | 294977
2000 | 200 | 5 | 37193 | 386312
2000 | 300 | 6 | 44585 | 453801
2000 | 400 | 5 | 52434 | 504584
2000 | 500 | 5 | 60108 | 558175
2000 | 600 | 5 | 67862 | 642926
number of cores is 8 | | | |
2000 | 100 | 5 | 28790 | 239215
2000 | 200 | 4 | 35738 | 330791
2000 | 300 | 5 | 43251 | 391847
2000 | 400 | 5 | 50363 | 464413
2000 | 500 | 5 | 57080 | 476765
2000 | 600 | 5 | 64221 | 543255
number of cores is 16 | | | |
2000 | 100 | 5 | 27675 | 233491
2000 | 200 | 5 | 34664 | 291623
2000 | 300 | 5 | 41339 | 368996
2000 | 400 | 5 | 47832 | 420652
2000 | 500 | 6 | 55003 | 460946
2000 | 600 | 5 | 62438 | 495215

mode89 · 2017-09-26T07:17:56Z

Hi @hkashyap, no problem!

I haven't ran the simulation on Windows. Visual Studio supports only OpenMP 2.0, which lacks #pragma omp task directive. I think it should be possible to replace it with #pragma omp section directive.
As we can see this time OpenMP performance doesn't change with increasing number of partitions, as opposed to the previous posts when I ran it on 8-core CPU and 4-core CPU, where run time changed drastically between 1-partition and 2-partitions configurations. My guess, is that OpenMP might be disabled. Did you use Make or CMake to build the project? I haven't changed Make scripts. GCC Compiler requires to pass the option -fopenmp to generate OpenMP compatible code, and GCC linker requires to link against GOMP library with the option -lgomp. Other toolchains, e.g. Visual Studio, require other flags. I think you can debug it by printing output of function omp_get_thread_num or omp_get_num_threads somewhere inside #pragma omp blocks.

mode89 added 5 commits July 30, 2017 14:48

replace pthread calls with openmp directives

419d3e5

Compile with OpenMP enabled.

add cmake option turning off multithreading

59f967e

remove pthread helper functions

52e4095

remove unused gpuAllocationLock

b5411d3

remove pthread related includes

1843574

mode89 changed the title ~~Replace pthread calls inside kernel with OpenMP directives. Add CMake option turning off multithreading. Remove pthread related code from kernel.~~ replace pthreads with openmp Jul 30, 2017

hkashyap requested review from tingshuc, hkashyap and staslist July 30, 2017 17:13

mode89 force-pushed the feat/openmp branch from 7537123 to 1843574 Compare July 31, 2017 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace pthreads with openmp #37

replace pthreads with openmp #37

mode89 commented Jul 30, 2017 •

edited

Loading

coveralls commented Jul 30, 2017

hkashyap commented Jul 30, 2017

mode89 commented Jul 31, 2017

mode89 commented Jul 31, 2017 •

edited

Loading

tingshuc commented Jul 31, 2017

mode89 commented Jul 31, 2017 •

edited

Loading

coveralls commented Jul 31, 2017

hkashyap commented Aug 1, 2017

hkashyap commented Aug 1, 2017

tingshuc commented Aug 1, 2017

mode89 commented Aug 1, 2017

mode89 commented Aug 1, 2017 •

edited

Loading

hkashyap commented Aug 2, 2017

mode89 commented Aug 2, 2017 •

edited

Loading

mode89 commented Aug 2, 2017

hkashyap commented Aug 4, 2017

hkashyap commented Sep 26, 2017 •

edited

Loading

mode89 commented Sep 26, 2017 •

edited

Loading

replace pthreads with openmp #37

Are you sure you want to change the base?

replace pthreads with openmp #37

Conversation

mode89 commented Jul 30, 2017 • edited Loading

coveralls commented Jul 30, 2017

hkashyap commented Jul 30, 2017

mode89 commented Jul 31, 2017

mode89 commented Jul 31, 2017 • edited Loading

tingshuc commented Jul 31, 2017

mode89 commented Jul 31, 2017 • edited Loading

coveralls commented Jul 31, 2017

hkashyap commented Aug 1, 2017

hkashyap commented Aug 1, 2017

tingshuc commented Aug 1, 2017

mode89 commented Aug 1, 2017

mode89 commented Aug 1, 2017 • edited Loading

hkashyap commented Aug 2, 2017

mode89 commented Aug 2, 2017 • edited Loading

mode89 commented Aug 2, 2017

hkashyap commented Aug 4, 2017

hkashyap commented Sep 26, 2017 • edited Loading

mode89 commented Sep 26, 2017 • edited Loading

mode89 commented Jul 30, 2017 •

edited

Loading

mode89 commented Jul 31, 2017 •

edited

Loading

mode89 commented Jul 31, 2017 •

edited

Loading

mode89 commented Aug 1, 2017 •

edited

Loading

mode89 commented Aug 2, 2017 •

edited

Loading

hkashyap commented Sep 26, 2017 •

edited

Loading

mode89 commented Sep 26, 2017 •

edited

Loading