-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503
Conversation
7b302ac
to
0f9ce40
Compare
Hi @martin-frbg. Can you review the PR and give your insights. |
Thank you, see what you mean now (and why my own experiments did not look promising). Two quick comments - I would prefer not to have an |
Thank you for your comments @martin-frbg
Multiple buffers (Existing) : We can have multiple buffers and no lock on gemm_driver, So whenever we get multiple independent OpenBLAS calls, they will each have their own buffers (Limited by the NUM_PARALLEL) and rest of the calls will busy wait in the while loop inside exec_blas. Single Buffer with locks (Proposed in PR and currently in use by Pthread and Win32 Backends) : The idea is to have a lock in the initial phase of OpenBLAS call inside the gemm_driver function. Now, when multiple independent OpenBLAS calls comes only one of them is granted a lock and others will sleep or busy wait (According to the inner implementation of locking mechanism). How the test was performed :
So, Having a single buffer with lock is fine as only one OpenBLAS call is being served at a time. It was showing consistent results during my testing on multiple OpenBLAS execution in parallel. Edit: I have attached a screenshot of the test.cpp. |
Thank you - I agree that having a single set of buffers and locking out any competing caller will work, I am only wondering if better performance can be achieved (in the future) by keeping NUM_PARALLEL as well and having as many buffers for parallel callers. The only drawback is that there probably needs to be more duplication of |
Hi @martin-frbg I want your views on some points before moving forward with the modification (Future performance gain with NUM_PARALLEL) : I want to know your thoughts on how to utilize NUM_PARALLEL to get better OpenBLAS efficiency and your about your future plans on NUM_PARALLEL. I also didn't wanted to remove a existing feature but I failed to find any reason of future improvement using it.
(How to Proceed forward) : There are 2 ways we can have NUM_PARALLEL with OpenMP locks
Plesae let me know which design is inligned with your future plans on NUM_PARALLEL |
Sorry for the delay - some other things going on in real life - what I'm thinking about is OpenMP locks coexisting with NUM_PARALLEL (for some small number of NUM_PARALLEL, like 2 or 3) if that works out to be practical. I think the limitation to one caller is purely historical - most of the "threading infrastructure" part of OpenBLAS is still the same as the original GotoBLAS from close to twenty years ago, written from scratch by a single developer and appropriate for the hardware of the time. (It wasn't until about six years ago that it became apparent that the pthreads build was not thread-safe at all even for a single caller.) |
Hi @martin-frbg
|
|
0f9ce40
to
4f7cf07
Compare
80ce081
to
0ff8549
Compare
2f2a98e
to
4c017a7
Compare
386283b
to
d49ebc5
Compare
Hi @martin-frbg As discussed,I have made some changes. I have tested it and it seems to be working fine. Please review the changes and let me know if any modification is required. |
Thank you - unfortunately it looks as if I am not getting much of anything done today, so I've only had a quick look so far |
Hi @martin-frbg |
sorry I'm currently sick, hope to get to this later in the week |
Oh, Take your time. I hope you get well soon |
Hi @martin-frbg |
Looks good (as far as I can tell today), thank you very much. I intend to merge it tomorrow. |
Brief description: This pull request proposes replacing MAX_PARALLEL_NUMBER/NUM_PARALLEL feature with OpenMP locking mechanism to achieve a more refined implementation in line with OpenMP specifications and consistent with Pthreads and Win32 designs.
In the current codebase, the exec_blas function in OpenMP employs busy-waiting, utilizing the max_parallel_number feature to enable the execution of multiple instances of OpenBLAS in parallel. With our changes:
We introduce OpenMP locks during the initialization of the queue. We have removed existing lock mechanism and have used omp_lock instead of in-house spin lock. This design is consistent with the existing Pthreads and Win32 threading backend designs.
This adjustment ensures that concurrent calls to BLAS functions do not utilize the same BLAS thread buffers.
OpenMP locks(currently using spin locks) have an expected scope for improvement in the future. Any performance gain will contribute to the new implementation without requiring manual changes.
Kindly note:
We have eliminated the NUM_PARALLEL flag and its associated code. NUM_PARALLEL was an add-on to the existing OPENMP_NUM_THREADS and requires recompilation of OPENBLAS for usage.
There is no impact to performance with default setting of NUM_PARALLEL=1. We are using the same logic but now in a more refined manner aligned with OpenMP specification with scope for future improvement.
For further details and discussions on this issue, please refer to: #4418