-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose OpenMP backends to more analysis methods #3435
Comments
Coming up with a consistent way to enable OpenMP acceleration would be good. Note that we came across situations where the numpy OpenMP behavior decreased performance (in the context of the on the fly transformations IIRC) . |
Generally I think the OpenMP backends need reviewing (I don't even think they get properly tested in CI). I guess the main question I have here is; would weak scaling be a better goal here for future development? breaking things down to one thread per-frame seem like more likely to get us good scaling instead of multiple threads per frame? I can see cases where allowing for both options could be useful though (hbond analysis maybe?) |
I am adding a few related issues for context
|
My experience is that there are limits to how well you can make "split-apply-combine" parallelization work. It's often better if you can get multiple nodes involved (on a parallel file system) and then it's quite useful if you can make use of the cores on the node. Being able to do some heterogenous parallelization is not a bad thing, in my opinion. Furthermore, even "normal" operations such as distance-based selections will benefit on modern multicore machines (essentially, "for free").
I think OpenMP-based acceleration (and GPU acceleration) has a place in MDAnalysis. Per-frame based analysis is harder to accommodate in a seamless manner, as we have seen with PMDA. In an ideal world, our Analysis classes are automagically parallel but we're not there yet. For the time being, using multiprocessing, dask, or MPI along the lines of User Guide : parallelizing analysis and PRACE Workshop: Day 3, Session 1 (pdf)/Practical: Parallelism is probably the easiest. |
Do you have opinions on the distance calculations, @richardjgowers @hmacdope and parallelization? |
First up thanks for having a look into this, improving performance is something we are really looking into at the moment. :) I will also say that we are developing an intrinsics based explicitly SIMD vectorised package for calculating distances (https://github.com/MDAnalysis/distopia) which we are hoping may eventually replace some (most) of the hot distance code. To combine a "split-apply-combine" approach with SIMD intrinsics would require a different division of labour as each thread will need For this reason, I think parallelising across the frames axis i.e. one thread per frame is the way to move forward but my parallel code experience is limited. I'm also unsure how this interacts with things like I do think that leveraging OpenMP as much as possible is a still a really good idea and worthwhile goal moving forward, as there are so many analyses that can benefit. 👍 |
Thanks for all the responses!
Acknowledged - the few that I've been playing around with are covered, but I understand if work is done there may be some groundwork/cleanup/test expansion to do first.
Awesome! I'm interested, will take a look and see if I can help out. I'd be very interested in GPU implementations, that's another side project I was looking into for the main codebase anyway.
Right, one of the big benefits of OpenMP is that it can really help local workloads while scaling alright to HPC level, hopefully transparent to the user. Another use case for in-frame parallelization is analysis where frames aren't independent of each other - such as mean squared displacement. It seems overall there are several overlapping endeavors here - multiprocessing, multithreading, and SIMD (with GPUs hovering in the background). These things can all happily coexist if planned together, but can clash if not, so I'm not actually sure what to take away from all of this (useful) information. The project could maybe benefit from some centralized structures defining the parallelism scheme? I'm imagining each analysis tool could configure these settings, and choose from one or more compatible offload/parallelization schemes, overridable by users. |
I agree that we could do with possibly formalising where each parallelism hierarchy fits into future plans. Would people be amenable to this @MDAnalysis/coredevs? Perhaps something like this already exists that I'm not aware of. |
Thanks for looking into this, I think maybe the benchmark is a little small (at 2k atoms?). We really need to be designing around large problem sizes, where smart algorithms (here nsgrid) are required. That said...
In terms of backend selection, I think this probably belongs as a trait of the MDAnalysis package, something like how you can tell matplotlib what backend to use? Rather than every single function call taking kwargs. |
Agreed, but NSgrid was crashing for me at higher atom counts, including the highest count on the current benchmark (10k). Also something I was planning to look into but haven't filed a bug yet.
That can work, but we'd need to carefully define / document the way analysis tools interact with that trait. It's infeasible to have every tool implemented in every backend, so if the trait says "GPU", a serial-only tool either needs to use its serial implementation or throw a useful error message. So at some level, tools or libraries have to state their capabilities. |
Another question is what's the team's priority right now? I started down the OpenMP improvement route, but I'd be happy to work on a mechanism that has momentum. Is there a roadmap for integrating the SIMD libraries, and/or implementing per-frame parallelization internally? |
Honestly, the crash at 10k seems a quite high priority to me :) The development of the SIMD code is happening in the "distopia" repo as it is very experimental. Once it replicates the contents of lib.distances, theoretically you could slide it (or any other backend...) under most analysis (and core) functions that use lib.distances. Something like a BLAS for distance calculations. |
Oh and pmda is where (at least one direction) of per-frame parallelism development is happening, that's @orbeckst 's initiative. |
It looks like the crash has a bug already (assuming the same underlying issue) at #3183 . I found the crashing line, can try to figure out what's going on, seems like a good first issue. I did see pmda, but it didn't look like there's been any active development in the last year or so, so I didn't know if it's an active project. |
Is your feature request related to a problem?
Some analysis tools rely on underlying libraries that have both OpenMP and serial implementations, but only ever allow the serial implementation to run. InterRdf is a good example of this. In the main loop:
Describe the solution you'd like
Allow users to accelerate RDF and other routines with existing parallel implementations. A demo implementation (not ready for submisison) can be found on my fork here. Some local benchmarks on my Ryzen 5:
I tuned the brute force thread count with the OMP_NUM_THREADS env variable while running asv. Isolating a benchmark with 2000 atoms, we get linear scaling of performance per openmp thread. Also of interest is the very poor scaling of the nsgrid implementation, but that's another issue (and in smaller benchmarks, nsgrid outperforms brute force for shorter cutoffs).
Describe alternatives you've considered
There are two questions here -
This could be done test-by-test (see hacky example here), but it may be worth looking into something more standard.
OpenMP support is easily detected, and IIRC other MDAnalysis dependencies like Numpy already implement transparent multithreading for some routines.
Additional context
Looking for feedback / input.
The text was updated successfully, but these errors were encountered: