-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad Performance of Parallelization with On-the-fly Transformation #144
Comments
Your results are interesting. What is the single blue bar "RMSD" result? Is this running the code in serial but allow multi threaded operations? I don't understand why it improves with n_cores/ Interestingly, cupy does not do better than single threaded numpy. This reminds me of results from a REU where Robert Delgado found the same http://doi.org/10.6084/m9.figshare.3823293.v1 because the data transfer to/from GPU was expensive. |
The blue bar is the RMSD without any Transformation, just as a reference. |
So ideally we would like to only use one thread for |
If you set it as an environment variable from Python, does it work? Is there a |
Actually there is one. threadpoolctl. For the Also, I should double-check the benchmark with another benchmarking system. |
@yuxuanzhuang have you tried PMDA again since PR MDAnalysis/mdanalysis#2950 got merged? |
EDIT: Sorry, I need to find some time to look into it. It is really weird to see parallel RMSD performs better with a single core... maybe I messed up something. from MDAnalysis.analysis.rms import RMSD as serial_rmsd
from pmda.rms.rmsd import RMSD as parallel_rmsd
u = mda.Universe(files['PDB'], files['SHORT_TRAJ'])
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)
rmsd = serial_rmsd(u.atoms, u.atoms)
%time rmsd.run() CPU times: user 20.6 s, sys: 961 ms, total: 21.5 s rmsd = parallel_rmsd(u.atoms, u.atoms)
%time rmsd.run(n_blocks=1, n_jobs=1) CPU times: user 11.4 s, sys: 0 ns, total: 11.4 s |
Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans? |
The develop branch of MDA and PR #132 branch of PMDA. |
Expected behaviour
Analysis of a
Universe
with on-the-fly transformation scales good (reasonable).Actual behaviour
The scaling performance is really bad even with two cores.
Code
Reason
In some
Transformations
includesnumpy.dot
which itself is multi-threaded. So the cores are oversubscribed.Possible solution
numpy
(https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.cupy
(https://cupy.dev/) to leverage the GPU power. (only replacing thenumpy.dot
operation of theTransformation
)Benchmarking result
Currently version of MDAnalysis:
(run
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 2.0.0 dev(run
python -c "import pmda; print(pmda.__version__)"
)(run
python -c "import dask; print(dask.__version__)"
)The text was updated successfully, but these errors were encountered: