Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit numpy thread usage for Transformation classes #2950

Merged
merged 33 commits into from
Apr 10, 2021

Conversation

yuxuanzhuang
Copy link
Contributor

@yuxuanzhuang yuxuanzhuang commented Sep 19, 2020

Fixes #2996 and MDAnalysis/pmda#144

Changes made in this Pull Request:

  • add a TransformationBase class for handling thread limiting.
  • TransformationBase has a parallelizable to check if it can be used in the parallel analysis (split-apply-combine approach)

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting addition. I might be commenting prematurely but here are initial comments

  • I assume that under normal circumstances, multithreading is beneficial, i.e., whenever someone just runs MDA on a multicore machine without thinking about parallelization. Under these conditions we would not want to limit the threads, would we? Is there a sensible way by which we can make the thread limiting optional?
  • tests
  • docs
  • changes

Did you check the performance impact of limiting threads?

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echoing @orbeckst's comment.

I realise this is linked to MDAnalysis/pmda#144, but it might be good to have an issue on MDAnalysis itself. In part because it would be to go through all the use cases & see where benchmarks vary between users (I wouldn't be surprised if you end up getting a lot of variance on whether or not enabling multithreading speeds/slows things down for a given use case).

.travis.yml Outdated
@@ -30,7 +30,7 @@ env:
- SETUP_CMD="${PYTEST_FLAGS}"
- BUILD_CMD="pip install -e package/ && (cd testsuite/ && python setup.py build)"
- CONDA_MIN_DEPENDENCIES="mmtf-python biopython networkx cython matplotlib scipy griddataformats hypothesis gsd codecov"
- CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py"
- CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py threadpoolctl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadpoolctl isn't a core dependency, so as it is, it will fail when running on minimal dependencies.

@yuxuanzhuang
Copy link
Contributor Author

I am afraid that with the decorator (the current implementation), we cannot change the thread limit at runtime. (e.g. passing a self.n_thread arg). However, we can do it inside __init__ by self.__call__ = threadpool_limits_decorator(self.n_thread)(self.__call__), which I think is not that clear/pretty as with the decorator.

As for the impact on performance, I can reproduce the results (MDAnalysis/pmda#144) with another 6-core CPU. Sorry for the rough data, but you can see it's mainly the hyperthreading that is limiting the performance.
image
Note this is purely the serial code (no PMDA involved)

I will create an MDA issue when I have time.

@orbeckst
Copy link
Member

The decorator is pretty but if it does not provide enough flexibility, just use the context manager:

class RotateTransformation:
   def __init__(self, rotmat, max_threads=None):
       self.rotmat = rotmat
       self.n_thread = max_threads  # possibly needs some logic here to "do the best thing"
  
   def __call__(self, ts):
        with threadpool_limits(self.n_thread):
              ts.positions[:] = np.dot(ts.positions, self.rotmat)
        return ts

That still looks reasonable to me.

Or is there a deeper reason for preferring the function decorator?

The bigger question is what the default for max_threads ought to be and if all transformations need to have it. I think the common use case is serial on multicore so it should probably default to whatever the thread-parallel code wants to do (even though with hyper-threading this might be sub-optimal). We should then give guidance on how to tune the settings.

@yuxuanzhuang
Copy link
Contributor Author

Sorry for the radio silence...was pretty busy last week. I think we can port this function into PMDA if we decide serial code just goes into its default setting.

And in PMDA, say if the user is requiring all the cores, then we limit the self._single_frame to one thread so that everything inside, Transformation included, will be run with one thread. This is also the reason I prefer to use a decorator.

@orbeckst
Copy link
Member

orbeckst commented Oct 9, 2020

It looks like a worthwhile tuning knob to include in the MDAnalysis transformations. Then PMDA (and anyone else) can just use the optional max_threads as needed.

https://github.com/joblib/threadpoolctl/blob/32037cf43a61909282b5d07e6b21d9621fa03e25/threadpoolctl.py#L154 says that threadpool_limits(limits=None) does nothing so we can just use None as a default for max_threads=None.

Including threadpoolctl as dependency should be easy (conda-forge or pip).

@orbeckst
Copy link
Member

orbeckst commented Oct 9, 2020

I think we can port this function into PMDA if we decide serial code just goes into its default setting.

Do you mean then we don't have threadctl in MDA? Is this because then the user would have to limit the threads when they create the transformations, i.e., they need to know that they will use PMDA?

And in PMDA, say if the user is requiring all the cores, then we limit the self._single_frame to one thread so that everything inside, Transformation included, will be run with one thread.

I can see that it's easier to just apply the limits to everything in PMDA.

I don't think that it will be easy to figure out if there's any oversubscription of the CPU, at least not when someone is using distributed. For multiprocessing you could try to find out how many cores are available but this starts getting complicated, I feel.

This is also the reason I prefer to use a decorator.

Can you explain more? You mean so that you can decorate single_frame()?? Maybe an example would help me to understand.

@orbeckst
Copy link
Member

orbeckst commented Oct 9, 2020

@yuxuanzhuang in your performance comparison #2950 (comment), what is the default when you don't limit any threads? That number is missing from the comparison. If the default runs 12 threads (or enough to always reduce performance) then I better understand your point that you want to limit to 1 thread by default.

Did you check with threadpool_info() what the default settings were, i.e., how many threads np.dot uses by default?

@yuxuanzhuang
Copy link
Contributor Author

It turns out there's an issue with this decoration approach that doesn't seem to be easily solvable---the thread limit is applied globally, instead of only the decorated function. Here is an example. https://gist.github.com/yuxuanzhuang/05b0da16a51f567e54f7f3f22591e316

@yuxuanzhuang
Copy link
Contributor Author

@yuxuanzhuang in your performance comparison #2950 (comment), what is the default when you don't limit any threads? That number is missing from the comparison. If the default runs 12 threads (or enough to always reduce performance) then I better understand your point that you want to limit to 1 thread by default.

Did you check with threadpool_info() what the default settings were, i.e., how many threads np.dot uses by default?

The default is 12 threads (or whatever that computer has including hyper threads). I think for numpy, the threshold of asking for all the threads is quite low. So Transformation is always performing pretty bad in default.

@IAlibay
Copy link
Member

IAlibay commented Oct 12, 2020

@yuxuanzhuang apologies, I think I'm missing part of the conversation here 😅 From your test, it seems that the context manager method might work here right? Would that be suitable or is there a barrier to using it?

@yuxuanzhuang
Copy link
Contributor Author

@yuxuanzhuang apologies, I think I'm missing part of the conversation here 😅 From your test, it seems that the context manager method might work here right? Would that be suitable or is there a barrier to using it?

You are right, the context manager method should work. I was thinking that with that you have to add the code snippet everywhere needed.

@IAlibay
Copy link
Member

IAlibay commented Oct 12, 2020

You are right, the context manager method should work. I was thinking that with that you have to add the code snippet everywhere needed.

I'll admit I'm not super versed on the transformation code, but it seems like pretty much everything is a class with both an __init__ and a __call__. Is there any way we could create a base class for all these so that we only have to implement one __call__ that has the context manager, and then calls some other function (let's say _run or something)?

@jbarnoud
Copy link
Contributor

The transformations are any callable that takes and returns a timestep. In practice, callables are not always pickable, hence the use of classes. Since all the transformations we will ship will be such classes, they may as well pack some features and wrap the transformation logic with the core count one.

@orbeckst
Copy link
Member

orbeckst commented Oct 13, 2020

EDIT: The post below was written under the assumption that np.dot() and other external function calls that use OpenMP can limit MDAnalysis performance when too many OpenMP threads are enabled and oversubscription happens. However, that does not seem to be the situation #2950 (comment) so ignore what I wrote until we know better what is actually happening.


Given that OpenMP (and therefore the BLAS implementations in numpy) default to the maximum number of threads and therefore hurt MDAnalysis performance I would say we

  • add threadpoolctl as a dependency
  • make our transformation classes accept a max_threads kwarg in the constructor
  • use the threadpoolctl.threadpool_limits context manager to limit our transformations
  • change the transformations base class to have a _transform() method (similar to _single_frame() in AnalysisBase) that is called inside the context manager in the __call__() method

By the way, np.dot() shows up pretty frequently in the code (encore, diffusionmap, gnm, helix_analysis, rms, pca, atoms.transform). Is it always the case that the perform could be improved by limiting the thread number?

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Oct 13, 2020

It turns out the speed limiting step is not np.dot (which is actually faster with multi-threads).
Due to the previously mentioned globally-applied thread limiting, the performance is actually enhanced by other parts of the code. (Sorry for the confusion, should do the line profiler earlier).
What confused me now is that such acceleration (or stall) only happens with Transformation present. (https://gist.github.com/yuxuanzhuang/c14f1889cd49c42fd661a860bf7e92d7) I will look into that later, but if you have any insight, please comment!

EDIT:
Updating gist
https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419

@orbeckst
Copy link
Member

orbeckst commented Oct 13, 2020

So if I read your gist correctly then the profiling shows that np.dot() is not a problem and even with 12 threads runs marginally faster:

class function n=1 n=6 n=12
_single_frame CalcRMSDRotationalMatrix() 295.9 309.6 515.7
mobile_atoms.center() 2860.4 3025.6 5498.6
self._mobile_coordinates64[:] = self.mobile_atoms.positions 1414.2 1481.7 3041.7
self._mobile_coordinates64 -= mobile_com 592.7 607.6 998.6
fit_rot_trans align.rotation_matrix() 1367.7 1364.2 1790.8
mobile.atoms.center() 2843.4 2849.7 5525.8
mobile_coordinates = self.mobile.atoms.positions - mobile_com 2149.6 2156.4 4640.1
np.dot(ts.positions, rotation.T) 1030.1 666.1 944.9

However, the big problem is the center of mass calculation in atoms.center() and the simple array creation/subtractions. These are all actually much slower than the rotation matrix calculation and they get much worse for higher thread numbers.

Is this run on 12 real cores or are you using hyperthreading?

It's a 6-core CPU. (https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419#gistcomment-3488152)

EDIT: The table shows the "Per hit" time in µs. Also note the huge amount of time it takes to create new arrays or just do an element-wise operation. Does the latter include array copies, i.e., atoms.positions with fancy indexing?

EDIT 2: updated table with n=6 values from https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419 and note that it's a 6-core cpu

@orbeckst
Copy link
Member

Judging from @yuxuanzhuang 's benchmarks, oversubscribing cores for OpenMP hurts performance in unexpected ways, even for code where there's no obvious parallelization going on (AtomGroup.center() takes twice the amount of time in his example!).

np.dot() shows some modest improvements in performance, as long as OpenMP does not oversubscribe cores.

@orbeckst
Copy link
Member

orbeckst commented Oct 13, 2020

It is also odd that oversubscription appears to affect the performance of the actual read step in the XDRReader, see https://gist.github.com/yuxuanzhuang/4918cd1b5d8d62de79eab9df40de4bb7#gistcomment-3488210 but only when transformations are added.

@yuxuanzhuang
Copy link
Contributor Author

Related gist:

The tests were down with

  • 6-core CPU
  • numpy 1.19.1
  • python 3.8
threadpool_info()
[{'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-ae94cfde.3.9.dev.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.9.dev',
  'num_threads': 12,
  'threading_layer': 'pthreads'},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 12},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libmkl_rt.so',
  'prefix': 'libmkl_rt',
  'user_api': 'blas',
  'internal_api': 'mkl',
  'version': '2020.0.1',
  'num_threads': 6,
  'threading_layer': 'intel'},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libiomp5.so',
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 12}]

@richardjgowers
Copy link
Member

This fix looks good to me given the problem

@yuxuanzhuang
Copy link
Contributor Author

  • Make sure that threadpoolctl is always installed (setup.py and elsewhere... @IAlibay might know about some other files where it needs to be).

As far as I am aware, we'll need it in the gh actions workflow, the travis cron job, azure pipelines, and setup.py.

Thanks! Let me know if I have missed anywhere.

@IAlibay
Copy link
Member

IAlibay commented Apr 7, 2021

  • Make sure that threadpoolctl is always installed (setup.py and elsewhere... @IAlibay might know about some other files where it needs to be).

As far as I am aware, we'll need it in the gh actions workflow, the travis cron job, azure pipelines, and setup.py.

Thanks! Let me know if I have missed anywhere.

@lilyminium do we use the maintainer/conda/environment.yml for anything / should we update it too?

.travis.yml Show resolved Hide resolved
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small typos, a test and a question re: documenting thread limitations.

package/MDAnalysis/transformations/base.py Show resolved Hide resolved

To define a new Transformation, :class:`TransformationBase`
has to be subclassed.
``max_threads`` will be set to ``None`` in default,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``max_threads`` will be set to ``None`` in default,
``max_threads`` will be set to ``None`` by default,

the environment variable :envvar:`OMP_NUM_THREADS`
(see the `OpenMP specification for OMP_NUM_THREADS <https://www.openmp.org/spec-html/5.0/openmpse50.html>`_)
are used.
``parallelizable`` will be set to ``True`` in default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``parallelizable`` will be set to ``True`` in default.
``parallelizable`` will be set to ``True`` by default.

package/MDAnalysis/transformations/fit.py Show resolved Hide resolved
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You addressed all my comments but @IAlibay raised important points so please address these.

@orbeckst
Copy link
Member

orbeckst commented Apr 8, 2021

There's one RDKit test failing https://github.com/MDAnalysis/mdanalysis/pull/2950/checks?check_run_id=2297717464

_____________ TestRDKitConverter.test_mol_from_selection[ILE-38-1] _____________
[gw0] linux -- Python 3.8.8 /usr/share/miniconda/envs/test/bin/python

self = <MDAnalysisTests.coordinates.test_rdkit.TestRDKitConverter object at 0x7f20c185ec70>
peptide = <AtomGroup with 155 atoms>, resname = 'ILE', n_atoms = 38
n_fragments = 1

    @pytest.mark.parametrize("resname, n_atoms, n_fragments", [
        ("PRO", 14, 1),
        ("ILE", 38, 1),
        ("ALA", 20, 2),
        ("GLY", 21, 3),
    ])
    def test_mol_from_selection(self, peptide, resname, n_atoms, n_fragments):
        mol = peptide.select_atoms("resname %s" % resname).convert_to("RDKIT")
>       assert n_atoms == mol.GetNumAtoms()
E       assert 38 == 1
E        +  where 1 = <bound method GetNumAtoms of <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>>()
E        +    where <bound method GetNumAtoms of <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>> = <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>.GetNumAtoms

/home/runner/work/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/coordinates/test_rdkit.py:186: AssertionError

I assume that this one has nothing to do with this PR?

I am listed as the Assignee. However, @IAlibay when you're happy with the PR I will not object to you merging. Or ping me when a final look-over is needed & I'll happily do the merge.

@lilyminium
Copy link
Member

@orbeckst it sounds like it's related to #2958 but I have not personally seen this failure before.

@IAlibay
Copy link
Member

IAlibay commented Apr 8, 2021

I'll restart CI, we really need to make @cbouy's PRs our next priority for 2.0.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I'll let you have a final look/merge @orbeckst

@IAlibay
Copy link
Member

IAlibay commented Apr 10, 2021

In the interest of progressing towards 2.0, I'll go ahead with the squash-merge.

@IAlibay IAlibay merged commit 13a5df0 into MDAnalysis:develop Apr 10, 2021
@orbeckst
Copy link
Member

orbeckst commented Apr 10, 2021 via email

@yuxuanzhuang
Copy link
Contributor Author

Thanks for the review!

freebsd-git pushed a commit to freebsd/freebsd-ports that referenced this pull request Jun 21, 2022
Version 0.19.2 is too old and does not support Python-3.9.

Remark: threadpoolctl is disabled in this port. See
<MDAnalysis/mdanalysis#2950> for the impacts on
performance.

Releases notes at <https://github.com/MDAnalysis/mdanalysis/releases>.

PR:		264716
Approved by:	yuri (maintainer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PositionAverager Transformation ends up with wrong results with parallel analysis
8 participants