Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the RDKitConverter caching system #2942

Merged
merged 12 commits into from
May 11, 2021

Conversation

cbouy
Copy link
Member

@cbouy cbouy commented Sep 13, 2020

The current "homemade" caching system in the RDKit converter only allows to store the most recent conversion.
This new version uses the functools.lru_cache which allows users to select how many molecules should be cached, and improves readability/maintainability IMO

Also, the new caching system retrieves the converted items from the hash of all the arguments passed to the decorated atomgroup_to_mol function, instead of the id of the atomgroup and the arguments, which makes more sense. I didn't know what a hash was until recently so please forgive me for the rookie mistake :D
Now if you successively run u.atoms.convert_to("RDKIT") it will benefit from the caching system.

I needed to convert two different atomgroups (protein and ligand) while iterating over a trajectory and the previous system would just rebuild the whole topology (which takes quite some time for a protein) for each molecule at every frame hence why I think this is necessary. Now it works like a breeze.

Changes made in this Pull Request:

  • Replaces the RDKitConverter cache system with functools.lru_cache
  • Adds the set_converter_cache_size(maxsize) function to modify how many items are retained in the cache
  • Moves atomgroup_to_mol outside of the RDKitConverter class (it's not really needed there anyway), otherwise I need to define hash and eq dunders for the caching to work
  • Changes the default behavior of the RDKitConverter when an AtomGroup with no hydrogen is being converted: the converter now raises an AttributeError. The error is not raised when NoImplicit=False
  • Adds the force parameter to the RDKitConverter to ignore the above AttributeError and continue the conversion, which is mostly useful for inorganic molecules, CO2 and so on.
  • Adds a few more tests to increase the coverage of the RDKitConverter

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Sep 13, 2020

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-11 10:08:59 UTC

@codecov
Copy link

codecov bot commented Sep 14, 2020

Codecov Report

Merging #2942 (2aceaf8) into develop (d0fc581) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #2942   +/-   ##
========================================
  Coverage    93.55%   93.56%           
========================================
  Files          176      176           
  Lines        22837    22837           
  Branches      3194     3195    +1     
========================================
+ Hits         21366    21368    +2     
+ Misses        1421     1418    -3     
- Partials        50       51    +1     
Impacted Files Coverage Δ
package/MDAnalysis/converters/RDKit.py 98.08% <100.00%> (+0.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0fc581...2aceaf8. Read the comment docs.

@cbouy
Copy link
Member Author

cbouy commented Sep 22, 2020

ping @IAlibay @richardjgowers

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of the lru cache from the stdlib. Peripheral comments inline.

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
conversions in memory. Using ``maxsize=None`` will remove all limits
to the cache size, i.e. everything is cached.
"""
global atomgroup_to_mol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

justified use of global ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not thread-safe – not a big deal, though, and I don't have a better idea.

(Although, we don't really encourage use of threads for parallelization; multiprocessing should do just fine.)

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm! Just a few comments, mainly to do with tests & docs.

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbouy, couple of near-final comments with very minimal changes (one of which probably can be just ignored). The main discussion point remains this implicit hydrogens thing.

testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
@cbouy
Copy link
Member Author

cbouy commented Nov 3, 2020

I updated my first post with the new changes.
The Azure fails seem to be related to H5MD, I'll let you investigate.
Also coverage is not 100% even with the new tests 😭

@cbouy
Copy link
Member Author

cbouy commented Nov 18, 2020

All tests are passing 💃 anything else ?

@IAlibay
Copy link
Member

IAlibay commented Nov 19, 2020

All tests are passing 💃 anything else ?

Apologies for taking so long here, I'll re-review over the weekend but I think we should be good.

@IAlibay IAlibay added this to the 2.0 milestone Mar 14, 2021
@IAlibay
Copy link
Member

IAlibay commented Apr 10, 2021

@cbouy if you want to update this against the current develop, it'll finally be on my list for the next thing I review.

@cbouy
Copy link
Member Author

cbouy commented Apr 10, 2021

Okay I think I finally managed to run a proper git rebase upstream/develop this time 💁‍♂️

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the massive delay here @cbouy, overall lgtm!

Just a couple of small questions.

@orbeckst do you want to re-review or are we good to merge?

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved
@pytest.mark.parametrize("smi", ["[H]", "C", "O", "[He]"])
def test_single_atom_mol(self, smi):
u = mda.Universe.from_smiles(smi, addHs=False,
generate_coordinates=False)
mol = u.atoms.convert_to("RDKIT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I'm just being silly and forgetting a very obvious thing. Could you remind me why these are all being switched away from convert_to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_to doesn't pass arguments to the underlying converter, it was in a PR at some point though (#2882 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this behaviour contradict the docstring? I.e. ":func:set_converter_cache_size. However, ag.convert_to("RDKIT")
followed by ag.convert_to("RDKIT", NoImplicit=False) will not use the"

Or was the argument that we would merge #2882 before this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is the converter modules weren't really documented to be instantiated like c = mda.coordinates.RDKit.RDKitConverter(); c.convert(...) but usually go through the convert_to AtomGroup method.
So yeah I assumed 2882 would be merged before v2.0 comes out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, let's see if we can revive #2882 then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched back to using convert_to now that it's merged!

testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved
@orbeckst
Copy link
Member

@IAlibay sorry, don't have time today to review — I'll leave it to you.

@IAlibay
Copy link
Member

IAlibay commented Apr 23, 2021

I think this PR is complete, but I want to hold off on merging before we have a clearer idea of what's going on with #2882.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why I completely forgot 🙀 We need a changelog entry (and can you also add in an entry for #2926?).

@cbouy
Copy link
Member Author

cbouy commented Apr 24, 2021

I'm adding the missing changelog now. For the changelog of this PR though, do I mention the fixes/changes I made, or just the enhancements (set_converter_cache_size(maxsize) and the force parameter) ? The RDKit converter isn't released yet so it's a bit weird fixing something that isn't officially out...

@orbeckst
Copy link
Member

orbeckst commented Apr 24, 2021 via email

@IAlibay
Copy link
Member

IAlibay commented May 10, 2021

I'm adding the missing changelog now. For the changelog of this PR though, do I mention the fixes/changes I made, or just the enhancements (set_converter_cache_size(maxsize) and the force parameter) ? The RDKit converter isn't released yet so it's a bit weird fixing something that isn't officially out...

Sorry for the delayed response here @cbouy.

I'd add the following entries:

enhancements:

changes:

Fixes:

edit: once #2882 is done if you can add these and then update against develop I'll merge this.

@IAlibay
Copy link
Member

IAlibay commented May 10, 2021

RDKIT crashes are starting to happen too frequently for py3.6 + numpy 1.16 (see: #3287), I'm not sure if this is somehow linked to the new converter API, so I've updated this PR against the current develop to see if it fixes things.

@cbouy please do double check that I've not accidentally broken things!

edit: best way to check that this is fixed is just by re-running CI I guess -- number of successful CI runs: 3 (that should be enough)

@cbouy
Copy link
Member Author

cbouy commented May 10, 2021

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbouy lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants