Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple RDKitConverter #2775

Merged
merged 107 commits into from
Aug 21, 2020
Merged

Simple RDKitConverter #2775

merged 107 commits into from
Aug 21, 2020

Conversation

cbouy
Copy link
Member

@cbouy cbouy commented Jun 19, 2020

Part of the fixes for #2468

Changes made in this Pull Request:

  • Added a converter that works for any input with all hydrogens explicit in the topology.

Inspired from the ParmEdConverter

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?
  • Merge tempfactors and bfactors
  • Check Codecov
  • Check pep8speaks

@codecov
Copy link

codecov bot commented Jun 19, 2020

Codecov Report

Merging #2775 into develop will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2775      +/-   ##
===========================================
+ Coverage    92.93%   92.96%   +0.03%     
===========================================
  Files          187      187              
  Lines        24507    24813     +306     
  Branches      3185     3234      +49     
===========================================
+ Hits         22775    23067     +292     
- Misses        1686     1698      +12     
- Partials        46       48       +2     
Impacted Files Coverage Δ
package/MDAnalysis/coordinates/RDKit.py 97.31% <100.00%> (-2.69%) ⬇️
package/MDAnalysis/coordinates/H5MD.py 91.08% <0.00%> (-3.47%) ⬇️
util.py 89.47% <0.00%> (+0.07%) ⬆️
coordinates/base.py 95.12% <0.00%> (+0.26%) ⬆️
auxiliary/base.py 91.31% <0.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e68617...4e49616. Read the comment docs.

@IAlibay
Copy link
Member

IAlibay commented Jun 23, 2020

@cbouy since this seemd like WIP, I've been holding off on reviewing, do ping us when you'd like a review.

@cbouy
Copy link
Member Author

cbouy commented Jun 24, 2020

@IAlibay @richardjgowers @fiona-naughton

It's starting to look good!
image

It's still missing quite some tests but I guess you can start having a look now

@IAlibay
Copy link
Member

IAlibay commented Jun 24, 2020

It's starting to look good!

Oooo that's some really cool stuff! Do make sure to include this (or a similar image) in your next blog :)

@RMeli
Copy link
Member

RMeli commented Jun 24, 2020

This looks really cool! =)

Would it be easier to call the conversion method rdkit() or to_rdkit()? The reason I'm asking is that one could use code autocompletion to quickly see the conversion methods available, while convert_to() requires to look at the documentation (which might get quite hairy if multiple conversions are supported?). It's also less typing. ;)

This is just a quick thought I had looking at your code snippet, feel free to ignore!

EDIT: rdkit() is inspired by PyTorch's numpy() (converting torch.Tensor to np.ndarray) while to_rdkit() is inspired by Panda's to_numpy().

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On an initial read it looks good, here are a few initial comments & thoughts.

package/MDAnalysis/coordinates/RDKit.py Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
@richardjgowers
Copy link
Member

richardjgowers commented Jun 24, 2020 via email

@cbouy
Copy link
Member Author

cbouy commented Jun 24, 2020

Would it be easier to call the conversion method rdkit() or to_rdkit()? The reason I'm asking is that one could use code autocompletion to quickly see the conversion methods available, while convert_to() requires to look at the documentation (which might get quite hairy if multiple conversions are supported?). It's also less typing. ;)

I decided to follow what was already there for the parmed converter, but yes I completely agree that a to_rdkit method would be very nice, although I guess it should be in another PR so that we can add the to_parmed as well.
I prefer the pandas style to_rdkit so that you can quickly see which converters are available by autocompleting after typing to_.

@cbouy
Copy link
Member Author

cbouy commented Jun 24, 2020

The cis/trans is rotating, I’m guessing this is because we’re not labelling the bond?

Yes, although atom and bond stereochemistry can be inferred from the coordinates with Chem.AssignStereochemistryFrom3D and Chem.DetectBondStereoChemistry. But I'm not reading the coordinates yet, there's still a lot to do on the topology conversion first.

@RMeli
Copy link
Member

RMeli commented Jun 24, 2020

I decided to follow what was already there for the parmed converter, but yes I completely agree that a to_rdkit method would be very nice, although I guess it should be in another PR so that we can add the to_parmed as well.

I wasn't aware of the syntax for the parmed converter. +1 for a separate PR changing both and I also prefer the pandas style for the reason you mentioned.

@IAlibay
Copy link
Member

IAlibay commented Jun 24, 2020

I wasn't aware of the syntax for the parmed converter. +1 for a separate PR changing both and I also prefer the pandas style for the reason you mentioned.

pinging @lilyminium who will know better here, but I don't think there's a specific convention yet. Unless we have multiple converters already using the same naming style, I can't see a reason why we can't try to agree on a good naming scheme now.

@cbouy it might be a good idea to bring this up an issue on this. As we have more and more "converters" it would be good to have a unified API.

@lilyminium
Copy link
Member

I like the current model of the metaclass registering the format. We could do a similar thing to TopologyAttrs though, and get the metaclass to add methods to AtomGroups. e.g.

def __init__(cls, name, bases, classdict):
    method_name = f'to_{cls.lib.lower()}'
    setattr(AtomGroup, method_name, cls.convert)
    setattr(Universe, method_name, cls.convert)

The to_lib() method would show up in dir() etc and I think a fix to docstrings in sphinx is in the works.

This would require ParmEdWriter to change too, though, so it's probably best as a separate issue @cbouy

@cbouy cbouy mentioned this pull request Jun 25, 2020
@cbouy
Copy link
Member Author

cbouy commented Jun 25, 2020

Currently I'm automatically guessing the element attribute as well as bonds if they are not present, which is not necessarily a good idea, even if we are warning the user.
However making the converter fail on such cases would require them to add the missing attributes themselves, but they might not be too familiar with the topology API (although it's very well documented).
Would it be acceptable to add guess_elements=False and guess_bonds=False to the convert method, and pass these through **kwargs in the convert_to method of the AtomGroup ? So users have a simple way of doing it ?

@IAlibay
Copy link
Member

IAlibay commented Jun 25, 2020

Currently I'm automatically guessing the element attribute as well as bonds if they are not present, which is not necessarily a good idea, even if we are warning the user.
However making the converter fail on such cases would require them to add the missing attributes themselves, but they might not be too familiar with the topology API (although it's very well documented).
Would it be acceptable to add guess_elements=False and guess_bonds=False to the convert method, and pass these through **kwargs in the convert_to method of the AtomGroup ? So users have a simple way of doing it ?

So my worry here is that there are many ways in which elements can be guessed and hopefully in the future we'll want to provide guessers from things like mass, etc... instead of just atom names. Accounting for all of this should be outside the scope of the method in my opinion. Indeed, users not being aware of the API, and especially the assumptions our guessers make, might lead to a lot more unnoticed errors that will strongly impede scientific reproducibility.

For elements, I think it might be a better call to instead give an error message that points users to the right part of the documentation / user guide (I recently wrote an addition to the guessers docstring to show how one could guess elements [and the dangers of doing so], I could probably port that to the user guide).

Bonds are a different issue in my opinion, guessing should be rather consistent, but we should warn users that we are doing a guess.

@fiona-naughton, @richardjgowers might have different opinions here.

I'm also going to ping @jbarnoud and @lilyminium who have been very involved in the element guessing discussions.

@cbouy cbouy mentioned this pull request Jun 26, 2020
5 tasks
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment re: ordering of the docstring (plus maybe an extra line to clarify), otherwise assuming tests pass lgtm!

@richardjgowers and @fiona-naughton, ideally we should aim to get this merged soon. Could you have a look/final approval before then? Depending on how things go I'll squash merge on Monday evening.

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved
@IAlibay
Copy link
Member

IAlibay commented Aug 14, 2020

Just restarted the azure-pipelines CI, should be fixed now.
@cbouy looks like docs building is failing: "/home/travis/build/MDAnalysis/mdanalysis/package/MDAnalysis/coordinates/RDKit.py:docstring of MDAnalysis.coordinates.RDKit:36:Duplicate explicit target name: "rdkit""

Could you look into this & build things locally to see if there's any further issues?

@cbouy
Copy link
Member Author

cbouy commented Aug 14, 2020

@IAlibay should be good now 🤞

@fiona-naughton
Copy link
Contributor

Looks good to me!

@IAlibay IAlibay merged commit 6440873 into MDAnalysis:develop Aug 21, 2020
@IAlibay
Copy link
Member

IAlibay commented Aug 21, 2020

So somehow I managed to mess up the squash-merge on this, which has led to a couple of things:

  1. the merge commit didn't attribute to @cbouy
  2. the merge log defaulted to some non-sensical list of commits (which is especially saddening having spent a good while writing something nice 😢 )
  3. @cbouy now has 94 commits (see: https://github.com/MDAnalysis/mdanalysis/graphs/contributors)

@orbeckst
Copy link
Member

Were all of @cbouy 's previous merges squashed, i.e., would it be safe to remove the small commits visible in the history (eg Aug 14 in https://github.com/MDAnalysis/mdanalysis/commits/develop ) ? Answer: I think yes, because these are the same commits as the ones in this PR.

It seems that GitHub double-merged this PR, once as squash+merge, the other as full merge.

@IAlibay
Copy link
Member

IAlibay commented Aug 21, 2020

It seems that GitHub double-merged this PR, once as squash+merge, the other as full merge.

It does seem to be the case, although how this can be possible I have to admit goes way beyond my understanding of how git works.

@orbeckst
Copy link
Member

I think the easiest fix will be the following: First

git reset --hard 2f0381e58328e4b2d684315fa5721a6de9945512
git push -f 

This will reset develop.

It seems that 6440873 is actually a merge and not a squash, so not a double merge/squash.

We the need to work with the cbouy:rdkit-converter (or simply the diff 50cd6e7:6440873eeefa12078d07fc1eed61e2201df63841 ). I can then manually squash the diff.

@IAlibay could you please write the log message and post it here? Sorry to make you do the work again.

@orbeckst
Copy link
Member

orbeckst commented Aug 21, 2020

To recover this branch I do (after git pull and before resetting)

git checkout -b rdkit-converter 50cd6e756336e7e0e41048e7a0d09a6af06bb841   # anchor at branch point of @cbouy's branch
git pull https://github.com/cbouy/mdanalysis.git rdkit-converter
git rebase -i 50cd6e756336e7e0e41048e7a0d09a6af06bb841    # will do this with search+replace, maybe a smarter way is possible

Once I have your commit message I will amend the commit message and make sure that @cbouy is the author. Then I can merge it into develop and push.

Then I'll restore branch protection.

Sounds ok? Going for lunch now and continue in ~1.5h...

@IAlibay
Copy link
Member

IAlibay commented Aug 21, 2020

A slightly abbreviated version of the log message I originally wrote:

Towards #2468 (RDKit interoperability GSOC project)

## Overview of work done in PR
This PR adds the `RDKitConverter` class which converts MDAnalysis
`Universe`/`Atomgroup` objects to `RDKit rdchem.Mol` objects.

## Limitations
  - Bonds and elements must be present (the former will be inferred if not
    present).
  - This converter mainly aims at supporting cases where explicit hydrogens
    are present.

## Extra implementation details
  - Bond order and formal charges are inferred via atomic valencies & the
    number of unpaired electrons (see `_infer_bo_and_charges`).
  - In part due to the influence of atom read order on inferring, bond
    conjugation and functional groups are standardized using SMARTS
    reactions (see `_standardize_patterns`).

## Other changes
Also includes some PEP8 changes and some cleaning up of the tests for the
`RDKITReader`.

## References
For more information on the development process for this PR, see:
1) https://cbouy.github.io/blog/2020/07/01/rdkit-converter
2) https://cbouy.github.io/blog/2020/07/22/rdkit-converter-part2

@orbeckst
Copy link
Member

... it's not easy to rebase anything with lots of merges ... it certainly does not "just do it".

@MDAnalysis/coredevs I am going to reset develop now,

git reset --hard 2f0381e58328e4b2d684315fa5721a6de9945512
git push -f 

which will remove this PR but will still show this PR a merged. I am working on getting the code in this PR into a compact, mergeable state (i.e., not 100 commits). If anyone has any good ideas I am all ears.

@IAlibay
Copy link
Member

IAlibay commented Aug 21, 2020

I was discussing this with a friend and I think we came up with a simple-ish way of doing this:

  1. git rebase -i 50cd6e756336e7e0e41048e7a0d09a6af06bb841
  2. drop the 90 commits at the bottom of the list that belong to this PR
  3. git remote add rdkit [email protected]:cbouy/mdanalysis.git
  4. git merge --squash rdkit/rdkit-converter
  5. Fix the ~ 6 conflicts that pop up (mainly stuff relating to Yuxuan's latest PR)
  6. git commit --amend to change the author

I think that might work (at least the git log seems reasonable?) although I'm not sure I properly managed to transfer the author.

Edit: steps 1 & 2 made sense before the history re-write, I guess 4-6 might do the trick though.

@orbeckst
Copy link
Member

I just rewrote the develop history and removed all signs of PR #2775

@orbeckst
Copy link
Member

I'll try merge --squash. I've been looking at cherry-picking only @cbouy 's commits but there are also merges that mess it up.

I'll try your idea.

@orbeckst
Copy link
Member

With the clean develop

git checkout develop
git pull
git checkout -b rdkit-converter
git pull --squash https://github.com/cbouy/mdanalysis.git rdkit-converter

Looking good:

From https://github.com/cbouy/mdanalysis
 * branch                rdkit-converter -> FETCH_HEAD
Removing package/doc/sphinx/source/documentation_pages/topology/RDKitParser.rst
Auto-merging package/CHANGELOG
Squash commit -- not updating HEAD
Automatic merge went well; stopped before committing as requested
$ git status
On branch rdkit-converter
Cherry-pick currently in progress.
  (run "git cherry-pick --continue" to continue)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   package/CHANGELOG
	modified:   package/MDAnalysis/coordinates/RDKit.py
	modified:   package/doc/sphinx/source/conf.py
	modified:   package/doc/sphinx/source/documentation_pages/converters.rst
	new file:   package/doc/sphinx/source/documentation_pages/converters/RDKitParser.rst
	deleted:    package/doc/sphinx/source/documentation_pages/topology/RDKitParser.rst
	modified:   package/doc/sphinx/source/documentation_pages/topology_modules.rst
	modified:   testsuite/MDAnalysisTests/coordinates/test_rdkit.py

Check with

git diff --cached

Then commit with @IAlibay 's log message and @cbouy as the author and create PR #2916

git commit --author="Cédric Bouysset <[email protected]>"
git push --set-upstream origin rdkit-converter

Please review PR #2916. I wanted to make sure that CI is still ok.

IAlibay added a commit that referenced this pull request Aug 23, 2020
Towards #2468 (RDKit interoperability GSOC project)

PR #2916 -> Originally PR #2775

##Overview of work done in PR

This PR adds the RDKitConverter class which converts MDAnalysis
Universe/Atomgroup objects to RDKit rdchem.Mol objects.

##Limitations

  - Bonds and elements must be present (the former will be inferred if not
     present).
  - This converter mainly aims at supporting cases where explicit hydrogens
     are present.

##Extra implementation details

  - Bond order and formal charges are inferred via atomic valencies & the
     number of unpaired electrons (see _infer_bo_and_charges).
   - In part due to the influence of atom read order on inferring, bond
      conjugation and functional groups are standardized using SMARTS
      reactions (see _standardize_patterns).

##Other changes

Also includes some PEP8 changes and some cleaning up of the tests for the
RDKITReader.

##References

For more information on the development process for this PR, see:
https://cbouy.github.io/blog/2020/07/01/rdkit-converter
https://cbouy.github.io/blog/2020/07/22/rdkit-converter-part2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants