Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic .convert_to("RDKIT") behaviour #3235

Closed
haroldgrosjean opened this issue Apr 19, 2021 · 5 comments
Closed

Erratic .convert_to("RDKIT") behaviour #3235

haroldgrosjean opened this issue Apr 19, 2021 · 5 comments

Comments

@haroldgrosjean
Copy link

Hello,

I have to experience a rather odd behaviour from .convert_to("RDKIT") and I can't really make sense of it. Hopefully, someone will be able to help me here. I am trying to convert groups to rdkit objects which is done successfully most of the time. On some occasions, the following error appears:

File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3188, in convert_to
    return converter().convert(self.atoms)
File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/coordinates/RDKit.py", line 333, in convert
    xyz = ag.positions[idx].astype(float)
IndexError: index 18 is out of bounds for axis 0 with size 18

Annoyingly, the error appears not to be reproducible as different atom groups induce the error on different runs and/or iteration. For example, the code worked on the attached PDB (example.pdb) in the two first iterations but not on the third. When I repeat the code different atom groups can lead to failure while the one that precedently bugged is correctly processed.

PDB: example.pdb.txt

Any help resolving this issue would be greatly appreciated. Maybe @cbouy would be able to help?

Many thanks in advance for your help.

Expected behavior

import MDAnalysis as mda
pdb = 'example.pdb.txt'
ag = u.select_atoms('protein or resname LIG').convert_to("RDKIT")
lig = u.select_atoms('resname LIG').convert_to("RDKIT")
prot = u.select_atoms('protein').convert_to("RDKIT")
print(ag)
<rdkit.Chem.rdchem.Mol object at 0x7f92e7669850>
print(lig)
<rdkit.Chem.rdchem.Mol object at 0x7f92e76693f0>
print(prot)
<rdkit.Chem.rdchem.Mol object at 0x7f92e76696c0>

Actual behavior

import MDAnalysis as mda
pdb = 'example.pdb'
ag = u.select_atoms('protein or resname LIG').convert_to("RDKIT")
lig = u.select_atoms('resname LIG').convert_to("RDKIT")
Traceback (most recent call last):
  [...]
    lig = u.select_atoms('resname LIG').convert_to("RDKIT")
  File "anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3188, in convert_to
    return converter().convert(self.atoms)
  File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/coordinates/RDKit.py", line 333, in convert
    xyz = ag.positions[idx].astype(float)
IndexError: index 18 is out of bounds for axis 0 with size 18

Code to reproduce the behavior

As explained above, the error seems unreproducible.

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)"): 2.0.0-dev0
  • Which version of Python (python -V)? Python 3.7.9
  • Which operating system? macOS Catalina v10.15.7 AND Ubuntu 18.04.5 LTS
@haroldgrosjean
Copy link
Author

Update:

Splitting the code in the following way seems to have resolved the problem:

ag_mda = u.select_atoms('protein or resname LIG')
ag = ag_mda.convert_to("RDKIT")
lig_mda = u.select_atoms('resname LIG')
lig = lig_mda.convert_to("RDKIT")
prot_mda = u.select_atoms('protein')
prot = prot_mda.convert_to("RDKIT")

I remain unsure about the origin of the bug.

Bests,

Harold

@cbouy
Copy link
Member

cbouy commented Apr 19, 2021

Hi Harold,

I believe the bug you're experiencing is related to #2958 and is likely due to how the RDKitConverter caching system currently works. Basically, to avoid rebuilding the whole RDKit molecule from scratch while you iterate over timesteps in a trajectory, the converter has a caching system that will check if the id of the current atomgroup and that of the last converted atomgroup match, and if it does it will reuse the RDKit molecule linked to this id and just update the coordinates of atoms.
So I think what you're seeing is some rare occurrence of two different python objects (a previously converted atomgroup that doesn't exist anymore, and the one currently being converted) which end up with the same id. Because the topology/number of atoms is different, you end up with an index error.

The whole caching system has been reworked in #2942 (it will take into account the id of the Universe instead of the atomgroup, as well as the indices of selected atoms) so hopefully you won't see this error once this PR is merged.
As a workaround for now, you can completely bypass the cache system:

from functools import partial
mda_to_rdkit = partial(mda._CONVERTERS["RDKIT"]().convert, cache=False)
ag = mda_to_rdkit(u.select_atoms('protein or resname LIG'))
lig = mda_to_rdkit(u.select_atoms('resname LIG'))
prot = mda_to_rdkit(u.select_atoms('protein'))

Your solution (saving the atomgroup in a variable before converting it) also works as it will prevent the atomgroup (created from a selection) to be deleted/garbage-collected and so its id cannot be reused, but just to be safe I'd recommend using the cache=False solution.

Hope that answers your interrogations!
Best,
Cedric

@haroldgrosjean
Copy link
Author

Hi Cedric,

Many thanks for your response! I'll try that right away. Also, what would be the impact of your solution on the speed of the code?

Bests,

Harold

@cbouy
Copy link
Member

cbouy commented Apr 19, 2021

In the context of iterating over timesteps in a trajectory and doing the 3 conversions inside the loop, the impact will be pretty large (and increase with the size and complexity of the system) as you're rebuilding the whole RDKit molecule (and inferring bond orders/charges) every frame.
But even with the current develop branch and the trick that you use, you wouldn't benefit from the cache as it only keeps track of one single atomgroup at a time, so you'd have to break your code in 3 loops (one for each atomgroup in your example) if you really need to rely on caching.
The PR that I mentioned above will let you select how many atomgroups can be saved in the cache, in addition to fixing this issue, and I think it's somewhere on @IAlibay 's current todo list so it will likely be merged soon.

@haroldgrosjean
Copy link
Author

haroldgrosjean commented Apr 19, 2021

Ok - great. Thanks for the help!
Please, keep us posted of future updates related to this issue!
Bonne soirée,
Harold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants