Erratic .convert_to("RDKIT") behaviour #3235

haroldgrosjean · 2021-04-19T12:31:27Z

Hello,

I have to experience a rather odd behaviour from .convert_to("RDKIT") and I can't really make sense of it. Hopefully, someone will be able to help me here. I am trying to convert groups to rdkit objects which is done successfully most of the time. On some occasions, the following error appears:

File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3188, in convert_to
    return converter().convert(self.atoms)
File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/coordinates/RDKit.py", line 333, in convert
    xyz = ag.positions[idx].astype(float)
IndexError: index 18 is out of bounds for axis 0 with size 18

Annoyingly, the error appears not to be reproducible as different atom groups induce the error on different runs and/or iteration. For example, the code worked on the attached PDB (example.pdb) in the two first iterations but not on the third. When I repeat the code different atom groups can lead to failure while the one that precedently bugged is correctly processed.

PDB: example.pdb.txt

Any help resolving this issue would be greatly appreciated. Maybe @cbouy would be able to help?

Many thanks in advance for your help.

Expected behavior

import MDAnalysis as mda
pdb = 'example.pdb.txt'
ag = u.select_atoms('protein or resname LIG').convert_to("RDKIT")
lig = u.select_atoms('resname LIG').convert_to("RDKIT")
prot = u.select_atoms('protein').convert_to("RDKIT")
print(ag)
<rdkit.Chem.rdchem.Mol object at 0x7f92e7669850>
print(lig)
<rdkit.Chem.rdchem.Mol object at 0x7f92e76693f0>
print(prot)
<rdkit.Chem.rdchem.Mol object at 0x7f92e76696c0>

Actual behavior

import MDAnalysis as mda
pdb = 'example.pdb'
ag = u.select_atoms('protein or resname LIG').convert_to("RDKIT")
lig = u.select_atoms('resname LIG').convert_to("RDKIT")
Traceback (most recent call last):
  [...]
    lig = u.select_atoms('resname LIG').convert_to("RDKIT")
  File "anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3188, in convert_to
    return converter().convert(self.atoms)
  File "/anaconda3/envs/mdcontacts/lib/python3.7/site-packages/MDAnalysis/coordinates/RDKit.py", line 333, in convert
    xyz = ag.positions[idx].astype(float)
IndexError: index 18 is out of bounds for axis 0 with size 18

Code to reproduce the behavior

As explained above, the error seems unreproducible.

Current version of MDAnalysis

Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)"): 2.0.0-dev0
Which version of Python (python -V)? Python 3.7.9
Which operating system? macOS Catalina v10.15.7 AND Ubuntu 18.04.5 LTS

The text was updated successfully, but these errors were encountered:

haroldgrosjean · 2021-04-19T14:06:42Z

Update:

Splitting the code in the following way seems to have resolved the problem:

ag_mda = u.select_atoms('protein or resname LIG')
ag = ag_mda.convert_to("RDKIT")
lig_mda = u.select_atoms('resname LIG')
lig = lig_mda.convert_to("RDKIT")
prot_mda = u.select_atoms('protein')
prot = prot_mda.convert_to("RDKIT")

I remain unsure about the origin of the bug.

Bests,

Harold

cbouy · 2021-04-19T15:53:43Z

Hi Harold,

I believe the bug you're experiencing is related to #2958 and is likely due to how the RDKitConverter caching system currently works. Basically, to avoid rebuilding the whole RDKit molecule from scratch while you iterate over timesteps in a trajectory, the converter has a caching system that will check if the id of the current atomgroup and that of the last converted atomgroup match, and if it does it will reuse the RDKit molecule linked to this id and just update the coordinates of atoms.
So I think what you're seeing is some rare occurrence of two different python objects (a previously converted atomgroup that doesn't exist anymore, and the one currently being converted) which end up with the same id. Because the topology/number of atoms is different, you end up with an index error.

The whole caching system has been reworked in #2942 (it will take into account the id of the Universe instead of the atomgroup, as well as the indices of selected atoms) so hopefully you won't see this error once this PR is merged.
As a workaround for now, you can completely bypass the cache system:

from functools import partial
mda_to_rdkit = partial(mda._CONVERTERS["RDKIT"]().convert, cache=False)
ag = mda_to_rdkit(u.select_atoms('protein or resname LIG'))
lig = mda_to_rdkit(u.select_atoms('resname LIG'))
prot = mda_to_rdkit(u.select_atoms('protein'))

Your solution (saving the atomgroup in a variable before converting it) also works as it will prevent the atomgroup (created from a selection) to be deleted/garbage-collected and so its id cannot be reused, but just to be safe I'd recommend using the cache=False solution.

Hope that answers your interrogations!
Best,
Cedric

haroldgrosjean · 2021-04-19T16:16:42Z

Hi Cedric,

Many thanks for your response! I'll try that right away. Also, what would be the impact of your solution on the speed of the code?

Bests,

Harold

cbouy · 2021-04-19T16:51:54Z

In the context of iterating over timesteps in a trajectory and doing the 3 conversions inside the loop, the impact will be pretty large (and increase with the size and complexity of the system) as you're rebuilding the whole RDKit molecule (and inferring bond orders/charges) every frame.
But even with the current develop branch and the trick that you use, you wouldn't benefit from the cache as it only keeps track of one single atomgroup at a time, so you'd have to break your code in 3 loops (one for each atomgroup in your example) if you really need to rely on caching.
The PR that I mentioned above will let you select how many atomgroups can be saved in the cache, in addition to fixing this issue, and I think it's somewhere on @IAlibay 's current todo list so it will likely be merged soon.

haroldgrosjean · 2021-04-19T20:04:39Z

Ok - great. Thanks for the help!
Please, keep us posted of future updates related to this issue!
Bonne soirée,
Harold

cbouy mentioned this issue Apr 19, 2021

RDKIT tests sometimes fails #2958

Closed

haroldgrosjean closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erratic .convert_to("RDKIT") behaviour #3235

Erratic .convert_to("RDKIT") behaviour #3235

haroldgrosjean commented Apr 19, 2021

haroldgrosjean commented Apr 19, 2021

cbouy commented Apr 19, 2021

haroldgrosjean commented Apr 19, 2021

cbouy commented Apr 19, 2021

haroldgrosjean commented Apr 19, 2021 •

edited

Loading

Erratic .convert_to("RDKIT") behaviour #3235

Erratic .convert_to("RDKIT") behaviour #3235

Comments

haroldgrosjean commented Apr 19, 2021

Expected behavior

Actual behavior

Code to reproduce the behavior

Current version of MDAnalysis

haroldgrosjean commented Apr 19, 2021

cbouy commented Apr 19, 2021

haroldgrosjean commented Apr 19, 2021

cbouy commented Apr 19, 2021

haroldgrosjean commented Apr 19, 2021 • edited Loading

haroldgrosjean commented Apr 19, 2021 •

edited

Loading