Direct integration of common docking output files (#12)

- Adds the `Molecule.from_rdkit` classmethod to easily prepare RDKit molecules for ProLIF - Adds `prolif.Molecule` suppliers for common docking output files (PDBQT, SDF, MOL2) - Adds a new method (`Fingerprint.run_from_iterable` ) that computes a fingerprint using one of those suppliers
chemosim-lab · Feb 2, 2021 · 4231d29 · 4231d29
1 parent 67d5d69
commit 4231d29
Show file tree

Hide file tree

Showing 10 changed files with 3,550 additions and 104 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,17 +6,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 ### Added
-- Zenodo to automatically generate a DOI for new releases
+- Integration with Zenodo to automatically generate a DOI for new releases
 - Citation page
-- PDBQT section in the "How-to" notebook
-- Example PDBQT files from Vina in the data folder
+- Docking section in the Quickstart notebook (Issue #11)
+- PDBQT, MOL2 and SDF molecule suppliers to make it easier for users to use docking
+  results as input (Issue #11)
+- `Molecule.from_rdkit` classmethod to easily prepare RDKit molecules for ProLIF
 ### Changed
 - The visualisation notebook now displays the protein with py3Dmol. Some examples for
   creating and displaying a graph from the interaction dataframe have been added
 - Updated the installation instructions to show how to install a specific release
 - The previous repr method of `ResidueId` was easy to confuse with a string, especially
   when trying to access the `Fingerprint.ifp` results by string. The new repr method is
   now more explicit.
+- Added the `Fingerprint.run_from_iterable` method, which uses the new supplier functions
+  to quickly generate a fingerprint.
+- Sorted the output of `Fingerprint.list_available`
 ### Deprecated
 ### Removed
 ### Fixed

diff --git a/docs/notebooks/how-to.ipynb b/docs/notebooks/how-to.ipynb
@@ -6,7 +6,15 @@
    "source": [
     "# How-to\n",
     "\n",
-    "This notebook serves as a practical guide to common questions users might have."
+    "This notebook serves as a practical guide to common questions users might have.\n",
+    "\n",
+    "**Table of content**\n",
+    "\n",
+    "- [Changing the parameters for an interaction](#Changing-the-parameters-for-an-interaction)\n",
+    "- [Writing your own interaction](#Writing-your-own-interaction)\n",
+    "- [Working with docking poses instead of MD simulations](#Working-with-docking-poses-instead-of-MD-simulations)\n",
+    "- [Using PDBQT files](#Using-PDBQT-files)\n",
+    "\n"
    ]
   },
   {
@@ -17,6 +25,7 @@
    "source": [
     "import MDAnalysis as mda\n",
     "import prolif as plf\n",
+    "import pandas as pd\n",
     "import numpy as np"
    ]
   },
@@ -37,7 +46,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 1. Changing the parameters for an interaction"
+    "## Changing the parameters for an interaction"
    ]
   },
   {
@@ -81,7 +90,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 1.a Overwriting the original interaction"
+    "### Overwriting the original interaction"
    ]
   },
   {
@@ -136,7 +145,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 1.b Reparameterizing an interaction with another name"
+    "### Reparameterizing an interaction with another name"
    ]
   },
   {
@@ -183,7 +192,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2. Writing your own interaction"
+    "## Writing your own interaction"
    ]
   },
   {
@@ -339,78 +348,93 @@
    ]
   },
   {
-   "cell_type": "markdown",
-   "metadata": {},
    "source": [
-    "## 3. Using PDBQT files\n",
+    "## Working with docking poses instead of MD simulations\n",
     "\n",
-    "The typical use case here is getting the IFP from AutoDock Vina's output. It requires a few additional steps and informations compared to other formats like MOL2, since the PDBQT format gets rid of most hydrogen atoms and doesn't contain bond order information.\n",
+    "ProLIF currently provides file readers for MOL2, SDF and PDBQT files. The API is slightly different compared to the quickstart example but the end result is the same.\n",
     "\n",
-    "The prerequisites for a successfull usage of ProLIF in this case is having external files (typically in the MOL2 format) that contain bond orders and formal charges for your ligand and protein, or at least a PDB file with explicit hydrogen atoms. \n",
+    "Please note that this part of the tutorial is only suitable for interactions between one protein and several ligands, or in more general terms, between one molecule with multiple residues and one molecule with a single residue. This is not suitable for protein-protein or DNA-protein interactions.\n",
     "\n",
-    "Please note that your PDBQT input must have a single model per file. Splitting a multi-model file can be done using the `vina_split` command-line tool that comes with AutoDock Vina: `vina_split --input vina_output.pdbqt`\n",
-    "\n",
-    "Let's start by loading our \"template\" file with bond orders. It can be a SMILES string, MOL2, SDF file or anything supported by RDKit."
-   ]
+    "Let's start by loading the protein. Here I'm using a PDB file but you can use any format supported by MDAnalysis as long as it contains explicit hydrogens."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "from rdkit import Chem\n",
-    "from rdkit.Chem import AllChem\n",
-    "\n",
-    "template = Chem.MolFromSmiles(\"C[NH+]1CC(C(=O)NC2(C)OC3(O)C4CCCN4C(=O)C\"\n",
-    "                              \"(Cc4ccccc4)N3C2=O)C=C2c3cccc4[nH]cc(c34)CC21\")\n",
-    "template"
+    "# load protein\n",
+    "prot = mda.Universe(plf.datafiles.datapath / \"vina\" / \"rec.pdb\")\n",
+    "prot = plf.Molecule.from_mda(prot)\n",
+    "prot.n_residues"
    ]
   },
   {
+   "source": [
+    "### Using an SDF file"
+   ],
    "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "Next, we'll define a function that loads a PDBQT file and assigns bond orders and charges using the template. The template and  PDBQT file must have the exact same atoms (even hydrogens) otherwise no match will be found. Since PDBQT files partially keep the hydrogen atoms, we have the choice between:\n",
+    "# load ligands\n",
+    "path = str(plf.datafiles.datapath / \"vina\" / \"vina_output.sdf\")\n",
+    "lig_suppl = list(plf.sdf_supplier(path))\n",
+    "# generate fingerprint\n",
+    "fp = plf.Fingerprint()\n",
+    "fp.run_from_iterable(lig_suppl, prot)\n",
+    "df = fp.to_dataframe()\n",
+    "df"
+   ]
+  },
+  {
+   "source": [
+    "Please note that converting the `lig_suppl` to a list is optionnal (and maybe not suitable for large files) as it will load all the ligands in memory, but it's nicer to track the progression with the progress bar.\n",
     "\n",
-    "* Manually selecting where to add the hydrogens on the template, do the matching, then add the remaining hydrogens\n",
-    "* Or just remove the hydrogens from the PDBQT file, do the matching, then add all hydrogens.\n",
+    "If you want to calculate the Tanimoto similarity between your docked poses and a reference ligand, here's how to do it.\n",
     "\n",
-    "This last option will delete the coordinates of your hydrogens atoms and replace them by the ones generated by RDKit, but unless you're working with an exotic system this should be fine."
-   ]
+    "We first need to generate the interaction fingerprint for the reference, and concatenate it to the previous one"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "mda_to_rdkit = mda._CONVERTERS[\"RDKIT\"]().convert\n",
-    "\n",
-    "def pdbqt_to_rdkit(pdbqt, template):\n",
-    "    pdbqt = mda.Universe(pdbqt)\n",
-    "    # set attributes needed by the converter\n",
-    "    elements = [mda.topology.guessers.guess_atom_element(x)\n",
-    "                for x in pdbqt.atoms.names]\n",
-    "    pdbqt.add_TopologyAttr(\"elements\", elements)\n",
-    "    pdbqt.add_TopologyAttr(\"chainIDs\", pdbqt.atoms.segids)\n",
-    "    pdbqt.atoms.types = pdbqt.atoms.elements\n",
-    "    # convert without infering bond orders and charges\n",
-    "    mol = mda_to_rdkit(pdbqt.atoms, NoImplicit=False)\n",
-    "    # assign BO from template then add hydrogens\n",
-    "    mol = Chem.RemoveHs(mol, sanitize=False)\n",
-    "    mol = AllChem.AssignBondOrdersFromTemplate(template, mol)\n",
-    "    mol = Chem.AddHs(mol, addCoords=True, addResidueInfo=True)\n",
-    "    return mol\n",
-    "\n",
-    "pdbqt_to_rdkit(plf.datafiles.datapath / \"vina/vina_output_ligand_1.pdbqt\", template)"
+    "# load the reference\n",
+    "ref = mda.Universe(plf.datafiles.datapath / \"vina\" / \"lig.pdb\")\n",
+    "ref = plf.Molecule.from_mda(ref)\n",
+    "# generate IFP\n",
+    "fp.run_from_iterable([ref], prot)\n",
+    "df0 = fp.to_dataframe()\n",
+    "df0.rename({0: \"ref\"}, inplace=True)\n",
+    "# drop the ligand level on both dataframes\n",
+    "df0.columns = df0.columns.droplevel(0)\n",
+    "df.columns = df.columns.droplevel(0)\n",
+    "# concatenate them\n",
+    "df = (pd.concat([df0, df])\n",
+    "        .fillna(False)\n",
+    "        .sort_index(axis=1, level=0,\n",
+    "                    key=lambda index: [plf.ResidueId.from_string(x) for x in index]))\n",
+    "df"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now our ligand is ready to be used. For the protein, there's usually no need to load the PDBQT that was used by Vina. The original file that was used to generate the PDBQT can be used directly, but **it must contain explicit hydrogen atoms**:"
+    "Lastly, we can convert the dataframe to a list of RDKit bitvectors to finally compute the Tanimoto similarity between our reference pose and the docking poses generated by Vina:"
    ]
   },
   {
@@ -419,59 +443,59 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "prot = mda.Universe(plf.datafiles.datapath / \"vina/rec.pdb\")\n",
-    "prot = plf.Molecule.from_mda(prot)\n",
-    "prot.GetNumHeavyAtoms()"
+    "from rdkit import DataStructs\n",
+    "\n",
+    "bvs = plf.to_bitvectors(df)\n",
+    "for i, bv in enumerate(bvs[1:]):\n",
+    "    tc = DataStructs.TanimotoSimilarity(bvs[0], bv)\n",
+    "    print(f\"{i}: {tc:.3f}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We are now ready to compute the IFP for our docking poses (we will also do it for the reference structure):"
+    "Interestingly, the best scored docking pose (#0) isn't the most similar to the reference (#5)"
    ]
   },
+  {
+   "source": [
+    "### Using a MOL2 file\n",
+    "\n",
+    "The input mol2 file can contain multiple ligands in different conformations."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "# load ligands\n",
+    "path = plf.datafiles.datapath / \"vina\" / \"vina_output.mol2\"\n",
+    "lig_suppl = list(plf.mol2_supplier(path))\n",
+    "# generate fingerprint\n",
     "fp = plf.Fingerprint()\n",
-    "ifp = []\n",
-    "\n",
-    "# equivalent to what happens in the fp.run method\n",
-    "def run_fp(lig, prot, frame=0):\n",
-    "    data = {\"Frame\": frame}\n",
-    "    for lres in lig:\n",
-    "        for presid in plf.get_residues_near_ligand(lres, prot):\n",
-    "            bv = fp.bitvector(lres, prot[presid])\n",
-    "            data[(lres.resid, presid)] = bv\n",
-    "    return data\n",
-    "\n",
-    "# original structure for reference\n",
-    "lig = mda.Universe(plf.datafiles.datapath / \"vina/lig.pdb\")\n",
-    "lig = plf.Molecule.from_mda(lig)\n",
-    "data = run_fp(lig, prot, frame=0)\n",
-    "ifp.append(data)\n",
-    "\n",
-    "# docking poses\n",
-    "pdbqt_files = sorted((plf.datafiles.datapath / \"vina\").glob(\"*.pdbqt\"))\n",
-    "for i, pdbqt in enumerate(pdbqt_files, start=1):\n",
-    "    lig = pdbqt_to_rdkit(pdbqt, template)\n",
-    "    lig = plf.Molecule(lig)\n",
-    "    data = run_fp(lig, prot, frame=i)\n",
-    "    ifp.append(data)\n",
-    "\n",
-    "df = plf.to_dataframe(ifp, fp.interactions.keys())\n",
+    "fp.run_from_iterable(lig_suppl, prot)\n",
+    "df = fp.to_dataframe()\n",
     "df"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Lastly, we can now compute the Tanimoto similarity between our reference pose and the docking poses generated by Vina:"
+    "### Using PDBQT files\n",
+    "\n",
+    "The typical use case here is getting the IFP from AutoDock Vina's output. It requires a few additional steps and informations compared to other formats like MOL2, since the PDBQT format gets rid of most hydrogen atoms and doesn't contain bond order information.\n",
+    "\n",
+    "The prerequisites for a successfull usage of ProLIF in this case is having external files that contain bond orders and formal charges for your ligand (like SMILES, SDF or MOL2), or at least a file with explicit hydrogen atoms. \n",
+    "\n",
+    "Please note that your PDBQT input must have a single model per file (this is required by MDAnalysis). Splitting a multi-model file can be done using the `vina_split` command-line tool that comes with AutoDock Vina: `vina_split --input vina_output.pdbqt`\n",
+    "\n",
+    "Let's start by loading our \"template\" file with bond orders. It can be a SMILES string, MOL2, SDF file or anything supported by RDKit."
    ]
   },
   {
@@ -480,24 +504,26 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from rdkit import DataStructs\n",
+    "from rdkit import Chem\n",
+    "from rdkit.Chem import AllChem\n",
     "\n",
-    "bvs = plf.to_bitvectors(df)\n",
-    "DataStructs.BulkTanimotoSimilarity(bvs[0], bvs[1:])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Interestingly, the best scored docking pose (#1) isn't the most similar to the reference (#6)!"
+    "template = Chem.MolFromSmiles(\"C[NH+]1CC(C(=O)NC2(C)OC3(O)C4CCCN4C(=O)C\"\n",
+    "                              \"(Cc4ccccc4)N3C2=O)C=C2c3cccc4[nH]cc(c34)CC21\")\n",
+    "template"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4. Ignoring the protein backbone when computing interactions"
+    "Next, we'll use the PDBQT supplier which loads each file from a list of paths, and assigns bond orders and charges using the template. The template and PDBQT file must have the exact same atoms, **even hydrogens**, otherwise no match will be found. Since PDBQT files partially keep the hydrogen atoms, we have the choice between:\n",
+    "\n",
+    "* Manually selecting where to add the hydrogens on the template, do the matching, then add the remaining hydrogens (not covered here)\n",
+    "* Or just remove the hydrogens from the PDBQT file, do the matching, then add all hydrogens.\n",
+    "\n",
+    "This last option will delete the coordinates of your hydrogens atoms and replace them by the ones generated by RDKit, but unless you're working with an exotic system this should be fine.\n",
+    "\n",
+    "For the protein, there's usually no need to load the PDBQT that was used by Vina. The original file that was used to generate the PDBQT can be used directly, but **it must contain explicit hydrogen atoms**:"
    ]
   },
   {
@@ -506,7 +532,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# not implemented yet"
+    "# load ligands\n",
+    "pdbqt_files = sorted(plf.datafiles.datapath.glob(\"vina/*.pdbqt\"))\n",
+    "lig_suppl = list(plf.pdbqt_supplier(pdbqt_files, template))\n",
+    "# generate fingerprint\n",
+    "fp = plf.Fingerprint()\n",
+    "fp.run_from_iterable(lig_suppl, prot)\n",
+    "df = fp.to_dataframe()\n",
+    "df"
    ]
   }
  ],
@@ -526,9 +559,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.5-final"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}
diff --git a/docs/notebooks/quickstart.ipynb b/docs/notebooks/quickstart.ipynb
@@ -232,7 +232,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can also compute a Tanimoto similarity between each frames:"
+    "You can also compute a Tanimoto similarity between each frame:"
    ]
   },
   {
@@ -269,4 +269,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}