Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty dataframe #15

Closed
TheKipiDragon opened this issue Mar 9, 2021 · 33 comments
Closed

Empty dataframe #15

TheKipiDragon opened this issue Mar 9, 2021 · 33 comments

Comments

@TheKipiDragon
Copy link

Greetings.
I made a program following the instructions in the "How to" page, to obtain the descriptors of the interaction between ligand and protein, but all I get as a result are empty dataframes.
I'll attach the code that I'm using.
Code

Thank you for your attention.
Alberto Blanco, University Rovira i Virgili, Tarragona, Spain.

@cbouy
Copy link
Member

cbouy commented Mar 9, 2021

Hi @TheKipiDragon ,

I'm assuming that in your script rootdir corresponds to the prolif/data folder.
The vina subdirectory contains 2 PDB files (lig.pdb and rec.pdb) and 1 SDF file, so good ends up being equal to 3.
If you modify if file.endswith(".pdb"): with if file.endswith("rec.pdb"): your script should work fine!

Don't hesitate if there's another problem
Cédric

@TheKipiDragon
Copy link
Author

Sorry, I didnt specified the path.
rootdir corresponds to a directory in which I have several folders, that each include a pdb file for the protein and a sdf file for the ligand. What I try to do is go into each folder and get both files to execute prolif, thats why I used as a counter 2.
So the subdir variable is the path to the folder that contains both files.

@cbouy
Copy link
Member

cbouy commented Mar 10, 2021

Ok, then the problem probably comes from using plf.datafiles.datapath / subdir / protfile and plf.datafiles.datapath / subdir / ligfile when you're loading files. plf.datafiles.datapath points to the ProLIF installation dir (something like ~/miniconda3/envs/prolif/lib/python3.8/site-packages/prolif/data), so plf.datafiles.datapath / subdir / protfile should point to something that doesn't exist. But I'm surprised it doesn't even throw an error in that case.

Anyway, I would suggest you modify your script like this:

import pathlib
import prolif as plf
import MDAnalysis as mda

# path to your rootdir, containing all the folders with PDB and SDF files
rootdir = pathlib.Path("/home/cedric/projects/ProLIF/prolif/data")

for subdir in rootdir.glob("*"):
    # skip if subdir is not a directory
    if not subdir.is_dir():
        continue
    try:
       # search for pdb and sdf files in subdir
        protfile = next(subdir.glob("*.pdb"))
        ligfile = next(subdir.glob("*.sdf"))
    except StopIteration:
        # skip if subdir is missing either pdb or sdf file
        continue
    print(subdir.name, ligfile.name, protfile.name)
    # load protein
    prot = mda.Universe(protfile)
    prot = plf.Molecule.from_mda(prot)
    # load ligands
    lig_suppl = list(plf.sdf_supplier(str(ligfile)))
    # generate fingerprint
    fp = plf.Fingerprint()
    fp.run_from_iterable(lig_suppl, prot)
    df = fp.to_dataframe()
    print(df)

Tell me if that works

PS: if you want to share some code, put it between triple backticks like so:

```python
# Put your code in here
```

@TheKipiDragon
Copy link
Author

Alright, thank you!
I've tried it, but it still gives me empty dataframes. I've only modified

rootdir = pathlib.Path("/home/milax/Escriptori/test")

The terminal shows me the following:

Mpro-x10387_protein Mpro-x10387_ligand.sdf Mpro-x10387_protein.pdb 
/home/milax/anaconda3/envs/prolif/lib/python3.9/site-packages/MDAnalysis/coordinates/RDKit.py:492: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(value, (float, np.float)):
/home/milax/anaconda3/envs/prolif/lib/python3.9/site-packages/MDAnalysis/coordinates/RDKit.py:494: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif isinstance(value, (int, np.int)):
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.67it/s]
Empty DataFrame
Columns: []
Index: [0]

Thank you, again, for all the help providen!

@cbouy
Copy link
Member

cbouy commented Mar 10, 2021

Ok, can you try to run this as well:

import MDAnalysis as mda
import prolif as plf

protfile = "path/to/Mpro-x10387_protein.pdb"
ligfile = "/path/to/Mpro-x10387_ligand.sdf"

prot = mda.Universe(protfile)
prot = plf.Molecule.from_mda(prot)
lig_suppl = plf.sdf_supplier(ligfile)
lig = next(lig_suppl)
pocket_residues = plf.get_residues_near_ligand(lig, prot, cutoff=6.0)
print(pocket_residues)

It should print a list of residues that are close to your ligand.

@TheKipiDragon
Copy link
Author

Alright, that works, but what I want to obtain are the descriptors for the ligand-protein interaction.
Is there any way I can obtain those, because the fingerprinting part didn't seem to work for me.

Thank you for your assistance!

@cbouy
Copy link
Member

cbouy commented Mar 11, 2021

Okay that's reassuring! Do both PDB and SDF file contain explicit hydrogens for all atoms ?
For the SDF file it's not problematic if it's not the case there's a quick workaround, but for the PDB file it's mandatory.

@TheKipiDragon
Copy link
Author

Yeah, they both contain the hydrogens.

Thanks for the help!

@cbouy
Copy link
Member

cbouy commented Mar 11, 2021

Mmh that's strange then... Have you tried running the quickstart tutorial ? Just to make sure that there's no problem with your installation.

If you get empty dataframes even with the tutorial, can you tell me which python version are you using, and the output of the following commands:

import prolif, MDAnalysis, rdkit
print("prolif", prolif.__version__, 
      "MDAnalysis", MDAnalysis.__version__, 
      "rdkit", rdkit.__version__)

If the tutorial worked, can you try running the code snippet from 2 messages ago (the one with the pocket_residues), then run the following:

fp = plf.Fingerprint()
for resid in pocket_residues:
    bv = fp.bitvector(lig, prot[resid])
    print(resid, bv)

If it prints some True values, then there's a bug somewhere in the fp.run_from_iterable method or when converting the results to a dataframe that I will track down.

@TheKipiDragon
Copy link
Author

I dont know if this prolif version is supposed to show like that, since i followed the tutorial to install it, if I installed it incorrectly, please do tell me.
This is from the first snippet:

prolif 0+unknown MDAnalysis 2.0.0-dev0 rdkit 2020.09.5

As for the second one, it didn't print any trues at all.

Thanks for all the help!

@cbouy
Copy link
Member

cbouy commented Mar 12, 2021

And if you run the quickstart tutorial (with the topology and trajectory available in plf.datafiles) do you get empty dataframes or not? If yes, can you also check if fp.ifp is empty?
Regarding the weird version string for prolif, I'll look into it

@TheKipiDragon
Copy link
Author

I was trying it, but when doing

fp.run(u.trajectory[::10], lig, prot)

I get the following error

Traceback (most recent call last):
  File "/home/milax/Escriptori/test/pfv2.py", line 25, in <module>
    fp.run(..., residues=["TYR38.A", "ASP129.A"])
TypeError: run() missing 2 required positional arguments: 'lig' and 'prot'

I copy pasted the code from the quickstart.

Thanks for the assistance!

@cbouy
Copy link
Member

cbouy commented Mar 12, 2021

It's actually the line just below that is failing: fp.run(..., residues=["TYR38.A", "ASP129.A"]) because you're supposed to replace ... with u.trajectory[::10], lig, prot I was just being lazy and wanted to put the emphasis on the residues argument but no worries we don't need it here.
You can stop at fp.run(u.trajectory[::10], lig, prot), then just do

df = fp.to_dataframe()
print(df)

If it's empty, also do print(fp.ifp)

@TheKipiDragon
Copy link
Author

Alright!
Tried it out, and dont get any empty dataframes, all seem to be working out fine!

@cbouy
Copy link
Member

cbouy commented Mar 15, 2021

Ok so the installation is working correctly which is reassuring. I have a few more ideas as to where the problem could come from, could you run the following code snippet please:

import MDAnalysis as mda
from rdkit import Chem
from rdkit.Chem import Draw
import prolif as plf

print(plf.__file__)

protfile = "path/to/Mpro-x10387_protein.pdb"
ligfile = "/path/to/Mpro-x10387_ligand.sdf"

# draw protein residues
prot = mda.Universe(protfile)
prot = plf.Molecule.from_mda(prot)
frags = []
for res in prot:
    mol = Chem.RemoveHs(res)
    mol.RemoveAllConformers()
    frags.append(mol)
img = Draw.MolsToGridImage(frags, legends=[str(res.resid) for res in prot], subImgSize=(200, 140), molsPerRow=5, maxMols=prot.n_residues)
img.save("prot.png")

# draw ligand
lig_suppl = plf.sdf_supplier(ligfile)
lig = next(lig_suppl)
lig = Chem.RemoveHs(lig)
lig.RemoveAllConformers()
img = Draw.MolToImage(lig, size=(500, 500))
img.save("lig.png")

The first print should show where prolif is installed, and include which version of prolif is installed (that's the only way to obtain it for now, until I find a way to fix it).
The rest of the script should save two images (prot and lig), could you have a look at them and make sure that there are no unexpected formal charges (typically (-) charges on carbons) for both the protein and the ligand ?

Thanks,
Cédric

@TheKipiDragon
Copy link
Author

Sorry for taking so long to answer!
Firstly, this is the path where prolif is installed

/home/milax/anaconda3/envs/prolif/lib/python3.9/site-packages/prolif/__init__.py

As for the images, I cant visualize them as I'm executing it in linux cmd. Also, it gave me a lot of lines as the following:

WARNING: not removing hydrogen atom without neighbors

And this error:

Traceback (most recent call last):
  File "/home/milax/Escriptori/test/pfv3.py", line 19, in <module>
    img = Draw.MolsToGridImage(frags, legends=[str(res.resid) for res in prot], subImgSize=(200, 140), molsPerRow=5, maxMols=prot.n_residues)
  File "/home/milax/anaconda3/envs/prolif/lib/python3.9/site-packages/rdkit/Chem/Draw/__init__.py", line 612, in MolsToGridImage
    return _MolsToGridImage(mols, molsPerRow=molsPerRow, subImgSize=subImgSize, legends=legends,
  File "/home/milax/anaconda3/envs/prolif/lib/python3.9/site-packages/rdkit/Chem/Draw/__init__.py", line 555, in _MolsToGridImage
    d2d.DrawMolecules(list(mols), legends=legends or None, highlightAtoms=highlightAtomLists,
Boost.Python.ArgumentError: Python argument types in
    MolDraw2D.DrawMolecules(MolDraw2DCairo, list)
did not match C++ signature:
    DrawMolecules(RDKit::MolDraw2D {lvalue} self, boost::python::api::object mols, boost::python::api::object highlightAtoms=None, boost::python::api::object highlightBonds=None, boost::python::api::object highlightAtomColors=None, boost::python::api::object highlightBondColors=None, boost::python::api::object highlightAtomRadii=None, boost::python::api::object confIds=None, boost::python::api::object legends=None)

Thanks for the help!

@cbouy
Copy link
Member

cbouy commented Mar 16, 2021

No worries take your time!
Regarding the version, I confused the path of installation with something else so it doesn't show the version in there, but I just realized you can simply run

pip list | grep prolif

in a bash shell and it should give you the full version string (something like prolif 0.3.1+2.gfb07487).

For the error with the images code, can you add the following import at the beginning of the script:
from rdkit.Chem.Draw import IPythonConsole
and run it again ? Hopefully it will fix the image issue

And I think the warning might be a clue as to why the fingerprint is empty but we'll see

@TheKipiDragon
Copy link
Author

Sorry for the wait again!
This is the version:

prolif          0.3.1+2.gfb07487

As for the image itself, i've gotten this error

Traceback (most recent call last):
  File "/home/milax/Escriptori/test/pfv3.py", line 21, in <module>
    img.save("prot.png")
AttributeError: 'Image' object has no attribute 'save'

Thanks for the help! I'll try not to take as long this time, sorry for any inconveniences again!

@cbouy
Copy link
Member

cbouy commented Mar 19, 2021

Okay so you're on the latest version.

For the error, try adding returnPNG=False inside Draw.MolsToGridImage as mentioned on this thread

@TheKipiDragon
Copy link
Author

I tried it, but still gives me the same error, 'Image' object has no attribute 'save'

@cbouy
Copy link
Member

cbouy commented Mar 22, 2021

Mmh ok let's try it slightly differently then:

import MDAnalysis as mda
from rdkit import Chem
from rdkit.Chem import Draw
import prolif as plf

protfile = "path/to/Mpro-x10387_protein.pdb"
ligfile = "/path/to/Mpro-x10387_ligand.sdf"

# draw protein residues
prot = mda.Universe(protfile)
prot = plf.Molecule.from_mda(prot)
for i in range(3):
    res = prot[i]
    mol = Chem.RemoveHs(res)
    mol.RemoveAllConformers()
    Draw.MolToImageFile(mol, f"{res.resid}.png", size=(500, 400))

# draw ligand
lig_suppl = plf.sdf_supplier(ligfile)
lig = next(lig_suppl)
lig = Chem.RemoveHs(lig)
lig.RemoveAllConformers()
Draw.MolToImageFile(lig, "lig.png", size=(500, 400))

This should save PNG images for the 3 first residues in your protein and 1 file for your ligand. Could you have a look at them and make sure that there are no unexpected formal charges (typically (-) charges on carbons) ?

Thanks!

@TheKipiDragon
Copy link
Author

Alright! It's working now, although it only prints one image file for the residues, not three (I've checked the protein, it's supposed to have 306 residues). The ligand structure is printed properly.

@cbouy
Copy link
Member

cbouy commented Mar 22, 2021

Okay I'm even more confused now 😆
What's the output of

import MDAnalysis as mda
import prolif as plf

protfile = "path/to/Mpro-x10387_protein.pdb"

prot = mda.Universe(protfile)
prot = plf.Molecule.from_mda(prot)
for i in range(10):
    res = prot[i]
    print(i, res.resid)

Actually is it possible for you to send me the protein file (my email is on my profile) ? Or is it sensitive data ? It will make it easier for me to debug as I'm starting to run out of ideas

@TheKipiDragon
Copy link
Author

This is the output of the code

0 ACE0.A
1 ACE0.A
2 ACE0.A
3 SER1.A
4 SER1.A
5 SER1.A
6 SER1.A
7 SER1.A
8 SER1.A
9 SER1.A

I dont think that the pdb is sensitive data, but I'll check with the director about it nonetheless.
Thanks for the help!

@cbouy
Copy link
Member

cbouy commented Apr 23, 2021

Hi @TheKipiDragon,

Just checking back on this issue, is your director ok with sharing the pdb file ? You could also just share a part of it, something like the first 5 residues should be enough for me to pinpoint the source of the problem. Just make sure the "reduced" file also reproduces the same behavior as the previous code snippet, i.e:

import MDAnalysis as mda
import prolif as plf

protfile = "path/to/first_5_residues.pdb"

prot = mda.Universe(protfile)
prot = plf.Molecule.from_mda(prot)
for i in range(5):
    res = prot[i]
    print(i, res.resid)

should output

0 ACE0.A
1 ACE0.A
2 ACE0.A
3 SER1.A
4 SER1.A
5 SER1.A

@soumyosen
Copy link

soumyosen commented Jun 3, 2021

Hi, I am recently trying to use prolif to get protein-protein interactions from a single pdb file. I also faced very similar problem, the dataframe is empty. All the test files are running properly, results as expected. I am going to share the 10 lines script. If you want, I can also give you the pdb file. I am not sure if the topology file is necessary for the analysis.

`import MDAnalysis as mda
import prolif as plf

u1 = mda.Universe("trial_part.pdb")
res1 = u1.select_atoms("protein and resid 63")
prot1 = u1.select_atoms("protein and not group res1", res1=res1)

fp = plf.Fingerprint()
fp.run(u1.trajectory[:], res1, prot1)
df = fp.to_dataframe()
df.head()
`

@cbouy
Copy link
Member

cbouy commented Jun 3, 2021

Hi @soumyosen,

I don't see any problem in your script, could you send me the pdb file ([email protected]) so that I can reproduce the bug and track down the issue ?

Thanks in advance!

@soumyosen
Copy link

Thank you @cbouy
I just sent you.

@cbouy
Copy link
Member

cbouy commented Jun 3, 2021

I managed to reproduce the issue, thanks!

Can you try this and tell me if it works ?:

u1 = mda.Universe("trial_part.pdb")

u1.atoms.guess_bonds()  # <-- fix bonds not being guessed properly from the pdb file

res1 = u1.select_atoms("protein and resid 63")
# ...

@soumyosen
Copy link

@cbouy Thanks it is working now. Just one more question. By default, I am supposed to get only the "True" results. Right?

@cbouy
Copy link
Member

cbouy commented Jun 3, 2021

Yes, by default fp.to_dataframe drops the columns where all values are False, but if you want to keep them you can use fp.to_dataframe(drop_empty=False)

@soumyosen
Copy link

OK, Thanks a lot for your time.

@akshay-sridhar
Copy link

I managed to reproduce the issue, thanks!

Can you try this and tell me if it works ?:

u1 = mda.Universe("trial_part.pdb")

u1.atoms.guess_bonds()  # <-- fix bonds not being guessed properly from the pdb file

res1 = u1.select_atoms("protein and resid 63")
# ...

Hello @cbouy . Thanks for this.
I just wanted to provide feedback to say I faced the same issue with my PDB file and your suggestion fixed it.

cbouy added a commit that referenced this issue Jul 15, 2021
@cbouy cbouy mentioned this issue Jul 15, 2021
cbouy added a commit that referenced this issue Aug 2, 2021
### Added
- Improved the documentation on how to properly restrict interactions to ignore the protein backbone (Issue #22), how to fix the empty dataframe issue when no bond information is present in the PDB file (Issue #15), how to save the LigNetwork diagram (Issue #21), and some clarifications on using `fp.generate`

### Fixed
- Mixing residue type with interaction type in the interactive legend of the LigNetwork would incorrectly display/hide some residues on the canvas
- MOL2 files starting with a comment (`#`) would lead to an error
@cbouy cbouy closed this as completed Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants