Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

Closed
piehld opened this issue Nov 17, 2021 · 14 comments
Assignees

Comments

@piehld
Copy link

piehld commented Nov 17, 2021

I was going to create a PR, but there are some other differences in my fork that shouldn't get merged, so I'll just describe the changes here.

See this commit for reference: rcsb@48c83e4

  1. Add alternate field names for retrieving quality score metrics, as used in some files from bulk data download
c.write_scores(
            self.remarks.get('TSVMOD METHOD'), self.remarks.get('TSVMOD RMSD'),
            self.remarks.get('TSVMOD NO35'),
            self.remarks.get('GA341 SCORE', self.remarks.get('MODEL SCORE')),
            self.remarks.get('zDOPE SCORE', self.remarks.get('ZDOPE SCORE')),
            self.remarks.get('MPQS', self.remarks.get('MODPIPE QUALITY SCORE')))
  1. Add "UNK: X" to three_to_one dictionary

Some of the other changes in that commit still need to be discussed within our team, so you can disregard those.

Thank you

@benmwebb benmwebb self-assigned this Nov 18, 2021
@benmwebb
Copy link
Member

Ah, the code to generate PDBs for bulk download has always been separate from that used in the website. I hadn't noticed that some of the remarks it generates are subtly different. For the time being I fixed the bulk download code to more closely resemble the website code (although a proper fix would be to use the same code in both cases of course). But I'll incorporate your change to the PDB-to-mmCIF conversion script.

I'm curious to know why you needed the UNK to X mapping though. Do you have an example model where things failed without it? I thought that ModBase didn't generate models containing UNK (it is supposed to map it to GLY instead IIRC) but I may be mistaken - there are certainly some old models in there.

@piehld
Copy link
Author

piehld commented Nov 18, 2021

Thanks for incorporating the additions. I know the appearance of 'X' was occurring in the alignment files, but I can't recall if it ever occurred in a PDB. One example from the Arabidopsis thaliana bulk download is NP_001030619.1_1.ali.xml.

Another issue I ran into and addressed in that same commit referenced above was that in some cases the TEMPLATE BEGIN and/or TEMPLATE END contained an insertion code, such as in the file NP_001030766.1_1.pdb. I addressed this by creating a new attribute (as you can see in the commit), but we haven't added that to our MA dictionary yet. When we do, I'll create another ticket.

@benmwebb
Copy link
Member

Another issue I ran into and addressed in that same commit referenced above was that in some cases the TEMPLATE BEGIN and/or TEMPLATE END contained an insertion code

If I understand @brindakv correctly the residue numbers in the template table are supposed to be label_seq_id and so don't have/need an insertion code. So my script isn't right at the moment since it's using PDB residue numbers (auth_seq_id) to fill these in. I think the proper fix here is to pull the pdbx_poly_seq_scheme mapping from the mmCIF version of the template to get rid of author-provided template numbering entirely. But that's a separate issue.

@piehld
Copy link
Author

piehld commented Nov 18, 2021

Oh right, I see. That sounds like the appropriate method then, thanks!

@brindakv
Copy link

@benmwebb is correct. The residue numbers have to follow the label_seq_id and not auth_seq_id. Therefore, no insertion codes.

@benmwebb
Copy link
Member

I know the appearance of 'X' was occurring in the alignment files, but I can't recall if it ever occurred in a PDB. One example from the Arabidopsis thaliana bulk download is NP_001030619.1_1.ali.xml.

Ah, the alignment file contains some X residues in the template but we need the 3-letter name to output entity_poly_seq. I opened a separate issue (#5) for this.

@benmwebb
Copy link
Member

benmwebb commented Dec 6, 2021

FWIW the latest code should handle most modern ModBase models correctly - I ran it on the entire A. thaliana dataset and it succeeded without issues (other than the residue numbering and handling of UNK you noted, I also had to add handling of ASX and GLX in e63df25).

@piehld
Copy link
Author

piehld commented Dec 6, 2021

@benmwebb Cool, thanks! I also noticed that you added reference DB details (e.g., to parse the SEQDB remarks in the PDB). Since the bulk PDB downloads don't currently contain that remark, that would be another motivation for eventually obtaining a tarball of pre-converted mmCIF models (for at least certain species of interest), once all the remaining kinks are addressed. Not a pressing concern now though, of course.

@benmwebb
Copy link
Member

benmwebb commented Dec 6, 2021

I also noticed that you added reference DB details (e.g., to parse the SEQDB remarks in the PDB). Since the bulk PDB downloads don't currently contain that remark, that would be another motivation for eventually obtaining a tarball of pre-converted mmCIF models

The latest human PDB files do contain SEQDB remarks IIRC. I'll be updating the other recent bulk downloads over the next few weeks.

@piehld
Copy link
Author

piehld commented Dec 6, 2021

Ah OK, I didn't realize the Human bulk dataset had been updated. I've still been using the previous bulk download (from Sept. 10, 2020). I'll check it out, thanks. I don't believe the latest Arabidopsis thaliana or Panicum virgatum bulk downloads contains that remark yet, though.

@benmwebb
Copy link
Member

benmwebb commented Dec 8, 2021

All the bulk downloads for 2019 or later should now be updated to contain at least rudimentary SEQDB remarks.

@piehld
Copy link
Author

piehld commented Dec 8, 2021

Hmm, I suppose the NCBI ID in the filename (or GENSCAN, etc.) can serve as an alternative, but I don't see any SEQDB remarks in the Arabidopsis models...

Just one random example, NP_001030614.1_1.pdb:

HEADER    ModPipe Model of NP_001030614.1         2021-05-1
TITLE     Model of SubName: Full=Phosphoglycerate mutase-like protein 
AUTHOR     URSULA PIEPER, BENJAMIN WEBB, EASHWAR NARAYANAN, ANDREJ SALI        
REMARK 220 EXPERIMENTAL DETAILS                                       
REMARK 220 EXPERIMENT TYPE: THEORETICAL MODEL                         
REMARK 220 METHOD: HOMOLOGY MODELING                                  
REMARK 220 PROGRAM: MODPIPE 2.0                                       
REMARK 220 SEQUENCE IDENTITY:           59                            
REMARK 220 MODEL SCORE:                 1                             
REMARK 220 MODPIPE QUALITY SCORE:       1.00278                       
REMARK 220 ZDOPE:                       -0.5                          
REMARK 220 EVALUE:                      0                             
REMARK 220 TSVMOD METHOD:               MSALL                         
REMARK 220 TSVMOD RMSD:                 1.954                         
REMARK 220 TSVMOD NO35:                 0.916                         
REMARK 220 TEMPLATE PDB:                3t7a                          
REMARK 220 TEMPLATE CHAIN:              A                             
REMARK 220 TARGET LENGTH:               1050                          
REMARK 220 TARGET BEGIN:                12                            
REMARK 220 TARGET END:                  337                           
REMARK 220 TEMPLATE BEGIN:              42                            
REMARK 220 TEMPLATE END:                360                           
REMARK 220 MODPIPE RUN:                 MW-a_thaliana                 
REMARK 220 MODPIPE MODEL ID:            fecfefe11db81b3ec93bb0a21cf412b4
REMARK 220 MODPIPE ALIGN ID:            fa92f743b30e909252356a4f9bb3a2b4
REMARK 220 MODPIPE SEQUENCE ID:         129809a15860e9a56d6df8c21a9af8b1MEMENGRS
REMARK 220 TOTAL NUMBER OF MODELS FOR THIS SEQUENCE:   2         
EXPDTA    THEORETICAL MODEL, MODELLER SVN 2021/04/09 18:25:24
REMARK   6 MODELLER OBJECTIVE FUNCTION:      1577.3577
REMARK   6 MODELLER BEST TEMPLATE % SEQ ID:  58.934
REMARK   6 GENERATED BY MODPIPE VERSION SVN.r1703
REMARK   6 SEQUENCE: 129809a15860e9a56d6df8c21a9af8b1MEMENGRS
REMARK   6 ALIGNMENT: fa92f743b30e909252356a4f9bb3a2b4.ali
REMARK   6 SCRIPT: 129809a15860e9a56d6df8c21a9af8b1MEMENGRS-models.py
REMARK   6 GA341 score: 1.00000
REMARK   6 DOPE score: -34999.62500
REMARK   6 DOPE-HR score: -30525.31055
REMARK   6 Normalized DOPE score: -0.49556
REMARK   6 TEMPLATE: 3t7aA 42:A - 360:A MODELS 12:A - 337:A AT 58.9%
ATOM      1  N   GLU A  12      -8.112  47.537 -13.517  1.00 82.32           N
ATOM      2  CA  GLU A  12      -8.377  47.241 -12.095  1.00 82.32           C
ATOM      3  CB  GLU A  12      -9.767  46.596 -11.938  1.00 82.32           C
...

@benmwebb
Copy link
Member

benmwebb commented Dec 8, 2021

Hmm, I suppose the NCBI ID in the filename (or GENSCAN, etc.) can serve as an alternative, but I don't see any SEQDB remarks in the Arabidopsis models...

Just one random example, NP_001030614.1_1.pdb:

Are you sure you got the latest download from https://salilab.org/modbase-download/projects/genomes/A_thaliana/2021/ ? The file was updated on 12/06, file size 2172948480 bytes. That particular PDB you reference looks like this:

REMARK 220 EXPERIMENTAL DETAILS                                       
REMARK 220 EXPERIMENT TYPE: THEORETICAL MODEL 
REMARK 220 METHOD: HOMOLOGY MODELING
REMARK 220 PROGRAM: MODPIPE 2.0                                  
REMARK 220 SEQDB: RefSeq    NP_001030614.1 
REMARK 220 SEQUENCE IDENTITY:           59        
REMARK 220 GA341 SCORE:                 1   
REMARK 220 EVALUE:                      0
REMARK 220 MPQS:                        1.00278                       

@piehld
Copy link
Author

piehld commented Dec 8, 2021

Ah, I didn't realize they were just updated. I just re-downloaded the latest Arabidopsis thaliana and Panicum virgatum data sets, and they both look good. Thank you for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants