Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

piehld · 2021-11-17T18:52:59Z

I was going to create a PR, but there are some other differences in my fork that shouldn't get merged, so I'll just describe the changes here.

See this commit for reference: rcsb@48c83e4

Add alternate field names for retrieving quality score metrics, as used in some files from bulk data download

c.write_scores(
            self.remarks.get('TSVMOD METHOD'), self.remarks.get('TSVMOD RMSD'),
            self.remarks.get('TSVMOD NO35'),
            self.remarks.get('GA341 SCORE', self.remarks.get('MODEL SCORE')),
            self.remarks.get('zDOPE SCORE', self.remarks.get('ZDOPE SCORE')),
            self.remarks.get('MPQS', self.remarks.get('MODPIPE QUALITY SCORE')))

Add "UNK: X" to three_to_one dictionary

Some of the other changes in that commit still need to be discussed within our team, so you can disregard those.

Thank you

The text was updated successfully, but these errors were encountered:

benmwebb · 2021-11-18T01:39:39Z

Ah, the code to generate PDBs for bulk download has always been separate from that used in the website. I hadn't noticed that some of the remarks it generates are subtly different. For the time being I fixed the bulk download code to more closely resemble the website code (although a proper fix would be to use the same code in both cases of course). But I'll incorporate your change to the PDB-to-mmCIF conversion script.

I'm curious to know why you needed the UNK to X mapping though. Do you have an example model where things failed without it? I thought that ModBase didn't generate models containing UNK (it is supposed to map it to GLY instead IIRC) but I may be mistaken - there are certainly some old models in there.

piehld · 2021-11-18T14:35:15Z

Thanks for incorporating the additions. I know the appearance of 'X' was occurring in the alignment files, but I can't recall if it ever occurred in a PDB. One example from the Arabidopsis thaliana bulk download is NP_001030619.1_1.ali.xml.

Another issue I ran into and addressed in that same commit referenced above was that in some cases the TEMPLATE BEGIN and/or TEMPLATE END contained an insertion code, such as in the file NP_001030766.1_1.pdb. I addressed this by creating a new attribute (as you can see in the commit), but we haven't added that to our MA dictionary yet. When we do, I'll create another ticket.

benmwebb · 2021-11-18T17:37:23Z

Another issue I ran into and addressed in that same commit referenced above was that in some cases the TEMPLATE BEGIN and/or TEMPLATE END contained an insertion code

If I understand @brindakv correctly the residue numbers in the template table are supposed to be label_seq_id and so don't have/need an insertion code. So my script isn't right at the moment since it's using PDB residue numbers (auth_seq_id) to fill these in. I think the proper fix here is to pull the pdbx_poly_seq_scheme mapping from the mmCIF version of the template to get rid of author-provided template numbering entirely. But that's a separate issue.

piehld · 2021-11-18T17:50:53Z

Oh right, I see. That sounds like the appropriate method then, thanks!

brindakv · 2021-11-19T03:40:38Z

@benmwebb is correct. The residue numbers have to follow the label_seq_id and not auth_seq_id. Therefore, no insertion codes.

benmwebb · 2021-11-19T07:21:43Z

I know the appearance of 'X' was occurring in the alignment files, but I can't recall if it ever occurred in a PDB. One example from the Arabidopsis thaliana bulk download is NP_001030619.1_1.ali.xml.

Ah, the alignment file contains some X residues in the template but we need the 3-letter name to output entity_poly_seq. I opened a separate issue (#5) for this.

benmwebb · 2021-12-06T19:30:03Z

FWIW the latest code should handle most modern ModBase models correctly - I ran it on the entire A. thaliana dataset and it succeeded without issues (other than the residue numbering and handling of UNK you noted, I also had to add handling of ASX and GLX in e63df25).

piehld · 2021-12-06T20:29:54Z

@benmwebb Cool, thanks! I also noticed that you added reference DB details (e.g., to parse the SEQDB remarks in the PDB). Since the bulk PDB downloads don't currently contain that remark, that would be another motivation for eventually obtaining a tarball of pre-converted mmCIF models (for at least certain species of interest), once all the remaining kinks are addressed. Not a pressing concern now though, of course.

benmwebb · 2021-12-06T21:43:58Z

I also noticed that you added reference DB details (e.g., to parse the SEQDB remarks in the PDB). Since the bulk PDB downloads don't currently contain that remark, that would be another motivation for eventually obtaining a tarball of pre-converted mmCIF models

The latest human PDB files do contain SEQDB remarks IIRC. I'll be updating the other recent bulk downloads over the next few weeks.

piehld · 2021-12-06T22:31:03Z

Ah OK, I didn't realize the Human bulk dataset had been updated. I've still been using the previous bulk download (from Sept. 10, 2020). I'll check it out, thanks. I don't believe the latest Arabidopsis thaliana or Panicum virgatum bulk downloads contains that remark yet, though.

benmwebb · 2021-12-08T19:55:27Z

All the bulk downloads for 2019 or later should now be updated to contain at least rudimentary SEQDB remarks.

piehld · 2021-12-08T20:22:22Z

Hmm, I suppose the NCBI ID in the filename (or GENSCAN, etc.) can serve as an alternative, but I don't see any SEQDB remarks in the Arabidopsis models...

Just one random example, NP_001030614.1_1.pdb:

HEADER    ModPipe Model of NP_001030614.1         2021-05-1
TITLE     Model of SubName: Full=Phosphoglycerate mutase-like protein 
AUTHOR     URSULA PIEPER, BENJAMIN WEBB, EASHWAR NARAYANAN, ANDREJ SALI        
REMARK 220 EXPERIMENTAL DETAILS                                       
REMARK 220 EXPERIMENT TYPE: THEORETICAL MODEL                         
REMARK 220 METHOD: HOMOLOGY MODELING                                  
REMARK 220 PROGRAM: MODPIPE 2.0                                       
REMARK 220 SEQUENCE IDENTITY:           59                            
REMARK 220 MODEL SCORE:                 1                             
REMARK 220 MODPIPE QUALITY SCORE:       1.00278                       
REMARK 220 ZDOPE:                       -0.5                          
REMARK 220 EVALUE:                      0                             
REMARK 220 TSVMOD METHOD:               MSALL                         
REMARK 220 TSVMOD RMSD:                 1.954                         
REMARK 220 TSVMOD NO35:                 0.916                         
REMARK 220 TEMPLATE PDB:                3t7a                          
REMARK 220 TEMPLATE CHAIN:              A                             
REMARK 220 TARGET LENGTH:               1050                          
REMARK 220 TARGET BEGIN:                12                            
REMARK 220 TARGET END:                  337                           
REMARK 220 TEMPLATE BEGIN:              42                            
REMARK 220 TEMPLATE END:                360                           
REMARK 220 MODPIPE RUN:                 MW-a_thaliana                 
REMARK 220 MODPIPE MODEL ID:            fecfefe11db81b3ec93bb0a21cf412b4
REMARK 220 MODPIPE ALIGN ID:            fa92f743b30e909252356a4f9bb3a2b4
REMARK 220 MODPIPE SEQUENCE ID:         129809a15860e9a56d6df8c21a9af8b1MEMENGRS
REMARK 220 TOTAL NUMBER OF MODELS FOR THIS SEQUENCE:   2         
EXPDTA    THEORETICAL MODEL, MODELLER SVN 2021/04/09 18:25:24
REMARK   6 MODELLER OBJECTIVE FUNCTION:      1577.3577
REMARK   6 MODELLER BEST TEMPLATE % SEQ ID:  58.934
REMARK   6 GENERATED BY MODPIPE VERSION SVN.r1703
REMARK   6 SEQUENCE: 129809a15860e9a56d6df8c21a9af8b1MEMENGRS
REMARK   6 ALIGNMENT: fa92f743b30e909252356a4f9bb3a2b4.ali
REMARK   6 SCRIPT: 129809a15860e9a56d6df8c21a9af8b1MEMENGRS-models.py
REMARK   6 GA341 score: 1.00000
REMARK   6 DOPE score: -34999.62500
REMARK   6 DOPE-HR score: -30525.31055
REMARK   6 Normalized DOPE score: -0.49556
REMARK   6 TEMPLATE: 3t7aA 42:A - 360:A MODELS 12:A - 337:A AT 58.9%
ATOM      1  N   GLU A  12      -8.112  47.537 -13.517  1.00 82.32           N
ATOM      2  CA  GLU A  12      -8.377  47.241 -12.095  1.00 82.32           C
ATOM      3  CB  GLU A  12      -9.767  46.596 -11.938  1.00 82.32           C
...

benmwebb · 2021-12-08T21:43:18Z

Hmm, I suppose the NCBI ID in the filename (or GENSCAN, etc.) can serve as an alternative, but I don't see any SEQDB remarks in the Arabidopsis models...

Just one random example, NP_001030614.1_1.pdb:

Are you sure you got the latest download from https://salilab.org/modbase-download/projects/genomes/A_thaliana/2021/ ? The file was updated on 12/06, file size 2172948480 bytes. That particular PDB you reference looks like this:

REMARK 220 EXPERIMENTAL DETAILS                                       
REMARK 220 EXPERIMENT TYPE: THEORETICAL MODEL 
REMARK 220 METHOD: HOMOLOGY MODELING
REMARK 220 PROGRAM: MODPIPE 2.0                                  
REMARK 220 SEQDB: RefSeq    NP_001030614.1 
REMARK 220 SEQUENCE IDENTITY:           59        
REMARK 220 GA341 SCORE:                 1   
REMARK 220 EVALUE:                      0
REMARK 220 MPQS:                        1.00278

piehld · 2021-12-08T22:21:40Z

Ah, I didn't realize they were just updated. I just re-downloaded the latest Arabidopsis thaliana and Panicum virgatum data sets, and they both look good. Thank you for the update!

benmwebb self-assigned this Nov 18, 2021

benmwebb closed this as completed in a2812b8 Nov 18, 2021

benmwebb mentioned this issue Nov 18, 2021

Don't use author/PDB-provided residue numbers or asym_ids in template tables #4

Closed

benmwebb mentioned this issue Nov 19, 2021

Handle UNK residues in template sequences #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

piehld commented Nov 17, 2021 •

edited

Loading

benmwebb commented Nov 18, 2021

piehld commented Nov 18, 2021 •

edited

Loading

benmwebb commented Nov 18, 2021

piehld commented Nov 18, 2021

brindakv commented Nov 19, 2021

benmwebb commented Nov 19, 2021

benmwebb commented Dec 6, 2021

piehld commented Dec 6, 2021

benmwebb commented Dec 6, 2021

piehld commented Dec 6, 2021

benmwebb commented Dec 8, 2021

piehld commented Dec 8, 2021

benmwebb commented Dec 8, 2021

piehld commented Dec 8, 2021

Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

Handling of differences between entry-page PDB downloads and bulk PDB downloads #3

Comments

piehld commented Nov 17, 2021 • edited Loading

benmwebb commented Nov 18, 2021

piehld commented Nov 18, 2021 • edited Loading

benmwebb commented Nov 18, 2021

piehld commented Nov 18, 2021

brindakv commented Nov 19, 2021

benmwebb commented Nov 19, 2021

benmwebb commented Dec 6, 2021

piehld commented Dec 6, 2021

benmwebb commented Dec 6, 2021

piehld commented Dec 6, 2021

benmwebb commented Dec 8, 2021

piehld commented Dec 8, 2021

benmwebb commented Dec 8, 2021

piehld commented Dec 8, 2021

piehld commented Nov 17, 2021 •

edited

Loading

piehld commented Nov 18, 2021 •

edited

Loading