-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of differences between entry-page PDB downloads and bulk PDB downloads #3
Comments
Ah, the code to generate PDBs for bulk download has always been separate from that used in the website. I hadn't noticed that some of the remarks it generates are subtly different. For the time being I fixed the bulk download code to more closely resemble the website code (although a proper fix would be to use the same code in both cases of course). But I'll incorporate your change to the PDB-to-mmCIF conversion script. I'm curious to know why you needed the UNK to X mapping though. Do you have an example model where things failed without it? I thought that ModBase didn't generate models containing UNK (it is supposed to map it to GLY instead IIRC) but I may be mistaken - there are certainly some old models in there. |
Thanks for incorporating the additions. I know the appearance of 'X' was occurring in the alignment files, but I can't recall if it ever occurred in a PDB. One example from the Arabidopsis thaliana bulk download is Another issue I ran into and addressed in that same commit referenced above was that in some cases the |
If I understand @brindakv correctly the residue numbers in the template table are supposed to be |
Oh right, I see. That sounds like the appropriate method then, thanks! |
@benmwebb is correct. The residue numbers have to follow the |
Ah, the alignment file contains some X residues in the template but we need the 3-letter name to output |
FWIW the latest code should handle most modern ModBase models correctly - I ran it on the entire A. thaliana dataset and it succeeded without issues (other than the residue numbering and handling of UNK you noted, I also had to add handling of ASX and GLX in e63df25). |
@benmwebb Cool, thanks! I also noticed that you added reference DB details (e.g., to parse the SEQDB remarks in the PDB). Since the bulk PDB downloads don't currently contain that remark, that would be another motivation for eventually obtaining a tarball of pre-converted mmCIF models (for at least certain species of interest), once all the remaining kinks are addressed. Not a pressing concern now though, of course. |
The latest human PDB files do contain |
Ah OK, I didn't realize the Human bulk dataset had been updated. I've still been using the previous bulk download (from Sept. 10, 2020). I'll check it out, thanks. I don't believe the latest Arabidopsis thaliana or Panicum virgatum bulk downloads contains that remark yet, though. |
All the bulk downloads for 2019 or later should now be updated to contain at least rudimentary |
Hmm, I suppose the NCBI ID in the filename (or GENSCAN, etc.) can serve as an alternative, but I don't see any SEQDB remarks in the Arabidopsis models... Just one random example,
|
Are you sure you got the latest download from https://salilab.org/modbase-download/projects/genomes/A_thaliana/2021/ ? The file was updated on 12/06, file size 2172948480 bytes. That particular PDB you reference looks like this:
|
Ah, I didn't realize they were just updated. I just re-downloaded the latest Arabidopsis thaliana and Panicum virgatum data sets, and they both look good. Thank you for the update! |
I was going to create a PR, but there are some other differences in my fork that shouldn't get merged, so I'll just describe the changes here.
See this commit for reference: rcsb@48c83e4
three_to_one
dictionarySome of the other changes in that commit still need to be discussed within our team, so you can disregard those.
Thank you
The text was updated successfully, but these errors were encountered: