-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide what do to when curators use different UniProt ids for the same gene in different publications #125
Comments
Once the curated data is live on PHI-base 5 (and updated regularly) we could advise curators to do a look up and see if their gene has already been curated. If it has, maybe they could use the same UniProt id. |
Can you provide the two UniProt IDs in question? If they've been curated with exactly the same allele name in PHI-Canto, then that could be one quick check we could do with the JSON export. Another option is checking if the gene names are identical in UniProtKB (in the case when there are gene names). |
We already manually made the changes to the UniProt id so that both sessions now use the same id. It was not straight forward finding the UniProt id as there were multiple entries and I have noted the problems at the start of #118 Good idea about searching on allele name in PHI-Canto, but different papers may have made different mutations to the same gene. |
From #118 |
At PomBase we have lots of log files for various checks (like alleles with the same description and different names, same name different descriptions). PHI-base checks might be slightly different but a check for You might need to also check the sequences of both to be sure the alleles descriptions match up correctly. You might be able to extend manus code to do this. This sort of use case would be useful to feed back to Maria when you next speak with her. |
It seems like this is more a problem with UniProtKB, so I'm not really sure what we can do about it. If we don't have firm guidelines about which UniProtKB accession should be preferred in every case (apparently 'use the reference proteome' isn't enough here), we can't expect any curator to make the right decision. Similarly, if our choice of UniProtKB accession is ultimately somewhat arbitrary, then it would be unfair to expect anything more from curators. I agree with Val that we'll need some pretty advanced QC checks. @CuzickA Out of curiosity, why did you use choose the reference proteome for the T4 strain instead of the B05.10 strain? I would've expected (maybe naively) that strain B05.10 from Bf would be closer to B05.10 from Bc than T4 from Bf. |
Yes, I would have done it if was available. Note above in screenshots says that I looked at each of the entries and none were for strain B05.10. |
and its not straight forward when there are multiple reference proteomes - 3 in this case! |
I find this bizarre. There should only be a single reference proteome. Another question for the Maria list. |
Sorry, still struggling to follow this. What was the reason you chose Q873W7 over the other Erg27 entries? I can't see any obvious difference between them, and none of them seem to be linked to a reference proteome based on the Proteome column that I added to the search results. |
https://www.uniprot.org/uniprotkb/Q873W7/entry If you click on the EMBL link in this record it shows that the strain is T4 My notes above suggest that I did this for all the UniProt entries looking for strain B05.10 and didn't find any. Strain T4 was listed as one of the 3 reference proteomes so I selected this UniProt. I'm still not sure if it's correct but it seemed the best option. |
Thanks for clarifying. In terms of my decision, I think the best we can do is make automatic QC checks to track which annotations in PHI-Canto reference the same gene name or allele name but different UniProtKB accession numbers (also constrained by NCBI Taxonomy ID like Val suggested). It looks like the choice of preferred accession number will be made on a case-by-case basis by admin curators, since we're already choosing on what 'seems' correct, so it doesn't seem like there's a procedure that would work for all cases (or easy enough to be followed by community curators). |
Sounds good. Did we develop a method for changing a UniProt id in a curation session if required? I think we discussed this in the past. |
@CuzickA We haven't developed a solution yet since we were waiting to see how frequently this problem occurs and for a list of examples that need changing. The relevant issue is linked below: |
Can you pop the IDs in the ticket? It might be that the author has requenced the gene, and submitted it, which will create another TremBL entry. This will eventually be merged with (I assume) the genome entry taking precedence. |
This is the Swiss-Prot entry https://www.uniprot.org/uniprotkb/O42772/entry This is the TrEMBL entry https://www.uniprot.org/uniprotkb/G8EI90/entry |
OK if there is a swiss-prot and a trembl entry you should use the Swiss-prot entry. Eventually the Tremb entry will merge into the Swiss-prot entry and the Trembl ID will become a secondary ID. |
Ok, thanks. Unfortunately a lot of genotypes and annotations were made to the Trembl ID so it will take some work to switch over to the Swiss-prot. @jseager7, would you be able to help with swapping this UniProt ID over please? |
@jseager7, would you be able to help with swapping this UniProt ID over please? Yes, I'll re-raise the issue of implementing a script to replace a UniProtKB ID in a session. |
I guess this is a future activity for @jseager7 |
This happened recently for Erg27 in Botrytis cinerea see #120 and #118.
The problem is is that if two different UniProts ids are used then the curated results for the same gene will be displayed on two separate PHI-base 5 gene-centric pages.
In this case I recognised the gene name and we made sure the same UniProt was used for both curation sessions. However, as the numbers of curated publications increase it would be very easy to miss something like this.
@ValWood @jseager7 any ideas?
The text was updated successfully, but these errors were encountered: