Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide what do to when curators use different UniProt ids for the same gene in different publications #125

Open
CuzickA opened this issue May 24, 2023 · 22 comments
Assignees

Comments

@CuzickA
Copy link

CuzickA commented May 24, 2023

This happened recently for Erg27 in Botrytis cinerea see #120 and #118.

The problem is is that if two different UniProts ids are used then the curated results for the same gene will be displayed on two separate PHI-base 5 gene-centric pages.

In this case I recognised the gene name and we made sure the same UniProt was used for both curation sessions. However, as the numbers of curated publications increase it would be very easy to miss something like this.

@ValWood @jseager7 any ideas?

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

Once the curated data is live on PHI-base 5 (and updated regularly) we could advise curators to do a look up and see if their gene has already been curated. If it has, maybe they could use the same UniProt id.

@jseager7
Copy link

Can you provide the two UniProt IDs in question?

If they've been curated with exactly the same allele name in PHI-Canto, then that could be one quick check we could do with the JSON export. Another option is checking if the gene names are identical in UniProtKB (in the case when there are gene names).

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

We already manually made the changes to the UniProt id so that both sessions now use the same id. It was not straight forward finding the UniProt id as there were multiple entries and I have noted the problems at the start of #118

Good idea about searching on allele name in PHI-Canto, but different papers may have made different mutations to the same gene.

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

From #118
image
image
image

@ValWood
Copy link

ValWood commented May 24, 2023

At PomBase we have lots of log files for various checks (like alleles with the same description and different names, same name different descriptions).
https://curation.pombase.org/dumps/latest_build/logs/
You should think about building up a suite of QC checks like this.

PHI-base checks might be slightly different but a check for
same species + same gene name different UniPRot ID would be very useful.

You might need to also check the sequences of both to be sure the alleles descriptions match up correctly. You might be able to extend manus code to do this.

This sort of use case would be useful to feed back to Maria when you next speak with her.
Maybe they can help. Also, it might be possible to collaborate with UniProt to fast-track the curation of proteins of interest for pathogenicity into UniProt.

@jseager7
Copy link

It seems like this is more a problem with UniProtKB, so I'm not really sure what we can do about it. If we don't have firm guidelines about which UniProtKB accession should be preferred in every case (apparently 'use the reference proteome' isn't enough here), we can't expect any curator to make the right decision.

Similarly, if our choice of UniProtKB accession is ultimately somewhat arbitrary, then it would be unfair to expect anything more from curators. I agree with Val that we'll need some pretty advanced QC checks.

@CuzickA Out of curiosity, why did you use choose the reference proteome for the T4 strain instead of the B05.10 strain? I would've expected (maybe naively) that strain B05.10 from Bf would be closer to B05.10 from Bc than T4 from Bf.

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

@CuzickA Out of curiosity, why did you use choose the reference proteome for the T4 strain instead of the B05.10 strain? I would've expected (maybe naively) that strain B05.10 from Bf would be closer to B05.10 from Bc than T4 from Bf.

Yes, I would have done it if was available. Note above in screenshots says that I looked at each of the entries and none were for strain B05.10.

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

'use the reference proteome' isn't enough here

and its not straight forward when there are multiple reference proteomes - 3 in this case!

@ValWood
Copy link

ValWood commented May 24, 2023

and its not straight forward when there are multiple reference proteomes - 3 in this case!

I find this bizarre. There should only be a single reference proteome. Another question for the Maria list.

@jseager7
Copy link

jseager7 commented May 24, 2023

Yes, I would have done it if was available. Note above in screenshots says that I looked at each of the entries and none were for strain B05.10.

Sorry, still struggling to follow this. What was the reason you chose Q873W7 over the other Erg27 entries? I can't see any obvious difference between them, and none of them seem to be linked to a reference proteome based on the Proteome column that I added to the search results.

@CuzickA
Copy link
Author

CuzickA commented May 24, 2023

https://www.uniprot.org/uniprotkb/Q873W7/entry

If you click on the EMBL link in this record it shows that the strain is T4

image

image

My notes above suggest that I did this for all the UniProt entries looking for strain B05.10 and didn't find any.

Strain T4 was listed as one of the 3 reference proteomes so I selected this UniProt. I'm still not sure if it's correct but it seemed the best option.

@jseager7
Copy link

Strain T4 was listed as one of the 3 reference proteomes so I selected this UniProt. I'm still not sure if it's correct but it seemed the best option.

Thanks for clarifying. In terms of my decision, I think the best we can do is make automatic QC checks to track which annotations in PHI-Canto reference the same gene name or allele name but different UniProtKB accession numbers (also constrained by NCBI Taxonomy ID like Val suggested).

It looks like the choice of preferred accession number will be made on a case-by-case basis by admin curators, since we're already choosing on what 'seems' correct, so it doesn't seem like there's a procedure that would work for all cases (or easy enough to be followed by community curators).

@CuzickA
Copy link
Author

CuzickA commented Jun 5, 2023

Sounds good.

Did we develop a method for changing a UniProt id in a curation session if required? I think we discussed this in the past.

@jseager7
Copy link

jseager7 commented Jun 5, 2023

Did we develop a method for changing a UniProt id in a curation session if required?

@CuzickA We haven't developed a solution yet since we were waiting to see how frequently this problem occurs and for a list of examples that need changing. The relevant issue is linked below:

pombase/canto#2677 (comment)

@CuzickA
Copy link
Author

CuzickA commented Jun 21, 2023

Also see #127 and #116 for another example of this problem.

@CuzickA
Copy link
Author

CuzickA commented Jan 31, 2024

Also #126 and #207

@ValWood
Copy link

ValWood commented Jan 31, 2024

Can you pop the IDs in the ticket?
I would ask UniProt to merge the entries into a single Swiss-Prot entry. I believe in this situation one ID will become the primary ID and one will become secondary.

It might be that the author has requenced the gene, and submitted it, which will create another TremBL entry. This will eventually be merged with (I assume) the genome entry taking precedence.

@CuzickA
Copy link
Author

CuzickA commented Jan 31, 2024

This is the Swiss-Prot entry https://www.uniprot.org/uniprotkb/O42772/entry

This is the TrEMBL entry https://www.uniprot.org/uniprotkb/G8EI90/entry

(For #126 and #207)

@ValWood
Copy link

ValWood commented Jan 31, 2024

OK if there is a swiss-prot and a trembl entry you should use the Swiss-prot entry. Eventually the Tremb entry will merge into the Swiss-prot entry and the Trembl ID will become a secondary ID.
It is always worth letting UniProt know though because they should be able to prioritise a emerge if a publication is associated.

@CuzickA
Copy link
Author

CuzickA commented Jan 31, 2024

Ok, thanks. Unfortunately a lot of genotypes and annotations were made to the Trembl ID so it will take some work to switch over to the Swiss-prot.

@jseager7, would you be able to help with swapping this UniProt ID over please?

@jseager7
Copy link

@jseager7, would you be able to help with swapping this UniProt ID over please?

Yes, I'll re-raise the issue of implementing a script to replace a UniProtKB ID in a session.

@CuzickA
Copy link
Author

CuzickA commented Sep 6, 2024

I think the best we can do is make automatic QC checks to track which annotations in PHI-Canto reference the same gene name or allele name but different UniProtKB accession numbers (also constrained by NCBI Taxonomy ID like Val suggested).

I guess this is a future activity for @jseager7

@CuzickA CuzickA added the Future label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants