Decide what do to when curators use different UniProt ids for the same gene in different publications #125

CuzickA · 2023-05-24T13:38:29Z

This happened recently for Erg27 in Botrytis cinerea see #120 and #118.

The problem is is that if two different UniProts ids are used then the curated results for the same gene will be displayed on two separate PHI-base 5 gene-centric pages.

In this case I recognised the gene name and we made sure the same UniProt was used for both curation sessions. However, as the numbers of curated publications increase it would be very easy to miss something like this.

@ValWood @jseager7 any ideas?

CuzickA · 2023-05-24T13:40:07Z

Once the curated data is live on PHI-base 5 (and updated regularly) we could advise curators to do a look up and see if their gene has already been curated. If it has, maybe they could use the same UniProt id.

jseager7 · 2023-05-24T13:42:42Z

Can you provide the two UniProt IDs in question?

If they've been curated with exactly the same allele name in PHI-Canto, then that could be one quick check we could do with the JSON export. Another option is checking if the gene names are identical in UniProtKB (in the case when there are gene names).

CuzickA · 2023-05-24T13:48:24Z

We already manually made the changes to the UniProt id so that both sessions now use the same id. It was not straight forward finding the UniProt id as there were multiple entries and I have noted the problems at the start of #118

Good idea about searching on allele name in PHI-Canto, but different papers may have made different mutations to the same gene.

CuzickA · 2023-05-24T13:50:48Z

From #118

ValWood · 2023-05-24T13:55:21Z

At PomBase we have lots of log files for various checks (like alleles with the same description and different names, same name different descriptions).
https://curation.pombase.org/dumps/latest_build/logs/
You should think about building up a suite of QC checks like this.

PHI-base checks might be slightly different but a check for
same species + same gene name different UniPRot ID would be very useful.

You might need to also check the sequences of both to be sure the alleles descriptions match up correctly. You might be able to extend manus code to do this.

This sort of use case would be useful to feed back to Maria when you next speak with her.
Maybe they can help. Also, it might be possible to collaborate with UniProt to fast-track the curation of proteins of interest for pathogenicity into UniProt.

jseager7 · 2023-05-24T14:01:33Z

It seems like this is more a problem with UniProtKB, so I'm not really sure what we can do about it. If we don't have firm guidelines about which UniProtKB accession should be preferred in every case (apparently 'use the reference proteome' isn't enough here), we can't expect any curator to make the right decision.

Similarly, if our choice of UniProtKB accession is ultimately somewhat arbitrary, then it would be unfair to expect anything more from curators. I agree with Val that we'll need some pretty advanced QC checks.

@CuzickA Out of curiosity, why did you use choose the reference proteome for the T4 strain instead of the B05.10 strain? I would've expected (maybe naively) that strain B05.10 from Bf would be closer to B05.10 from Bc than T4 from Bf.

CuzickA · 2023-05-24T14:05:49Z

@CuzickA Out of curiosity, why did you use choose the reference proteome for the T4 strain instead of the B05.10 strain? I would've expected (maybe naively) that strain B05.10 from Bf would be closer to B05.10 from Bc than T4 from Bf.

Yes, I would have done it if was available. Note above in screenshots says that I looked at each of the entries and none were for strain B05.10.

CuzickA · 2023-05-24T14:07:29Z

'use the reference proteome' isn't enough here

and its not straight forward when there are multiple reference proteomes - 3 in this case!

ValWood · 2023-05-24T14:10:19Z

and its not straight forward when there are multiple reference proteomes - 3 in this case!

I find this bizarre. There should only be a single reference proteome. Another question for the Maria list.

jseager7 · 2023-05-24T14:25:12Z

Yes, I would have done it if was available. Note above in screenshots says that I looked at each of the entries and none were for strain B05.10.

Sorry, still struggling to follow this. What was the reason you chose Q873W7 over the other Erg27 entries? I can't see any obvious difference between them, and none of them seem to be linked to a reference proteome based on the Proteome column that I added to the search results.

CuzickA · 2023-05-24T14:38:36Z

https://www.uniprot.org/uniprotkb/Q873W7/entry

If you click on the EMBL link in this record it shows that the strain is T4

My notes above suggest that I did this for all the UniProt entries looking for strain B05.10 and didn't find any.

Strain T4 was listed as one of the 3 reference proteomes so I selected this UniProt. I'm still not sure if it's correct but it seemed the best option.

jseager7 · 2023-05-30T09:15:02Z

Strain T4 was listed as one of the 3 reference proteomes so I selected this UniProt. I'm still not sure if it's correct but it seemed the best option.

Thanks for clarifying. In terms of my decision, I think the best we can do is make automatic QC checks to track which annotations in PHI-Canto reference the same gene name or allele name but different UniProtKB accession numbers (also constrained by NCBI Taxonomy ID like Val suggested).

It looks like the choice of preferred accession number will be made on a case-by-case basis by admin curators, since we're already choosing on what 'seems' correct, so it doesn't seem like there's a procedure that would work for all cases (or easy enough to be followed by community curators).

CuzickA · 2023-06-05T12:34:08Z

Sounds good.

Did we develop a method for changing a UniProt id in a curation session if required? I think we discussed this in the past.

jseager7 · 2023-06-05T12:48:58Z

Did we develop a method for changing a UniProt id in a curation session if required?

@CuzickA We haven't developed a solution yet since we were waiting to see how frequently this problem occurs and for a list of examples that need changing. The relevant issue is linked below:

pombase/canto#2677 (comment)

CuzickA · 2023-06-21T09:17:03Z

Also see #127 and #116 for another example of this problem.

CuzickA · 2024-01-31T12:01:02Z

Also #126 and #207

ValWood · 2024-01-31T14:15:10Z

Can you pop the IDs in the ticket?
I would ask UniProt to merge the entries into a single Swiss-Prot entry. I believe in this situation one ID will become the primary ID and one will become secondary.

It might be that the author has requenced the gene, and submitted it, which will create another TremBL entry. This will eventually be merged with (I assume) the genome entry taking precedence.

CuzickA · 2024-01-31T14:28:20Z

This is the Swiss-Prot entry https://www.uniprot.org/uniprotkb/O42772/entry

This is the TrEMBL entry https://www.uniprot.org/uniprotkb/G8EI90/entry

(For #126 and #207)

ValWood · 2024-01-31T14:38:08Z

OK if there is a swiss-prot and a trembl entry you should use the Swiss-prot entry. Eventually the Tremb entry will merge into the Swiss-prot entry and the Trembl ID will become a secondary ID.
It is always worth letting UniProt know though because they should be able to prioritise a emerge if a publication is associated.

CuzickA · 2024-01-31T14:42:12Z

Ok, thanks. Unfortunately a lot of genotypes and annotations were made to the Trembl ID so it will take some work to switch over to the Swiss-prot.

@jseager7, would you be able to help with swapping this UniProt ID over please?

jseager7 · 2024-01-31T14:44:03Z

@jseager7, would you be able to help with swapping this UniProt ID over please?

Yes, I'll re-raise the issue of implementing a script to replace a UniProtKB ID in a session.

CuzickA · 2024-09-06T13:42:25Z

I think the best we can do is make automatic QC checks to track which annotations in PHI-Canto reference the same gene name or allele name but different UniProtKB accession numbers (also constrained by NCBI Taxonomy ID like Val suggested).

I guess this is a future activity for @jseager7

jseager7 added the discuss label May 24, 2023

CuzickA assigned jseager7 Sep 6, 2024

CuzickA added the Future label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide what do to when curators use different UniProt ids for the same gene in different publications #125

Decide what do to when curators use different UniProt ids for the same gene in different publications #125

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

jseager7 commented May 24, 2023

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

ValWood commented May 24, 2023

jseager7 commented May 24, 2023

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

ValWood commented May 24, 2023

jseager7 commented May 24, 2023 •

edited

Loading

CuzickA commented May 24, 2023

jseager7 commented May 30, 2023

CuzickA commented Jun 5, 2023

jseager7 commented Jun 5, 2023

CuzickA commented Jun 21, 2023

CuzickA commented Jan 31, 2024

ValWood commented Jan 31, 2024

CuzickA commented Jan 31, 2024

ValWood commented Jan 31, 2024

CuzickA commented Jan 31, 2024

jseager7 commented Jan 31, 2024

CuzickA commented Sep 6, 2024

Decide what do to when curators use different UniProt ids for the same gene in different publications #125

Decide what do to when curators use different UniProt ids for the same gene in different publications #125

Comments

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

jseager7 commented May 24, 2023

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

ValWood commented May 24, 2023

jseager7 commented May 24, 2023

CuzickA commented May 24, 2023

CuzickA commented May 24, 2023

ValWood commented May 24, 2023

jseager7 commented May 24, 2023 • edited Loading

CuzickA commented May 24, 2023

jseager7 commented May 30, 2023

CuzickA commented Jun 5, 2023

jseager7 commented Jun 5, 2023

CuzickA commented Jun 21, 2023

CuzickA commented Jan 31, 2024

ValWood commented Jan 31, 2024

CuzickA commented Jan 31, 2024

ValWood commented Jan 31, 2024

CuzickA commented Jan 31, 2024

jseager7 commented Jan 31, 2024

CuzickA commented Sep 6, 2024

jseager7 commented May 24, 2023 •

edited

Loading