UMLS_UniProtKB

Create a UMLS/UniProtKB table

The problem with UMLS Protein entities is that many of them only link to MeSH, and there's no links to other protein identifiers. So we want to find a way to generically connect UMLS to UniProtKB.

We're going to rely on some work that has been done in Babel, but this particular part doesn't slot into that work very well, because the workflow is a bit circular. To generate a UMLS/UniProt mapping, we're going to rely on already having a Gene/Protein conflation generated, as well as a Gene and Protein concordance.

We are relying on the fact that for most human gene/protein pairs, there appear to be in UMLS a GeneID that has at least one lexical entry of " gene, human" along with a corresponding protein ID that has at least one lexical entry of " protein". We use these to join UMLS Proteins to Genes (at least for humans), then use Babel's Gene/Protein conflation to get back to UniProtKB identifiers.

Generate the UMLS/UniProtKB table

To run this, you need to put the following Babel files into /inputs, changing their name to fit the following:

File	local name
UMLS synonyms	UMLS_synonyms
UniProtKB labels	UniProtKB_labels
Gene Protein Conflation	Gene_Protein_Conflation
Gene Concordance	Gene_Concordance

If you are running at RENCI and have access to the translator-dev namespace on Sterling, you can run collect_data.sh to get the files you need.

Create the output mappings:

python create_umls_uniprotkb.py

This will create outputs/UMLS_UniProtKB.tsv, which contains the mappings described above. Note that there can be multiple UniProtKB mappings for a single UMLS Protein.

QC

We generate mappings for over 13000 UMLS Proteins. To check the mappings, we use an LLM to inspect the labels of the mapped proteins.

First, convert the tsv to a jsonl and add the UMLS and UniProtKB labels:

python add_labels.py

Then, run the LLM:

export OPENAI_LITCOIN_KEY=<your key>
python run_qc.py
python parse_qc.py

This is going to run the QC on the mappings in batch mode using gpt-4o-mini. In addition to generating a call for each UMLS/UniProtKB pair, it will also generate a specfied fraction of calls by permuting UMLS and UniProtKB identifiers. This allows us to see whether the QC is able to differntiate between (putatively) correct and (known) incorrect mappings.

Now the noteboook analyze_qc.ipynb can be used to analyze the results of the QC. In particular we can look at the distribution of scores for correct and incorrect mappings:

score	bad	good
0	846	22
1	375	64
2	34	45
3	12	25
4	17	113
5	23	13478

So the vast majority of putatively good mappings gat a score of 5 and the vast majority of known bad mappings get a score of 0 or 1. Hand inspection of putatively good mappings with low scores appears to be due to the LLM failing rather than the mapping procedure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMLS_UniProtKB

Generate the UMLS/UniProtKB table

QC

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
outputs		outputs
README.md		README.md
add_labels.py		add_labels.py
analyze_qc.ipynb		analyze_qc.ipynb
collect_data.sh		collect_data.sh
create_umls_uniprotkb.py		create_umls_uniprotkb.py
parse_qc.py		parse_qc.py
run_qc.py		run_qc.py

cbizon/UMLS_UniProtKB

Folders and files

Latest commit

History

Repository files navigation

UMLS_UniProtKB

Generate the UMLS/UniProtKB table

QC

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages