Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Translator documentation #292

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
31 changes: 24 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,33 +22,50 @@ strong dependencies against the Babel code.

## Configuration

Before running, edit `config.json` and set the `babel_downloads` and `babel_output` directories. Do not edit the
remaining items, which are used to control the build process.
The [`./kubernetes`](./kubernetes/README.md) directory contains Kubernetes manifest files
that can be used to set up a Pod to run Babel in. They'll give you an idea of the disk
space and memory requirements needed to run this pipeline.

Also, if building the disease/phenotype compendia, there are two files that
must be obtained with the user's UMLS license. In particular `MRCONSO.RRF`
and `MRSTY.RRF` should be placed in `/babel/input_data/private`.
Before running, read through `config.json` and make sure that the settings look correct.
You may need to update the version numbers of some databases that need to be downloaded,
or change the download and output directories.

A UMLS API key is required in order to download UMLS and RxNorm databases. You will need
to set the `UMLS_API_KEY` environmental variable to a UMLS API key, which you can obtain
by creating a profile on the [UMLS Terminology Services website](https://uts.nlm.nih.gov/uts).

## Building Compendia

Compendia building is managed by snakemake. To build, for example, the anatomy related compendia, run

```snakemake --cores 1 anatomy```

Currently, the following targets build compendia:
Currently, the following targets build compendia and synonym files:
* anatomy
* chemical
* disease
* gene
* protein
* genefamily
* protein
* macromolecular_complex
* taxon
* process
* publications

And these two build conflations:
* geneprotein
* drugchemical

Each target builds one or more compendia corresponding to a biolink model category. For instance, the anatomy target
builds compendia for `biolink:AnatomicalEntity`, `biolink:Cell`, `biolink:CellularComponent`, and `biolink:GrossAnatomicalStructure`.

You can also just run:

```snakemake --cores 5```

without a target to create all the files that are produced as part of Babel, including all reports and
alternate exports.

## Build Process

The information contained here is not required to create the compendia, but may be useful to understand. The build process is
Expand Down
30 changes: 30 additions & 0 deletions docs/Conflation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Babel Conflation

Babel is designed to produce cliques of _identical_ identifiers, but our users would sometimes like to combine
identifiers that are similar in some other way. Babel generates "conflations" to support this.

Babel currently generates two conflations:
1. GeneProtein conflates gene with the protein transcribed from it.
The gene identifier will always be returned.
2. DrugChemical conflates drugs with their active ingredients as a chemical. For each conflation we attempt to
determine a Biolink type, and arrange the identifiers in order of (1) preferred prefix order for that Biolink
type, followed by (2) ordering identifiers from the numerically smallest suffix to the numerically largest
suffix.

## How are conflations generated in Babel and used in NodeNorm?

Each conflation consists of a list of lists of clique identifiers, which are stored in Redis databases in NodeNorm.
If a particular conflation is turned on, NodeNorm will:
1. Normalize the input identifier to a clique identifier.
2. If the clique identifier is not part of any conflation, we return it as-is.
3. If the clique identifier is part of a conflation, we construct a new clique whose preferred identifier is the first
identifier in the clique, and which consists of all the identifiers from all the cliques included in that conflation.

## How are types handled for conflated cliques?

Babel does not assign a type to any conflations. When NodeNorm is called with a particular conflation turned on,
it determines the types of a conflated clique by:
1. Starting with the most specific type of the first identifier in the conflation.
2. Adding all the supertypes of the most specific type for the first identifier in the conflation as determined
by the [Biolink Model Toolkit](https://github.com/biolink/biolink-model-toolkit).
3. Add all the types and ancestors for all the other identifiers in the conflation without duplication.
120 changes: 120 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Babel Documentation

This directory contains several pieces of Babel documentation.

Both [Node Normalization (NodeNorm)](https://github.com/TranslatorSRI/NodeNormalization) and
[Name Resolution (NameRes or NameLookup)](https://github.com/TranslatorSRI/NameResolution) have their own GitHub repositories
with their own documentation, but this directory is intended to include all the basic instructions
needed to work with Babel and its tools.

## What does Babel do?

Babel was built as part of the [NCATS Translator project](https://ui.transltr.io/) to solve the problem
of multiple databases using different identifiers (specifically, [CURIEs](https://en.wikipedia.org/wiki/CURIE)) to
refer to the same concept, such as [CHEBI:15377 "water"](https://www.ebi.ac.uk/chebi/searchId.do?chebiId=15377) and
[PUBCHEM.COMPOUND:962 "water"](https://pubchem.ncbi.nlm.nih.gov/compound/962). Babel downloads many online
databases of identifiers and uses their cross-reference information to identify
_cliques_ of identifiers that refer to the same concept. Each clique is assigned a
type from the [Biolink Model](https://github.com/biolink/biolink-model), which determines which identifier prefixes are
allowed and the order in which the identifiers are presented. One of these identifiers
is chosen to be the _preferred identifier_ for the clique. Within Translator, this
information is made available through the [Node Normalization service](https://github.com/TranslatorSRI/NodeNormalization).

In certain contexts, differentiating between some related cliques doesn't make sense:
for example, you might not want to differentiate between a gene and the product of that
gene, such as a protein. Babel provides different [conflations](./Conflation.md) that group cliques
on the basis of various criteria: for example, the GeneProtein conflation combines a
gene with the protein that that gene encodes.

While generating these cliques, Babel also collects all the synonyms for every clique,
which can then be used by tools like [Name Resolution (NameRes)](https://github.com/TranslatorSRI/NameResolution) to provide
name-based lookup of concepts.

## How can I access Babel cliques?

There are several ways of accessing Babel cliques:
* You can run the Babel pipeline to generate the cliques yourself. Note that Babel
currently has very high memory requirements -- it requires around 500G of memory
in order to generate the Protein clique. The [main Babel README](../README.md) has
information on running this pipeline.
* The NCATS Translator project provides the [Node Normalization](https://nodenorm.transltr.io/docs) frontend to
"normalize" identifiers -- any member of a particular clique will be normalized
to the same preferred identifier, and the API will return all the secondary
identifiers, Biolink type, description and other useful information.
You can find out more about this frontend on [its GitHub repository](https://github.com/TranslatorSRI/NodeNormalization).
* The NCATS Translator project also provides the [Name Lookup (Name Resolution)](https://name-lookup.transltr.io/)
frontends for searching for concepts by labels or synonyms. You can find out more
about this frontend at [its GitHub repository](https://github.com/TranslatorSRI/NameResolution).
* Members of the Translator consortium can also request access to the [Babel outputs](./BabelOutputs.md), which
are currently available in JSONL or [KGX](https://github.com/biolink/kgx) formats.

## What is the Node Normalization service (NodeNorm)?

The Node Normalization service, Node Normalizer or [NodeNorm](https://github.com/TranslatorSRI/NodeNormalization) is an
NCATS Translator web service to normalize identifiers by returning a single preferred identifier for any identifier
provided.

In addition to returning the preferred identifier and all the secondary identifiers for a clique, NodeNorm will also
return its Biolink type and "information content" score, and optionally any descriptions we have for these identifiers.

It also includes some endpoints for normalizing an entire TRAPI message and other APIs intended primarily for
Translator users.

You can find out more about NodeNorm at its [Swagger interface](https://nodenormalization-sri.renci.org/docs)
or [in this Jupyter Notebook](https://github.com/TranslatorSRI/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb).

## What is the Name Resolution service (NameRes)?

The Name Resolution service, Name Lookup or [NameRes](https://github.com/TranslatorSRI/NameResolution) is an
NCATS Translator web service for looking up preferred identifiers by search text. Although it is primarily
designed to be used to power NCATS Translator's autocomplete text fields, it has also been used for
named-entity linkage.

You can find out more about NameRes at its [Swagger interface](https://name-resolution-sri.renci.org/docs)
or [in this Jupyter Notebook](https://github.com/TranslatorSRI/NameResolution/blob/master/documentation/NameResolution.ipynb).

## What are "information content" values?

Babel obtains information content values for over 3.8 million concepts from
[Ubergraph](https://github.com/INCATools/ubergraph?tab=readme-ov-file#graph-organization) based on the number of
terms related to the specified term as either a subclass or any existential relation. They are decimal values
that range from 0.0 (high-level broad term with many subclasses) to 100.0 (very specific term with no subclasses).

## I've found a "split" clique: two identifiers that should be considered identical are in separate cliques.

Please report this as an issue to the [Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues).
At a minimum, please include the identifiers (CURIEs) for the identifiers that should be combined. Links to
a NodeNorm instance showing the two cliques are very helpful. Evidence supporting the lumping, such as a link to an
external database that makes it clear that these identifiers refer to the same concept, are also very helpful: while we
have some ability to combine cliques manually if needed urgently for some application, we prefer to find a source of
mappings that would combine the two identifiers, allowing us to improve cliquing across Babel.

## I've found a "lumped" clique: two identifiers that are combined in a single clique refer to different concepts.

Please report this as an issue to the [Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues).
At a minimum, please include the identifiers (CURIEs) for the identifiers that should be split. Links to
a NodeNorm instance showing the lumped clique is very helpful. Evidence, such as a link to an external database
that makes it clear that these identifiers refer to the same concept, are also very helpful: while we have some
ability to combine cliques manually if needed urgently for some application, we prefer to find a source of mappings
that would combine the two identifiers, allowing us to improve cliquing across Babel.

## Where do the clique descriptions come from?

Currently, all descriptions for NodeNorm concepts come from [UberGraph](https://github.com/INCATools/ubergraph/). You
will note that descriptions are collected for every identifier within a clique, and then the description associated
with the most preferred identifier is provided for the preferred identifier. Descriptions are not included in NameRes,
but the `description` flag can be used to include any descriptions when returning cliques from NodeNorm.

## How can I build Babel?

Babel is difficult to build, primarily because of its inefficient memory handling -- we currently need around 500G of
memory to build the largest compendia (Protein and DrugChemical conflated information), although the smaller
compendia should be buildable with far less memory. We are working on reducing these restrictions as far as possible.
You can read more about [Babel's build process](./Build.md), and please do contact us if you run into any problems or
would like some assistance.

## Who should I contact for more information about Babel?

You can find out more about Babel by [opening an issue on this repository](https://github.com/TranslatorSRI/Babel/issues),
contacting one of the [Translator SRI PIs](https://ncats.nih.gov/research/research-activities/translator/projects), or
contacting the [NCATS Translator team](https://ncats.nih.gov/research/research-activities/translator/about).