ProteoNeMo

This repository containes the code for pre-training and inference procedures of protein language models with Nvidia NeMo toolkit from Peptone Ltd.

ProteoNeMo can be used to extract residue level representations of proteins and to train related downstream tasks.

Usage

Quick-start

As a prerequisite, you must have NeMo 1.7 or later installed to use this repository.

Install the proteonemo package:

Clone the ProteoNeMo repository, go to the ProteoNeMo directory and run

python setup.py install

Datasets

ProteoNeMo can be pre-trained on:

UniRef
- UniRef 50
- UniRef 90
- UniRef 100
UniParc
UniProtKB
- UniProtKB Swiss-Prot
- UniProtKB TrEMBL
- UniProtKB isoform sequences

Download and preprocess datasets

ProteoNeMo can be pre-trained on the datasets pointed-out above. You can choose your preferred one or make use of two or more of them at the same time.

Each dataset will be:

Downloaded from UniProt and decopressed as a .fasta file
Sharded into several smaller .txt sub-files containing a random set of the related .fasta file, already splitted into training and evaluation samples
Tokenized into several .hdf5 files, one for each .txt sharded file, where the masking procedure has been already applied

In the ProteoNeMo directory run:

export BERT_PREP_WORKING_DIR=<your_dir>
cd scripts
bash create_datasets_from_start.sh <to_download>

Where:

BERT_PREP_WORKING_DIR defines the directory where the data will be downloaded and preprocessed
<to_download> defines the datasets we want to download and preprocess where uniref_50_only is the default.

The outputs are the download, sharded and hdf5 directories under the $BERT_PREP_WORKING_DIR parent directory, containing the related files.

To Download	Datasets
`uniref_50_only`	UniRef 50
`uniref_all`	UniRef 50, 90 and 100
`uniparc`	UniParc
`uniprotkb_all`	UniProtKB Swiss-Prot, TrEMBL and isoform sequences

ProteoNeMo pre-training

Once the download and preprocessing procedure is completed you're ready to pre-train ProteoNeMo.

The pre-training procedure exploits NeMo to solve the Masked Language Modeling (Masked LM) task. One training instance of Masked LM is a single modified protein sequence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with a random token and the remaining 10% the token is retained. The task is then to predict the original token.

We have currently integrated BERT-like uncased models from HuggingFace.

The first thing you need to do is creating a model_config.yaml file in the conf directory, specifying the relative pre-training and model options. You can use this config as template.

Take a look to these NeMo tutorials to get familiar with such options.

Secondly, you have to modify the config_name argument of the @hydra_runner decorator in bert_pretraining.py

Lastly, in the ProteoNeMo directory run:

cd scripts
python bert_pretraining.py

The pre-training will start and a progress bar will appear

Tensorboard monitoring

Once the pre-training procedure has started a nemo_experiments directory will be automatically created under the scripts directory.

Based on the name: <PretrainingModelName> parameter in the .yaml configuration file, a <PretrainingModelName> sub-directory containing all the related pre-training experiment logs will be created under nemo_experiments.

In the ProteoNeMo directory run:

tensorboard --logdir=scripts/nemo_experiments/<PretrainingModelName>

The Tensorboard UI will be available on port 6006

Residue level representations extraction

Once a ProteoNeMo model will be pre-trained you'll get a .nemo file, placed in the nemo_path you've utilised in the .yaml configuration file.

You're now ready to extract the residue level representations of each protein a .fasta file.

In the ProteoNeMo directory run:

cd scripts
python bert_eval.py --input_file <fasta_input_file> \
                    --vocab_file ../static/vocab.txt \
                    --output_dir <reprs_output_dir> \
                    --model_file <nemo_pretrained_model>

Where:

--input_file defines the .fasta file containing the proteins for which you want to extract the residue level representations
--vocab_file defines the .txt file containing the vacabulary you want to use during the inference phase. We suggets you use the standard one
--output_dir defines the output directory where the residue level representations will be written. You'll get a .pt file for each protein sequence in the --input_file
--model_file defines the .nemo file used to get the pre-trained weights needed to get the residue level representations

Licence

This source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory of this source tree.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
conf		conf
proteonemo		proteonemo
requirements		requirements
scripts		scripts
static		static
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteoNeMo

Table of Contents

Usage

Quick-start

Datasets

Download and preprocess datasets

ProteoNeMo pre-training

Tensorboard monitoring

Residue level representations extraction

Licence

About

Releases 2

Packages

Contributors 2

Languages

License

PeptoneLtd/ProteoNeMo

Folders and files

Latest commit

History

Repository files navigation

ProteoNeMo

Table of Contents

Usage

Quick-start

Datasets

Download and preprocess datasets

ProteoNeMo pre-training

Tensorboard monitoring

Residue level representations extraction

Licence

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages