$ mamba create -n pangolin_train -c conda-forge -c pytorch --yes \
python=3.10 pytorch=1.13 torchvision torchaudio cudatoolkit=11.7 numpy=1.24 h5py
#check the GPU availability
$ python -c "import torch; print(torch.cuda.is_available())
The steps described below are for recreating the training and testing steps in Zeng and Li, Genome Biology 2022. In progress--currently setup to produce top-k and AUPRC metrics for Pangolin as in Figure 1b.
Generate training and test datasets. To generate from intermediate files (splice_table_{species}.txt
present in the repository for {species}
= Human
,Macaque
,Mouse
,Rat
) follow Step 2. To run the whole pipeline (starting from RNA-seq reads), follow Step 1 and Step 2.
Run snakemake -s Snakefile1 --config SPECIES={species}
and snakemake -s Snakefile2 --config SPECIES={species}
for {species}
= Human
,Macaque
,Mouse
,Rat
. You will probably need to adjust file paths. This will map RNA-seq reads for each species and tissue, quantify usage of splice sites, and output tables of splice sites for each gene.
Dependencies: Snakemake, Samtools, fastp, STAR, RSEM, MMR, Sambamba, RegTools, SpliSER, pybedtools
Inputs:
- Reference genomes and annotations from GENCODE and Ensembl
- RNA-seq reads from ArrayExpress (mouse, rat, macaque, human)
Outputs:
splice_table_{species}.txt
for each species, used to generate training datasets, andsplice_table_Human.test.txt
, used to generate test datasets. Each line is formatted as:
gene_id paralog_marker chromosome strand gene_start gene_end splice_site_pos:heart_usage,liver_usage,brain_usage,testis_usage,...
# Note: paralog_marker is unused and set to 0 for all genes
# Note: See utils_multi.py for how genes with usage < 0 is interpreted
Run ./create_files.sh
. This will generate dataset*.h5
files, which are the training and test datasets (requires ~500GB of space). These can be used in the train
and evaluate
steps below.
Dependencies:
conda create -c bioconda -n create_files_env python=2.7 h5py bedtools
or equivalent
Inputs:
splice_table_{species}.txt
for each species andsplice_table_Human.test.txt
(included in the repository or generated from Step 1)- Reference genomes for each species from GENCODE and Ensembl
Outputs: dataset_train_all.h5
(all species) and dataset_test_1.h5
(human test sequences)
Run train.sh
to train all models for the evaluations used in Figure 1b. Depending on your GPU, this may take a few weeks! I have uploaded models from just running the first two lines of train.sh
to train/models
for reference. (TODO: Add fine tuning steps for models used in later figures.)
Dependencies:
conda create -c pytorch -n train_test_env python=3.8 pytorch torchvision torchaudio cudatoolkit=11.3 h5py
or equivalent
Inputs:
dataset_train_all.h5
frompreprocessing
steps
Outputs:
- Model checkpoints in
train/models
Run test.sh
to get top-k and AUPRC statistics for test datasets. (TODO: Add additional evaluation metrics.)
Dependencies
- Same as those for
training
+sklearn
Inputs:
dataset_test_1.h5
frompreprocessing
steps- Follow
training
steps or clone https://github.com/tkzeng/Pangolin.git to get models