Skip to content

shiro-kur/Pangolin_train

 
 

Repository files navigation

Modified Installation

$ mamba create -n pangolin_train -c conda-forge -c pytorch --yes \
  python=3.10 pytorch=1.13 torchvision torchaudio cudatoolkit=11.7 numpy=1.24 h5py
#check the GPU availability
$ python -c "import torch; print(torch.cuda.is_available())

Pangolin_train

The steps described below are for recreating the training and testing steps in Zeng and Li, Genome Biology 2022. In progress--currently setup to produce top-k and AUPRC metrics for Pangolin as in Figure 1b.

preprocessing

Generate training and test datasets. To generate from intermediate files (splice_table_{species}.txt present in the repository for {species} = Human,Macaque,Mouse,Rat) follow Step 2. To run the whole pipeline (starting from RNA-seq reads), follow Step 1 and Step 2.

Step 1

Run snakemake -s Snakefile1 --config SPECIES={species} and snakemake -s Snakefile2 --config SPECIES={species} for {species} = Human,Macaque,Mouse,Rat. You will probably need to adjust file paths. This will map RNA-seq reads for each species and tissue, quantify usage of splice sites, and output tables of splice sites for each gene.

Dependencies: Snakemake, Samtools, fastp, STAR, RSEM, MMR, Sambamba, RegTools, SpliSER, pybedtools

Inputs:

Outputs:

  • splice_table_{species}.txt for each species, used to generate training datasets, and splice_table_Human.test.txt, used to generate test datasets. Each line is formatted as:
gene_id paralog_marker  chromosome  strand  gene_start  gene_end  splice_site_pos:heart_usage,liver_usage,brain_usage,testis_usage,...
# Note: paralog_marker is unused and set to 0 for all genes
# Note: See utils_multi.py for how genes with usage < 0 is interpreted

Step 2

Run ./create_files.sh. This will generate dataset*.h5 files, which are the training and test datasets (requires ~500GB of space). These can be used in the train and evaluate steps below.

Dependencies:

  • conda create -c bioconda -n create_files_env python=2.7 h5py bedtools or equivalent

Inputs:

  • splice_table_{species}.txt for each species and splice_table_Human.test.txt (included in the repository or generated from Step 1)
  • Reference genomes for each species from GENCODE and Ensembl

Outputs: dataset_train_all.h5 (all species) and dataset_test_1.h5 (human test sequences)

training

Run train.sh to train all models for the evaluations used in Figure 1b. Depending on your GPU, this may take a few weeks! I have uploaded models from just running the first two lines of train.sh to train/models for reference. (TODO: Add fine tuning steps for models used in later figures.)

Dependencies:

  • conda create -c pytorch -n train_test_env python=3.8 pytorch torchvision torchaudio cudatoolkit=11.3 h5py or equivalent

Inputs:

  • dataset_train_all.h5 from preprocessing steps

Outputs:

  • Model checkpoints in train/models

evaluate

Run test.sh to get top-k and AUPRC statistics for test datasets. (TODO: Add additional evaluation metrics.)

Dependencies

  • Same as those for training + sklearn

Inputs:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.8%
  • Shell 24.2%