Name		Name	Last commit message	Last commit date
parent directory ..
envs		envs
params		params
snakemake_config/slurm		snakemake_config/slurm
README.md		README.md
class2_instructions.sh		class2_instructions.sh
class2_pyrpipe.py		class2_pyrpipe.py
cleanup		cleanup
conda_instructions		conda_instructions
config.yaml		config.yaml
environment.yaml		environment.yaml
mammalian_orphan.yaml		mammalian_orphan.yaml
prepare_data.sh		prepare_data.sh
pyrpipe_conf.yaml		pyrpipe_conf.yaml
snakefile		snakefile
srrids.txt		srrids.txt
testclass.py		testclass.py

README.md

Introduction

This repository contains the evidence-based Direct Inference pipeline for transcript assembly. The whole pipeline is fully automated and the user only needs to provide a list of NCBI-SRA Run Accessions. The pipeline directly infers genes from RNA-Seq evidence. (Neither ab initio predictions nor homology information are considered.) The pipeline can be fexibly modified and shared.

The pipeline steps are:

Download raw RNA-Seq data from NCBI-SRA by providing a list of NCBI-SRA Run Accessions.
Perform transcript assembly using a choice of tools
Obtain splice-junctions using portcullis
Perform meta-assembly using mikado
Use orfipy for identification of ORFs

Steps 1 and 2 are implemented with pyrpipe. The whole pipline is implemented in snakemake in order to be parallelized over samples.

Minimum required input to the pipeline is a list of NCBI-SRA Run Accessions in the srrids.txt

Final output of the pipeline is the file mikado.loci.gff3, containing the final transcript assembly

Setting up tools

Please follow this section to correctly setup an environment to execute the pipeline.

Setting up conda

The pipeline dependencies are available via the bioconda channel. Please install conda if not already present.
Make sure conda channels are set up correctly:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Build the conda environment present in the environment.yaml file:

conda env create -f environment.yaml

Note: Install portcullis separately if you cannot install it successfully.

conda install -c bioconda portcullis

If using class2 please execute the instructions in the class2_instructions.sh to setup class2
Mikado and orfipy steps are executed in a separate conda environment. This conda environment file is located in the envs/ directory.

Setting up pipeline

The main pipeline configuration file is config.yaml. Edit this file to change the pipeline behaviour such as reference genome, hisat2 index, output directoy, mikado and orfipy options.

Prepare reference genome data

To test pipeline with Arabidopsis thaliana, execute the script prepare_data.sh to download the reference genome and build a hisat2 index:

conda activate orphan_prediction
bash prepare_data.sh

Above command will create a directory, reference_data

To generate annotations on species of your choice. Download the genome and build a hisat2 index. You need to edit config.yaml to specify which genome and index to use.

Tool parameters

Edit the files under params directory to specify tools parameters. These files are automatically parsed by pyrpipe. The parameters required by Mikado and orfipy are tunable via the config.yaml or the snakemake file.

Executing the pipeline

Using single node

To execute the pipeline on a single node use the following command:

snakemake -j 2 --use-conda --conda-frontend conda

Above command will execute the pipeline using 2 cores i.e. the transcript assembly step will run 2 samples in parallel.

Scaling pipeline on HPC

To execute the pipeline on an HPC with mutiple nodes, execute as

snakemake -j 20 --profile snakemake_config/slurm --use-conda --conda-frontend conda

Above command will execute the pipeline ans schedule 20 jobs in parallel. The snakemake profile, snakemake_config/slurm could be modified to fine tune resource usage. Above command is compatible for SLURM job-scheduler, you may need to change the parameters based on the system. Specifically, edit the snakemake_config/slurm/config.yaml and snakemake_config/slurm/cluster_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence_based_pipeline

evidence_based_pipeline

README.md

Introduction

Setting up tools

Setting up conda

Setting up pipeline

Prepare reference genome data

Tool parameters

Executing the pipeline

Using single node

Scaling pipeline on HPC

Files

evidence_based_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

evidence_based_pipeline

Folders and files

parent directory

README.md

Introduction

Setting up tools

Setting up conda

Setting up pipeline

Prepare reference genome data

Tool parameters

Executing the pipeline

Using single node

Scaling pipeline on HPC