Skip to content

Proof of concept of a RNA-Seq pipeline from reads to count matrix (including quality control) with Nextflow and additional example RNA-Seq analysis in R

License

Notifications You must be signed in to change notification settings

MaxGreil/rnaseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rnaseq

Proof of concept of a RNA-Seq pipeline from reads to count matrix (including quality control) with Nextflow and additional example RNA-Seq analysis in R.

Prerequisites

  • Unix-like OS (Linux, macOS, etc.)
  • Java version 8
  • Docker engine 1.10.x (or later)

Necessary files

  • Reads to be mapped must be stored in compressed .fastq.gz file format in folder data

Additional necessary files

If the reads to be analyzed originate from a human RNA-Seq experiment, these additional 3 files must be stored in folder data:

  • Prebuild Hisat2 index for H. sapiens, release GRCh38
https://genome-idx.s3.amazonaws.com/hisat/grch38_snptran.tar.gz
  • Gencode GTF file, release 38 (GRCh38.p13)
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.chr_patch_hapl_scaff.annotation.gtf.gz
  • USCS BED file, assembly GRCh38/hg38, track GENCODE V38
http://genome.ucsc.edu/cgi-bin/hgTables

The BED file must be stored in *.annotation.bed.gz file format.

For the analysis of another species, the corresponding files for this organismus must be downloaded.

Table of Contents

Quick start

Because this pipeline uses HISAT2 as the alignment program for mapping reads, this pipeline is for short reads only!

Example run:

nextflow run main.nf

The above example uses default parameter params.reads for single-end reads:

nextflow run main.nf --reads "data/*.fastq.gz"

For paired-end reads, additionally parameter params.singleEnd in nextflow.config must be changed to false. Then the input command must be:

nextflow run main.nf --reads "data/*_{1,2}*.fastq.gz"

Optionally, you can specify the Nextflow output directory with flag --outdir <folder>. By default, all resulting files will be saved in folder output and folder info will contain all information about the last run nextflow session.

Installation

Clone this repository with the following command:

git clone https://github.com/maxgreil/rnaseq && cd rnaseq

Then, install Nextflow by using the following command:

curl https://get.nextflow.io | bash

The above snippet creates the nextflow launcher in the current directory.

Finally pull the following Docker container:

docker pull maxgreil/rnaseq

Alternatively, you can build the Docker Image yourself using the following command:

cd docker && docker image build . -t maxgreil/rnaseq

Arguments

Optional Arguments

Argument Usage Description
--reads <files> Directory and glob pattern of input files
--outdir <folder> Directory to save output files

Documentation

This pipeline is designed to:

  • map given reads to a genome
  • create a count matrix of mapped reads for subsequent RNA-Seq analysis in R
  • do a quality control of the created files

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  1. hisat2 - map given reads to genome
  2. samtools - create sorted BAM files from HISAT2 SAM files
  3. picard - mark duplicates in sorted BAM files
  4. featureCounts - count mapped reads to genomic features (exons)
  5. deeptools - create BIGWIG from BAM for IGV
  6. preseq - predict and estimate the complexity of genomic sequencing library
  7. reseqc - comprehensive evaluation of used RNA-Seq data
  8. FastQC - BAM file quality control
  9. MultiQC - aggregate report, describing results of the whole pipeline

About

Proof of concept of a RNA-Seq pipeline from reads to count matrix (including quality control) with Nextflow and additional example RNA-Seq analysis in R

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages