Skip to content

Latest commit

 

History

History
127 lines (96 loc) · 5.17 KB

README.md

File metadata and controls

127 lines (96 loc) · 5.17 KB

Contributors Forks Stargazers Issues MIT License


DNA Coverage Analysis

Snakemake pipelines for preprocessing, mapping, and coverage charts of bacterial DNA-Seq data
Explore the docs »

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage

About The Project

These pipelines visualize the coverage of DNA-Seq data on one or multiple reference genomes. A pipeline consists of the following steps:

  1. Quality control of the raw data with FastQC
  2. Preprocessing with fastp
  3. Quality control of the preprocessed data with FastQC
  4. rRNA filtering with SortMeRNA
  5. For each reference:
    1. Mapping with bowtie2
    2. Feature counting with featureCounts
    3. Coverage plots with bedtools and R-Sushi

Getting Started

Prerequisites

The only requirements are a functional conda/mamba and Snakemake with version 8 or newer.

  • mamba

  • snakemake

    mamba create -c conda-forge -c bioconda -n snakemake snakemake

Installation

git clone https://github.com/pblumenkamp/dna_coverage_analysis.git

Required files

  • DNA-Seq data in gzipped FASTQ format
  • One or multiple reference genomes in (gzipped or uncompressed) FASTA format
  • Reference Annotation for each genome in uncompressed GFF3 format

Usage

  1. Use the pipeline in paired_end for paired-end data and the pipeline in single_end for single-end data.

    # e.g.
    cd paired_end
  2. Change settings in config.yaml. The most important settings are the input directory and the used references.

  3. Start the snakemake pipeline locally or on a compute cluster.

    # Local
    snakemake --configfile config.yaml --use-conda --resources mem_mb=<max_ram_usage_in_mb>
    # Compute cluster 
    snakemake --configfile config.yaml --use-conda --profile <path_to_your_cluster_profile>/cluster_profile

Config.yaml

There are, at the moment, 4 different parts in the config.yaml.

fastq_input_dir

This defines the directory where the DNA-Seq data is stored. As a naming convention, all single-end DNA-Seq files must end with fastq.gz, and all paired-end files must end with _R1.fastq.gz and _R2.fastq.gz.

coverage_resolution

Defines the resolution in base pairs (bp) for each bar in the final coverage bar plots. A list with multiple resolutions is possible (comma-separated), so separate folders for each coverage plot are created.

references

A list of all reference genomes for the coverage analysis. Each reference will be analyzed separately. genome must be the path to the reference genome in (compressed) FASTA format. annotation is the path to the reference annotation in uncompressed GFF3 format. gff_features is a list of GFF feature types which will be counted in separate count tables. Please verify that the listed feature type can also be found in the GFF3 file.

memory_usage_in_mb

List of pipeline steps with data-dependent memory usage. Please adjust these numbers if you use Snakemake on a compute cluster with memory limits and run in out-of-memory errors. These settings can also be used locally with the option --resources mem_mb=<max_ram_usage_in_mb>.

License

Distributed under the MIT License. See LICENSE.txt for more information.