Snakemake pipelines for preprocessing, mapping, and coverage charts of bacterial DNA-Seq data
Explore the docs »
Table of Contents
These pipelines visualize the coverage of DNA-Seq data on one or multiple reference genomes. A pipeline consists of the following steps:
- Quality control of the raw data with FastQC
- Preprocessing with fastp
- Quality control of the preprocessed data with FastQC
- rRNA filtering with SortMeRNA
- For each reference:
- Mapping with bowtie2
- Feature counting with featureCounts
- Coverage plots with bedtools and R-Sushi
The only requirements are a functional conda/mamba and Snakemake with version 8 or newer.
git clone https://github.com/pblumenkamp/dna_coverage_analysis.git
- DNA-Seq data in gzipped FASTQ format
- One or multiple reference genomes in (gzipped or uncompressed) FASTA format
- Reference Annotation for each genome in uncompressed GFF3 format
-
Use the pipeline in paired_end for paired-end data and the pipeline in single_end for single-end data.
# e.g. cd paired_end
-
Change settings in config.yaml. The most important settings are the input directory and the used references.
-
Start the snakemake pipeline locally or on a compute cluster.
# Local snakemake --configfile config.yaml --use-conda --resources mem_mb=<max_ram_usage_in_mb> # Compute cluster snakemake --configfile config.yaml --use-conda --profile <path_to_your_cluster_profile>/cluster_profile
There are, at the moment, 4 different parts in the config.yaml
.
This defines the directory where the DNA-Seq data is stored. As a naming convention, all single-end DNA-Seq files must end with fastq.gz,
and all paired-end files must end with _R1.fastq.gz
and _R2.fastq.gz
.
Defines the resolution in base pairs (bp) for each bar in the final coverage bar plots. A list with multiple resolutions is possible (comma-separated), so separate folders for each coverage plot are created.
A list of all reference genomes for the coverage analysis. Each reference will be analyzed separately. genome
must be the path to the reference genome in (compressed) FASTA format. annotation
is the path to the reference annotation in uncompressed GFF3 format. gff_features
is a list of GFF feature types which will be counted in separate count tables. Please verify that the listed feature type can also be found in the GFF3 file.
List of pipeline steps with data-dependent memory usage. Please adjust these numbers if you use Snakemake on a compute cluster with memory limits and run in out-of-memory errors. These settings can also be used locally with the option --resources mem_mb=<max_ram_usage_in_mb>
.
Distributed under the MIT License. See LICENSE.txt
for more information.