Continuous Integration and Evaluation for Variant Detection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the conda package management system and nextflow workflow language.
This tool suite was developed for Linux and is the only officially supported operating system here. Having any derivative of the conda package management system installed is the only strict system requirement. A recent version (≥20.04.0) of nextflow is required to execute the workflows, but can easily be installed via conda. For an installation instruction of nextflow via conda see Installation.
🖥️ See list of tested setups:
Requirement | Tested with |
---|---|
64 bits Linux operating system | Ubuntu 20.04.5 LTS |
Conda | vers. 23.5.0, 24.1.2 |
Nextflow | vers. 20.04.0, 23.10.1 |
- Download the repository:
git clone https://github.com/rki-mf1/cievad.git
- [Optional] Install nextflow if not yet on your system. For good practise you should use a new conda environment:
conda deactivate
conda create -n cievad -c bioconda nextflow
conda activate cievad
This tool suite provides multiple functional features to generate synthetic sequencing data, generate sets of ground truth variants (truthsets) and evaluate sets of predicted variants (callsets).
There are two main workflows, hap.nf
and eval.nf
.
Both workflows are executed via the nextflow command line interface (CLI).
⚠️ Run commands from the root directory:
Without further ado, please run the commands from a terminal at the top folder (root directory) of this repository.
Otherwise relative paths within the workflows might be invalid.
The minimal command to generate haplotype data is
nextflow run hap.nf -profile local,conda
This generates the following data within the <project_root>/results/
directory:
- a haplotype (FASTA), which is a copy of the provided reference sequence but deviates by a set of synthetic genomic variants
- the variant set (VCF) of synthetic genomic variants in the haplotype
- a set of reads (FASTQ) representing a sequencing experiment from the haplotype
The minimal command to evaluate the accordance between a truthset (generated data) and a callset is
nextflow run eval.nf -profile local,conda --callsets_dir <path/to/callsets>
where --callsets_dir
is the parameter to specify a folder containing the callset VCF files.
Currently, a callset within this folder has to follow the naming convention callset_<X>.vcf[.gz]
where <X> is the integer of the corresponding truthset.
Alternatively, one can provide a sample sheet of comma separated values (CSV file) with the columns "index", "truthset" and callset", where "index" is an integer from 1 to n (number of samples) and "callset"/"truthset" are paths to the pairwise matching VCF files.
Callsets can optionally be gzip compressed.
The command for the sample sheet input is
nextflow run eval.nf -profile local,conda --sample_sheet <path/to/sample_sheet>
This generates the following data within the <project_root>/results/
directory:
- a report (CSV, JSON) about accordance between the synthetic variant set and a given corresponding callset
- a report (CSV) with statistis across all tested individuals
CIEVaD enables access and finetuning to a vast majority of parameters of the internal software tools.
The parameters to adjust the workflows are listed on their respective help pages.
To inspect the help pages type --help
after the script name, e.g. nextflow run hap.nf --help
for the hap.nf workflow.
Parameters can be adjusted via the CLI or directly within the nextflow.config file.
Mind that parameters provided by the CLI will overwrite parameters set in config.
More information about tuning crucial parameters, e.g. read quality and genome coverage, can be found in the Wiki.
Visit the project wiki for more detail information on parameters, help and FAQs.
Please file issues, bug reports and questions to the issues section.
We have a manuscript available for CIEVaD. If you use CIEVaD please cite
@article{krannich2024cievad,
title={CIEVaD: A Lightweight Workflow Collection for the Rapid and On-Demand Deployment of End-to-End Testing for Genomic Variant Detection},
author={Krannich, Thomas and Ternovoj, Dmitrii and Paraskevopoulou, Sofia and Fuchs, Stephan},
journal={Viruses},
volume={16},
number={9},
pages={1444},
year={2024},
doi={10.3390/v16091444}
}