VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs
- About VStrains
- Updates
- Installation
3.1. Option 1. Quick Install
3.2. Option 2. Manual Install
3.3. Download & Install VStrains - Running VStrains
4.1. Quick Usage
4.2. Support SPAdes
4.3. Output - Stand-alone binaries
- Experiment
- Citation
- Feedback and bug reports
VStrains is a de novo approach for reconstructing strains from viral quasispecies.
-
Replace the PE link inference module
VStrains_Alignment.py
withVStrains_PE_Inference.py
VStrains_PE_Inference.py
implements a hash table approach that produce efficient perfect match lookup, the new module leads to consistent evaluation results and substantially decrease the runtime and memory usage against previous alignment approach.
VStrains requires a 64-bit Linux system or Mac OS and python (supported versions are python3: 3.2 and higher).
Install (mini)conda as a light-weighted package management tool. Run the following commands to initialize and setup the conda environment for VStrains
# add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create conda environment
conda create --name VStrains-env
# activate conda environment
conda activate VStrains-env
conda install -c bioconda -c conda-forge python=3 graph-tool minimap2 numpy gfapy matplotlib
Manually install dependencies:
And python modules:
After successfully setup the environment and dependencies, clone the VStrains into your desirable place.
git clone https://github.com/metagentools/VStrains.git
Install the VStrains via Pip
cd VStrains; pip install .
Run the following commands to ensure VStrains is correctly setup & installed.
vstrains -h
VStrains supports assembly results from SPAdes (includes metaSPAdes and metaviralSPAdes) and may supports other graph-based assemblers in the future.
usage: VStrains [-h] -a {spades} -g GFA_FILE [-p PATH_FILE] [-o OUTPUT_DIR] -fwd FWD -rve RVE
Construct full-length viral strains under de novo approach from contigs and assembly graph, currently supports
SPAdes
optional arguments:
-h, --help show this help message and exit
-a {spades}, --assembler {spades}
name of the assembler used. [spades]
-g GFA_FILE, --graph GFA_FILE
path to the assembly graph, (.gfa format)
-p PATH_FILE, --path PATH_FILE
contig file from SPAdes (.paths format), only required for SPAdes. e.g., contigs.paths
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
path to the output directory [default: acc/]
-fwd FWD, --fwd_file FWD
paired-end sequencing reads, forward strand (.fastq format)
-rve RVE, --rve_file RVE
paired-end sequencing reads, reverse strand (.fastq format)
VStrains takes as input an assembly graph in Graphical Fragment Assembly (GFA) Format and associated contig information, together with the raw reads in paired-end format (e.g., forward.fastq, reverse.fastq).
When running SPAdes, we recommend to use --careful
option for more accurate assembly results. Do not modify any contig/node name from the SPAdes assembly results for consistency. Please refer to SPAdes for further guideline. Example usage as below:
# SPAdes assembler example, pair-end reads
python spades.py -1 forward.fastq -2 reverse.fastq --careful -t 16 -o output_dir
Both assembly graph (assembly_graph_after_simplification.gfa
) and contig information (contigs.paths
) can be found in the output directory after running SPAdes assembler. Please use them together with raw reads as inputs for VStrains, and set -a
flag to spades
. Example usage as below:
vstrains -a spades -g assembly_graph_after_simplification.gfa -p contigs.paths -o output_dir -fwd forward.fastq -rve reverse.fastq
VStrains stores all output files in <output_dir>
, which is set by the user.
<output_dir>/aln/
directory contains paired-end (PE) linkage information, which is stored inpe_info
andst_info
.<output_dir>/gfa/
directory contains iteratively simplified assembly graphs, wheregraph_L0.gfa
contains the assembly graph produced by SPAdes after Strandedness Canonization,split_graph_final.gfa
contains the assembly graph after Graph Disentanglement, andgraph_S_final.gfa
contains the assembly graph after Contig-based Path Extraction, the rests are intermediate results. All the assembly graphs are in GFA 1.0 format.<output_dir>/paf/
and<output_dir>/tmp/
are temporary directories, feel free to ignore them.<output_dir>/strain.fasta
contains resulting strains in.fasta
, the headers for each strain has the formNODE_<strain name>_<sequence length>_<coverage>
which is compatiable to SPAdes contigs format.<output_dir>/strain.paths
contains paths in the assembly graph (inputGFA_FILE
) corresponding tostrain.fasta
using Bandage for further downstream analysis.<output_dir>/vstrains.log
contains the VStrains log.
evals/quast_evaluation.py
is a wrapper script for strain-level experimental result analysis using MetaQUAST.
usage: quast_evaluation.py [-h] -quast QUAST [-cs FILES [FILES ...]] [-d IDIR] -ref REF_FILE -o OUTPUT_DIR
Use MetaQUAST to evaluate assembly result
options:
-h, --help show this help message and exit
-quast QUAST, --path_to_quast QUAST
path to MetaQuast python script, version >= 5.2.0
-cs FILES [FILES ...], --contig_files FILES [FILES ...]
contig files from different tools, separated by space
-d IDIR, --contig_dir IDIR
contig files from different tools, stored in the directory, .fasta format
-ref REF_FILE, --ref_file REF_FILE
ref file (single)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
output directory
VStrains is evaluated on both simulated and real datasets under default settings, and the source of the datasets can be found in the links listed below:
- Simulated Dataset, can be found at savage-benchmark (No preprocessing is required)
- 6 Poliovirus (20,000x)
- 10 HCV (20,000x)
- 15 ZIKV (20,000x)
- Real Dataset (please refer to Supplementary Material for preprocessing the real datasets)
- 5 HIV labmix (20,000x) SRR961514, reference genome sequences are available at 5 HIV References
- 2 SARS-COV-2 (4,000x) SRR18009684, SRR18009686, pre-processed reads and individually assemble ground-truth reference sequences can be found at 2 SARS-COV-2 Dataset
VStrains has been accepted at RECOMB 2023 and manuscript is publicly available at here.
If you use VStrains in your work, please cite the following publications.
Runpeng Luo and Yu Lin, VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs
Thanks for using VStrains. If any bugs be experienced during execution, please re-run the program with additional -d
flag and provide the vstains.log
together with user cases via Issues