HAplotype and PHylodynamics pipeline for viral assembly, population genetics, and phylodynamics.
Our full User Guide is available here.
1. Create a conda environment with HAPHPIPE
conda create -n haphpipe haphpipe
2. Activate the environment
conda activate haphpipe
3. Install GATK
Due to license restrictions, bioconda cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK (version 3.8-0) from the Broad Institute: https://software.broadinstitute.org/gatk/download/archive.
Register the package using gatk3-register:
gatk3-register /path/to/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2
This will copy GATK into your conda environment.
NOTE: HAPHPIPE was developed and tested using GATK 3.8.
After successful installation, the demo dataset can be run to ensure HAPHPIPE is installed and set up correctly.
Running the demo is simple and requires a single command:
hp_demo
A specific outdirectory can be specified by:
hp_demo --outdir $outdir_name
The output of the entire demo is as such
If running the entire demo is not desired, this command can be executed to just pull the references included in HAPHPIPE into the directory that is specified (default is .
).
hp_demo --refonly
Output on the terminal is as such, and the three HIV reference files are located in the subdirectory refs
. See the User Guide for more information regarding these reference files.
/base/directory/path/of/haphpipe)
Demo was run with --refonly. References are now in outdirectory: $outdir_name/haphpipe_demo/refs.
This pipeline implements amplicon assembly using a denovo approach. Reads are error-corrected and used to refine the initial assembly, with up to 5 refinement steps.
This pipeline implements amplicon assembly using a reference-based mapping approach. Reads are error-corrected and used to refine the initial assembly, with up to 5 refinement steps.
See more information regarding the pipelines at the wiki.
Each stage can be run on its own. Stages are grouped into 5 categories: hp_reads, hp_assemble, hp_haplotype, hp_description, and hp_phylo. More detailed description of command line options for each stage are available in the wiki. To view all available stages in HAPHPIPE, run:
haphpipe -h
Stages to manipulate reads and perform quality control. Input is reads in FASTQ format, output is modified reads in FASTQ format.
Subsample reads using seqtk (documentation). Input is reads in FASTQ format. Output is sampled reads in FASTQ format. Example to execute:
haphpipe sample_reads --fq1 read_1.fastq --fq2 read_2.fastq --nreads 1000 --seed 1234
Trim reads using Trimmomatic (documentation). Input is reads in FASTQ format. Output is trimmed reads in FASTQ format. Example to execute:
haphpipe trim_reads --fq1 read_1.fastq --fq2 read_2.fastq
Join reads using FLASH (paper). Input is reads in FASTQ format. Output is joined reads in FASTQ format. Example to execute:
haphpipe join_reads --fq1 trimmed_1.fastq --fq2 trimmed_2.fastq
Error correction using SPAdes (documentation). Input is reads in FASTQ format. Output is error-corrected reads in FASTQ format. Example to execute:
haphpipe ec_reads --fq1 trimmed_1.fastq --fq2 trimmed_2.fastq
Assemble consensus sequence(s). Input reads (in FASTQ format) are assembled using either denovo assembly or reference-based alignment. Resulting consensus can be further refined.
Assemble reads via de novo assembly using SPAdes (documentation). Input is reads in FASTQ format. Output is contigs in FNA format. Example to execute:
haphpipe assemble_denovo --fq1 corrected_1.fastq --fq2 corrected_2.fastq --outdir denovo_assembly --no_error_correction TRUE
Assemble contigs from de novo assembly using both a reference sequence and amplicon regions with MUMMER 3+ (documentation). Input is contigs and reference sequence in FASTA format and amplicon regions in GTF format. Example to execute:
haphpipe assemble_amplicons --contigs_fa denovo_contigs.fa --ref_fa refSequence.fasta --ref_gtf refAmplicons.gtf
Scaffold contigs against a reference sequence with MUMMER 3+ (documentation). Input is contigs in FASTA format and reference sequence in FASTA format. Output is scaffold assembly, alligned scaffold, imputed scaffold, and padded scaffold in FASTA format. Example to execute:
haphpipe assemble_scaffold --contigs_fa denovo_contigs.fa --ref_fa refSequence.fasta
Map reads to reference sequence (instead of running de novo assembly) using Bowtie2 (documentation) and Picard (documentation). Input is reads in FASTQ format and reference sequence in FASTA format. Example to execute:
haphpipe align_reads --fq1 corrected_1.fastq --fq2 corrected _2.fastq --ref_fa refSequence.fasta
Variant calling from alignment using GATK (documentation). Input is alignment file in BAM format and reference sequence in FASTA format (either reference from reference-based assembly or consensus final sequence from de novo assembly). Output is a Variant Call File (VCF) format file. Example to execute:
haphpipe call_variants --aln_bam alignment.bam --ref_fa refSequence.fasta
Generate a consensus sequence from a VCF file. Input is a VCF file. Output is the consensus sequence in FASTA format. Example to execute:
haphpipe vcf_to_consensus --vcf variants.vcf
Map reads to a denovo assembly or reference alignment. Assembly or alignment is iteratively updated. Input is reads in FASTQ format and reference sequence (assembly or reference alignment) in FASTA format. Output is refined assembly in FASTA format. Example to execute:
haphpipe refine_assembly --fq_1 corrected_1.fastq --fq2 corrected_2.fastq --ref_fa refSequence.fasta
Finalize consensus, map reads to consensus, and call variants. Input is reads in FASTQ format and reference sequence in FASTA format. Output is finalized reference sequence, alignment, and variants (in FASTA, BAM, and VCF formats, respectively).
haphpipe finalize_assembly --fq_1 corrected_1.fastq --fq2 corrected_2.fastq --ref_fa refined.fna
Haplotype assembly stages. HAPHPIPE implements PredictHaplo (paper), although other haplotype reconstruction programs can be utilized outside of HAPHPIPE using the final output of HAPHPIPE, typically with the final consensus sequence (FASTA) file, reads (raw, trimmed, and/or corrected), and/or final alignment (BAM) file as input.
Haplotype identification with PredictHaplo. Input is reads in FASTQ format and and reference sequence in FASTA format. Output is the longest global haplotype file and corresponding HTML file. Note: PredictHaplo must be installed separately before running this stage. Example to execute:
haphpipe predict_haplo corrected_1.fastq --fq2 corrected_2.fastq --ref_fa final.fna
Return PredictHaplo output as a correctly formatted FASTA file. Input is the output file from predict_haplo (longest global .fas file). Output is a correctly formatted FASTA file. Example to execute:
haphpipe ph_parser best.fas
Haplotype identification with CliqueSNV. Input is reads in FASTQ format and and reference sequence in FASTA format. Output is a FASTA file containing haplotypes with frequencies, a TXT file with CliqueSNV parameters and output, and a parsed summary TXT file (similar to the output of ph_parser). The CliqueSNV JAR file must be downloaded before running this stage, available here. If the file is not located in the current directory, provide the path to its directory using the --jardir
option.
Example to execute:
haphpipe cliquesnv corrected_1.fastq --fq2 corrected_2.fastq --ref_fa final.fna
Stages to annotate and extract regions from sequences using a reference sequence and GTF file. Also includes a module that calculates summary statistics.
Apply correct coordinate system to final sequence(s) to facilitate downstream analyses. Input is the final sequence file in FASTA format, a reference sequence in FASTA format, and a reference GFT file. Output is a JSON file to be used in extract_pairwise
.
Example to execute:
haphpipe pairwise_align --amplicons_fa final.fna --ref_fa refSequence.fasta --ref_gtf referenceSeq.gtf
Extract sequence regions from the pairwise alignment produced in pairwise_align
. Input is the JSON file from pairwise_align
. Output is either an unaligned nucleotide FASTA file, an aligned nucleotide FASTA file, an amino acid FASTA file, an amplicon GTF file, or a tab-separated values (TSV) file (default: nucleotide FASTA with regions of interest from GTF file used in pairwise_align
).
Example to execute:
haphpipe extract_pairwise --align_json pairwise_aligned.json --refreg HIV_B.K03455.HXB2:2085-5096
Generates summary statistics for samples. Input in a TXT with a list of sample directories. Output is a TXT and TSV file. Example to execute:
haphpipe summary_stats --dir_list demo_dir_list.txt --amplicons
Phylogenetic stages that include multiple sequence alignment, determination of best-fit model of evolution, and building phylogeny options.
Aligns sequences using MAFFT. Input is a FASTA file with sequences wanting aligned and/or a TXT file with a list of directories AND a reference GTF file. Example to execute:
haphpipe multiple_align --dir_list demo_dir_list.txt -ref_gtf referenceSeq.gtf
Determine best-fit evolutionary model with ModelTest-NG. Input is an aligned FASTA or PHYLIP file with sequences. Output is text file with ModelTest output showing best-fit models of evolution. Example to execute:
haphpipe model_test --seqs alignment.fasta
Create phylogenetic tree with RAxML-NG. Input is an aligned FASTA or PHYLIP file with sequences. Output are tre files. Example to execute:
haphpipe build_tree_NG --seqs alignment.fasta --all --model GTR