-
Notifications
You must be signed in to change notification settings - Fork 45
Tutorial
Documentation current for HybPiper version 2.1.8
The purpose of this tutorial is to help familiarize you with the format of the input you need and output you should expect from running HybPiper. The tutorial uses a test dataset that is a subset of real Illumina data from an enriched library.
This tutorial assumes that you have some experience executing programs from the command line on a UNIX-like system. If you would like to learn more about the command line, or need a refresher, you can find a command line tutorial here.
Click this link to download the files for the test dataset. If you installed HybPiper by cloning the repository, the file test_dataset.tar.gz
will already be in the repository directory. Extract the file by either double-clicking on it, or via a terminal using the command:
tar -zxvf test_dataset.tar.gz
The extracted directory test_dataset
contains the following files:
-
test_reads.fastq.tar.gz
contains paired reads from nine samples chosen from the initial HybPiper manuscript. It includes six "ingroup" samples (genus Artocarpus) and three outgroup samples. Each sample has a pair of*.fastq
files, representing the forward and reverse reads, generated on an Illumina MiSeq 2x300 platform. -
test_targets.fasta
is a file containing the full coding sequence from 13 target genes based on the Artocarpus probe set described in the HybPiper manuscript. There are two "sources" of sequence for each target: Artocarpus (sequences from a draft genome in the target group) and Morus (a reference genome in the same family as Artocarpus). For example, both of these sequences representgene002
:>Artocarpus-gene002 ATGATGAAGCAGGACGCCACCAACGGCGGCGAGAACTGGGTGCGCGTGTGCGACACGTGC CGCTCGGCGGCGTGCACGGTTTACTGCCGCGCCGACTCGGCTTACCTCTGCGCCGGATGC >Morus-gene002 ATGATGAAGGAGGACACAAACGGGGGCAACTCCAGCAAGAACTGGGCGCGCGTGTGTGAC ACGTGCCGTTCCGCGGCGTGCGCGGTGTACTGCCGTGCCGACTCGGCGTACCTTTGCGCG
Having multiple sources for each gene increases the likelihood that reads will map to the targets during the first phase. HybPiper then chooses the version with the best overall mapping score to serve as a reference for extracting coding sequences.
-
namelist.txt
A file containing the list of sample names, one per line. While not required, this file will help run the main HybPiper script and post-processing scripts on multiple samples. -
run_hybpiper_test_dataset.sh
This shell script contains a few of the example commands we will cover in this tutorial. The script can be run from within thetest_dataset
directory simply by:./run_hybpiper_test_dataset.sh
All the commands in the bash script will be covered in more detail in the following sections.
Uncompress the read files using the tar
utility command:
tar -zxvf test_reads.fastq.tar.gz
This will generate 18 *.fastq
files in your current directory, representing the forward and reverse Illumina reads for nine samples.
The main command of HybPiper is hybpiper assemble
. HybPiper needs sequencing reads and a file containing the target coding sequences, in either amino-acid or nucleotide format.
HybPiper is run separately for each sample, and generates a directory that contains all of the output files, organized by gene.
Run this command from the test_dataset
directory:
hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa
Using the wild card (asterisk) saves some typing and instructs HybPiper to use both the R1 (forward) and R2 (reverse) read files. The --prefix
flag will be the name of the directory generated by HybPiper, as well as the identifier for all sequences generated. The --bwa
flag is required when the target file contains nucleotide sequences. Note: if you supply a target file containing nucleotide sequences using the t_dna
/--targetfile_dna
parameter but omit the --bwa
flag, HybPiper will translate the target file and map reads using BLASTx.
HybPiper will generate coding sequences and translated proteins from the sequencing reads in three phases:
- Read mapping. BLASTx or DIAMOND is used if the targets are amino-acid sequences, and BWA is used if the targets are nucleotide sequences. Sequencing reads that map to each gene are sorted into separate gene directories. Liberal parameters for mapping quality are used to ensure the maximum number of reads are used for contig assembly.
- Contig assembly. Each gene is assembled separately from the pool of reads identified in the first step. Assembly is conducted using SPAdes, which automatically detects the best k-mer values to use. If one or more k-mer values fails (usually due to lack of read depth), HybPiper re-runs SPAdes with smaller k-mer values.
- Coding sequence extraction First, HybPiper uses Exonerate to align the contigs to the appropriate target sequence, and coding sequences (i.e. exons) are extracted. Next, the coding sequences are sorted according to their alignment position. If there is one sequence that represents the entire aligned portion, it is chosen. Otherwise, a set of selection criteria are applied, including length of alignment to the target locus, percent identity between the coding sequence and the target, and depth of coverage.
Briefly, if two coding sequences have non-overlapping alignments to the reference, they are combined into a "stitched-contig". If two contigs have similar alignments to the target sequence, the contig with the longer alignment is chosen. If two contigs have identical alignment positions, but one contig has a much greater depth of coverage (10x more by default), it is chosen. If they both have similar depth, the contig with the greater percent identity to the target is chosen.
NOTE: In cases where HybPiper recovers sequence for multiple non-contiguous segments of a gene, the gaps between the segments will be padded with a number of 'N' characters. The number of Ns corresponds to the number of amino acids in the 'best' protein reference for that gene that do not have corresponding SPAdes contig hits, multiplied by 3 to convert to nucleotides. See this issue for some more details.
hybpiper assemble
Although HybPiper is set up to run on each sample separately, if the input files are organized and named appropriately, it is easy to set up and run HybPiper on multiple samples consecutively.
Here, we will employ a "while loop" to get the names of samples from the file namelist.txt
, and use that name as a "variable" to access the names of read files and set the --prefix
flag. From the test_dataset
directory, run the command:
while read name;
do hybpiper assemble -t_dna test_targets.fasta -r $name*.fastq --prefix $name --bwa;
done < namelist.txt
It should only take a few minutes to run through every sample. The "while loop" syntax and the namelist.txt
file will also be used for other post-processing commands later in the tutorial.
hybpiper stats
This command will summarize target enrichment and gene recovery efficiency for a set of samples. The output is two tables in tab-separated-values (*.tsv
) format. These are:
-
seq_lengths.tsv
. This table lists the length of coding sequence recovered by HybPiper for each gene and sample. The first line ofseq_lengths.tsv
has the column header 'Species' followed by the names of each gene. The second line has the column header 'MeanLength' followed by the length of the target gene, averaged over each "source" for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is present. -
hybpiper_stats.tsv
. This table lists one sample per line and the following statistics:
ColumnName | Description |
---|---|
Name | Sample name corresponding to the parameter --prefix or generated from read files. |
NumReads | Total number of input reads in the *.fastq files provided. |
ReadsMapped | Total number of input reads that mapped to sequences in the target file. |
PctOnTarget | Percentage of reads in the input *.fastq files that mapped to sequences in the target file. |
GenesMapped | Number of genes in the target file that had reads mapped to their representative sequence(s). |
GenesWithContigs | From genes with mapped reads, the number of genes for which SPAdes assembled contigs. |
GenesWithSeqs | From genes with SPAdes contigs, the number of genes that had coding sequences extracted after Exonerate searches of contigs vs the target file reference sequence. |
GenesAt25pct | Number of genes with sequences > 25% of the mean target length. |
GenesAt50pct | Number of genes with sequences > 50% of the mean target length. |
GenesAt75pct | Number of genes with sequences > 75% of the mean target length. |
GenesAt150pct | Number of genes with sequences > 150% of the mean target length. |
ParalogWarningsLong | Number of genes that had more than one coding sequence extracted from different contigs by Exonerate searches, each covering more than 75% (default) the length of the selected target file reference sequence. |
ParalogWarningsDepth | Number of genes where the coverage depth of coding sequences extracted by Exonerate was >1 for 75% (default) the length of the selected target file reference sequence. Coding sequences can be of any length. |
GenesWithoutStitchedContigs | Number of genes with sequence derived from a single SPAdes contig. |
GenesWithStitchedContigs | Number of genes with sequence derived from multiple SPAdes contigs. |
GenesWithStitchedContigsSkipped | Number of genes for which stitched contig creation was skipped (i.e. if the flag --no_stitched_contig was provided to the command hybpiper assemble ). |
TotalBasesRecovered | Total number of nucleotides summed from all gene sequences, not including any N characters. |
Example Command Line:
hybpiper stats -t_dna test_targets.fasta gene namelist.txt
hybpiper recovery_heatmap
A heatmap is one way to get a quick glance of the overall success of HybPiper in recovering coding sequences. To generate the heatmap, we can use the seq_lengths.tsv
produced by the previous hybpiper stats
command. Supply this file to the hybpiper recovery_heatmap
command using:
hybpiper recovery_heatmap seq_lengths.tsv
This will plot a heatmap representing HybPiper's success at recovering genes at each locus, across all the samples. An image file in *.png
format called recovery_heatmap.png
will be produced:
Each column is a gene, and each row is a sample. The darkness of shading in each cell represents the length of the sequence recovered, expressed as a percentage of the length of the target gene. In the test dataset, most genes worked very well, but it is easy to see at a glance that some samples did not work well (NZ874) and some genes did not work as well (gene030).
hybpiper retrieve_sequences
This command fetches the sequences recovered for the same gene from multiple samples and generates an unaligned multi-FASTA file for each gene. Alternatively, it can be used to recover sequences from a single sample in a single multi-FASTA file. In this tutorial we will carry out the former task.
Make sure all your sample runs are in the same directory. The command will retrieve all unique gene names from the target file used in the run of the pipeline.
Example command Line:
hybpiper retrieve_sequences dna -t_dna test_targets.fasta --sample_names namelist.txt
You must specify whether you want the protein (aa
) or nucleotide (dna
) sequences.
If you ran Intronerate on these samples (i.e. you did not use the flag --no_intronerate
when running the hybpiper assemble
command), you can also specify "supercontig
" or "intron
" to recover those sequences instead.
By default, the command will output unaligned fasta files, one per gene, to the current directory. Alternatively, you can specify an output directory by adding the parameter --fasta_dir <your_output_directory_name>
.
We have prepared more in-depth discussions about extracting introns or diagnosing putative paralogs on separate pages.