Skip to content
chrisjackson edited this page Oct 30, 2024 · 27 revisions

Documentation current for HybPiper version 2.3.1


Welcome to the HybPiper tutorial!

The purpose of this tutorial is to help familiarize you with the format of the input you need and output you should expect from running HybPiper. The tutorial uses a test dataset that is a subset of real Illumina data from an enriched library.

This tutorial assumes that you have some experience executing programs from the command line on a UNIX-like system. If you would like to learn more about the command line, or need a refresher, you can find a command line tutorial here.

Test dataset

Click this link to download the files for the test dataset. If you installed HybPiper by cloning the repository, the file test_dataset.tar.gz will already be in the repository directory. Extract the file by either double-clicking on it, or via a terminal using the command:

tar -zxvf test_dataset.tar.gz

The extracted directory test_dataset contains the following files:

  • test_reads.fastq.tar.gz contains paired reads from nine samples chosen from the initial HybPiper manuscript. It includes six "ingroup" samples (genus Artocarpus) and three outgroup samples. Each sample has a pair of *.fastq files, representing the forward and reverse reads, generated on an Illumina MiSeq 2x300 platform.

  • test_targets.fasta is a file containing the full coding sequence from 13 target genes based on the Artocarpus probe set described in the HybPiper manuscript. There are two "sources" of sequence for each target: Artocarpus (sequences from a draft genome in the target group) and Morus (a reference genome in the same family as Artocarpus). For example, both of these sequences represent gene002:

    >Artocarpus-gene002
    ATGATGAAGCAGGACGCCACCAACGGCGGCGAGAACTGGGTGCGCGTGTGCGACACGTGC
    CGCTCGGCGGCGTGCACGGTTTACTGCCGCGCCGACTCGGCTTACCTCTGCGCCGGATGC
    >Morus-gene002
    ATGATGAAGGAGGACACAAACGGGGGCAACTCCAGCAAGAACTGGGCGCGCGTGTGTGAC
    ACGTGCCGTTCCGCGGCGTGCGCGGTGTACTGCCGTGCCGACTCGGCGTACCTTTGCGCG
    

    Having multiple sources for each gene increases the likelihood that reads will map to the targets during the first phase. HybPiper then chooses the version with the best overall mapping score to serve as a reference for extracting coding sequences.

  • namelist.txt A file containing the list of sample names, one per line. While not required, this file will help run the main HybPiper script and post-processing scripts on multiple samples.

  • run_hybpiper_test_dataset.sh This shell script contains a few of the example commands we will cover in this tutorial. The script can be run from within the test_dataset directory simply by:

    ./run_hybpiper_test_dataset.sh
    

All the commands in the bash script will be covered in more detail in the following sections.

Running HybPiper

Uncompress the read files using the tar utility command:

tar -zxvf test_reads.fastq.tar.gz

This will generate 18 *.fastq files in your current directory, representing the forward and reverse Illumina reads for nine samples.

The main command of HybPiper is hybpiper assemble. HybPiper needs sequencing reads and a file containing the target coding sequences, in either amino-acid or nucleotide format.

HybPiper is run separately for each sample, and generates a directory that contains all of the output files, organized by gene.

Run this command from the test_dataset directory:

hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa

Using the wild card (asterisk) saves some typing and instructs HybPiper to use both the R1 (forward) and R2 (reverse) read files. The --prefix flag will be the name of the directory generated by HybPiper, as well as the identifier for all sequences generated. The --bwa flag is required when the target file contains nucleotide sequences. Note: if you supply a target file containing nucleotide sequences using the t_dna/--targetfile_dna parameter but omit the --bwa flag, HybPiper will translate the target file and map reads using BLASTx.

HybPiper will generate coding sequences and translated proteins from the sequencing reads in three phases:

  1. Read mapping. BLASTx or DIAMOND is used if the targets are amino-acid sequences, and BWA is used if the targets are nucleotide sequences. Sequencing reads that map to each gene are sorted into separate gene directories. Liberal parameters for mapping quality are used to ensure the maximum number of reads are used for contig assembly.
  2. Contig assembly. Each gene is assembled separately from the pool of reads identified in the first step. Assembly is conducted using SPAdes, which automatically detects the best k-mer values to use. If one or more k-mer values fails (usually due to lack of read depth), HybPiper re-runs SPAdes with smaller k-mer values.
  3. Coding sequence extraction First, HybPiper uses Exonerate to align the contigs to the appropriate target sequence, and coding sequences (i.e. exons) are extracted. Next, the coding sequences are sorted according to their alignment position. If there is one sequence that represents the entire aligned portion, it is chosen. Otherwise, a set of selection criteria are applied, including length of alignment to the target locus, percent identity between the coding sequence and the target, and depth of coverage.

Briefly, if two coding sequences have non-overlapping alignments to the reference, they are combined into a "stitched-contig". If two contigs have similar alignments to the target sequence, the contig with the longer alignment is chosen. If two contigs have identical alignment positions, but one contig has a much greater depth of coverage (10x more by default), it is chosen. If they both have similar depth, the contig with the greater percent identity to the target is chosen.

NOTE: In cases where HybPiper recovers sequence for multiple non-contiguous segments of a gene, the gaps between the segments will be padded with a number of 'N' characters. The number of Ns corresponds to the number of amino acids in the 'best' protein reference for that gene that do not have corresponding SPAdes contig hits, multiplied by 3 to convert to nucleotides. See this issue for some more details.


Running multiple samples

hybpiper assemble

Although HybPiper is set up to run on each sample separately, if the input files are organized and named appropriately, it is easy to set up and run HybPiper on multiple samples consecutively.

Here, we will employ a "while loop" to get the names of samples from the file namelist.txt, and use that name as a "variable" to access the names of read files and set the --prefix flag. From the test_dataset directory, run the command:

while read name; 
do hybpiper assemble -t_dna test_targets.fasta -r $name*.fastq --prefix $name --bwa;
done < namelist.txt

It should only take a few minutes to run through every sample. The "while loop" syntax and the namelist.txt file will also be used for other post-processing commands later in the tutorial.

Summary statistics

hybpiper stats

This command will summarize target enrichment and gene recovery efficiency for a set of samples. The output is two tables in tab-separated-values (*.tsv) format. These are:

  1. seq_lengths.tsv. This table lists the length of coding sequence recovered by HybPiper for each gene and sample. The first line of seq_lengths.tsv has the column header 'Species' followed by the names of each gene. The second line has the column header 'MeanLength' followed by the length of the target gene, averaged over each "source" for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is present.
  2. hybpiper_stats.tsv. This table lists one sample per line and the following statistics:
ColumnName Description
Name Sample name corresponding to the parameter --prefix or generated from read files.
NumReads Total number of input reads in the *.fastq files provided.
ReadsMapped Total number of input reads that mapped to sequences in the target file.
PctOnTarget Percentage of reads in the input *.fastq files that mapped to sequences in the target file.
GenesMapped Number of genes in the target file that had reads mapped to their representative sequence(s).
GenesWithContigs From genes with mapped reads, the number of genes for which SPAdes assembled contigs.
GenesWithSeqs From genes with SPAdes contigs, the number of genes that had coding sequences extracted after Exonerate searches of contigs vs the target file reference sequence.
GenesAt25pct Number of genes with sequences > 25% of the mean target length.
GenesAt50pct Number of genes with sequences > 50% of the mean target length.
GenesAt75pct Number of genes with sequences > 75% of the mean target length.
GenesAt150pct Number of genes with sequences > 150% of the mean target length.
ParalogWarningsLong Number of genes that had more than one coding sequence extracted from different contigs by Exonerate searches, each covering more than 75% (default) the length of the selected target file reference sequence.
ParalogWarningsDepth Number of genes where the coverage depth of coding sequences extracted by Exonerate was >1 for 75% (default) the length of the selected target file reference sequence. Coding sequences can be of any length.
GenesWithoutStitchedContigs Number of genes with sequence derived from a single SPAdes contig.
GenesWithStitchedContigs Number of genes with sequence derived from multiple SPAdes contigs.
GenesWithStitchedContigsSkipped Number of genes for which stitched contig creation was skipped (i.e. if the flag --no_stitched_contig was provided to the command hybpiper assemble).
TotalBasesRecovered Total number of nucleotides summed from all gene sequences, not including any N characters.

Example Command Line:

hybpiper stats -t_dna test_targets.fasta gene namelist.txt

Visualizing results

hybpiper recovery_heatmap

A heatmap is one way to get a quick glance of the overall success of HybPiper in recovering coding sequences. To generate the heatmap, we can use the seq_lengths.tsv produced by the previous hybpiper stats command. Supply this file to the hybpiper recovery_heatmap command using:

hybpiper recovery_heatmap seq_lengths.tsv

This will plot a heatmap representing HybPiper's success at recovering genes at each locus, across all the samples. An image file in *.png format called recovery_heatmap.png will be produced:

recovery_heatmap

Each column is a gene, and each row is a sample. The darkness of shading in each cell represents the length of the sequence recovered, expressed as a percentage of the length of the target gene. In the test dataset, most genes worked very well, but it is easy to see at a glance that some samples did not work well (NZ874) and some genes did not work as well (gene030).

Retrieving sequences

hybpiper retrieve_sequences

This command fetches the sequences recovered for the same gene from multiple samples and generates an unaligned multi-FASTA file for each gene. Alternatively, it can be used to recover sequences from a single sample in a single multi-FASTA file. In this tutorial we will carry out the former task.

Make sure all your sample runs are in the same directory. The command will retrieve all unique gene names from the target file used in the run of the pipeline.

Example command Line:

hybpiper retrieve_sequences dna -t_dna test_targets.fasta --sample_names namelist.txt

You must specify whether you want the protein (aa) or nucleotide (dna) sequences.

If you ran Intronerate on these samples (i.e. you did not use the flag --no_intronerate when running the hybpiper assemble command), you can also specify "supercontig" or "intron" to recover those sequences instead.

By default, the command will output unaligned fasta files, one per gene, to the current directory. Alternatively, you can specify an output directory by adding the parameter --fasta_dir <your_output_directory_name>.

Introns and Paralogs

We have prepared more in-depth discussions about extracting introns or diagnosing putative paralogs on separate pages.