Skip to content
chrisjackson edited this page Oct 30, 2024 · 18 revisions

Documentation current for HybPiper version 2.3.1


hybpiper assemble

Unless the flag --no_intronerate or --not_protein_coding is provided when running the command hybpiper assemble, the pipeline will attempt to identify introns (if present), and will also produce a 'supercontig' sequence for each gene/sample. These are defined as:

  • supercontig: A sequence containing all assembled SPAdes contigs with a unique alignment to the reference target file sequence, concatenated in to one sequence. The supercontig sequence can contain both exon AND intron sequences. See note 1 below.

  • introns: Sequences in the supercontig that Exonerate annotates as 'intron'. See note 2 below.

Example command:

hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa

Alternatively, if you have already run the hybpiper assemble command with the --no_intronerate flag, you can use the following command to run only the final stage of the pipeline (including intron and supercontig recovery):

hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa --start_from exonerate_contigs

NOTES:

  1. The supercontig can contain multiple SPAdes contigs that have been concatenated. Ideally these will contain genuine exon and intron (or intergenic sequence) sequences only. However, the sequence might also partly comprise mis-assembled contigs. While it may be difficult to tell whether the sequence is "real" from a single sample, we recommend running Intronerate on several samples. Then, extract the supercontig sequences with hybpiper retrieve_sequences and align them. Sequences that appear in only one sample are probably from mis-assembled contigs and may be trimmed, for example using the program Trimal.

  2. Intron sequences recovered via Intronerate correspond to regions in the supercontig that Exonerate has annotated as 'intron'. As Exonerate hits begin and end with exon hits against an exon-only target file sequence, this means that only intron sequences that occur between identified exons will be recovered. That is, sequence upstream of the first exon and downstream of the last exon will not be annotated, and will not appear in the inronerate.gff file found within each gene directory. These unannotated regions might be exonic (i.e. if the target protein used in Exonerate searches was shorter than the exonic regions assembled in SPAdes contigs) or intronic (but not detected for reasons described above).

Clone this wiki locally