-
Notifications
You must be signed in to change notification settings - Fork 45
Introns
Documentation current for HybPiper version 2.3.1
hybpiper assemble
Unless the flag --no_intronerate
or --not_protein_coding
is provided when running the command hybpiper assemble
, the pipeline will attempt to identify introns (if present), and will also produce a 'supercontig' sequence for each gene/sample. These are defined as:
-
supercontig
: A sequence containing all assembled SPAdes contigs with a unique alignment to the reference target file sequence, concatenated in to one sequence. Thesupercontig
sequence can contain both exon AND intron sequences. See note 1 below. -
introns
: Sequences in thesupercontig
that Exonerate annotates as 'intron'. See note 2 below.
Example command:
hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa
Alternatively, if you have already run the hybpiper assemble
command with the --no_intronerate
flag, you can use the following command to run only the final stage of the pipeline (including intron and supercontig recovery):
hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa --start_from exonerate_contigs
NOTES:
-
The supercontig can contain multiple SPAdes contigs that have been concatenated. Ideally these will contain genuine exon and intron (or intergenic sequence) sequences only. However, the sequence might also partly comprise mis-assembled contigs. While it may be difficult to tell whether the sequence is "real" from a single sample, we recommend running Intronerate on several samples. Then, extract the supercontig sequences with
hybpiper retrieve_sequences
and align them. Sequences that appear in only one sample are probably from mis-assembled contigs and may be trimmed, for example using the program Trimal. -
Intron sequences recovered via Intronerate correspond to regions in the supercontig that Exonerate has annotated as 'intron'. As Exonerate hits begin and end with exon hits against an exon-only target file sequence, this means that only intron sequences that occur between identified exons will be recovered. That is, sequence upstream of the first exon and downstream of the last exon will not be annotated, and will not appear in the
inronerate.gff
file found within each gene directory. These unannotated regions might be exonic (i.e. if the target protein used in Exonerate searches was shorter than the exonic regions assembled in SPAdes contigs) or intronic (but not detected for reasons described above).