-
Notifications
You must be signed in to change notification settings - Fork 45
Paralogs
`Documentation current for HybPiper version 2.3.1
When running the hybpiper assemble
command, you may notice warnings about paralogs. These warnings are triggered in two scenarios:
- When HybPiper detects multiple contigs containing long coding sequences - by default at least 75% of the reference sequence. HybPiper will choose among these competing long contigs by first checking whether one of the contigs has coverage depth that greatly exceeds the others (10x by default). If all competing long contigs have similar depth, the sequence with the greatest percent identity to the reference is chosen.
- When the set of sequences extracted from contigs has a coverage of >1 for 75% (default) of the reference sequence length.
These warnings are collated and written to report files for each sample within the main sample directory e.g.:
EG98/EG98_genes_with_long_paralog_warnings.txt
EG98/EG98_genes_with_paralog_warnings_by_contig_depth.csv
In scenario 1 (long paralogs detected), the criteria used to select the main sequence may not be ideal in all cases. Choosing the appropriate gene copy to use for phylogenetics requires careful consideration, and so HybPiper flags the genes that may require further attention. Note that there may be other reasons for multiple long-length contigs: recent polyploidy, contamination, or even allelic variation may also result in this.
HybPiper includes the post-processing command hybpiper paralog_retriever
to recover coding sequences from alternative long paralogs. To run it on the test data, provide the namelist.txt
file and the target_file (used to recover gene names):
hybpiper paralog_retriever namelist.txt -t_dna test_targets.fasta
The command will produce a number of output files/folders. These are:
-
paralog_report.tsv
: a report file in tab-separated-values format. Tabulates the number of sequences for each gene and sample (i.e. potential paralogs if > 1) e.g.:Species gene030 gene111 gene293 EG30 1 1 1 EG98 5 2 2 MWL2 4 1 2 etc... -
paralog_heatmap.png
: a heatmap visualisation of theparalog_report.tsv
file, e.g.:
-
paralogs_above_threshold_report.txt
: a text report file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs will be reported. -
paralogs_all
: a folder containing a*.fasta
file for each sample/gene, containing paralog sequences of present, or the*.FNA
sequence recovered by HybPiper if no paralogs were detected. -
paralogs_no_chimeras
: a folder containing a*.fasta
file for each sample/gene as above, but with any putative chimeric*.FNA
sequences removed.
If many samples have more than one copy for many genes, it may indicate an ancient gene duplication. If one sample tends to have many copies, it may indicate it is a polyploid.
Not all samples have paralog warnings, such as EG30. Some genes do not have any paralogs found, such as gene001. However, HybPiper recovered coding sequence from two contigs in the sample EG98 for gene002:
>EG98.main NODE_3_length_1380_cov_36.695132,Artocarpus-gene002,0,282,90.85,(1),17,869
>EG98.0 NODE_4_length_1359_cov_18.996753,Artocarpus-gene002,17,282,87.59,(1),0,786
The suffix .main
refers to the paralog that was chosen during the initial run of HybPiper. The original names of the contigs from SPAdes are also shown, along with some statistics from Exonerate searches of the contigs. The .main
contig had about twice as much depth of coverage as the other paralog (36x vs 18x), and also had a higher percent identity (90.85% vs 87.59%).
Are these two sequences paralogs or alleles? The best way to check is to use multiple samples and build gene trees from the sequences. The unaligned *.fasta
files generated by hybpiper paralog_retriever
can be used in a phylogenetics pipeline to align and reconstruct a phylogeny.
If you have mafft
and FastTree
installed, you can create the tree directly from a *.paralogs_all.fasta
file using pipes:
cat gene074_paralogs_all.fasta | mafft --auto - | FastTree -nt -gtr > gene074.paralogs.tre
From the tree above (plotted using FigTree), you can see that for species NZ874, the two sequences recovered for gene074 form a clade. This means there is no evidence of an ancient duplication event for this gene. There may be another explanation for why HybPiper found two competing sequences, such as alleles. For phylogenetics, choosing the main
sequence is sufficient.
Other genes have more complicated histories:
cat gene006_paralogs_all.fasta | mafft --auto - | FastTree -nt -gtr > gene006.paralogs.tre
We see that there are two distinct clades of sequences, sister to a clade of the outgroups (NZ874, EG30, and NZ281). The .main
sequences form one clade, suggesting that this gene had an ancestral duplication within the outgroup.
Cases like the one above were common in the Artocarpus data on which the test dataset is based. By investigating each gene tree, it is possible to delineate which sequence is appropriate for phylogenetics. However, some samples may be missing one or both copies of the duplicated gene. Although the genes may really be lost, it is also possible that HybPiper was not able to recover both copies.
To test this possibility, one option is to create a new target file for HybPiper that includes examples of both copies. HybPiper will map reads against each copy separately, treating them as separate loci. For the Artocarpus dataset, this allowed us to nearly double the number of "loci" for phylogenetics!
In scenario 2, a paralog warning is produced when the set of sequences extracted from SPAdes contigs has a depth greater than 1 across a given percentage length (default 75%) of the reference sequence for that gene/sample. This warning is useful as it provides an indication that paralogs might be present for the given gene/sample, without the requirement that SPAdes has assembled each paralog in to an almost full-length contig. The latter scenario is unlikely if your target genes contain long introns, meaning that long paralog warnings would not be produced even though paralogs are present in your dataset for that gene/sample.
Note that for paralog warning based on sequence depth, no sequences are written to file. This is because there is currently no way to definitively group contigs containing partial gene sequences into single-paralog clusters for concatenation. Please let us know if you can think of a way!