-
Notifications
You must be signed in to change notification settings - Fork 7
Full Stack Example
Before using epa-ng, it is important to know where it sits in the bigger picture. On this page you will find an example of what a full placement pipeline might look like.
There are two major components to phylogenetic placement:
- a set of sequences to place (called query sequences)
- a set of sequences that represent the context within which we want to place (called reference sequences)
If you've landed at this page, you probably have your query sequences already. Typically these come from metagenomic or metabarcoding sequencing, and are already filtered to belong to some genetic region (like 16S, 18S, or other common barcodes).
The most common question then is: for my query sequences, I want to know where they belong in terms of taxonomy. What is the taxonomic composition of my environmental sample?
epa-ng can answer this question, but it needs a handful of other programs to prepare the data and to perform any in-depth post analysis you might require for your research.
This is the most biologically involved step (apart from the wet-lab work), as it requires knowledge of the environment the query sequences were sampled from, the organisms that are suspected to inhabit it, possible contaminants, and so on.
Some general advice I can give here:
- we found its more robust to include representative sequences of a species, or even genus, rather than many, as too high similarity will just produce many placements of low certainty, spread across many members of such a group. This unneccesarily increases runtime.
- include as much diversity as possible
- keep in mind that the tree should still be small enough to visualize (unless you reduce its size during post processing), and there are limits to how big of a tree and alignment the programs involved can handle
This can be done using standard methods. The only requirement that epa-ng (currently) imposes, is that you should remember / obtain the model parameters with which the tree was inferred, and supply them to epa-ng when calling the main placement routine (currently only the GTRGAMMA model is supported).
Now that you have a reference MSA, we need to align our queries against it. There are multiple tools to do this, but we usually reccomend either hmmer/hmmalign, or papara
papara actually takes the reference tree into account when aligning sequences, and a call to it would look something like
papara -t $TREE -s $REF_MSA -q $QRY -r -n some_name
(tree in newick, ref_msa in phylip and qry in fasta format)
Note the -r
option: this is vital to ensure comparability between different query files for the same reference tree, as it forces papara not to add any sites to the original reference alignment!
Before the actual placement can commence, we need to explicitly prepare the input, as epa-ng (currently) only accepts separate query and reference alignment files, both in fasta (or bfast) format. Papara, for example, outputs the aligned queries together with the reference MSA, in phylip format.
(there will be a convenience function for this shortly)
Finally, the actual call to epa-ng will look something like
epa-ng --tree $TREE --ref-msa $REF_MSA --query $QRY_MSA --out-dir $OUT --model $INFO
$INFO is used to pass the aforementioned model parameters of the reference tree to epa-ng. It may (currently) be one of two things: either a raxml-ng-style model descriptor, like so:
GTR{0.7/1.8/1.2/0.6/3.0/1.0}+FU{0.25/0.23/0.30/0.22}+G4{0.47}
or, alternatively, a RAxML_info file resulting from a call using its -f e
-option, looking something like
raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n info -m GTRGAMMAX