-
Notifications
You must be signed in to change notification settings - Fork 7
Full Stack Example
Before using epa-ng, it is important to know where it sits in the bigger picture. On this page you will find an example of what a full placement pipeline might look like.
There are two major components to phylogenetic placement:
- a set of sequences to place (called query sequences)
- a set of sequences that represent the context within which we want to place (called reference sequences)
If you've landed at this page, you probably have your query sequences already. Typically these come from metagenomic or metabarcoding sequencing, and are already filtered to belong to some genetic region (like 16S, 18S, or other common barcodes).
The most common question then is: for my query sequences, I want to know where they belong in terms of taxonomy. What is the taxonomic composition of my environmental sample?
epa-ng can answer this question, but it needs a handful of other programs to prepare the data and to perform any in-depth post analysis you might require for your research.
This is the most biologically involved step (apart from the wet-lab work), as it requires knowledge of the environment the query sequences were sampled from, the organisms that are suspected to inhabit it, possible contaminants, and so on.
Some general advice I can give here:
- we found its more robust to include representative sequences of a species, or even genus, rather than many, as too high similarity will just produce many placements of low certainty, spread across many members of such a group. This unneccesarily increases runtime.
- include as much diversity as possible
- keep in mind that the tree should still be small enough to visualize (unless you reduce its size during post processing), and there are limits to how big of a tree and alignment the programs involved can handle
This can be done using standard methods. The only requirement that epa-ng (currently) imposes, is that you should remember / obtain the model parameters with which the tree was inferred, and supply them to epa-ng when calling the main placement routine (currently only explicit specification of model parameters is allowed).
Now that you have a reference MSA, we need to align our queries against it. There are multiple tools to do this, but we usually reccomend either hmmer/hmmalign, or papara
papara actually takes the reference tree into account when aligning sequences, and a call to it would look something like
papara -t $TREE -s $REF_MSA -q $QRY -r -n some_name
(tree in newick, ref_msa in phylip and qry in fasta format)
Note the -r
option: this is vital to ensure comparability between different query files for the same reference tree, as it forces papara not to add any sites to the original reference alignment!
Next, before the actual placement can commence, we need to explicitly prepare the input, as epa-ng (currently) only accepts separate query and reference alignment files, both in fasta (or bfast) format. Papara, for example, outputs the aligned queries together with the reference MSA, in phylip format.
This can be done directly with epa-ng now, using the --split
function:
epa-ng --split ref_alignment query_alignments+
Finally, the actual call to epa-ng will look something like
epa-ng --tree $TREE --ref-msa $REF_MSA --query $QRY_MSA --out-dir $OUT --model $INFO
$INFO is used to pass the aforementioned model parameters of the reference tree to epa-ng. It may (currently) be one of two things: either a raxml-ng-style model descriptor, like so:
GTR{0.7/1.8/1.2/0.6/3.0/1.0}+FU{0.25/0.23/0.30/0.22}+G4{0.47}
or, alternatively, a RAxML_info file resulting from a call using its -f e
-option, looking something like
raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n info -m GTRGAMMAX
Multiple options for post-analysis of placement data, as it is produced by EPA-ng, exist.
Guppy is the command line placement post-analysis tool by the developers of pplacer
and is fully compatible with the output of EPA-ng. It offers features such as Squash Clustering, Edge-PCA, diversity analysis, splitting place files, and more.
However it is not in active development for some time now, and has some performance caveats which are noticable with medium to large data.
gappa is a command line placement post analysis tool akin to guppy
that implements some of the most relevant features of guppy
in a more performant way as well as novel methods introduced here and here. It also includes a algorithm to perform taxonomic assignment based on the reference tree itself, described here.
genesis is the C++ library on which gappa
is built. It includes a wide range of features to do just about anything regarding placement data, and phylogenetics in general. It comes with a convenient feature that allows you to simply write a script-like single .cpp
file, drop it in genesis' apps
folder and have it be compiled alongside the library itself, generating a executable per such app. This is what I use most often. (I have some of these apps available here, use at your own risk!)