Skip to content
Pierre Barbera edited this page Feb 16, 2018 · 9 revisions

Before using epa-ng, it is important to know where it sits in the bigger picture. On this page you will find an example of what a full placement pipeline might look like.

There are two major components to phylogenetic placement:

  • a set of sequences to place (called query sequences)
  • a set of sequences that represent the context within which we want to place (called reference sequences)

If you've landed at this page, you probably have your query sequences already. Typically these come from metagenomic or metabarcoding sequencing, and are already filtered to belong to some genetic region (like 16S, 18S, or other common barcodes).

The most common question then is: for my query sequences, I want to know where they belong in terms of taxonomy. What is the taxonomic composition of my environmental sample?

epa-ng can answer this question, but it needs a handful of other programs to prepare the data and to perform any in-depth post analysis you might require for your research.

Step 1: Selecting the reference sequences

This is the most biologically involved step (apart from the wet-lab work), as it required knowledge of the environment the query sequences were sampled from, the organisms that are suspected to inhabit it, possible contaminants, and so on.

Some general advice I can give here:

  • we found its more robust to include representative sequences of a species, or even genus, rather than many, as too high similarity will just produce many placements of low certainty, spread across many members of such a group. This unneccesarily increases runtime.
  • include as much diversity as possible
  • keep in mind that the tree should still be small enough to visualize (unless you reduce its size during post processing), and there are limits to how big of a tree and alignment the programs involved can handle

Step 2: Building a reference alignment and tree

This can be done using standard methods. The only requirement that epa-ng (currently) imposes, is that you should remember / obtain the model parameters with which the tree was inferred, and supply them to epa-ng when calling the main placement routine (currently only the GTRGAMMA model is supported).

More on this later.

Step 3: Aligning the query sequences

Now that you have a reference MSA, we need to align our queries against it. There are multiple tools to do this, but we usually reccomend either hmmer/hmmalign, or papara

Step 4: Placing the query sequences

Step 5 and onward: Visualization, post-analysis

Clone this wiki locally