-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: extract
Extract placements from clades of the tree and write per-clade jplace files.
Usage: gappa prepare extract [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
--clade-list-file |
Required. TEXT:FILE File containing a tab-separated list of taxon to clade mapping. |
--fasta-path |
TEXT:PATH(existing)=[] ... List of fasta files or directories to process. For directories, only files with the extension .(fasta|fas|fsa|fna|ffn|faa|frn)[.gz] are processed. |
Settings | |
--threshold |
FLOAT:FLOAT in [0.5 - 1]=0.95 Threshold of how much placement mass needs to be in a clade for extracting a pquery. |
--point-mass |
FLAG Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0. |
Output | |
--color-tree-file |
TEXT:PATH(non-existing) If a path is provided, an svg file with a tree colored by clades is written. |
--samples-out-dir |
TEXT=samples Directory to write output samples files to. |
--samples-file-prefix |
TEXT File prefix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--samples-file-suffix |
TEXT File suffix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--sequences-out-dir |
TEXT=sequences Directory to write output sequences files to. |
--sequences-file-prefix |
TEXT File prefix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--sequences-file-suffix |
TEXT File suffix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command extracts the queries that are placed in specified clades of the reference tree and writes per-clade jplace
files. It is mainly intended for the multilevel placement approach as explained here. It can of course also be used for other purposes where one is interested in just working with placements in a specific clade.
The command takes one or more jplace files as input, as well as a file describing clades of the reference tree used in the jplace files.
It then finds all placements in those clades and writes per-clade placement files, each of them containing only those placements that had more of their mass (likelihood weights) in that clade than specified by --threshold
.
Furthermore, two special clades are produced: basal_branches
, which collects all placements that have their mass on branches that do not belong to any clade, as well as uncertain
, which collects placements where no clade (including the basal clade) have more than the threshold amount of the mass in them.
Furthermore, if a set of fasta sequence files is provided, the command also creates per-clade fasta files, containing the sequences corresponding to the placements of the jplace files. This of course necessitates that the sequences are named the same as the placements - which is given if the placement files are simply the result of placing the sequences on a reference tree.
The algorithm assigns a clade to each of the branches of the reference tree (either one of the specified ones, or the basal clade).
For terminal branches (leaves), the assigned clade is simply as specified in the --clade-list-file
.
Inner branches are assigned to a clade if all leaves on one side of the split that is induced by the branch belong to the same clade.
In other words, all branches of a subtree that contains only taxa from one clade are assigned to that clade.
See the figure below for an example.
This file describes which taxa of the reference tree are considered to belong to which clade. Each line of the file needs to contain a taxon name of the tree, and the name of the clade it belongs to, separated by a tab:
AF401522_Carchesium_polypinum Alveolata
X56165_Tetrahymena_thermophila Alveolata
X03772_Paramecium_tetraurelia Alveolata
...
Not all taxa of the reference tree have to be part of the file; all missing ones are simply considered to be part of the special basal clade.
If provided with an output file name, an svg file is written that shows which branches of the tree were assigned to which clade:
This is useful to verify the process and to make sure that the correct branches were selected. In the figure, the basal branches are gray, while three exemplary clades are marked in color.
The behavior of selecting branches so that their subtrees are monophyletic with respect to a clade is visible here as well: For example, the green clade is split into two subtrees and a few single branches.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools