-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: clean tree
Clean a tree in Newick format by removing parts that other parsers have difficulties with.
Usage: gappa prepare clean-tree [options]
Input | |
---|---|
--tree-file |
Required. TEXT:FILE Tree file in Newick format. |
Settings | |
--remove-inner-labels |
FLAG Some Newick trees contain inner node labels, which can confuse some parsers. This option removes them. |
--replace-invalid-chars |
FLAG Replace invalid characters in node labels ( ,:;"()[] ) by underscores. The Newick format requires node labels to be wrapped in double quotation marks if they contain these characters, but many parsers cannot handle this. For such cases, replacing the characters can help. |
--remove-comments-and-nhx |
FLAG The Newick format allows for comments in square brackets [] , which are also often (mis-)used for ad-hoc and more established extensions such as the New Hampshire eXtended (NHX) format [&&NHX:key=value:...] . Many parsers cannot handle this; this option removes such annotations. |
--remove-extra-numbers |
FLAG The Rich/Rice Newick format extension allows to annotate bootstrap values and probabilities per branch, by adding additional :[bootstrap]:[prob] fields after the branch length. Many parsers cannot handle this; this option removes such annotations. |
--remove-jplace-tags |
FLAG The Jplace file format for phylogenetic placements also uses a custom Newick extension, by introducing curly brackets to annotate edge numbers in the tree {1} . We are not aware of any other Newick extension that uses this style, but still, with this option, all annotations in curly brackets is removed. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command cleans a tree in Newick format (and some of its extensions) by removing parts that might lead some downstream parsers to fail.
The Newick file format for phylogenetic trees in its original standard only supports node names (taxa names) and branch lengths. Over the years, many ad-hoc and custom extensions have been suggested and used in practice, to compensate for missing flexibility of the format. This however lead to many downstream parsers not being able to work with all those dialects of the format, see
A Critical Review on the Use of Support Values in Tree Viewers and Bioinformatics Toolkits.
Czech L, Huerta-Cepas J, Stamatakis A.
Molecular Biology and Evolution, 17(4), 2017.
https://doi.org/10.1093/molbev/msx055
for some of the issues that might arise.
This command can be used to clean some of those difficult extensions/annotations, by simply removing them. It is meant as a cleaning tool for other software packages that cannot read a given Newick tree. When all options are activated, all types of extra data (that we know of) are removed, leading to a tree with just node names at the terminal (leaf) nodes, and branch lengths. Note that branch lengths might slightly change even if nothing is removed, due to numerical rounding.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools