-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: phylogenetic kmeans
Run Phylogenetic k-means clustering on a set of samples.
Usage: gappa analyze phylogenetic-kmeans [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
Settings | |
--k |
Required. TEXT Number of clusters to find. Can be a comma-separated list of multiple values or ranges for k, such as "1-5,8,10,12"
|
--write-overview-file |
FLAG If provided, a table file is written that summarizes the average distance and variance of the clusters for each k. Useful for elbow plots. |
--point-mass |
FLAG Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0. |
--ignore-multiplicities |
FLAG Set the multiplicity of each pquery to 1.0. The multiplicity is the equvalent of abundances for placements, and hence ignored with this flag. |
--bins |
UINT=0 Bin the masses per-branch in order to save time and memory, with only minor differences in the cluster assignments. Default is 0, that is, no binning. If set, we recommend to use 50 bins or more. |
Color | |
--color-list |
TEXT=BuPuBk List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format #rrggbb using hex values, or by web color names. |
--reverse-color-list |
FLAG If set, the order of colors of the --color-list is reversed. |
--log-scaling |
FLAG If set, the sequential color list is logarithmically scaled instead of linearily. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT=pkmeans_ File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Tree Output | |
--write-newick-tree |
FLAG If set, the tree is written to a Newick file. This format cannot store color information. |
--write-nexus-tree |
FLAG If set, the tree is written to a Nexus file. This can for example be opened in FigTree. |
--write-phyloxml-tree |
FLAG If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx. |
--write-svg-tree |
FLAG If set, the tree is written to a SVG file. This gives a file for vector graphics editors. |
Newick Tree Output | |
--newick-tree-branch-length-precision |
INT=6 Needs: --write-newick-tree Number of digits to print for branch lengths in Newick format. |
--newick-tree-quote-invalid-chars |
FLAG Needs: --write-newick-tree If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and :;()[],{} ) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools. |
Svg Tree Output | |
--svg-tree-shape |
TEXT:{circular,rectangular}=circular Needs: --write-svg-tree Shape of the tree. |
--svg-tree-type |
TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree Type of the tree, either using branch lengths ( phylogram ), or not (cladogram ). |
--svg-tree-stroke-width |
FLOAT=5 Needs: --write-svg-tree Svg stroke width for the branches of the tree. |
--svg-tree-ladderize |
FLAG Needs: --write-svg-tree If set, the tree is ladderized. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command runs a Phylogenetic k-means clustering on a set of jplace
files (called samples). The aim is to group samples that are similar to each other regarding the Phylogenetic KR distance. This is for example useful to find structure in a set of samples from different locations or points in time.
It is often not obvious what the "natural" number of clusters of a set of samples is. To this end, it makes sense to try different values for k
and explore how the clustering changes. Then, techniques like the Elbow Method can be used to estimate a reasonable number of clusters. See below for more on that.
To this end, the option --k
accepts multiple values, separated by commas, as well as ranges of numbers, specified via a dash. This is similar to how specific pages can be selected in common software before printing.
Example: --k 1-6,10,15
For each specified k
, the result of the clustering is written to an assignment table, which lists for each sample the cluster number it was grouped into, as we as the distance (Phylogenetic KR distance) from the sample to the centroid of the cluster. The cluster numbers are zero based, and thus span the range [0, k-1]
.
If furthermore an output tree format is specified (via one of the ---write-...-tree
options), the centroids of each cluster are visualized as mass trees. That is, the average mass distribution of all samples that were assigned to a cluster is calculated and visualized on the tree. This is useful to explore what each cluster represents - that is, how the samples were clustered.
If multiple values for k
are specified (see above), the option --write-overview-file
can be used to write an overview table that lists for each value of k
the average distance and variance from each sample to its assigned cluster centroid. This table can directly be visualized to create plots such as the Elbow Method.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools