Subcommand: phylogenetic kmeans

Run Phylogenetic k-means clustering on a set of samples.

Usage: gappa analyze phylogenetic-kmeans [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
Settings
`--k`	Required. `TEXT` Number of clusters to find. Can be a comma-separated list of multiple values or ranges for k, such as `"1-5,8,10,12"`
`--write-overview-file`	`FLAG` If provided, a table file is written that summarizes the average distance and variance of the clusters for each k. Useful for elbow plots.
`--point-mass`	`FLAG` Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
`--ignore-multiplicities`	`FLAG` Set the multiplicity of each pquery to 1.0. The multiplicity is the equvalent of abundances for placements, and hence ignored with this flag.
`--bins`	`UINT=0` Bin the masses per-branch in order to save time and memory, with only minor differences in the cluster assignments. Default is 0, that is, no binning. If set, we recommend to use 50 bins or more.
Color
`--color-list`	`TEXT=BuPuBk` List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format `#rrggbb` using hex values, or by web color names.
`--reverse-color-list`	`FLAG` If set, the order of colors of the `--color-list` is reversed.
`--log-scaling`	`FLAG` If set, the sequential color list is logarithmically scaled instead of linearily.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT=pkmeans_` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Tree Output
`--write-newick-tree`	`FLAG` If set, the tree is written to a Newick file. This format cannot store color information.
`--write-nexus-tree`	`FLAG` If set, the tree is written to a Nexus file. This can for example be opened in FigTree.
`--write-phyloxml-tree`	`FLAG` If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx.
`--write-svg-tree`	`FLAG` If set, the tree is written to a SVG file. This gives a file for vector graphics editors.
Newick Tree Output
`--newick-tree-branch-length-precision`	`INT=6 Needs: --write-newick-tree` Number of digits to print for branch lengths in Newick format.
`--newick-tree-quote-invalid-chars`	`FLAG Needs: --write-newick-tree` If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and `:;()[],{}`) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Svg Tree Output
`--svg-tree-shape`	`TEXT:{circular,rectangular}=circular Needs: --write-svg-tree` Shape of the tree.
`--svg-tree-type`	`TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree` Type of the tree, either using branch lengths (`phylogram`), or not (`cladogram`).
`--svg-tree-stroke-width`	`FLOAT=5 Needs: --write-svg-tree` Svg stroke width for the branches of the tree.
`--svg-tree-ladderize`	`FLAG Needs: --write-svg-tree` If set, the tree is ladderized.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command runs a Phylogenetic k-means clustering on a set of jplace files (called samples). The aim is to group samples that are similar to each other regarding the Phylogenetic KR distance. This is for example useful to find structure in a set of samples from different locations or points in time.

Details

Values for `k`

It is often not obvious what the "natural" number of clusters of a set of samples is. To this end, it makes sense to try different values for k and explore how the clustering changes. Then, techniques like the Elbow Method can be used to estimate a reasonable number of clusters. See below for more on that.

To this end, the option --k accepts multiple values, separated by commas, as well as ranges of numbers, specified via a dash. This is similar to how specific pages can be selected in common software before printing.

Example: --k 1-6,10,15

Output Format

For each specified k, the result of the clustering is written to an assignment table, which lists for each sample the cluster number it was grouped into, as we as the distance (Phylogenetic KR distance) from the sample to the centroid of the cluster. The cluster numbers are zero based, and thus span the range [0, k-1].

Centroid Trees

If furthermore an output tree format is specified (via one of the ---write-...-tree options), the centroids of each cluster are visualized as mass trees. That is, the average mass distribution of all samples that were assigned to a cluster is calculated and visualized on the tree. This is useful to explore what each cluster represents - that is, how the samples were clustered.

Multiple `k` and Overview File

If multiple values for k are specified (see above), the option --write-overview-file can be used to write an overview table that lists for each value of k the average distance and variance from each sample to its assigned cluster centroid. This table can directly be visualized to create plots such as the Elbow Method.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050

Home

Citation and References

General Usage

Phylogenetic Placement

Module analyze

Module edit

Module examine

Module prepare

Module simulate

Module tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subcommand: phylogenetic kmeans

Options

Description

Details

Values for `k`

Output Format

Centroid Trees

Multiple `k` and Overview File

Citation

Clone this wiki locally

Subcommand: phylogenetic kmeans

Options

Description

Details

Values for k

Output Format

Centroid Trees

Multiple k and Overview File

Citation

Clone this wiki locally

Values for `k`

Multiple `k` and Overview File