Subcommand: phat

Generate consensus sequences from a sequence database according to the PhAT method.

Usage: gappa prepare phat [options]

Options

Input
`--taxonomy-file`	Required. `TEXT:FILE` File that lists the taxa of the database.
`--sequence-file`	Required. `TEXT:FILE` Fasta file containing the sequences of the database.
Taxonomy Expansion
`--target-size`	Required. `UINT=0` Target size of how many taxa to select for building consensus sequences.
`--sub-taxonomy`	`TEXT` If a taxopath from the taxonomy is provided, only the respective sub-taxonomy is used.
`--min-subclade-size`	`UINT=0` Minimal size of sub-clades. Everything below is expanded.
`--max-subclade-size`	`UINT=0` Maximal size of a non-expanded sub-clades. Everything bigger is first expanded.
`--min-tax-level`	`UINT=0` Minimal taxonomic level. Taxa below this level are always expanded.
`--allow-approximation`	`FLAG` Allow to expand taxa that help getting closer to the --target-size, even if they are not the ones with the highest entropy.
`--no-taxa-selection`	`FLAG` If set, no taxa selection using entropy is performed. Instead, all taxa on all levels/ranks are used and consensus sequences for all of them are calculated. This is useful for testing and to try out new ideas.
Consensus Method
`--consensus-method`	`TEXT:{majorities,cavener,threshold}=majorities` Consensus method to use for combining sequences.
`--consensus-threshold`	`FLOAT:FLOAT in [0 - 1]=0.5 Needs: --consensus-method` Threshold value to use with --consensus-method threshold. Has to be in [ 0.0, 1.0 ].
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--write-info-files`	`FLAG` If set, two additional info files are written, containing the new pruned taxonomy, as well as the entropy of all clades of the original taxonomy.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

Given a set of sequences and a fitting taxonomy, the command produces consensus sequences representing taxonomic clades, according to our PhAT method as described here. The main inputs are --sequence-file and --taxonomy-file, which provide the input data, as well as the --target-size of how many consensus sequences to build.

PhAT workflow.

After running the command, the resulting set of sequences can be used to infer a reference tree using any tree inference program.

Details

`--taxonomy-file`

The taxonomy file needs to contain a list of the taxa used for the taxonomic expansion algorithm. Each line of the file lists a semicolon-separated taxonomic clade. Everything after the first tab is ignored.

Example:

Eukaryota;	4	domain
Eukaryota;Amoebozoa;	4052	kingdom		119
Eukaryota;Amoebozoa;Myxogastria;	4094	phylum		119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;	4095	genus		119
Eukaryota;Amoebozoa;Myxogastria;Badhamia;	4096	genus		119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia;	4097	genus		119
Eukaryota;Amoebozoa;Myxogastria;Comatricha;	4098	genus		119
...

`--sequence-file`

The sequence file needs to be in fasta format, and contain sequences that are labelled with the taxonomic path that they belong to. This taxonomic path can either be the whole label, or everything after the first whitespace (space or tab). This allows to have sequences with unique identifiers as the first part of the label.

For example, sequences in the Silva database are labelled like this:

>AY842031.1.1855 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete comata
>JQ031957.1.4380 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia maxima

In the example, the sequences first contain a unique identifier, followed by a space and the taxonomic path the sequence belongs to. The path contains an additional taxonomic level which is not present in the database. If this occurs, the last level is assumed to be species level, and removed from the path. The resulting taxonommic path is part of the taxonomy, and hence the sequence can be used.

`--sub-taxonomy`

If provided with a semicolon-separated taxonomic path (e.g., Eukaryota;Amoebozoa;), only this subclade is used for the algorithm. That is, the algorithm behaves as if the --taxonomy-file and --sequence-file only contained the taxa and sequences of the provided clade.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767