Skip to content

Subcommand: phat

Lucas Czech edited this page Jan 4, 2022 · 9 revisions

Generate consensus sequences from a sequence database according to the PhAT method.

Usage: gappa prepare phat [options]

Options

Input
--taxonomy-file Required. TEXT:FILE
File that lists the taxa of the database.
--sequence-file Required. TEXT:FILE
Fasta file containing the sequences of the database.
Taxonomy Expansion
--target-size Required. UINT=0
Target size of how many taxa to select for building consensus sequences.
--sub-taxonomy TEXT
If a taxopath from the taxonomy is provided, only the respective sub-taxonomy is used.
--min-subclade-size UINT=0
Minimal size of sub-clades. Everything below is expanded.
--max-subclade-size UINT=0
Maximal size of a non-expanded sub-clades. Everything bigger is first expanded.
--min-tax-level UINT=0
Minimal taxonomic level. Taxa below this level are always expanded.
--allow-approximation FLAG
Allow to expand taxa that help getting closer to the --target-size, even if they are not the ones with the highest entropy.
--no-taxa-selection FLAG
If set, no taxa selection using entropy is performed. Instead, all taxa on all levels/ranks are used and consensus sequences for all of them are calculated. This is useful for testing and to try out new ideas.
Consensus Method
--consensus-method TEXT:{majorities,cavener,threshold}=majorities
Consensus method to use for combining sequences.
--consensus-threshold FLOAT:FLOAT in [0 - 1]=0.5 Needs: --consensus-method
Threshold value to use with --consensus-method threshold. Has to be in [ 0.0, 1.0 ].
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--write-info-files FLAG
If set, two additional info files are written, containing the new pruned taxonomy, as well as the entropy of all clades of the original taxonomy.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

Given a set of sequences and a fitting taxonomy, the command produces consensus sequences representing taxonomic clades, according to our PhAT method as described here. The main inputs are --sequence-file and --taxonomy-file, which provide the input data, as well as the --target-size of how many consensus sequences to build.

PhAT workflow.

After running the command, the resulting set of sequences can be used to infer a reference tree using any tree inference program.

Details

--taxonomy-file

The taxonomy file needs to contain a list of the taxa used for the taxonomic expansion algorithm. Each line of the file lists a semicolon-separated taxonomic clade. Everything after the first tab is ignored.

Example:

Eukaryota;	4	domain
Eukaryota;Amoebozoa;	4052	kingdom		119
Eukaryota;Amoebozoa;Myxogastria;	4094	phylum		119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;	4095	genus		119
Eukaryota;Amoebozoa;Myxogastria;Badhamia;	4096	genus		119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia;	4097	genus		119
Eukaryota;Amoebozoa;Myxogastria;Comatricha;	4098	genus		119
...

--sequence-file

The sequence file needs to be in fasta format, and contain sequences that are labelled with the taxonomic path that they belong to. This taxonomic path can either be the whole label, or everything after the first whitespace (space or tab). This allows to have sequences with unique identifiers as the first part of the label.

For example, sequences in the Silva database are labelled like this:

>AY842031.1.1855 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete comata
>JQ031957.1.4380 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia maxima

In the example, the sequences first contain a unique identifier, followed by a space and the taxonomic path the sequence belongs to. The path contains an additional taxonomic level which is not present in the database. If this occurs, the last level is assumed to be species level, and removed from the path. The resulting taxonommic path is part of the taxonomy, and hence the sequence can be used.

--sub-taxonomy

If provided with a semicolon-separated taxonomic path (e.g., Eukaryota;Amoebozoa;), only this subclade is used for the algorithm. That is, the algorithm behaves as if the --taxonomy-file and --sequence-file only contained the taxa and sequences of the provided clade.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767

Clone this wiki locally