-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: phat
Generate consensus sequences from a sequence database according to the PhAT method.
Usage: gappa prepare phat [options]
Input | |
---|---|
--taxonomy-file |
Required. TEXT:FILE File that lists the taxa of the database. |
--sequence-file |
Required. TEXT:FILE Fasta file containing the sequences of the database. |
Taxonomy Expansion | |
--target-size |
Required. UINT=0 Target size of how many taxa to select for building consensus sequences. |
--sub-taxonomy |
TEXT If a taxopath from the taxonomy is provided, only the respective sub-taxonomy is used. |
--min-subclade-size |
UINT=0 Minimal size of sub-clades. Everything below is expanded. |
--max-subclade-size |
UINT=0 Maximal size of a non-expanded sub-clades. Everything bigger is first expanded. |
--min-tax-level |
UINT=0 Minimal taxonomic level. Taxa below this level are always expanded. |
--allow-approximation |
FLAG Allow to expand taxa that help getting closer to the --target-size, even if they are not the ones with the highest entropy. |
--no-taxa-selection |
FLAG If set, no taxa selection using entropy is performed. Instead, all taxa on all levels/ranks are used and consensus sequences for all of them are calculated. This is useful for testing and to try out new ideas. |
Consensus Method | |
--consensus-method |
TEXT:{majorities,cavener,threshold}=majorities Consensus method to use for combining sequences. |
--consensus-threshold |
FLOAT:FLOAT in [0 - 1]=0.5 Needs: --consensus-method Threshold value to use with --consensus-method threshold. Has to be in [ 0.0, 1.0 ]. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--write-info-files |
FLAG If set, two additional info files are written, containing the new pruned taxonomy, as well as the entropy of all clades of the original taxonomy. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
Given a set of sequences and a fitting taxonomy, the command produces consensus sequences representing taxonomic clades, according to our PhAT method as described here.
The main inputs are --sequence-file
and --taxonomy-file
, which provide the input data, as well as the --target-size
of how many consensus sequences to build.
After running the command, the resulting set of sequences can be used to infer a reference tree using any tree inference program.
The taxonomy file needs to contain a list of the taxa used for the taxonomic expansion algorithm. Each line of the file lists a semicolon-separated taxonomic clade. Everything after the first tab is ignored.
Example:
Eukaryota; 4 domain
Eukaryota;Amoebozoa; 4052 kingdom 119
Eukaryota;Amoebozoa;Myxogastria; 4094 phylum 119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete; 4095 genus 119
Eukaryota;Amoebozoa;Myxogastria;Badhamia; 4096 genus 119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia; 4097 genus 119
Eukaryota;Amoebozoa;Myxogastria;Comatricha; 4098 genus 119
...
The sequence file needs to be in fasta format, and contain sequences that are labelled with the taxonomic path that they belong to. This taxonomic path can either be the whole label, or everything after the first whitespace (space or tab). This allows to have sequences with unique identifiers as the first part of the label.
For example, sequences in the Silva database are labelled like this:
>AY842031.1.1855 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete comata
>JQ031957.1.4380 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia maxima
In the example, the sequences first contain a unique identifier, followed by a space and the taxonomic path the sequence belongs to. The path contains an additional taxonomic level which is not present in the database. If this occurs, the last level is assumed to be species level, and removed from the path. The resulting taxonommic path is part of the taxonomy, and hence the sequence can be used.
If provided with a semicolon-separated taxonomic path (e.g., Eukaryota;Amoebozoa;
), only this subclade is used for the algorithm. That is, the algorithm behaves as if the --taxonomy-file
and --sequence-file
only contained the taxa and sequences of the provided clade.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools