-
Notifications
You must be signed in to change notification settings - Fork 4
3. more info on zol
zol is the major program of the suite. In brief, zol takes a set of homologous (ideally orthologous) gene clusters and collapses it down into a tabular/spreadsheet report where each row corresponds to a distinct ortholog group. Ortholog groups are presented in a consensus order and the resulting report also features various evolutionary statistics and functional annotations for each ortholog group.
Column | Description | Notes |
---|---|---|
Ortholog Group (OG) ID | The identifier of the ortholog/homolog group. | |
OG is Single Copy? | Whether the ortholog/homolog group is single copy in the context of the gene-cluster. Evolutionary statistics should be evaluated carefully if False or multiple gene-clusters are from the same genome. | |
Proportion of Total Gene Clusters with OG | The proportion of input gene-clusters/GenBanks which feature the homolog group. | |
OG Median Length (bp) | The median length of the homolog group in basepairs. | |
OG Consensus Order | The consensus order of the homolog group across all gene clusters. | |
OG Consensus Direction | The consensus direction of the homolog group across all gene clusters. | |
Proportion of Focal Gene Clusters with OG | Only produced if comparative analysis is requested by user. | |
Proportion of Comparator Gene Clusters with OG | Only produced if comparative analysis is requested by user. | |
Fixation Index | Fst estimate based on measuring pairwise differences in codon alignments and the statistic developed by Hudson, Slatkin, and Maddison 1992 | Only produced if comparative analysis is requested by user. |
Upstream Region Fixation Index | Fst estimate based on upstream 100 bp nucleotide alignments. | |
Tajima's D | Tajima's D calculated using implementation described in lsaBGC based on statistic developed by Tajima 1989. | Interpret with care and consideration of divergence of genomes from which gene clusters were extracted. Calculation of statistic modified to account for the presence of gaps in alignments. Filtering of codon alignments in zol is currently different than what is applied in lsaBGC. |
Proportion of Filtered Codon Alignment is Segregating Sites | Proportion of sites in filtered codon alignment which correspond to segregating sites. | Note, segregating sites require two different non-gap alleles - gaps are not counted as a distinct allele. |
Entropy | Average entropy over largely non-ambiguous sites (<10% ambiguity) in codon alignments. | |
Upstream Region Entropy | Average entropy over largely non-ambiguous sites (<10% ambiguity) in nucleotide alignments of upstream regions. | |
Median Beta-RD-gc | The median Beta-RD statistic for ortholog group relative to the full gene-cluster. | Calculation is similar to what was described in the lsaBGC study, but expected divergence for ortholog group sequence between pair of gene-clusters/samples is not based on genome-wide divergence but gene-cluster divergence. |
Max Beta-RD-gc | The max Beta-RD statistic observed for the ortholog group between two gene-clusters. | |
Proportion of sites which are highly ambiguous in codon alignment | The proportion of sites which are ambiguous (e.g. feature gaps) in greater than 10% of the sequences of a codon alignments (before trimming/filtering). | |
Proportion of sites which are highly ambiguous in trimmed codon alignment | The proportion of sites which are ambiguous (e.g. feature gaps) in greater than 10% of the sequences of trimmed codon alignments (via trimal). | |
Median GC | The median GC% of genes belonging to the ortholog group. | |
Median GC Skew | The median GC skew (G-C)/(G+C) of genes belonging to the ortholog group. | |
GARD Partitions Based on Distinct Segments based on Recombination Breakpoints | Number of recombination segments detected by HyPhy's GARD method: Kosakovsky Pond et al. 2006. Not run by default due to time requirements. | |
Number of Sites Identified as Under Positive or Negative Selection | The number of sites inferred as under positive Prob[α<β] or negative selection Prob[α>β] based on FUBAR method: Not run by default due to time requirements. Uses HyPhy's FUBAR method: Murrell et al. 2013 | |
Average delta(Beta, Alpha) by FUBAR across sites | The average difference of β-α across sites in the codon alignment as calculated by FUBAR. | More negative values imply greater purifying selection whereas more positive values imply greater positive selection. |
Proportion of Sites Under Selection which are Positive | Proportion of the number of sites identified as under either positive or negative selection by FUBAR analysis which are under positive selection. | |
Custom Annotation (E-value) | Custom annotation based on user providing custom protein database. | |
KO Annotation (E-value) | Best KEGG ortholog annotation(s) (the HMMER3 E-value associated with the best score) | |
PGAP Annotation (E-value) | Best PGAP annotation(s) (the HMMER3 E-value associated with the best score) | |
PaperBLAST Annotation (E-value) | Best PaperBLAST annotation(s) (the DIAMOND E-value associated with the best bitscore). For associated papers BLAST the consensus sequence or the ID here to on the PaperBLAST webpage. | |
CARD Annotation (E-value) | Best CARD annotation(s) of antimicrobial resistance genes (the DIAMOND E-value associated with the best bitscore) | |
IS Finder (E-value) | Best ISFinder annotation(s) of IS elements / transposons (the DIAMOND E-value associated with the best bitscore) | |
MIBiG Annotation (E-value) | Best MIBiG annotation(s) for genes in characterized BGCs (the DIAMOND E-value associated with the best bitscore) | |
VOG Annotation (E-value) | Best VOG annotation(s) for viral/phage ortholog groups (the HMMER3 E-value associated with the best score) | |
Pfam Domains | Pfam domains with E-value < 1e-5 and meeting the "trusted" score thresholds. | |
CDS Locus Tags | Locus tag identifiers of genes belonging to the ortholog group. | |
Consensus Sequence | The consensus sequence for the ortholog group. |
Some of the 8 annotation databases are profile HMMs whereas others are DIAMOND databases:
profile HMMs: KO, PGAP, VOG, Pfam DIAMOND databases: PaperBLAST, CARD, IS Finder, and MIBiG
The consensus sequence of each ortholog group is used for annotations for computational efficiency and consolidation. For both profile HMMs (searched using hmmscan in HMMER3) and DIAMOND databases (searched via DIAMOND blastp) we require an E-value of 1e-5 to report the best scoring annotation(s) (based on score or bitscore).
For details on how the ortholog group consensus order and direction are calculated, please reference the description on the lsaBGC wiki. We use a similar approach in zol.
usage: zol [-h] -i INPUT [INPUT ...] -o OUTPUT_DIR [-sfp] [-it IDENTITY_THRESHOLD] [-ct COVERAGE_THRESHOLD] [-et EVALUE_THRESHOLD] [-fl] [-fd] [-r] [-d] [-ri] [-cdo] [-cdp CDHIT_PARAMS]
[-dt DEREP_IDENTITY] [-dc DEREP_COVERAGE] [-di DEREP_INFLATION] [-dsg] [-rp REINFLATE_PARAMS] [-ibc] [-ces] [-aec] [-qa] [-b] [-s] [-sg] [-cd CUSTOM_DATABASE] [-rgc] [-l LENGTH] [-w WIDTH]
[-fgl] [-f FOCAL_GENBANKS] [-fc COMPARATOR_GENBANKS] [-oo] [-po PRECOMPUTED_ORTHOGROUPS] [-sc] [-c THREADS] [-mm MAX_MEMORY] [-v]
Program: zol
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
**************************************************************************************
oooooooooooo ooooo
d'''''''d888' `888'
.888P .ooooo. 888
d888' d88' `88b 888
.888P 888 888 888
d888' .P 888 888 888 o
.8888888888P `Y8bod8P' o888ooooood8
**************************************************************************************
zol is a lightweight software that can generate reports on conservation, annotation,
and evolutionary statistics for defined orthologous/homologous loci (e.g. BGCs, phages,
MGEs, or any genomic island / operon!).
CONSIDERATIONS:
---------------
* It is advised that multiple GenBanks from the same genome/sample be concatenated into
a multi-record GenBank to account for fragmentation of gene-clusters and properly
calculate copy count of ortholog groups.
* Locus tags cannot contain commas, if they do however, you can use the --rename_lt flag
to request new locus tags!
* Ortholog group and homolog group are/were used inter-changeably in the code/comments. We
recommend using the term ortholog group which is more commonly used in literature for
the type of protein clustering we perform in zol. Since v1.28 - result files and logging
messages should largely use "ortholog group" or "OG".
* Dereplication uses ANI & AF estimates by skani, which the author recommends should be
used on contigs (or gene-clusters in this case) greater than 10 kb for accurate
calculations.
options:
-h, --help show this help message and exit
-i INPUT [INPUT ...], --input INPUT [INPUT ...]
Either a directory or set of files with orthologous/homologous
locus-specific GenBanks.
Files must end with ".gbk", ".gbff", or ".genbank".
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Parent output/workspace directory.
-sfp, --select-fai-params-mode
Determine statistics informative for selecting parameters for running
fai to find more instances of the gene cluster.
-it IDENTITY_THRESHOLD, --identity-threshold IDENTITY_THRESHOLD
Minimum identity coverage for an alignment between protein
pairs from two gene-clusters to consider in search for
orthologs. [Default is 30].
-ct COVERAGE_THRESHOLD, --coverage-threshold COVERAGE_THRESHOLD
Minimum query coverage for an alignment between protein
pairs from two gene-clusters to consider in search for
orthologs. [Default is 50].
-et EVALUE_THRESHOLD, --evalue-threshold EVALUE_THRESHOLD
Maximum E-value for an alignment between protein pairs from
two gene-clusters to consider in search for orthologs.
[Default is 0.001].
-fl, --filter-low-quality
Filter gene-clusters which feature alot of missing
bases (>10 percent).
-fd, --filter-draft-quality
Filter records of gene-clusters which feature CDS
features on the edge of contigs (those marked with
attribute near_contig_edge=True by fai) or which are
multi-record.
-r, --rename-lt Rename locus-tags for CDS features in GenBanks.
-d, --dereplicate Perform dereplication of input GenBanks using skani
and single-linkage clustering or MCL.
-ri, --reinflate Perform re-inflation with all gene-clusters of
ortho-groups identified via dereplicated analysis.
-cdo, --cdhit-orthogroup
Cluster proteins using CD-HIT instead of using the
standard InParanoid-like ortholog group prediction approach.
This approach is faster and uses less memory, but is less accurate.
-cdp CDHIT_PARAMS, --cdhit-params CDHIT_PARAMS
Parameters for performing CD-HIT based ortholog group
clustering if requested via --cdhit_orthogroup.
[Default is "-c 0.5 -aL 0.25 -aS 0.5 -n 3 -M 4000"]
-dt DEREP_IDENTITY, --derep-identity DEREP_IDENTITY
skani ANI threshold to use for dereplication. [Default is 99.0].
-dc DEREP_COVERAGE, --derep-coverage DEREP_COVERAGE
skani aligned fraction threshold to use for
dereplication. [Default is 95.0].
-di DEREP_INFLATION, --derep-inflation DEREP_INFLATION
Inflation parameter for MCL to use for dereplication of
gene-clusters. If not specified single-linkage clustering
will be used instead.
-dsg, --derep-small-genomes
Run skani with the --small-genomes preset for
dereplication - recommended if dealing
with lots of gene cluster instances that are < 20 kb
in length (requires skani version >0.2.2).
-rp REINFLATE_PARAMS, --reinflate-params REINFLATE_PARAMS
Parameters for running CD-HIT for re-inflation, please
surround argument input with double quotes.
[Default is "-c 0.98 -aL 0.95 -aS 0.95 -n 5 -M 4000"].
-ibc, --impute-broad-conservation
Impute weighted conservation stats based on cluster size associated
with dereplicated representatives.
-ces, --comprehensive-evo-stats
Allow computing of evolutionary statistics for non-single-copy genes.
-aec, --allow-edge-cds
Allow CDS within gene-cluster GenBanks with the attribute
"near_scaffold_edge=True", which is set by fai for features
within 2kb of contig edges.
-qa, --quality-align Use MUSCLE align instead of super5 for alignments - slower
but more accurate.
-b, --betard-analysis
Compute Beta-RD-gc statsitics - off by default because it requires
a lot of memory for large gene cluster sets.
-s, --selection-analysis
Run selection analysis using HyPhy's GARD
and FUBAR methods. These are turned off by default because
they are computationally intensive.
-sg, --skip-gard Skip GARD detection of recombination breakpoints
prior to running FUBAR selection analysis. Less
accurate than running with GARD preliminary analysis,
but much faster. Default is False because these are
computationally intensive.
-cd CUSTOM_DATABASE, --custom-database CUSTOM_DATABASE
Path to FASTA file of protein sequences corresponding to a
custom annotation database.
-rgc, --refine-gene-calling
Perform gene-calling refinement using custom database.
All ortholog groups which don't match to a protein in the
custom database will be ignored.
-l LENGTH, --length LENGTH
Specify the height/length of the heatmap plot [Default is 7].
-w WIDTH, --width WIDTH
Specify the width of the heatmap plot [Default is 14].
-fgl, --full-genbank-labels
Use full GenBank labels instead of just the first 20 characters.
-f FOCAL_GENBANKS, --focal-genbanks FOCAL_GENBANKS
File with focal genbank(s) listed (one per line).
-fc COMPARATOR_GENBANKS, --comparator-genbanks COMPARATOR_GENBANKS
Optional file with comparator genbank(s) listed.
Default is to use remaining GenBanks as comparators to focal listing.
-oo, --only-orthogroups
Only compute ortholog groups and stop (runs up to step 2).
-po PRECOMPUTED_ORTHOGROUPS, --precomputed-orthogroups PRECOMPUTED_ORTHOGROUPS
Path to two-column tab delimited file where the first column
corresponds to locus_tags and the second column to corresponding
orthogroup identifiers. Requires locus tags to be non-overlapping
across input gene cluster GenBank files and ortholog designations
for all CDS locus tags.
-sc, --skip-cleanup Whether to skip cleanup of temporary files in the
'Determine_Orthogroups/' subdirectory.
-c THREADS, --threads THREADS
The number of threads to use.
-mm MAX_MEMORY, --max-memory MAX_MEMORY
Uses resource module to set soft memory limit. Provide in Giga-bytes.
Configured in the shell environment [experimental; Default is None].
-v, --version Get version and exit.
Option (Short) | Option (Long) | Description |
---|---|---|
-i |
--input |
Either a directory or set of files with orthologous/homologous locus-specific GenBanks. Files must end with ".gbk", ".gbff", or ".genbank". |
-o |
--output-dir |
The output directory - there is a lot of output files zol produces - so it is recommended that this is a new, discrete directory. |
-sfp |
--select-fai-params-mode |
Determine statistics informative for selecting parameters for running fai to find more instances of the gene cluster. See this wiki page for more information. |
-it |
--identity-threshold |
Minimum identity coverage for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 30]. |
-ct |
--coverage-threshold |
Minimum query coverage for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 50]. |
-et |
--evalue-threshold |
Maximum E-value for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 0.001]. |
-fl |
--filter-low-quality |
Discard gene clusters which feature a lot of missing bases (>10 percent). |
-fd |
--filter-draft-quality |
Discard gene clusters which feature CDS features on the edge of contigs (those marked with attribute near_contig_edge=True by fai) or which are multi-record. |
-r |
--rename-lt |
Rename locus-tags for CDS features in GenBanks. zol relies on locus tags as protein identifiers, so if all input gene clusters do not have locus_tag qualifiers for all CDS features use this option to ensure everything runs smoothly and zol doesn't complain. |
-cd |
--custom-database |
Path to FASTA file of protein sequences corresponding to a custom annotation database. This can be the query used for fai analysis - it helps you keep grounded and navigate the results mapping ortholog groups to known genes and highly recommended to use. |
-d |
--dereplicate |
Perform dereplication of input GenBanks using skani based pairwise ANI estimates with single-linkage clustering or MCL. |
-ri |
--reinflate |
If dereplication was requested, this flag further requests performing re-inflation of ortholog groups with proteins from non-representative gene cluster instances. |
-cdo |
--cdhit-orthogroup |
Cluster proteins using CD-HIT instead of using the standard InParanoid-like ortholog group prediction approach. This approach is faster and uses less memory, but is less accurate. |
-cdp |
--cdhit-params |
Parameters for performing CD-HIT based ortholog group clustering if requested via --cdhit_orthogroup [Default is "-c 0.5 -aL 0.25 -aS 0.5 -n 3 -M 4000"]. |
-dt |
--derep-identity |
skani ANI threshold between gene cluster pairs to use for dereplication [Default is 99.0]. |
-dc |
--derep-coverage |
skani aligned fraction threshold between gene cluster pairs to use for dereplication [Default is 95.0]. |
-di |
--derep-inflation |
Inflation parameter for MCL to use for dereplication of gene clusters. If not specified single-linkage clustering will be used instead. |
-dsg |
--derep-small-genomes |
Run skani with the --small-genomes preset for dereplication - recommended if dealing with gene clusters that are < 20 kb in length (requires skani version >0.2.2). |
-rp |
--reinflate-params |
Parameters for running CD-HIT for re-inflation, please surround argument input with double quotes. [Default is "-c 0.98 -aL 0.95 -aS 0.95 -n 5 -M 4000"]. |
-ibc |
--impute-broad-conservation |
Impute weighted conservation stats based on cluster size associated with dereplicated representatives - will essentially assume non-representative gene cluster instances have the same gene content profile as their closest representative. |
-ces |
--comprehensive-evo-stats |
Allow computing of evolutionary statistics for non-single-copy genes. Generally not recommended because paralogs can have unexpected effects on various evolutionary statistics. |
-aec |
--allow-edge-cds |
If draft gene clusters are included (-fd not specified) then include CDS features within gene cluster GenBanks with the attribute "near_scaffold_edge=True" in zol investigation, which is set by fai for features within 2kb of contig edges. |
-qa |
--quality-align |
Use MUSCLE align instead of super5 for alignments - slower but more accurate. |
-b |
--betard-analysis |
Compute Beta-RD-gc statistics - off by default because it requires a lot of memory for large gene cluster sets. Beta-RD is a simple statistic which indicates the similarity of pairs of genes belonging to an ortholog group relative to the overall similarity between gene cluster pairs. |
-s |
--selection-analysis |
Run selection analysis using HyPhy's GARD and FUBAR methods. These are turned off by default because they are computationally intensive. FUBAR also for inferring sites which are under positive/negative selection. |
-sg |
--skip-gard |
Skip GARD detection of recombination breakpoints prior to running FUBAR selection analysis. Less accurate than running with GARD preliminary analysis, but much faster. |
-rgc |
--refine-gene-calling |
Perform gene-calling refinement using custom database. All ortholog groups which don't match to a protein in the custom database will be ignored. |
-f |
--focal-genbanks |
File with focal GenBank(s) listed (one per line) for performing comparative analysis. If --comparator-genbanks is not specified then the comparator gene cluster set is just the complementary set of gene clusters to the focal set. E.g. this could be gene clusters from a particular lineage or species that you want to see how they differ to related gene cluster instances from other strains/species. |
-fc |
--comparator-genbanks |
Optional file with comparator genbank(s) listed. Note, focal gene clusters must have been specified. Default is to use remaining GenBanks as comparators to focal listing. |
-oo |
--only-orthogroups |
Only compute ortholog groups and stop (runs up to step 2). |
-po |
--precomputed-orthogroups |
Path to two-column tab delimited file where the first column corresponds to locus_tags and the second column to corresponding orthogroup identifiers. Requires locus tags to be non-overlapping across input gene cluster GenBank files and ortholog designations for all CDS locus tags. |