Skip to content

3. more info on zol

Rauf Salamzade edited this page Nov 23, 2024 · 21 revisions

zol is the major program of the suite. In brief, zol takes a set of homologous (ideally orthologous) gene clusters and collapses it down into a tabular/spreadsheet report where each row corresponds to a distinct ortholog group. Ortholog groups are presented in a consensus order and the resulting report also features various evolutionary statistics and functional annotations for each ortholog group.

Explanation of Report

Column Description Notes
Ortholog Group (OG) ID The identifier of the ortholog/homolog group.
OG is Single Copy? Whether the ortholog/homolog group is single copy in the context of the gene-cluster. Evolutionary statistics should be evaluated carefully if False or multiple gene-clusters are from the same genome.
Proportion of Total Gene Clusters with OG The proportion of input gene-clusters/GenBanks which feature the homolog group.
OG Median Length (bp) The median length of the homolog group in basepairs.
OG Consensus Order The consensus order of the homolog group across all gene clusters.
OG Consensus Direction The consensus direction of the homolog group across all gene clusters.
Proportion of Focal Gene Clusters with OG Only produced if comparative analysis is requested by user.
Proportion of Comparator Gene Clusters with OG Only produced if comparative analysis is requested by user.
Fixation Index Fst estimate based on measuring pairwise differences in codon alignments and the statistic developed by Hudson, Slatkin, and Maddison 1992 Only produced if comparative analysis is requested by user.
Upstream Region Fixation Index Fst estimate based on upstream 100 bp nucleotide alignments.
Tajima's D Tajima's D calculated using implementation described in lsaBGC based on statistic developed by Tajima 1989. Interpret with care and consideration of divergence of genomes from which gene clusters were extracted. Calculation of statistic modified to account for the presence of gaps in alignments. Filtering of codon alignments in zol is currently different than what is applied in lsaBGC.
Proportion of Filtered Codon Alignment is Segregating Sites Proportion of sites in filtered codon alignment which correspond to segregating sites. Note, segregating sites require two different non-gap alleles - gaps are not counted as a distinct allele.
Entropy Average entropy over largely non-ambiguous sites (<10% ambiguity) in codon alignments.
Upstream Region Entropy Average entropy over largely non-ambiguous sites (<10% ambiguity) in nucleotide alignments of upstream regions.
Median Beta-RD-gc The median Beta-RD statistic for ortholog group relative to the full gene-cluster. Calculation is similar to what was described in the lsaBGC study, but expected divergence for ortholog group sequence between pair of gene-clusters/samples is not based on genome-wide divergence but gene-cluster divergence.
Max Beta-RD-gc The max Beta-RD statistic observed for the ortholog group between two gene-clusters.
Proportion of sites which are highly ambiguous in codon alignment The proportion of sites which are ambiguous (e.g. feature gaps) in greater than 10% of the sequences of a codon alignments (before trimming/filtering).
Proportion of sites which are highly ambiguous in trimmed codon alignment The proportion of sites which are ambiguous (e.g. feature gaps) in greater than 10% of the sequences of trimmed codon alignments (via trimal).
Median GC The median GC% of genes belonging to the ortholog group.
Median GC Skew The median GC skew (G-C)/(G+C) of genes belonging to the ortholog group.
GARD Partitions Based on Distinct Segments based on Recombination Breakpoints Number of recombination segments detected by HyPhy's GARD method: Kosakovsky Pond et al. 2006. Not run by default due to time requirements.
Number of Sites Identified as Under Positive or Negative Selection The number of sites inferred as under positive Prob[α<β] or negative selection Prob[α>β] based on FUBAR method: Not run by default due to time requirements. Uses HyPhy's FUBAR method: Murrell et al. 2013
Average delta(Beta, Alpha) by FUBAR across sites The average difference of β-α across sites in the codon alignment as calculated by FUBAR. More negative values imply greater purifying selection whereas more positive values imply greater positive selection.
Proportion of Sites Under Selection which are Positive Proportion of the number of sites identified as under either positive or negative selection by FUBAR analysis which are under positive selection.
Custom Annotation (E-value) Custom annotation based on user providing custom protein database.
KO Annotation (E-value) Best KEGG ortholog annotation(s) (the HMMER3 E-value associated with the best score)
PGAP Annotation (E-value) Best PGAP annotation(s) (the HMMER3 E-value associated with the best score)
PaperBLAST Annotation (E-value) Best PaperBLAST annotation(s) (the DIAMOND E-value associated with the best bitscore). For associated papers BLAST the consensus sequence or the ID here to on the PaperBLAST webpage.
CARD Annotation (E-value) Best CARD annotation(s) of antimicrobial resistance genes (the DIAMOND E-value associated with the best bitscore)
IS Finder (E-value) Best ISFinder annotation(s) of IS elements / transposons (the DIAMOND E-value associated with the best bitscore)
MIBiG Annotation (E-value) Best MIBiG annotation(s) for genes in characterized BGCs (the DIAMOND E-value associated with the best bitscore)
VOG Annotation (E-value) Best VOG annotation(s) for viral/phage ortholog groups (the HMMER3 E-value associated with the best score)
Pfam Domains Pfam domains with E-value < 1e-5 and meeting the "trusted" score thresholds.
CDS Locus Tags Locus tag identifiers of genes belonging to the ortholog group.
Consensus Sequence The consensus sequence for the ortholog group.

Method of Annotation

Some of the 8 annotation databases are profile HMMs whereas others are DIAMOND databases:

profile HMMs: KO, PGAP, VOG, Pfam DIAMOND databases: PaperBLAST, CARD, IS Finder, and MIBiG

The consensus sequence of each ortholog group is used for annotations for computational efficiency and consolidation. For both profile HMMs (searched using hmmscan in HMMER3) and DIAMOND databases (searched via DIAMOND blastp) we require an E-value of 1e-5 to report the best scoring annotation(s) (based on score or bitscore).

Determination of Ortholog Group Consensus Order and Direction

For details on how the ortholog group consensus order and direction are calculated, please reference the description on the lsaBGC wiki. We use a similar approach in zol.

Usage

usage: zol [-h] -i INPUT [INPUT ...] -o OUTPUT_DIR [-sfp] [-it IDENTITY_THRESHOLD] [-ct COVERAGE_THRESHOLD] [-et EVALUE_THRESHOLD] [-fl] [-fd] [-r] [-d] [-ri] [-cdo] [-cdp CDHIT_PARAMS]
           [-dt DEREP_IDENTITY] [-dc DEREP_COVERAGE] [-di DEREP_INFLATION] [-dsg] [-rp REINFLATE_PARAMS] [-ibc] [-ces] [-aec] [-qa] [-b] [-s] [-sg] [-cd CUSTOM_DATABASE] [-rgc] [-l LENGTH] [-w WIDTH]
           [-fgl] [-f FOCAL_GENBANKS] [-fc COMPARATOR_GENBANKS] [-oo] [-po PRECOMPUTED_ORTHOGROUPS] [-sc] [-c THREADS] [-mm MAX_MEMORY] [-v]

	Program: zol
	Author: Rauf Salamzade
	Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
		
	**************************************************************************************							
		
                      oooooooooooo           ooooo        
                     d'''''''d888'           `888'        
                           .888P    .ooooo.   888         
                          d888'    d88' `88b  888         
                        .888P      888   888  888         
                       d888'    .P 888   888  888       o 
                     .8888888888P  `Y8bod8P' o888ooooood8 
				
	**************************************************************************************
	
	zol is a lightweight software that can generate reports on conservation, annotation, 
	and evolutionary statistics for defined orthologous/homologous loci (e.g. BGCs, phages, 
	MGEs, or any genomic island / operon!).
	
	CONSIDERATIONS:
	---------------
	* It is advised that multiple GenBanks from the same genome/sample be concatenated into 
	  a multi-record GenBank to account for fragmentation of gene-clusters and properly 
	  calculate copy count of ortholog groups.
	* Locus tags cannot contain commas, if they do however, you can use the --rename_lt flag 
	  to request new locus tags!
	* Ortholog group and homolog group are/were used inter-changeably in the code/comments. We 
	  recommend using the term ortholog group which is more commonly used in literature for 
	  the type of protein clustering we perform in zol. Since v1.28 - result files and logging
	  messages should largely use "ortholog group" or "OG".
	* Dereplication uses ANI & AF estimates by skani, which the author recommends should be 
          used on contigs (or gene-clusters in this case) greater than 10 kb for accurate 
          calculations.
	

options:
  -h, --help            show this help message and exit
  -i INPUT [INPUT ...], --input INPUT [INPUT ...]
                        Either a directory or set of files with orthologous/homologous
                        locus-specific GenBanks.
                        Files must end with ".gbk", ".gbff", or ".genbank".
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Parent output/workspace directory.
  -sfp, --select-fai-params-mode
                        Determine statistics informative for selecting parameters for running
                        fai to find more instances of the gene cluster.
  -it IDENTITY_THRESHOLD, --identity-threshold IDENTITY_THRESHOLD
                        Minimum identity coverage for an alignment between protein
                        pairs from two gene-clusters to consider in search for
                        orthologs. [Default is 30].
  -ct COVERAGE_THRESHOLD, --coverage-threshold COVERAGE_THRESHOLD
                        Minimum query coverage for an alignment between protein
                        pairs from two gene-clusters to consider in search for
                        orthologs. [Default is 50].
  -et EVALUE_THRESHOLD, --evalue-threshold EVALUE_THRESHOLD
                        Maximum E-value for an alignment between protein pairs from
                        two gene-clusters to consider in search for orthologs.
                        [Default is 0.001].
  -fl, --filter-low-quality
                        Filter gene-clusters which feature alot of missing
                        bases (>10 percent).
  -fd, --filter-draft-quality
                        Filter records of gene-clusters which feature CDS
                        features on the edge of contigs (those marked with
                        attribute near_contig_edge=True by fai) or which are
                        multi-record.
  -r, --rename-lt       Rename locus-tags for CDS features in GenBanks.
  -d, --dereplicate     Perform dereplication of input GenBanks using skani
                        and single-linkage clustering or MCL.
  -ri, --reinflate      Perform re-inflation with all gene-clusters of
                        ortho-groups identified via dereplicated analysis.
  -cdo, --cdhit-orthogroup
                        Cluster proteins using CD-HIT instead of using the
                        standard InParanoid-like ortholog group prediction approach.
                        This approach is faster and uses less memory, but is less accurate.
  -cdp CDHIT_PARAMS, --cdhit-params CDHIT_PARAMS
                        Parameters for performing CD-HIT based ortholog group
                        clustering if requested via --cdhit_orthogroup.
                        [Default is "-c 0.5 -aL 0.25 -aS 0.5 -n 3 -M 4000"]
  -dt DEREP_IDENTITY, --derep-identity DEREP_IDENTITY
                        skani ANI threshold to use for dereplication. [Default is 99.0].
  -dc DEREP_COVERAGE, --derep-coverage DEREP_COVERAGE
                        skani aligned fraction threshold to use for
                        dereplication. [Default is 95.0].
  -di DEREP_INFLATION, --derep-inflation DEREP_INFLATION
                        Inflation parameter for MCL to use for dereplication of
                        gene-clusters. If not specified single-linkage clustering
                        will be used instead.
  -dsg, --derep-small-genomes
                        Run skani with the --small-genomes preset for
                        dereplication - recommended if dealing
                        with lots of gene cluster instances that are < 20 kb
                        in length (requires skani version >0.2.2).
  -rp REINFLATE_PARAMS, --reinflate-params REINFLATE_PARAMS
                        Parameters for running CD-HIT for re-inflation, please
                        surround argument input with double quotes.
                        [Default is "-c 0.98 -aL 0.95 -aS 0.95 -n 5 -M 4000"].
  -ibc, --impute-broad-conservation
                        Impute weighted conservation stats based on cluster size associated
                        with dereplicated representatives.
  -ces, --comprehensive-evo-stats
                        Allow computing of evolutionary statistics for non-single-copy genes.
  -aec, --allow-edge-cds
                        Allow CDS within gene-cluster GenBanks with the attribute
                        "near_scaffold_edge=True", which is set by fai for features
                        within 2kb of contig edges.
  -qa, --quality-align  Use MUSCLE align instead of super5 for alignments - slower
                        but more accurate.
  -b, --betard-analysis
                        Compute Beta-RD-gc statsitics - off by default because it requires
                        a lot of memory for large gene cluster sets.
  -s, --selection-analysis
                        Run selection analysis using HyPhy's GARD
                        and FUBAR methods. These are turned off by default because
                        they are computationally intensive.
  -sg, --skip-gard      Skip GARD detection of recombination breakpoints
                        prior to running FUBAR selection analysis. Less
                        accurate than running with GARD preliminary analysis,
                        but much faster. Default is False because these are
                        computationally intensive.
  -cd CUSTOM_DATABASE, --custom-database CUSTOM_DATABASE
                        Path to FASTA file of protein sequences corresponding to a
                        custom annotation database.
  -rgc, --refine-gene-calling
                        Perform gene-calling refinement using custom database.
                        All ortholog groups which don't match to a protein in the
                        custom database will be ignored.
  -l LENGTH, --length LENGTH
                        Specify the height/length of the heatmap plot [Default is 7].
  -w WIDTH, --width WIDTH
                        Specify the width of the heatmap plot [Default is 14].
  -fgl, --full-genbank-labels
                        Use full GenBank labels instead of just the first 20 characters.
  -f FOCAL_GENBANKS, --focal-genbanks FOCAL_GENBANKS
                        File with focal genbank(s) listed (one per line).
  -fc COMPARATOR_GENBANKS, --comparator-genbanks COMPARATOR_GENBANKS
                        Optional file with comparator genbank(s) listed.
                        Default is to use remaining GenBanks as comparators to focal listing.
  -oo, --only-orthogroups
                        Only compute ortholog groups and stop (runs up to step 2).
  -po PRECOMPUTED_ORTHOGROUPS, --precomputed-orthogroups PRECOMPUTED_ORTHOGROUPS
                        Path to two-column tab delimited file where the first column
                        corresponds to locus_tags and the second column to corresponding
                        orthogroup identifiers. Requires locus tags to be non-overlapping
                        across input gene cluster GenBank files and ortholog designations
                        for all CDS locus tags.
  -sc, --skip-cleanup   Whether to skip cleanup of temporary files in the
                        'Determine_Orthogroups/' subdirectory.
  -c THREADS, --threads THREADS
                        The number of threads to use.
  -mm MAX_MEMORY, --max-memory MAX_MEMORY
                        Uses resource module to set soft memory limit. Provide in Giga-bytes.
                        Configured in the shell environment [experimental; Default is None].
  -v, --version         Get version and exit.

Detailed Usage on Major Options:

Option (Short) Option (Long) Description
-i --input Either a directory or set of files with orthologous/homologous locus-specific GenBanks. Files must end with ".gbk", ".gbff", or ".genbank".
-o --output-dir The output directory - there is a lot of output files zol produces - so it is recommended that this is a new, discrete directory.
-sfp --select-fai-params-mode Determine statistics informative for selecting parameters for running fai to find more instances of the gene cluster. See this wiki page for more information.
-it --identity-threshold Minimum identity coverage for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 30].
-ct --coverage-threshold Minimum query coverage for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 50].
-et --evalue-threshold Maximum E-value for an alignment between protein pairs from gene-clusters to consider in search for orthologs/in-paralogs [Default is 0.001].
-fl --filter-low-quality Discard gene clusters which feature a lot of missing bases (>10 percent).
-fd --filter-draft-quality Discard gene clusters which feature CDS features on the edge of contigs (those marked with attribute near_contig_edge=True by fai) or which are multi-record.
-r --rename-lt Rename locus-tags for CDS features in GenBanks. zol relies on locus tags as protein identifiers, so if all input gene clusters do not have locus_tag qualifiers for all CDS features use this option to ensure everything runs smoothly and zol doesn't complain.
-cd --custom-database Path to FASTA file of protein sequences corresponding to a custom annotation database. This can be the query used for fai analysis - it helps you keep grounded and navigate the results mapping ortholog groups to known genes and highly recommended to use.
-d --dereplicate Perform dereplication of input GenBanks using skani based pairwise ANI estimates with single-linkage clustering or MCL.
-ri --reinflate If dereplication was requested, this flag further requests performing re-inflation of ortholog groups with proteins from non-representative gene cluster instances.
-cdo --cdhit-orthogroup Cluster proteins using CD-HIT instead of using the standard InParanoid-like ortholog group prediction approach. This approach is faster and uses less memory, but is less accurate.
-cdp --cdhit-params Parameters for performing CD-HIT based ortholog group clustering if requested via --cdhit_orthogroup [Default is "-c 0.5 -aL 0.25 -aS 0.5 -n 3 -M 4000"].
-dt --derep-identity skani ANI threshold between gene cluster pairs to use for dereplication [Default is 99.0].
-dc --derep-coverage skani aligned fraction threshold between gene cluster pairs to use for dereplication [Default is 95.0].
-di --derep-inflation Inflation parameter for MCL to use for dereplication of gene clusters. If not specified single-linkage clustering will be used instead.
-dsg --derep-small-genomes Run skani with the --small-genomes preset for dereplication - recommended if dealing with gene clusters that are < 20 kb in length (requires skani version >0.2.2).
-rp --reinflate-params Parameters for running CD-HIT for re-inflation, please surround argument input with double quotes. [Default is "-c 0.98 -aL 0.95 -aS 0.95 -n 5 -M 4000"].
-ibc --impute-broad-conservation Impute weighted conservation stats based on cluster size associated with dereplicated representatives - will essentially assume non-representative gene cluster instances have the same gene content profile as their closest representative.
-ces --comprehensive-evo-stats Allow computing of evolutionary statistics for non-single-copy genes. Generally not recommended because paralogs can have unexpected effects on various evolutionary statistics.
-aec --allow-edge-cds If draft gene clusters are included (-fd not specified) then include CDS features within gene cluster GenBanks with the attribute "near_scaffold_edge=True" in zol investigation, which is set by fai for features within 2kb of contig edges.
-qa --quality-align Use MUSCLE align instead of super5 for alignments - slower but more accurate.
-b --betard-analysis Compute Beta-RD-gc statistics - off by default because it requires a lot of memory for large gene cluster sets. Beta-RD is a simple statistic which indicates the similarity of pairs of genes belonging to an ortholog group relative to the overall similarity between gene cluster pairs.
-s --selection-analysis Run selection analysis using HyPhy's GARD and FUBAR methods. These are turned off by default because they are computationally intensive. FUBAR also for inferring sites which are under positive/negative selection.
-sg --skip-gard Skip GARD detection of recombination breakpoints prior to running FUBAR selection analysis. Less accurate than running with GARD preliminary analysis, but much faster.
-rgc --refine-gene-calling Perform gene-calling refinement using custom database. All ortholog groups which don't match to a protein in the custom database will be ignored.
-f --focal-genbanks File with focal GenBank(s) listed (one per line) for performing comparative analysis. If --comparator-genbanks is not specified then the comparator gene cluster set is just the complementary set of gene clusters to the focal set. E.g. this could be gene clusters from a particular lineage or species that you want to see how they differ to related gene cluster instances from other strains/species.
-fc --comparator-genbanks Optional file with comparator genbank(s) listed. Note, focal gene clusters must have been specified. Default is to use remaining GenBanks as comparators to focal listing.
-oo --only-orthogroups Only compute ortholog groups and stop (runs up to step 2).
-po --precomputed-orthogroups Path to two-column tab delimited file where the first column corresponds to locus_tags and the second column to corresponding orthogroup identifiers. Requires locus tags to be non-overlapping across input gene cluster GenBank files and ortholog designations for all CDS locus tags.