-
Notifications
You must be signed in to change notification settings - Fork 4
9. more info on abon
abon takes as input antiSMASH and/or GECCO results directories for a single sample together with a prepTG (target genomes) database to determine how unique the sample's BGC-ome is to the genomes in the database. Its development is inspired by studies which have shown that BGCs are often co-regulated (see: Beyond the Biosynthetic Gene Cluster Paradigm: Genome-Wide Coexpression Networks Connect Clustered and Unclustered Transcription Factors to Secondary Metabolic Pathways ) and that secondary/specialized metabolites can be the product of additional genes across the genome (e.g. as described in these two nice studies Kim and Lee 2012 & Mohite et al. 2022) and potentially multiple BGCs.
Importantly, abon will parse out "key" biosynthetic CDS features to enable more stringent requirement of their presence while allowing for more leniency in the presence of auxiliary BGC genes. For antiSMASH BGCs, these are CDS features marked with rule-based-clusters
. For GECCO BGCs, these are CDS with domains bearing the most "weight" in the CRF detection of BGCs (see: https://github.com/Kalan-Lab/lsaBGC/pull/11 for more info).
The specific cutoffs used in fai for gene cluster detection in target genomes can be adapted as needed. Alternatively, a simple BLASTp search can be performed instead to determine all homologs of proteins for each BGC from the focal sample in target genomes regardless of whether they are similarly co-located or not.
Note, to assess how individual BGCs relate to cataloged/known BGCs or gene cluster families (GCFs), we recommend the awesome BiG-FAM webserver
The following is a mini-tutorial on using abon to investigate the novelty of the BGC-ome of Bacillus subtilis st. 168 to representative Bacillales genomes we made available in a precompiled prepTG database.
First, lets download the query genome of interest.
# Download genome from NCBI
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/045/GCF_000009045.1_ASM904v1/GCF_000009045.1_ASM904v1_genomic.fna.gz
# Uncompress it & rename it
gunzip GCF_000009045.1_ASM904v1_genomic.fna.gz
mv GCF_000009045.1_ASM904v1_genomic.fna Bsubtilis_st168.fasta
Next, we can run antiSMASH and GECCO to call BGCs
# in some conda environment or setting with antiSMASH available
antismash --output-dir Bsubtilis_st168_antiSMASH_Results/ --genefinding-tool prodigal Bsubtilis_st168.fasta
# in some conda environment or setting with GECCO available
gecco run --genome Bsubtilis_st168.fasta -o Bsubtilis_st168_GECCO_Results/
Next, we can setup the precompiled database of Bacillales representative genome using prepTG:
# in zol's conda environment or via the Docker wrapper:
prepTG -d Bacillales -o Bacillales_Reps_prepTG_Database/
Now we are ready to run abon!
abon -tg Bacillales_Reps_prepTG_Database/ -a Bsubtilis_st168_antiSMASH_Results/ -g Bsubtilis_st168_GECCO_Results/ -o abon_Results/ -c 20
Note, this can take a while as it will involve running fai X times (where X is the number of BGCs in the focal sample of interest).
Similar to fai and zol's major results, abon also primarily produces an XLSX spreadsheet. On the first tab of abon's results XLSX spreadsheet, is an overview of the focal sample's antiSMASH and/or GECCO biosynthetic gene clusters:
Then on the second tab, the coverage of the focal sample's BGC-ome across the genomes in the target genomes database is shown:
-
Checking for BGC-Ome novelty is an exhaustive process and in the above example we used a database of representative genomes (dereplicated at 99% average nucleotide identity). Therefore we see that the B. subtilis st 168 BGC-Ome doesn't match any representative genome exactly; however, using a database of all Bacillus genomes present in GTDB release 214 (R214), we see that several Bacillus subtilis genomes are regarded as having all the BGCs predicted by antiSMASH & GECCO in strain 168. We provide comprehensive precompiled prepTG databases on Zenodo for the genera Bacillus, Streptomyces, and Micromonospora (featuring nearly all genomes belonging to these genera in GTDB R214) at: https://zenodo.org/records/10050207. To use these you would just download and uncompress, e.g.
wget https://zenodo.org/records/10050207/files/Micromonospora_prepTG_Database.tar.gz?download=1; gunzip -zxvf Micromonospora_prepTG_Database.tar.gz
. -
Default parameters for fai-based detection of BGCs are: 50% of BGC genes and 75% of key BGC genes (see above) need to be identified in whole or fragmented along scaffold edges via DIAMOND BLASTp at an E-value threshold of 1e-10. A syntenic similarity of 0.6 is also required. Note, there is a possibility that some BGCs might be highly paralogous and abon might not be able to resolve this super well - e.g. if your sample has two paralogous BGCs it might say they are both present in a target genome when only one is.
usage: abon [-h] [-a ANTISMASH_RESULTS] [-g GECCO_RESULTS] -tg TARGET_GENOMES_DB [-fo FAI_OPTIONS] [-s] [-si SIMPLE_BLASTP_IDENTITY_CUTOFF] [-sc SIMPLE_BLASTP_COVERAGE_CUTOFF]
[-se SIMPLE_BLASTP_EVALUE_CUTOFF] [-sk SIMPLE_BLASTP_KEY_PROTEINS_PROPORTION_CUTOFF] [-sm SIMPLE_BLASTP_SENSITIVITY_MODE] -o OUTDIR [-c CPUS]
Program: abon
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
abon - Assess Bgc-Ome Novelty
abon wraps fai to assess the novelty of a sample's BGC-ome relative to a set of target genomes.
Alternatively, it can run a simple DIAMOND BLASTp analysis to just assess the presence of BGC genes
individually - without the requirement they are co-located like in the focal sample's BGCs.
options:
-h, --help show this help message and exit
-a ANTISMASH_RESULTS, --antismash_results ANTISMASH_RESULTS
Path to antiSMASH BGC prediction results directory for a single sample/genome.
-g GECCO_RESULTS, --gecco_results GECCO_RESULTS
Path to GECCO BGC prediction results directory for a single sample/genome.
-tg TARGET_GENOMES_DB, --target_genomes_db TARGET_GENOMES_DB
prepTG database directory for target genomes of interest.
-fo FAI_OPTIONS, --fai_options FAI_OPTIONS
Provide fai options to run. Should be surrounded by quotes. [Default is "-kpm 0.75 -kpe 1e-10 -e 1e-10 -m 0.5 -dm -sct 0.6"]
-s, --use_simple_blastp
Use a simple DIAMOND BLASTp search with no requirement for co-localization of hits.
-si SIMPLE_BLASTP_IDENTITY_CUTOFF, --simple_blastp_identity_cutoff SIMPLE_BLASTP_IDENTITY_CUTOFF
If simple BLASTp mode requested : cutoff for identity between query proteins and matches in target genomes [Default is 40.0].
-sc SIMPLE_BLASTP_COVERAGE_CUTOFF, --simple_blastp_coverage_cutoff SIMPLE_BLASTP_COVERAGE_CUTOFF
If simple BLASTp mode requested : cutoff for coverage between query proteins and matches in target genomes [Default is 70.0].
-se SIMPLE_BLASTP_EVALUE_CUTOFF, --simple_blastp_evalue_cutoff SIMPLE_BLASTP_EVALUE_CUTOFF
If simple BLASTp mode requested : cutoff for E-value between query proteins and matches in target genomes [Default is 1e-10].
-sk SIMPLE_BLASTP_KEY_PROTEINS_PROPORTION_CUTOFF, --simple_blastp_key_proteins_proportion_cutoff SIMPLE_BLASTP_KEY_PROTEINS_PROPORTION_CUTOFF
If simple BLASTp mode requested : cutoff for proportion of key proteins needed to consider a BGC as present in a target genome [Default is 0.75].
-sm SIMPLE_BLASTP_SENSITIVITY_MODE, --simple_blastp_sensitivity_mode SIMPLE_BLASTP_SENSITIVITY_MODE
Sensitivity mode for DIAMOND BLASTp. [Default is "very-sensititve"].
-o OUTDIR, --outdir OUTDIR
Output directory.
-c CPUS, --cpus CPUS The number of CPUs to use.