Skip to content

4.1. selecting parameters for fai and zol

Rauf Salamzade edited this page Mar 24, 2024 · 1 revision

Selecting parameter values for fai

If the user has previously identified a handful of diverse instances of a gene cluster, they can provide them to zol and request the mode --select_fai_params_mode and identify appropriate parameters and command recommendations for fai.

An example report produced looks something like:

=============================================================
Recommendations for running fai to find additional instances of gene cluster:
-------------------------------------------------------------
Note, this functionality assumes that the known instances of the gene cluster
are representative of the gene cluster/taxonomic diversity you will be searching.
=============================================================
General statistics:
=============================================================
Maximum of maximum E-values observed for any OG 0.000868
Maximum of near-core OG E-values observed:      1.98e-05
Maximum distance between near-core OGs: 14
Median CDS count:       84.0
Median proportion of CDS which are near-core (conserved in 80 percent of gene-clusters):        0.3950617283950617
Best representative query gene-cluster instance to use: /home/salamzade/zol_development/showcase_examples/determine_phage_and_BGC_thresholds/phage/zol_fai_param_selections/PhamClust_3/Local_Modified_GenBanks/NC_023562.1.gbk
=============================================================
Parameter recommendations - CPUs set to 4 by default
please provide the path to the prepTG database yourself!
=============================================================
Lenient / Sensitive Recommendations for Exploratory Analysis:
fai --cpus 4 --output_dir fai_Search_Results/ --draft_mode --evalue_cutoff 0.000868 --min_prop 0.1 --syntenic_correlation_threshold 0.0 --max_genes_disconnect 17
-------------------------------------------------------------
Strict / Specific Recommendations:
fai --cpus 4 --output_dir fai_Search_Results/ --draft_mode --filter_paralogs --evalue_cutoff 0.000868 --min_prop 0.25 --syntenic_correlation_threshold 0.0 --max_genes_disconnect 14 key_protein_queries /home/salamzade/zol_development/showcase_examples/determine_phage_and_BGC_thresholds/phage/zol_fai_param_selections/PhamClust_3/NearCore_Proteins_from_Representative.faa --key_protein_min_prop 0.5 --key_protein_evalue_cutoff 1.98e-05

Prior distributions for fai parameter values for gene cluster families (BiG-SCAPE GCFs) and phage clusters (PhamClust)

Characterized BGCs from MIBiG v3.1 were downloaded and clustered int gene cluster families using BiG-SCAPE. Phage clusters from PhamClust were also gathered. Clusters of similar gene clusters (BGCs or phages) were processed through zol with the --select_fai_params_mode requested in batch. Results from the investigation which could be used to set fai parameter settings when looking at BGCs or phages without better prior information available:

image

Selecting parameter values for zol

There are some parameters which control the granularity of ortholog group clustering by zol. This includes thresholds for percent identity and coverage for pairs of proteins to be considered as related prior to MCL clustering. The default values of these parameters might be too stringent or conversely too loose depending on the set of gene clusters being investigated.

For best results with zol, if fai was used to identify the gene clusters, we thus advice users to assess the spreadsheet fai produces to see what values for these thresholds might be appropriate!