Skip to content

5. tutorial ‐ a detailed walkthrough

Rauf Salamzade edited this page Oct 26, 2024 · 27 revisions

Contents

Note on re-running and writing to the sample results directory

If re-running fai and zol and writing to a previously existing results directory, they will "overwrite" results - but this is only for steps which have not been successfully run already. Checkpoint files are kept in the subdirectory Checkpoint_Files/. If users which to completely overwrite results, they can simply delete the resulting directory or if they want to rerun specific steps, they can delete the corresponding checkpoint files.

Overview: Investigation of the Enterococcal polysaccharide antigen (epa) in E. faecalis and E. faecium

This tutorial will simply walkthrough the test commands in the run_tests.sh script to test proper installation and showcase various features in the suite. The test data pertains to the enterococcal polysaccharide antigen encoding locus epa in Enterococcus faecalis and Enterococcus faecium. We perform a more thorough examination of this gene cluster in the zol manuscript.

Step 1: Download and setup workspace

Lets begin by downloading the test dataset:

wget https://github.com/Kalan-Lab/zol/raw/main/test_case.tar.gz

Next, we can uncompress it and change directories into it:

tar -zxvf test_case.tar.gz
cd test_case/

Step 2: Create a prepTG database of target genomes to search (or download a premade one) 📁

Before we use fai to identify homologous or orthologous instances of gene clusters in a set of target genomes, we must format the target genomes using prepTG.

prepTG takes a directory of genomic assemblies as input, in either FASTA or GenBank format. If GenBank format is provided, the expectation is that CDS features are included (in other words it is a full GenBank). If FASTA files are provided the default operation is to use pyrodigal for bacterial gene calling, however, options exist to use prodigal or minimap instead. minimap was incorporated specifically for eukaryotic genomes provided in FASTA format, where users can also provide a predicted proteome file (another FASTA file with protein sequences) of a reference genome to map those to the remainder of the genomes. We took such an approach in the zol manuscript to map high-quality coding sequence predictions from a reference Aspergillus flavus genome to the remainder of Aspergillus flavus genomes available in NCBI but lacking coding sequence predictions. For prodigal and pyrodigal, usage metagenomics gene calling mode is also available as a configurable option.

Our target genomes, where we will be searching for the epa locus can be found in the subdirectory: Target_Genomes/. Because we are dealing with bacterial genomes - we can simply run prepTG with default settings as such:

prepTG -i Target_Genomes/ -o prepTG_Database/ -c 4 -cst

The -c option in this case controls the number of parallel threads to use.

The -cst option (new in v1.3.7+) creates a species tree of the target genomes from skani-based ANI estimates (works best for within a species or less diverse genus) + neighbor-joining. This species tree can then be used for visualization purposes downstream in fai. The resulting file will be in the prepTG database folder and named Species_Tree.nwk.

Automatically setup prepTG for any genus or species in GTDB

Users can easily create a prepTG database directory for all genomes classified as a particular species in GTDB using the -g

prepTG -i Folder_with_Additional_User_Provided_Genomes/ -g "Enterococcus" -o prepTG_DB/

Caution

BE CAREFUL, WELL-SEQUENCED TAXA CAN RESULT IN LARGE PREPTG DATABASES AND LARGE FILES IN THE FAI RESULTS!!!

Downloading a premade database of representative genomes for select taxa

We have uploaded premade databases of representative genomes for select bacterial taxa (see the premade prepTG dbs wiki page for further info).

Starting in version 1.3.7 of the suite, users can simply issue the command:

prepTG -d Enterococcus -o prepTG_Database/

Note, databases are stored on Zenodo. Option should generally work, but download speeds might be slow at times.

Downloading custom genome listings of GCAs/GCFs from NCBI's RefSeq or GenBank databases

Users should also be familiar with the ever so useful ncbi-genome-download program which is included automatically in the zol conda environment.

GitHub link: https://github.com/kblin/ncbi-genome-download

# Example for downloading all Enterococcus genomes in RefSeq in FASTA format
ncbi-genome-download -F fasta -s refseq -g "Enterococcus" --flat-output -o RefSeq_Enterococcus_Genomes/ bacteria -p 4

Step 3: Search for your query gene cluster in target genomes with fai 🔎

Get your query gene cluster(s) from somewhere in FASTA or GenBank format

fai accepts query gene clusters in multiple formats. For a description of these please see the basic usage examples wiki page.

In the test dataset, we already have a known instance of the epa gene cluster from E. faecalis provided as a protein FASTA (Epa_Proteins_from_MIBiG_GenBank.faa) and in GenBank format (Epa_MIBiG_GBK/Epa_MIBiG_GenBank.gbk).

We could also manually assemble a protein FASTA file from literature or online database such as NCBI or KEGG. We could also download the GenBank for the gene cluster directly from MIBiG using wget: wget https://mibig.secondarymetabolites.org/repository/BGC0000792/BGC0000792.gbk. Alternatively, we could provide a coordinate for the epa along some reference genome, e.g. the E. faecalis* V583 genome (also provided in the workspace, Efaecalis_V583_Genome.fasta). These options enables users to reference the coordinates of interesting they find in literature or cool webservers, such ICEberg or IslandFinder, more easily.

Running fai

Here is a quick look at how using these options might look.

Using a query provided as a FASTA file of proteins

fai -pq Epa_Proteins_from_MIBiG_GenBank.faa -tg prepTG_Database/ -o fai_Results/

A special case is using a single gene as a query via the -sq option - which gains inspiration from CORASON/EvoMining - more support for this option might be developed in the future.

Using a query provided as a GenBank file

fai -i Epa_MIBiG_GBK/Epa_MIBiG_GenBank.gbk -tg prepTG_Database/ -o fai_Results/

Using coordinates along a reference genome

fai -r Efaecalis_V583_Genome.fasta -rc NC_004668.1 -rs 2083902 -re 2115174 -tg prepTG_Database/ -o fai_Results/

Where are the results? 💥

The set of homologous gene cluster instances identified in target genomes will be located in the subdirectory: fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/

Key options in fai to consider:

Here is a quick overview of key options in fai a user should consider. Additional details around the algorithms fai uses to delineate gene can be found on the more info on fai wiki page.

  • -dm or --draft-mode : specifying this option will enable draft-assembly mode searching, which allows for looser search criteria and assumptions that the gene cluster might be split up across multiple scaffolds/contigs. False positives are also likely to be incurred.
  • -fp or --filter-paralogs : specifying this option will specify to filter secondary hits (which might be paralogous instances of the gene cluster) if multiple, overlapping hits for the gene cluster are found in individual target genomes.
  • -e or --evalue : The e-value cutoff for whether to consider a gene in a target genome exhibits similarity to a protein from the query gene cluster using DIAMOND blastp. Default value is 1e-10.
  • -m or --min-prop : The minimum proportion of genes from the query gene cluster needed to report a homologous instance of a query gene cluster. Note genes are actually collapsed into distinct alleles upfront, so this is the minimum proportion of distinct alleles. Default value is 0.5.
  • -mgd or --max-genes-disconnect : The maximum number of CDS features separating two genes in a target genome which exhibit sufficient similarity to a protein from the query gene cluster for them to be considered as part of the same cluster instance. Default is 5.
  • -kpq or --key-protein-queries : A FASTA protein file can be provided with individual proteins that are special and that users can specify separate e-value or conservation thresholds for gene cluster detection using (via the -kpe and -kpm options). E.g. are you looking for an NRPS-type BGC and there are three key NRPS genes, then you can specify them separately and make your search more stringent while loosening up criteria for detection of auxiliary gene cluster components.
  • -sct or --synteic-correlation-threshold : If a GenBank or coordinates along a reference genome are provided, a syntenic similarity assessment between detected gene clusters and the query gene cluster will be performed based on global correlation gene order similarity. Values closer to 1 are more stringent whereas a value of 0 implies no syntenic filter should be applied. The default value is 0.6.
  • -gdm or --gc-delineation-mode : The gene cluster delineation mode. There are basically two options, for most users we recommend the default setting.
  • -f or --flanking-context : The flanking of a homologous instance of the gene cluster identified in a target genome to include in the resulting GenBank output, useful to explore conservation and annotation of surrounding contexts of the gene cluster downstream in zol. Default is 1000 bp.
  • -gp or --generate-plots : Whether to generate a PDF with plots showcasing the similarity of detected homologous gene cluster instances from target genomes to the query gene cluster. Looks like the following, each bar is a gene along the target genome, the height corresponds to the ratio of the protein in the target genome to the best matching protein from the query gene cluster. The color corresponds to the percent identity (more red = higher identity).

image

  • -st or --species-tree : providing this option will allow using a species tree of the target genomes to construct a tree-heatmap figure showing whether the presence of the query gene cluster is widespread or clade specific.

Check out cblaster and CAGECAT as alternatives to fai for finding homologous instances of a query gene cluster

zol just takes a set of GenBanks as input and it is definitely possible to use cblaster by Gilchrist et al. 2021 instead of fai to gather gene-clusters in GenBank format; for instance:

# use cblaster to search for homologous co-clusters in NCBI genomes
cblaster search -qf queries.faa -s cblaster_results.json

# use cblaster to extract GenBannks of homologous gene-clusters detected
cblaster extract_clusters session.json -o example_directory/

CAGECAT by van den Belt et al. 2023 is an awesome webserver for running cblaster and clinker.

cblaster features the fantastic ability to use NCBI's computational infrastructure to perform alignment thereby minimizing the need for local resources.

Step 4: Manual selection of homologous/orthologous gene cluster instances identified by fai 👀

fai will automatically produce an XLSX spreadsheet which allows users to assess homologous hits to the query gene cluster at scale. Most columns feature automatic conditional formatting to ease user assessment of quantitative fields. It is inspired by the heatmap visuals from cblaster, but the spreadsheet format allows further flexibility for users to sort on various columns.

image

Users can more freely apply filters using this spreadsheet and select specific homologous gene cluster instances for follow-up analysis in zol. To further support such manual curation of fai's hits, we also include the script selectSpecificGeneClusters.py which can be provided with the fai results directory with a text file listing either (1) sample names (first column in the sheet "Genome Wide - Results") or (2) individual gene-cluster instances (second column in the sheet "Gene Cluster - Results").

selectSpecificGeneClusters.py -i fai_Results/ -s select_instances.txt -o select_instances_dir/  -t instance

Step 5: Generate a tabular report with evolutionary trends, annotation info, and conservation stats using zol 📃 📈

Finally, we can investigate the relationships between identified homologous instances of the query gene cluster using zol.

We can do this using the following basic command:

zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -o zol_Results/ -c 10

Once more -c is just specifying the number of threads to use.

I highly recommend using the option -cd or --custom-database to provide proteins from the query gene cluster used for fai as an extra database for annotation. This will map these proteins to the ortholog groups so you have a point of reference to the original query gene cluster.

zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -cd Epa_Proteins_from_MIBiG_GenBank.faa -o zol_Results/ -c 10

In the final Excel spreadsheet generated, one of the columns will now feature annotations from the custom database (protein FASTA file) provided.

Where are the results? 💥

The final results, once zol is done, can be found in the subdirectory at zol_Results/Final_Results/ with the major results file being the XLSX spreadsheet Consolidated_Report.xlsx.

Key options in zol to consider:

Here are a list and brief description of key options to consider in zol:

  • it or identity-threshold, ct or coverage-threshold, and et or evalue-threshold : These are key thresholds used for InParnoid-type ortholog grouping in zol. The default values might be appropriate for some gene clusters being studied but looser or more stringent criteria might be beneficial to your analysis for other gene clusters.
  • -fl or --filter-low-quality : Filter out gene clusters which feature alot of missing bases (>10%). I almost always issue this, but it is not turned on by default.
  • -fd or --filter-draft-quality : Filter out gene clusters which were found near scaffold edges. I also usually specify this, especially when interested in gene conservation. This is because it becomes tricky to assess whether a gene is missing simply because the gene cluster is fragmented due to assembly issues or because it is actually missing. If fai is used in "draft mode", users could provide multi-record GenBanks and choose to skip this argument, I have not tested the use of multi-record GenBanks much.
  • -r or --rename-lt : If CDS features in input GenBanks don't have locus_tag identifiers, just generate them from scratch.
  • -d or --dereplicate : Dereplicate gene clusters to remove highly similar versions. Can be used to reduce the complexity of the analysis. It is recommended to consider using dereplication if you are dealing with thousands of genomes to speed things up and keep the amount of harddisk space needed (albeit temporarily) down!!!.
  • -ri or --reinflate : This flag tells zol to reinflate the ortholog groups determined using a dereplicated/represntative set of gene clusters with proteins from the full set of gene clusters provided as input. Note, this reinflation currently works for grouping proteins which are very similar to an already (ortholog) grouped protein from the representative gene clusters. dereplication must be specified alongside this option. This approach is inspired by Roary.
  • -cd or --custom-database : See above section, basically it is usually nice to have proteins from the known gene cluster instance to reference instead of just arbitrary ortholog group identifiers, you can do this by providing a protein FASTA file for custom annotation of ortholog groups.
  • -dom or --domain-mode : Will "chop-up" protein coding CDS features into largely, non-overlapping sets of inter-domain/intra-domain regions. Useful for detecting signatures of adaptive evolution in individual domains when dealing with gene clusters featuring large, multi-domain enzymes.

Pair up zol with the visual capabilities of clinker

A key feature of zol is the option to dereplicate gene clusters - which can allow zol to pair nicely with interactive gene cluster visualization software, in particular clinker. This is because users can apply zol to first select representative gene clusters that are sufficiently distinct from each other and then visualize only this set using clinker (which also just takes GenBank files of gene clusters as input) to help with computational scalability and the responsiveness of the awesome HTML reports/figures.

Assuming dereplication is requested, representative gene cluster GenBank files will be found in the directory: zol_Results/Dereplicated_GenBanks/. Users can provide the smaller set of gene cluster instances in this format as input to clinker.

Step 6: Comparative analysis of gene clusters with zol 🍎 🍊

Users can perform comparative analyses between sets of gene clusters in zol. This is done by specifying a focal set of gene clusters by name in a text file and, optionally, a complementary set of gene clusters. These gene cluster sets could be instances belonging to a certain taxonomic group or associated with a certain environment.

We can demonstrate running a comparative analysis using the testing dataset. Specifically, let's say we want to compare instances of epa from E. faecalis to instances of epa from E. faecium. And our fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ has the following files as viewed using ls:

Efaecalis_B594_fai-gene-cluster-1.gbk   Efaecalis_V583_fai-gene-cluster-1.gbk      Efaecium_EnGen0013_fai-gene-cluster-1.gbk
Efaecalis_OG1RF_fai-gene-cluster-1.gbk  Efaecium_EnGen0003_fai-gene-cluster-1.gbk  Efaecium_EnGen0018_fai-gene-cluster-1.gbk

We can define the set of gene clusters belonging to E. faecalis by name, one per line, in a file to provide to zol's -f argument as such:

Efaecalis_B594_fai-gene-cluster-1.gbk
Efaecalis_OG1RF_fai-gene-cluster-1.gbk
Efaecalis_V583_fai-gene-cluster-1.gbk

Then, we can simply run zol in comparative mode using the following command. Because all other gene clusters are from E. faecium, we don't need to formally specify the comparing set via the -fc option. By default all gene clusters not defined in the focal gene cluster set will be used as the comparing set when only -f.

zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -f E_faecalis_GeneCluster_Listing.txt -o zol_Results_with_Comparaitve_Analysis/ -c 10

Step 7: Create a pan-gene-cluster network visual using cgcg 📷

If tables are not your thing or you want to incorporate the zol data into a figure, you can use cgc or cgcg. In particular, cgcg will create an HTML file with an interactive network of ortholog groups with edges representative gene order information and coloring some quantitative evolutionary statistic:

cgcg -i zol_Results/ -o cgcg_Results/

Clone this wiki locally