Skip to content

Example workflow

milnus edited this page Nov 15, 2022 · 6 revisions

Example workflow

This is a short tutorial/walk-through guide on producing a small pan-genome and running Corekaburra. This is not a 'best practice' for producing a pan-genome for your specific analysis, as this process depends on your question. If you find anything unclear, or see a way to improve the following tutorial let us know

Genomes used

For this simples example we use four genomes of Streptococcus pyogenes. 5448.gff, NCTC8198.gff, and SF370.gff are all reference genomes for the lineage of S. pyogenes called emm1. H293.gff is a reference for the emm12 lineage.
The genomes used can be found here, and should be downloaded if you want to follow this example.

Navigate to the folder in which you have downloaded the folder holding the Gff files. Let us inspect the folder:
$ ls Input_Gff_files/
5448.gff H293.gff NCTC8198.gff SF370.gff

Build a pan-genome

We start by building a pan-genome. In this case a simple pan-genome with four genomes, using the pipeline Panaroo (version 1.3.0).
The command we use is:
$ panaroo --clean-mode strict -i Input_Gff_files/*.gff -o Simple_test_pangenome

Quick inspection of pan-genome

To get a quick overview of the pan-genome we can inspect the summary_staticis.txt file produced by Panaroo. In this case most categories have 0 genes, due to the low number of genomes used in the construction of the pan-genome. The thing to note here is the number of core genes: 1496.

$ cat Simple_test_pangenome/summary_statistics.txt
Core genes (99% <= strains <= 100%) 1496
Soft core genes (95% <= strains < 99%) 0
Shell genes (15% <= strains < 95%) 533
Cloud genes (0% <= strains < 15%) 0
Total genes (0% <= strains <= 100%) 2029

Correcting Gff files (only done for Panaroo)

As Panaroo tries to improve annotation of genomes, some genes are refound. These are not included in the Gff files by default, but panaroo version 1.3.0 includes a script to do so. We run the panaroo-generate-gffs script to correct the Gff files:
$ panaroo-generate-gffs -i Input_Gff_files/*.gff -f prokka -o Simple_test_pangenome/

We now have corrected gff files in the panaroo output directory:
$ ls Simple_test_pangenome/postpanaroo_gffs/
5448_panaroo.gff H293_panaroo.gff NCTC8198_panaroo.gff SF370_panaroo.gff

To make the names of genomes match the gene_presence_absence_roary.csv file from Panaroo we must rename files:
for f in Simple_test_pangenome/postpanaroo_gffs/*.gff; do mv "$f" "${f/_panaroo.gff/.gff}"; done
(This is not the most elegant way of renaming files but it works with just the mv, as rename is not always available by default)

Running Corekaburra

Complete genomes

As all of the genomes in this example are complete/closed genomes we create an input file for Corekaburra to indicate this:
$ ls Simple_test_pangenome/postpanaroo_gffs/* > Complete_genomes.txt

$ cat Complete_genomes.txt
Simple_test_pangenome/postpanaroo_gffs/5448.gff
Simple_test_pangenome/postpanaroo_gffs/H293.gff
Simple_test_pangenome/postpanaroo_gffs/NCTC8198.gff
Simple_test_pangenome/postpanaroo_gffs/SF370.gff

Corekaburra

Now we are ready to run Corekaburra on our pan-genome!

We use the following command:
Corekaburra -ig Simple_test_pangenome/postpanaroo_gffs/*.gff -ip Simple_test_pangenome/ -cg Complete_genomes.txt -o Simple_genomes_corekaburra

In the stdout and .log file we get information on numbers of core genes and other categories identified by Corekaburra (see below). In this case Corekaburra finds as many core genes with 'useful' syntenic information as Panaroo did, 1496. This may not always be the case, especially as datasets get larger and more complex in terms of fragmented genes.
A total of:
1496 core gene clusters were identified
324 low frequency gene clusters were identified
209 intermediate accessory gene clusters were identified

Going further

With the output from Corekaburra multiple questions can be explored. We have provided some examples and ways of doing so in the Down stream analyses section of this wiki.