-
Notifications
You must be signed in to change notification settings - Fork 4
Example workflow
This is a short tutorial/walk-through guide on producing a small pan-genome and running Corekaburra. This is not a 'best practice' for producing a pan-genome for your specific analysis, as this process depends on your question. If you find anything unclear, or see a way to improve the following tutorial let us know
For this simples example we use four genomes of Streptococcus pyogenes. 5448.gff
, NCTC8198.gff
, and SF370.gff
are all reference genomes for the lineage of S. pyogenes called emm1. H293.gff
is a reference for the emm12 lineage.
The genomes used can be found here, and should be downloaded if you want to follow this example.
Navigate to the folder in which you have downloaded the folder holding the Gff files.
Let us inspect the folder:
$ ls Input_Gff_files/
5448.gff H293.gff NCTC8198.gff SF370.gff
We start by building a pan-genome. In this case a simple pan-genome with four genomes, using the pipeline Panaroo (version 1.3.0).
The command we use is:
$ panaroo --clean-mode strict -i Input_Gff_files/*.gff -o Simple_test_pangenome
To get a quick overview of the pan-genome we can inspect the summary_staticis.txt
file produced by Panaroo
. In this case most categories have 0 genes, due to the low number of genomes used in the construction of the pan-genome. The thing to note here is the number of core genes: 1496.
$ cat Simple_test_pangenome/summary_statistics.txt
Core genes (99% <= strains <= 100%) 1496
Soft core genes (95% <= strains < 99%) 0
Shell genes (15% <= strains < 95%) 533
Cloud genes (0% <= strains < 15%) 0
Total genes (0% <= strains <= 100%) 2029
As Panaroo tries to improve annotation of genomes, some genes are refound. These are not included in the Gff files by default, but panaroo
version 1.3.0 includes a script to do so. We run the panaroo-generate-gffs
script to correct the Gff files:
$ panaroo-generate-gffs -i Input_Gff_files/*.gff -f prokka -o Simple_test_pangenome/
We now have corrected gff files in the panaroo
output directory:
$ ls Simple_test_pangenome/postpanaroo_gffs/
5448_panaroo.gff H293_panaroo.gff NCTC8198_panaroo.gff SF370_panaroo.gff
To make the names of genomes match the gene_presence_absence_roary.csv
file from Panaroo we must rename files:
for f in Simple_test_pangenome/postpanaroo_gffs/*.gff; do mv "$f" "${f/_panaroo.gff/.gff}"; done
(This is not the most elegant way of renaming files but it works with just the mv
, as rename
is not always available by default)
As all of the genomes in this example are complete/closed genomes we create an input file for Corekaburra to indicate this:
$ ls Simple_test_pangenome/postpanaroo_gffs/* > Complete_genomes.txt
$ cat Complete_genomes.txt
Simple_test_pangenome/postpanaroo_gffs/5448.gff
Simple_test_pangenome/postpanaroo_gffs/H293.gff
Simple_test_pangenome/postpanaroo_gffs/NCTC8198.gff
Simple_test_pangenome/postpanaroo_gffs/SF370.gff
Now we are ready to run Corekaburra
on our pan-genome!
We use the following command:
Corekaburra -ig Simple_test_pangenome/postpanaroo_gffs/*.gff -ip Simple_test_pangenome/ -cg Complete_genomes.txt -o Simple_genomes_corekaburra
In the stdout and .log
file we get information on numbers of core genes and other categories identified by Corekaburra (see below). In this case Corekaburra finds as many core genes with 'useful' syntenic information as Panaroo
did, 1496. This may not always be the case, especially as datasets get larger and more complex in terms of fragmented genes.
A total of:
1496 core gene clusters were identified
324 low frequency gene clusters were identified
209 intermediate accessory gene clusters were identified
With the output from Corekaburra
multiple questions can be explored. We have provided some examples and ways of doing so in the Down stream analyses section of this wiki.