Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GO Enrichment analysis with panaroo output #319

Open
limrp opened this issue Jan 24, 2025 · 1 comment
Open

GO Enrichment analysis with panaroo output #319

limrp opened this issue Jan 24, 2025 · 1 comment

Comments

@limrp
Copy link

limrp commented Jan 24, 2025

Hi!

I want to do GO enrichment analysis of the accesory genes of my pangenome analysis. I want to see which functions are enriched in those genes. For that, I need to annotate all the protein-coding genes in the pangenome using the eggNOG-mapper to get the GO codes.

I have all these files as the output of panaroo:

aligned_gene_sequences
alignment_entropy.csv
combined_DNA_CDS.fasta
combined_protein_CDS.fasta
combined_protein_cdhit_out.txt
combined_protein_cdhit_out.txt.clstr
core_alignment_filtered_header.embl
core_alignment_header.embl
core_gene_alignment.aln
core_gene_alignment_filtered.aln
final_graph.gml
gene_data.csv
gene_presence_absence.Rtab
gene_presence_absence.csv
gene_presence_absence_roary.csv
pan_genome_reference.fa
pre_filt_graph.gml
struct_presence_absence.Rtab
summary_statistics.txt
tmps2mr6iie

The content of my summary_statistics.txt file is:

summary_statistics.txt
Core genes	(99% <= strains <= 100%)	3460
Soft core genes	(95% <= strains < 99%)	0
Shell genes	(15% <= strains < 95%)	3025
Cloud genes	(0% <= strains < 15%)	380
Total genes	(0% <= strains <= 100%)	6865

I was going to use the combined_protein_CDS.fasta but checking the number of sequences, I found that this file has 49887.

I was also considering using the translated pan_genome_reference.fa file. This one has 6834 sequences. A closer number to the total number of genes specified in my summary_statistics.txt file.

My gene_presence_absence.Rtab file has 6866 lines (for 6865 gene groups).

In the documentation:

combined_protein_CDS.fasta
Similar to the combined_DNA_CDS.fasta file, this is a fasta file which includes all protein sequence for both the annotated genes and those refound by the program. The gene names are the internal ones used by Panaroo. These can be translated to the original names using the 'gene_data.csv' file.

About the pan_genome_reference.fa:

pan_genome_reference.fa
This is a similar output to that produced by Roary. It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant. NOTE: to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference.

I'm still very new to this so I wanted to ask if someone could guide about which file should I use as the one who has all the genes present in the pangenome or, better, all the proteins present in the pangenome.

I also wanted to ask: if the total number of genes in my pangenome is 6865 (according to my summary_statistics.txt file) and there are 6834 sequences in pan_genome_reference.fa, where and what are the other 31 missing sequences?
Are all the sequences in pan_genome_reference.fa protein-coding genes or are there other kind of genes like rRNA or tRNA?

summary_statistics.txt
Core genes	(99% <= strains <= 100%)	3460
Soft core genes	(95% <= strains < 99%)	0
Shell genes	(15% <= strains < 95%)	3025
Cloud genes	(0% <= strains < 15%)	380
Total genes	(0% <= strains <= 100%)	6865

I'll be very grateful for your kind help :)

@gtonkinhill
Copy link
Owner

Hi,

The pan_genome_reference.fa file contains a representative sequence for each gene cluster, with only one representative per paralogous family. This is why it typically has fewer genes compared to what's reported in the summary_statistics.txt file.

For annotating with eggNOG-mapper in this context, the pan_genome_reference.fa is likely your best option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants