You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to do GO enrichment analysis of the accesory genes of my pangenome analysis. I want to see which functions are enriched in those genes. For that, I need to annotate all the protein-coding genes in the pangenome using the eggNOG-mapper to get the GO codes.
I was going to use the combined_protein_CDS.fasta but checking the number of sequences, I found that this file has 49887.
I was also considering using the translated pan_genome_reference.fa file. This one has 6834 sequences. A closer number to the total number of genes specified in my summary_statistics.txt file.
My gene_presence_absence.Rtab file has 6866 lines (for 6865 gene groups).
In the documentation:
combined_protein_CDS.fasta
Similar to the combined_DNA_CDS.fasta file, this is a fasta file which includes all protein sequence for both the annotated genes and those refound by the program. The gene names are the internal ones used by Panaroo. These can be translated to the original names using the 'gene_data.csv' file.
About the pan_genome_reference.fa:
pan_genome_reference.fa
This is a similar output to that produced by Roary. It creates a linear reference genome of all the genes found in the dataset. The order of the genes in this reference are not significant. NOTE: to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference.
I'm still very new to this so I wanted to ask if someone could guide about which file should I use as the one who has all the genes present in the pangenome or, better, all the proteins present in the pangenome.
I also wanted to ask: if the total number of genes in my pangenome is 6865 (according to my summary_statistics.txt file) and there are 6834 sequences in pan_genome_reference.fa, where and what are the other 31 missing sequences?
Are all the sequences in pan_genome_reference.faprotein-coding genes or are there other kind of genes like rRNA or tRNA?
The pan_genome_reference.fa file contains a representative sequence for each gene cluster, with only one representative per paralogous family. This is why it typically has fewer genes compared to what's reported in the summary_statistics.txt file.
For annotating with eggNOG-mapper in this context, the pan_genome_reference.fa is likely your best option.
Hi!
I want to do GO enrichment analysis of the accesory genes of my pangenome analysis. I want to see which functions are enriched in those genes. For that, I need to annotate all the protein-coding genes in the pangenome using the eggNOG-mapper to get the GO codes.
I have all these files as the output of panaroo:
The content of my
summary_statistics.txt
file is:I was going to use the
combined_protein_CDS.fasta
but checking the number of sequences, I found that this file has 49887.I was also considering using the translated
pan_genome_reference.fa
file. This one has 6834 sequences. A closer number to the total number of genes specified in mysummary_statistics.txt
file.My
gene_presence_absence.Rtab
file has 6866 lines (for 6865 gene groups).In the documentation:
About the
pan_genome_reference.fa
:I'm still very new to this so I wanted to ask if someone could guide about which file should I use as the one who has all the genes present in the pangenome or, better, all the proteins present in the pangenome.
I also wanted to ask: if the total number of genes in my pangenome is 6865 (according to my
summary_statistics.txt
file) and there are 6834 sequences in pan_genome_reference.fa, where and what are the other 31 missing sequences?Are all the sequences in
pan_genome_reference.fa
protein-coding genes or are there other kind of genes like rRNA or tRNA?I'll be very grateful for your kind help :)
The text was updated successfully, but these errors were encountered: