Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question #62

Closed
vicru93 opened this issue Aug 5, 2024 · 4 comments
Closed

Question #62

vicru93 opened this issue Aug 5, 2024 · 4 comments

Comments

@vicru93
Copy link

vicru93 commented Aug 5, 2024

upsetr

Sorry for this question, but could someone help me understand this graph?
P.S.: IDs with "_REF" are reference genomes

Best W,

Victor.

@hoelzer
Copy link
Contributor

hoelzer commented Aug 5, 2024

Hey @vicru93 thanks for using RIBAP!

The graph is a so-called UpSet plot. UpSet plots are a way to visualize the differences between many sets. You might be more familiar to Venn diagrams which work nice to visualize the overlaps between three or four sets but can not handle more.

In the plot above you have your genomes as rows and you can see how many genes were annotated (blue bars). In the columns you see how many genes are shared between different subsets of your input genomes. For example, all genomes except "A2_bin4" have 691 genes in common according to RIBAPs core gene calculation (first column in the plot). Next there is "A20_bin53“ which has 431 unique genes not found in any of the other genomes. And so on ..

"A2_bin4" has 17 unique genes not found in any other of your input genomes.

Does that clarify?

May I also ask: are you running RIBAP on metagenome-assembled genomes? I am assuming this from the naming ("...bin"..) ;)

If so, that's really cool bc this was anyway smt I wanted to try. I am happy about any feedback you can provide.

@vicru93
Copy link
Author

vicru93 commented Aug 5, 2024

Hi @hoelzer:
It is an efficient tool that fits an analysis I am doing on a time scale of 2 inocula fed with 3 different substrates in an anaerobic digestion process on a time scale (0-16-42.62 and 93 days). My idea is to monitor microorganisms at the genus level (including the reference genomes given by gtdbk) to determine how much my microbiome changes on a time scale of 0 to 93 days, so your tool can be very useful.

The only thing is that I am still trying to understand the generated output, for example the html does not redirect me in the generated alignments, so it is difficult to have traceability from this file.

Something I also see is the difficulty of the analysis when using MAGs, for example the completeness of "A2_bin4" according to CheckM2 is 70% and a contamination percentage below 6%, but the annotation done by prokka was deficient and this is something that worries if you want to follow changes in different genomes.

I suppose all this I am writing to you was something predictable when studying environmental simplicities :(

For now, this tool seems to me to serve its purpose, and now that you have explained this graph to me, I can see the great utility it can have for my research, thank you.

Best W,

Victor.

@hoelzer
Copy link
Contributor

hoelzer commented Aug 6, 2024

Hey @vicru93, thanks for the details about your project.

That's great, and it's exactly what I also had in mind as a future application for RIBAP. But we have never tested the pipeline on MAGs so far. My approach would have also been to produce MAGs and then filter them based on completeness/contamination (CheckM) and throw them into RIBAP to get a core meta-pangenome. Then, monitor how this changes over time...

The only thing is that I am still trying to understand the generated output, for example the html does not redirect me in the generated alignments, so it is difficult to have traceability from this file.

Yes, I agree. The output can be improved; we already have an open issue #60. However, the HTML report should link you to the MSA (and tree) - does this not work for you? Also, in the output files, you should find a table with all RIBAP groups, which you can filter depending on your needs. For example, users might be interested not only in genes found in 100% of the input genomes but also in lower cutoffs, etc...

Something I also see is the difficulty of the analysis when using MAGs, for example the completeness of "A2_bin4" according to CheckM2 is 70% and a contamination percentage below 6%, but the annotation done by prokka was deficient and this is something that worries if you want to follow changes in different genomes.

I see. But, strangely, you get almost no annotated genes even though it's 70% complete, according to CheckM. How large is the MAG in nucleotide size? Does it maybe belong to a very small bacteria? However, the smallest I know have ~1 Mbp genome and when the MAG is still 70% complete I would expect at least a few hundred genes found by Prokka...

I suppose all this I am writing to you was something predictable when studying environmental simplicities :(

Yeah, wild west.

For now, this tool seems to me to serve its purpose, and now that you have explained this graph to me, I can see the great utility it can have for my research, thank you.

Great, let me know if you have any further questions.

@hoelzer
Copy link
Contributor

hoelzer commented Oct 21, 2024

Closing bc seems to be solved. Pls reopen if there is more

@hoelzer hoelzer closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants