Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about usage of rpvg command #62

Open
CarlosAmadeo7 opened this issue Sep 7, 2024 · 11 comments
Open

Question about usage of rpvg command #62

CarlosAmadeo7 opened this issue Sep 7, 2024 · 11 comments

Comments

@CarlosAmadeo7
Copy link

Hello rpvg team
I have a general question and I would appreciate your help.
I am trying to infer the haplotype expression of human-experiment reads using rpvg.
I have performed the construction of a multipath alignment graph using previously built human graphs and spliced junction graphs. Now I want to perform rpvg using the command:

rpvg -g graph.xg -p paths.gbwt -a alignments.gamp -o rpvg_results -i

I understand that "paths.gbwt" is the pantranscriptome and the "alignments.gamp" is the multipath-alignment graph that I obtained using vg before, but I would like to know what is the " graph.xg" term in the command.
Is it the original human graph or is it the spliced-junction graph that I obtained before using vg?
I would appreciate your help
Best

@jeizenga
Copy link
Collaborator

jeizenga commented Sep 7, 2024

You would use the spliced graph.

@CarlosAmadeo7
Copy link
Author

It worked perfectly, thank you.
I am new in rpvg but I just wanted to know something general. After obtaining the expression haplotypes, what following studies you would recommend?

@jeizenga
Copy link
Collaborator

That's an area where I think there's still some room for methods development. If you marginalize out the haplotypes, you can get transcript-level expression values that fit into standard pipelines. We also found that for reasonably highly expressed genes, you can typically resolve the sample's haplotypes and their expression values as the top 2 most confident haplotypes. I think we have yet to see an analysis that fully utilizes all of the uncertainty over haplotypes that is included in the posterior distribution that rpvg estimates.

@agolicz
Copy link

agolicz commented Sep 18, 2024

Just following up on this. Do you think it's best to use two most confident haplotypes (rpvg.txt) or the most confident diplotype (rpvg_joint.txt). So far we've been using the most confident diplotype. I am re-running the analysis with two most confident haplotypes, but so far getting pretty similar results.
We've been using rpvg quantification for eQTL analysis. This is an old version, but gives an idea: https://www.biorxiv.org/content/10.1101/2024.03.14.585028v1.abstract

Agnieszka

@jeizenga
Copy link
Collaborator

The diplotype is what I meant. Sorry for the ambiguity. It makes sense that you get generally similar results, since the most likely diplotype will typically consist of the two most likely haplotypes.

@CarlosAmadeo7
Copy link
Author

Thank you for the response. I would also like to know if you guys have a page for interpretation of the results. I can understand most of the results, but I am a little confused about the Cluster ID meaning

@jeizenga
Copy link
Collaborator

We do not have an interpretation page, although that would be a good idea.

A cluster is roughly analogous to a gene. In particular, transcripts are joined into a cluster whenever there is a read that is simultaneously aligned to both of them. This typically means the transcripts have an overlapping exon. However, it's possible that transcripts that are considered to be from the same gene do not get placed into a cluster, which would occur when their overlapped exons are not expressed.

We deal with clusters rather than genes internally to avoid the squishy conceptual ambiguities around gene definitions in complex regions. Clusters can reasonably well identify groups of transcripts whose expression needs to be co-inferred.

@agolicz
Copy link

agolicz commented Sep 19, 2024

Yes, that would be fantastic. For us graph based expression quantification shapes up to be one of the main current applications other than SV genotyping. Is this an area where vg team sees future developments?

@CarlosAmadeo7
Copy link
Author

Thank you, would you recommend to do a marginalization of the haplotypes from the diplotype_joint output to get transcript-level expression values and then try to run a differential expression analysis ? In what cases should I use the single haplotype expression file and the diplotype expression file?
Thank you , I appreciate your patience and help

@CarlosAmadeo7
Copy link
Author

Hello rpvg team,
Following my previous question.
I was able to map RNA seq reads and get the haplotype-transcript path expression of the reads using rpvg. It gave me 994002 haplotype-specific transcripts. I am using the spliced junction and human pantranscriptome from the paper: A Draft Human Pangenome Reference.
I am trying to compare my results ( single haplotypes and the most possible haplotype combination(diplotype) ) with the results from an article where they found 51135 genes and their respective counts.
I am not quite sure how to do that because what I have is haplotype-transcript expression and the number is way higher compared to what they showed.
Previously, you mentioned that marginalization of the haplotypes can give me transcript-level expression values that fit into standard pipelines. By marginalization do you mean averaging all the haplotype combinations, in the rpvg_joint.txt, and then summing up all the ReadCounts and TPMs? I would appreciate more information about this, please.
In addition, I noticed that in my results sometimes the same haplotype is repeated in both locations (Name1 and Name2), what is the meaning of this? In that case, should I sum their ReadCounts and TPM for possible marginalization?

image

I would appreciate information about it.
Thank you so much
Best

@jeizenga
Copy link
Collaborator

To marginalize over haplotypes, you would essentially compute an average of the sum of TPM_1 and TPM_2 weighted byHaplotypingProbability across all transcripts with the same accession ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants