Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `hasNodeId(node_id)' failed #43

Open
wtulle opened this issue Feb 24, 2023 · 20 comments
Open

Assertion `hasNodeId(node_id)' failed #43

wtulle opened this issue Feb 24, 2023 · 20 comments

Comments

@wtulle
Copy link

wtulle commented Feb 24, 2023

Hi, I need help with an error running rpvg command.
I'm getting "paths_index.cpp:73: uint32_t PathsIndex::nodeLength(uint32_t) const: Assertion `hasNodeId(node_id)' failed" error, but I have no idea what that means.
The full command is:
rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -a mpmap.gamp -o rpvg --inference-model haplotypes-transcripts

Any idea of what the problem could come from, the .xg, .gbwt or .gamp parameter?
Thanks!

@jonassibbesen
Copy link
Owner

Hi, from the error it looks like the pantranscriptome does not match the graph that was given as input (ref.xg). Could you share the command lines you used when running vg rna and vg mpmap? Thanks!

@wtulle
Copy link
Author

wtulle commented Feb 27, 2023

Hi, here are the full commands I used:

bgzip -c contigs.vcf > contigs.vcf.gz && tabix contigs.vcf.gz
vg construct -r ref.fa -v contigs.vcf.gz -a > ref.vg
vg index -x ref.xg -T ref.vg
vg autoindex --workflow mpmap -t 2 --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gtf

vg mpmap -t 2 -x vg_rna.spliced.xg -g vg_rna.spliced.gcsa -d vg_rna.spliced.dist -f r1.fq.gz -f r2.fq.gz > mpmap.gamp
vg rna -n kortha.gtf ref.vg -i pantranscriptome.txt -b pantranscriptome.gbwt > pantranscriptome.txt.gz
rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Thanks!

@jeizenga
Copy link
Collaborator

To co-generate the pantranscriptome and the spliced graph/indexes, you should use vg autoindex --workflow mpmap --workflow rpvg. That should guarantee that all of the indexes are consistent with each other. In general, I would recommend against mixing indexes from the automatic vg autoindex pipeline and manual pipelines, which could use different construction parameters.

@wtulle
Copy link
Author

wtulle commented Mar 1, 2023

Sorry to bother you again, I'm trying your answer but now I'm getting another error I can't resolve:
vg autoindex --workflow mpmap --workflow rpvg --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gff

The output:
[IndexRegistry]: Checking for phasing in VCF(s).
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Reference FASTA
VCF
are insufficient to create target index Haplotype-Transcript GBWT

I've tried adding the parameter --gfa ref.gfa with similar result:

[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Provided: VCF
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Provided: Reference GFA
error:[vg autoindex] Input is not sufficient to create indexes

@jeizenga
Copy link
Collaborator

jeizenga commented Mar 1, 2023

The issue is that vg autoindex isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased?

@ld9866
Copy link

ld9866 commented Mar 8, 2023

The issue is that vg autoindex isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased?
Hello!
Have you solved the problem?
I had the same problem

@jeizenga
Copy link
Collaborator

jeizenga commented Mar 8, 2023

Have you already checked for phasing in your VCF?

@ld9866
Copy link

ld9866 commented Mar 8, 2023

Hello!
I would like to ask how to "phasing in your VCF", do you mean that this vcf needs to use bcftools for quality control?

@jeizenga
Copy link
Collaborator

jeizenga commented Mar 8, 2023

VCFs can express either phased or unphased genotypes. Phased genotypes link together the alleles at multiple loci as occurring on the same haplotype. Unphased genotypes simply assert the alleles at each locus without specifying what combination of alleles co-occur on each haplotype. The pantranscriptome is built from haplotype-specific transcripts, so you need phased genotypes in order to specify the haplotype sequences. There's more detail about phasing in the VCF format in section 1.4.2 of the file specification.

@ld9866
Copy link

ld9866 commented Mar 8, 2023

example.vcf.zip
I'm sorry to bother you again!
I have read the recommended document according to what you said, but I am not very clear about how to carry out the follow-up operation of my vcf. Could you please help me ?
Best wishes

@jeizenga
Copy link
Collaborator

jeizenga commented Mar 8, 2023

The most common way you'd see it expressed is if the genotypes were separated with a bar (e.g. 0|1) rather than a slash (e.g. 0/1), but it looks like your genotypes all have ploidy of 1. Does that make sense for your organism? It could be that this is an unhandled edge case in vg autoindex.

@ld9866
Copy link

ld9866 commented Mar 8, 2023

Thank for your reply!
First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz.
Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file.
l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that

[IndexRegistry]: Checking for haplotype lines in GFA.
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Reference FASTA
Reference GFA w/ Haplotypes
are insufficient to create target index Haplotype-Transcript GBWT

This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you.
Best yours,

@jeizenga
Copy link
Collaborator

jeizenga commented Mar 8, 2023

I think the issue is further upstream in this pipeline: you're not getting a phased VCF from your Minigraph-Cactus workflow. I would think that Minigraph-Cactus could (in theory) produce phased VCFs, so I'm not sure why you're not getting one. I would recommend making a support request/issue to the Minigraph-Cactus developers to see if it's possible to get a phased VCF as output. It's also a bit puzzling to me that you're getting haploid genotypes on chromosome 1 in the example you gave me, since that's a diploid chromosome in primates.

@ld9866
Copy link

ld9866 commented Mar 8, 2023

Ok, thank you for your reply. Wish you all the best!
We will ask the developer about this, thanks.

@wtulle
Copy link
Author

wtulle commented Mar 8, 2023

Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.

@wtulle
Copy link
Author

wtulle commented Mar 8, 2023

Now, I'm getting an error running rpvg:
rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Error:
Running rpvg (commit: d0478d0)
Random number generator seed: 1678281590
Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276)
Loaded graph and GBWT (9.37961 seconds, 6.22832 GB)
[E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687
terminate called after throwing an instance of 'std::runtime_error'
what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583

@jonassibbesen
Copy link
Owner

Now, I'm getting an error running rpvg: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Error: Running rpvg (commit: d0478d0) Random number generator seed: 1678281590 Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276) Loaded graph and GBWT (9.37961 seconds, 6.22832 GB) [E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687 terminate called after throwing an instance of 'std::runtime_error' what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583

Hi, I am not sure why this is happening. Would you be able to share the data? You can send it to [email protected] Thanks!

@jonassibbesen
Copy link
Owner

jonassibbesen commented Mar 8, 2023

Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that

[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT

This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,

Hi, see related issue: #46 (comment)

@ld9866
Copy link

ld9866 commented Mar 8, 2023

Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.

Thank you for your help!

@ld9866
Copy link

ld9866 commented Mar 8, 2023

Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that
[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT
This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,

Hi, see answer here in a related issue: #46 (comment)

Thank you!
l am trying it for your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants