Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg view reports that a subgraph by vg chunk is invalid #2504

Open
subwaystation opened this issue Oct 11, 2019 · 24 comments
Open

vg view reports that a subgraph by vg chunk is invalid #2504

subwaystation opened this issue Oct 11, 2019 · 24 comments

Comments

@subwaystation
Copy link
Member

Hi vgteam :)
@superjox @trgibbons

  1. What you were trying to do:
    I tried to use https://github.com/vgteam/sequenceTubeMap in order to browse certain positions S288C_chrVII:95084-95584 of a yeast 12 sample pangenome. The graph was build with https://github.com/ekg/seqwish which's .gfa was ported to .vg . A .xg was created, too.
    In order to confirm the issue, I also run the vg chunk and vg view manually. That produced the same problem. However, when I leave the -T and -b + -E out, the command runs through without issues. But as SequenceTubeMap requires these inputs, I am stuck here.

  2. What you wanted to happen:
    Take a look at the specified positions.

  3. What actually happened:
    SequenceTubeMap output:

http POST chr22_v4 received
nodeID = 95084
distance = 500
no gam index provided.
no gbwt file provided.
dataPath = ./exampleData/
[ 'chunk',
  '-x',
  './exampleData/joint_yeast_genomes-twelve.xg',
  '-c',
  '2',
  '-p',
  'S288C_chrVII:95084-95584',
  '-T',
  '-b',
  './tmp-152f2180-ec06-11e9-a890-7bfa71685467/chunk',
  '-E',
  './tmp-152f2180-ec06-11e9-a890-7bfa71685467/regions.tsv' ]
vg chunk exited with code 0
vg view err data: graph path 'DBVPG6765_chrVII' invalid: edge from 4218946 start to 1204907 start does not exist

vg view err data: [vg view] warning: graph is invalid!

vg view exited with code 0

And then SequenceTubeMap just crashes, of course.
Output, when I do the same thing solely on the command line:

graph path 'DBVPG6765_chrVII' invalid: edge from 4218946 start to 1204907 start does not exist
[vg view] warning: graph is invalid!
  1. What data and command line to use to make the problem recur, if applicable:
    vg: variation graph tool, version v1.19.0 "Tramutola"
time vg view --gfa-in /ctx/projects/Q2380-Pantograph/03_data_processing/10_seqwish/10_yeast/21_PacBio_twelve/joint_yeast_genomes-twelve.gfa --vg > joint_yeast_genomes-twelve.vg

vg index -x joint_yeast_genomes-twelve.xg -t 5 joint_yeast_genomes-twelve.vg

vg chunk -x joint_yeast_genomes-twelve.xg -c 2 -p S288C_chrVII:95084-95584 -T -b chunk -E regions.tsv | vg view -j - > S288C_chrVII:95084-95584.json

vg chunk -x joint_yeast_genomes-twelve.xg -c 2 -p S288C_chrVII:95084-95584  | vg view -j - > S288C_chrVII:95084-95584.json
  1. Provide links to data, if possible:
    joint_yeast_genomes-twelvexg.gz
    joint_yeast_genomes-twelvevg.gz
    S288C_chrVII:95084-95584.json.zip
@subwaystation subwaystation changed the title vg view reports vg view reports that a subgraph by vg chunk is invalid Oct 11, 2019
@subwaystation
Copy link
Member Author

I also played around with the -c parameter [20, 100, 1000, 500, 50], but that did not solve the issue.

@ekg
Copy link
Member

ekg commented Oct 11, 2019

This might have to do with the invalidity of paths in subgraphs. We have been talking about how to resolve this for some time without much progress. There are some simple hacks, like making new paths with a naming that relates them to the path range they are derived from.

@subwaystation
Copy link
Member Author

Thanks for the feedback @ekg . I have to admit, this makes me kind of unhappy.
Can you point me to these hacks? Are there any examples?
Can I contribute somehow so that we can solve this issue in a foreseeable time?

I assume deleting the invalid paths would solve the issue. But then we wouldn't have a complete subgraph?

@glennhickey
Copy link
Contributor

vg used to use the "rank" field to somewhat support disconnected paths. But we lost that when switching to the new API. The discussion on how to properly support subpaths with the new API is here: vgteam/libhandlegraph#29

vg chunk takes care to ensure that the reference path (DBVPG6765_chrVII) is not disconnected. And this has been sufficient for our VCF-based graphs. But now you have other assemblies in the graph and that's tripping up DBVPG6765_chrVII.

I think we probably need Erik's simple hack of renaming path chunks to get around this. It might go here:
https://github.com/vgteam/vg/blob/master/src/algorithms/subgraph.cpp#L292-L322

I'll try to take a shot at implementing it today. Sorry about this!

@subwaystation
Copy link
Member Author

Thanks for the prompt answer @glennhickey !

Ah I see.... so I want a subgraph where the reference path is not really part of any more, because of the assembly styled graph. And that vg chunk can not handle.

Cool, looking forward to that implementation ;)

@subwaystation
Copy link
Member Author

If I can assist you at some point, let me know.

@subwaystation
Copy link
Member Author

One edge case I can think of, having SequenceTubeMap and the current vg chunk in mind, is the following:
If I extract a subgraph by path_name:start_pos-end_pos, I will only get the paths running through the nodes of the subgraph. But it could be, that there is a path, which does not have any of the sequence represented by these nodes. Therefore, it is anchored in a node more left and a node more right to the subgraph. But, this might be a structural variation I want to be able to show in e.g. SequenceTubeMap.
Would it still be a valid subgraph if there is a path in it, having no visiting nodes?

@glennhickey
Copy link
Contributor

I just tried the data and can reproduce. But I'm curious to know why tubemaps is crashing though? vg view reporting the graph is invalid is just a warning. It still exits with code 0 (as per your output). Is it that tubemaps is looking for the edge that's missing in the path?

@subwaystation
Copy link
Member Author

subwaystation commented Oct 11, 2019

I suspect that it can not deal with the fact that the *.annotate.txt file is empty. But, I have to admit that I am not familiar enough with TubeMaps to test that out, yet.
I fuzzled around in the code, so that TubeMaps' implemented command line leaves out -T -b -E and then it just breaks again giving no helpful error whatsoever. At least to me.

@ekg
Copy link
Member

ekg commented Oct 11, 2019

@glennhickey that code snippet doesn't quite do what I'm suggesting.

My idea was to break the paths where they are discontinuous in the subgraph. For each broken path segment, we set a name that relates it to the path it was derived from.

The hack I wanted to implement was using naming convention to convey path ranges of the subgraph.

So if we had a path x that got split into pieces we might get paths like [x]:10-20, [x]:30-40. Then we could also make another subgraph of one, yielding [[x]:30-40]:3-6. Maybe we should be translating the positions from the original path, but that would be a bit more involved.

@glennhickey
Copy link
Contributor

@ekg which code snippet?

@ekg
Copy link
Member

ekg commented Oct 11, 2019 via email

@subwaystation
Copy link
Member Author

Any updates here? As far as I got it, #2506 did not pass Travis?

@glennhickey
Copy link
Contributor

Are you able to try the branch from #2506 to see if it solves your problem? That PR's stuck on a unit test failure that occurs only on Mac that I'm having trouble reproducing.

@subwaystation
Copy link
Member Author

I will try it out and report back here.

@subwaystation
Copy link
Member Author

So I did:

git clone --recursive https://github.com/vgteam/vg.git
cd vg
git checkout glenn
. ./source_me.sh && make

And I ran into:

In file included from src/packed_path_position_overlays.cpp:1:
include/bdsg/packed_path_position_overlays.hpp:16:10: fatal error: BooPHF.h: No such file or directory
   16 | #include <BooPHF.h>
      |          ^~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:63: obj/packed_path_position_overlays.o] Error 1
make[1]: Leaving directory '/home/heumos/git/vg_2504/vg/deps/libbdsg'
make: *** [Makefile:618: lib/libbdsg.a] Error 2

@subwaystation
Copy link
Member Author

Is there a dependency that is not installed on my machine? I have ArchLinux running.

@subwaystation
Copy link
Member Author

The README of https://github.com/vgteam/libbdsg tells me, I need to have https://github.com/rizkg/BBHash/tree/alltypes installed in a place on the system where the compiler can find them.
But BBHash seems to be there:

[heumos@wave deps]$ ls /home/heumos/git/vg_2504/vg/deps/BBHash/
BooPHF.h         example.cpp                      LICENSE
bootest.cpp      example_custom_hash.cpp          makefile
bootestFile.cpp  example_custom_hash_strings.cpp  README.md

I would expect that the MAKEFILE takes care of the rest?

@glennhickey
Copy link
Contributor

glennhickey commented Oct 29, 2019 via email

@subwaystation
Copy link
Member Author

Thanks! On it again.

@subwaystation
Copy link
Member Author

subwaystation commented Oct 30, 2019

I did:

git clone --recursive https://github.com/vgteam/vg.git --branch glenn
cd vg/
. ./source_me.sh && make

And I still get:

In file included from src/packed_path_position_overlays.cpp:1:
include/bdsg/packed_path_position_overlays.hpp:16:10: fatal error: BooPHF.h: No such file or directory
   16 | #include <BooPHF.h>
      |          ^~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:63: obj/packed_path_position_overlays.o] Error 1
make[1]: Leaving directory '/home/heumos/git/vg_2504/vg/deps/libbdsg'
make: *** [Makefile:618: lib/libbdsg.a] Error 2

Am I missing something?

The file exists in deps:

[heumos@wave vg]$ ls deps/BBHash/BooPHF.h 
deps/BBHash/BooPHF.h

@glennhickey
Copy link
Contributor

glennhickey commented Oct 30, 2019 via email

@subwaystation
Copy link
Member Author

I am not even able to build the master branch on my machine, see #2522. But it aborts with a different error. Puzzling.

I will try to build on a VM which hosts Ubuntu 18.04.
But I still want to be able to compile vg on my machine.

@subwaystation
Copy link
Member Author

So I was able to build both the current master and @glennhickey's branch on Ubuntu 18.04.
Now I can test his implementation.

But it would make me really happy, if I could compile vg on my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants