Skip to content

Programming with the vg API

Adam Novak edited this page Jan 9, 2023 · 9 revisions

You can use the graph data structures used by vg in your own programs. Different data structures, useful for different purposes, are packaged in different vg-ecosystem libraries.

Starting a libbdsg project

The most useful vg-ecosystem library is libbdsg, which provides the basic bi-directed sequence graph implementations that vg uses to store and manipulate graphs.

To use this library in a C++ project, you can add it as a Git submodule, and use it from your project's CMake-based build system with add_subdirectory:

mkdir myproject
cd myproject
git init

mkdir deps
cd deps
git submodule add https://github.com/vgteam/libbdsg.git
cd libbdsg
git checkout 2a513fb363f9f7c2f058faf9964b761e18e18310
git submodule update --init --recursive
cd ../..

cat >CMakeLists.txt <<'EOF'

cmake_minimum_required(VERSION 3.10)
project(myproject VERSION 0.0.1)

set(CMAKE_CXX_STANDARD 14)

# We will not need libbdsg's Python bindings
set(BUILD_PYTHON_BINDINGS OFF CACHE INTERNAL "" FORCE)
# We also will not need its Doxygen docs
set(RUN_DOXYGEN OFF CACHE INTERNAL "" FORCE)
add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/deps/libbdsg")

EOF

Then you can build the project:

cd myproject
mkdir build
cd build
cmake ..
make -j $(nproc)

To actually use the library, you will need to define a CMake target for your code, and connect it to the libbdsg library target:

cd myproject
mkdir src
cat >src/main.cpp <<'EOF'

int main(int argc, char** argv) {
    return 0;
}

EOF

cat >>CMakeLists.txt <<'EOF'

add_executable(myproject ${CMAKE_CURRENT_SOURCE_DIR}/src/main.cpp)
target_link_libraries(myproject libbdsg)

EOF
(cd build && cmake ..)

Using libbdsg to load and access a graph

Let's access node and edge data from an example graph.

Converting formats

First, we need to get a graph in the correct format. Since libbdsg can't yet read GFA format, we can use vg to convert to the HashGraph format, which it can understand.

cd myproject
curl -O https://raw.githubusercontent.com/vgteam/vg/ffd491375f7e6d97c63598e35fefc653c9aea286/test/graphs/cactus-BRCA2.gfa
vg convert --hash-out cactus-BRCA2.gfa >cactus-BRCA2.vg

Then we can check to make sure we have the right format:

cd myproject
vg stats --format cactus-BRCA2.vg

Since we want to load the graph with a bdsg::HashGraph, make sure this says:

format: HashGraph

If you have a graph that isn't a HashGraph, you would need to either use a different libbdsg class (bdsg::PackedGraph), or else convert to a supported libbdsg format with vg, like we do above.

Loading a HashGraph

Once you have a HashGraph, you can load it like this:

cd myproject
cat >src/main.cpp <<'EOF'

#include <bdsg/hash_graph.hpp>
#include <iostream>

int main(int argc, char** argv) {
    // Make an empty graph
    bdsg::HashGraph graph;
    // Populate it by loading the graph from the given filename
    graph.deserialize("cactus-BRCA2.vg");
    
    // Report the number of nodes in the graph
    std::cout << "Graph has " << graph.get_node_count() << " nodes." << std::endl;
    return 0;
}
EOF

# Build the program
(cd build && make -j $(nproc))

When you run the program:

cd myproject
./build/myproject

You should get the right node count:

Graph has 1134 nodes.

Reading nodes, edges, and sequences

Coming soon! See also the Python tutorial.

Method Documentation

A bdsg::HashGraph implements handlegraph::MutablePathDeletableHandleGraph, a full-featured interface for a graph that stores embedded paths, and supports modification and deletion of graph elements.

Working with Reference Release Formats

References for vg often include haplotype information, which tracks how individual genomes are embedded in the graph, and what paths they take. This often comes in the form of a GBZ file (see File Types), which includes both the base graph nodes and edges, and the haplotype information, in a single file. To work with real-world data, you might need to be able to load and access a GBZ file.

Adding GBZ support

First, we need to get a GBZ file. The vg binary knows how to make one.

cd myproject
vg gbwt -G cactus-BRCA2.gfa --gbz-format -g cactus-BRCA2.gbz

The GBZ file format is implemented in the GBWTGraph library. So, the first step is to add that library and its dependencies to our project.

cd myproject
cd deps

# Install GBWT library
git submodule add https://github.com/jltsiren/gbwt.git
cd gbwt
git checkout 180e041a029af05f1d74e57058e3c16974c85362
cd ..

# Install gbwtgraph library
git submodule add https://github.com/jltsiren/gbwtgraph.git
cd gbwtgraph
git checkout 3cb45d5f0c9950d2db4493b2445acb60c8f24199
cd ..

cd ..

# Attach to CMake
cat >>CMakeLists.txt <<'EOF'

# SDSL library ("sdsl" target) already exists from libbdsg

# GBWT and GBWTGraph don't have CMake yet so we need to explain how to build them.

# GBWT dependency: OpenMP
if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
    # OpenMP may be installed by Homebrew in a place CMake can't find it by default.
    set (HOMEBREW_PREFIX "$ENV{HOMEBREW_PREFIX}")
    if ("${HOMEBREW_PREFIX}" STREQUAL "")
        # Ask Homebrew if it exists and where it is.
        execute_process(COMMAND brew --prefix OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE HOMEBREW_PREFIX)
    endif()
    if ("${HOMEBREW_PREFIX}" STREQUAL "")
        # Default to somewhere.
        set (HOMEBREW_PREFIX "/opt/homebrew")
    endif()
    list(APPEND CMAKE_PREFIX_PATH "${HOMEBREW_PREFIX}")
    list(APPEND CMAKE_PREFIX_PATH "${HOMEBREW_PREFIX}/opt/libomp")
endif()
find_package( OpenMP REQUIRED )

# GBWT
foreach (gbwt_source IN ITEMS algorithms.cpp bwtmerge.cpp cached_gbwt.cpp dynamic_gbwt.cpp fast_locate.cpp files.cpp gbwt.cpp internal.cpp metadata.cpp support.cpp test.cpp utils.cpp variants.cpp)
    list(APPEND gbwt_libFiles "deps/gbwt/src/${gbwt_source}")
endforeach()
if(MSVC)
    file(GLOB gbwt_headerFiles RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}  "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwt/include/gbwt/*.h")
endif()
set( gbwt_SRCS ${gbwt_libFiles} ${gbwt_headerFiles})
add_library( gbwt ${gbwt_SRCS} )
target_include_directories( gbwt PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwt/include")
target_link_libraries( gbwt PUBLIC sdsl OpenMP::OpenMP_CXX )
install(TARGETS gbwt
        LIBRARY DESTINATION lib
        ARCHIVE DESTINATION lib
        PUBLIC_HEADER DESTINATION include)
 
# GBWTGraph
foreach (gbwtgraph_source IN ITEMS algorithms.cpp cached_gbwtgraph.cpp gbwtgraph.cpp gbz.cpp gfa.cpp internal.cpp minimizer.cpp path_cover.cpp utils.cpp)
    list(APPEND gbwtgraph_libFiles "deps/gbwtgraph/src/${gbwtgraph_source}")
endforeach()
if(MSVC)
    file(GLOB gbwtgraph_headerFiles RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}  "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwtgraph/include/gbwt/*.h")
endif()
set( gbwtgraph_SRCS ${gbwtgraph_libFiles} ${gbwtgraph_headerFiles})
add_library( gbwtgraph ${gbwtgraph_SRCS} )
target_include_directories( gbwtgraph PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwtgraph/include")
target_link_libraries( gbwtgraph PUBLIC gbwt handlegraph_shared )
install(TARGETS gbwtgraph
        LIBRARY DESTINATION lib
        ARCHIVE DESTINATION lib
        PUBLIC_HEADER DESTINATION include)
        
# Link gbwtgraph into our executable.
# Other libraries will come along
target_link_libraries(myproject gbwtgraph)
        
EOF

# Configure the build
(cd build && cmake ..)

Once that's done, we can load a GBZ file and interact with it using the normal handlegraph::HandleGraph interface.

cd myproject
cat >src/main.cpp <<'EOF'

#include <gbwtgraph/gbz.h>
#include <iostream>
#include <fstream>

int main(int argc, char** argv) {
    // Make an empty GBZ object, which handles deserialization.
    gbwtgraph::GBZ gbz;
    
    // Open a stream to read the GBZ from
    std::ifstream gbz_file("cactus-BRCA2.gbz");
    
    // Load from the stream
    gbz.simple_sds_load(gbz_file);
    
    // The GBZ object now owns a HandleGraph subclass (GBWTGraph) that we can use
    gbwtgraph::GBWTGraph& graph = gbz.graph;
    
    // Report the number of nodes in the graph
    std::cout << "Graph has " << graph.get_node_count() << " nodes." << std::endl;
    return 0;
}
EOF

# Build the program
(cd build && make -j $(nproc))

A GBZ file's graph only contains the nodes and edges that are visited by paths in the graph, so the node and edge counts may be different than those in the original graph used to generate the GBZ.

When you run the program:

cd myproject
./build/myproject

You should get the right node count:

Graph has 1134 nodes.

To recognize a GBZ file, you can use its leading magic number tag, 0x205A4247 little-endian, or GBZ .

Using predefined algorithms

In addition to directly working with the graph, you can use predefined algorithms from libhandlegraph, a dependency that libbdsg brings along. You can find a list of available algorithms in the Doxygen documentation.

Using Depth-First Search to find a subgraph around a node

Coming soon!

Cleaning up

If you are done with the tutorial and don't need your project anymore, you can get rid of it:

rm -Rf myproject
Clone this wiki locally