Skip to content

Extracting a FASTA from a Graph

Adam Novak edited this page Apr 13, 2023 · 2 revisions

Graph references often contain linear references within them, which you might want copies of for, for example, calling variants with a linear-reference-based caller like Google's DeepVariant.

If you don't already have a FASTA file for an assembly that is included in a graph, you can use vg to extract the assembly FASTA directly from the graph, like this:

vg paths --extract-fasta -x test/graphs/rgfa_with_reference.rgfa --paths-by GRCh38

Here, the argument to -x should be the graph file, in rGFA, GFA, .vg, .gbz, or any other graph file format that vg can read (see File Formats). The argument to --paths-by should be the prefix of the set of paths you would like to extract; generally you can use a sample or assembly name here. You can use vg paths --list -x <the graph> to get a list of all paths available.

This will produce a FASTA file on standard output:

>GRCh38#0#chr1
GGGGTACA

In most cases, the sequence names in the FASTA will be in PanSN format (see Path Metadata Model); these will match the names used by vg surject, and so a FASTA extracted like this is easy to use with a BAM file produced by vg surject.

To save it to a file, you can redirect the output with >.

If you are interested in extracting haplotype paths from a .gbwt file, you can pass the .gbwt file with the -g option to vg paths, and the corresponding .gg file or any matching graph with -x.

Clone this wiki locally