VG refresher

What to do during your down time

If you have have extra down time while you are waiting for tools to complete running, we have an optional activity that you can use it on--just for fun. This HPRC "scavenger hunt" has questions whose answers you can find the HPRC's marker paper.

Getting started

Since we have several VG versions installed, please execute the following to ensure you will run the correct VG version:

alias vg='/usr/local/bin/vg.1.48.0'

Once again, we're going to start with some toy data in the vg respository, so make sure you still have it checked out (if you still have this from day 1, you can omit this step):

cd ~
git clone https://github.com/vgteam/vg.git

Now create a directory to work on for this tutorial:

mkdir vg_refresher
cd vg_refresher
ln -s ~/vg/test/small

Review: using `vg` to build and view graphs

Exercise: Repeat the vg construct procedure from day 1 to construct a graph using small/x.fa and small/x.vcf.gz. Call the graph small.vg.

See answer

vg construct -r small/x.fa -v small/x.vcf.gz > small.vg

Exercise: Use vg view and dot to visualize the graph. Find nodes 56 and 57 in the visualization. Do you notice anything strange about them?

See answer

vg view -d small.vg | dot -Tpdf -o small.pdf

These nodes are separated by an edge, but not by any variation. It seems like it would be possible to have a longer single node in place of these two nodes.

This is in fact a default behavior in vg, which you can also see in nodes 101, 123, and 148.

Let's also revisit the GFA format, the text-based interchange format for graphs (similar to FASTA for sequences). Make a GFA for this graph.

# the -f flag indicates converting to GFA
vg convert -f small.vg > small.gfa

Since GFA is an interchange format for graphs, you can construct a graph by ingesting a GFA.

# -g indicates that the input is GFA, -p produces a .vg graph
vg convert -g -p small.gfa > small.copy.vg

Exercise: Convert small.copy.vg back into GFA format and confirm that the graphs are identical using diff. Do we get identical graphs?

See answer

The graphs are in fact identical, but you may still get a non-trivial diff. One reason is that the order of lines in a GFA is not fixed by the specification. Try using sort on both GFAs to put them in the same order and then comparing with diff again.

See answer

You may still not have an empty diff because of GFA's non-unique representation of edges. These two edge ("link") lines are equivalent:

L	1	+	2	+
L	2	-	1	-

The first connects the end of sequence 1 to the beginning of sequence 2, the second connects the beginning of sequence 2 to the end of sequence 1, i.e. they are the same.

Indexing and mapping in `vg`

`vg` mapping preliminaries

Although vg contains a number of tools for working with pangenome graphs, it is best-known for read mapping. This is ultimately what many of its users are interested in vg for. In fact, vg contains three mature short read mapping tools:

vg map: the original, highly accurate mapping algorithm
vg giraffe: the much faster and still accurate haplotype-based mapping algorithm
vg mpmap: the splice-aware RNA-seq mapping algorithm

Indexing a graph for mapping with `vg` (the hard way)

Like many other read mapping tools, each of these mapping tools requires specialized index data structures. Accordingly, the first step to mapping reads is to build and index a graph. We will start by indexing a graph for vg map, which has the simplest indexing pipeline of the three mapping tools.

We will start with some simple test data that is included in the VG source repository. Start by making a directory for us to work in today and then link the data into it.

cd ~
mkdir vg_mapping_exercises
cd vg_mapping_exercises
ln -s ~/vg/test/1mb1kgp
ln -s ~/vg/test/tiny

A `vg map` indexing pipeline

We begin the indexing pipeline by constructing a graph.

Excercise: Construct a graph using vg construct on reference sequence 1mb1kgp/z.fa and VCF 1mb1kgp/z.vcf.gz. Name the resulting file 1mb1kgp.vg.

See answer

You can use the following command:

vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz > 1mb1kgp.vg

The graph we've just constructed is a mutable data structure---it allows for changes to its structure and contents. This is important for the vg construct algorithm, which builds the graph incrementally as it processes the input data. However, there are performance costs to maintaining mutability. A static graph can support a greater range of queries at faster speeds. When we are mapping reads, we will no longer need the graph to be mutable, so we can improve performance by converting to a static graph, which we call "XG format".

# -x indicates that the output is XG format
vg convert -x 1mb1kgp.vg > 1mb1kgp.xg

Next we want to index the graph for fast exact match queries. To do that vg map uses the GCSA2 index, which is a graph-specific generalization of the better known FM-index (used in many widely-used mapping tools, such as bwa mem and bowtie2). The risk with a GCSA2 index is that its size can grow exponentially with the size of the graph, and in fact we entirely expect it do so on genome-scale graphs. To prevent this from happening, we need to simplify the complicated regions of the graph by removing some nodes and edges using the vg prune subcommand.

vg prune 1mb1kgp.vg > 1mb1kgp.pruned.vg

Excercise: Try using 1mb1kgp.xg as input instead. What happens and why?

See answer

You should see an error about being unable to load a mutable graph. The vg prune algorithm will modify the graph by removing part of it, so the graph must be mutable. Recall that the XG is a static format.

Excercise: Confirm that the nodes in 1mb1kgp.pruned.vg have matching node IDs to the 1mb1kgp.vg.

See answer

This could be accomplished in multiple ways. One easy way is to match the S lines from their respective GFAs.

# grep out the sequence (i.e. node) lines and order them consistently
vg convert -f 1mb1kgp.vg | grep "^S" | sort > 1mb1kgp.nodes.txt
vg convert -f 1mb1kgp.pruned.vg | grep "^S" | sort > 1mb1kgp.nodes.pruned.txt

# count the nodes in 1mb1kgp.pruned.vg that have no pair in 1mb1kgp.vg (should be 0)
comm -23 1mb1kgp.nodes.pruned.txt 1mb1kgp.nodes.txt | wc -l

The fact that these graphs have the same node IDs is a subtle but key point. Different methods of processing graphs can produce different node IDs. To collate results across steps in a pipeline, we need matching node IDs---for example, between graph mappings and the graph. This is a frequent source of bugs in pangenomics pipelines.

We are now ready to construct the GCSA index.

vg index -g 1mb1kgp.gcsa -t 16 1mb1kgp.pruned.vg

Important note: The -t parameter in the command above sets the thread usage (to 16 threads). In most vg commands, the default number of threads is the number of cores on the machine, and vg will happily keep all of these cores busy. If you are working in a shared compute environment, you should usually use -t to be considerate of other users.

You should now see two files: 1mb1kgp.gcsa and 1mb1kgp.gcsa.lcp. These should be thought of as a single index spread out over two files. You need both of them for mapping.

Mapping with `vg map`

Now we are finally ready to map reads.

# -x and -g supply the XG and GCSA, -f is for FASTQ input, and the second 
# -f indicates that these are paired end FASTQs
vg map -x 1mb1kgp.xg -g 1mb1kgp.gcsa -f /opt/vg_data/1mb1kgp/1mb1kgp_1.fq -f /opt/vg_data/1mb1kgp/1mb1kgp_2.fq -t 16 > 1mb1kgp.mapped.gam

The default output format is GAM, which is based on Google's Protobuf serialization method. To view what's in one of these alignments, we can convert it into a text-based JSON record.

# -a indicates alignment input, -j is JSON output
vg view -a -j 1mb1kgp.mapped.gam | head -n 1

The output is a bit busy, though. Let's use the jq tool to format it nicely.

vg view -a -j 1mb1kgp.mapped.gam | head -n 1 | jq

What is the name of the field that in the GAM that expresses the alignment of the read to the graph?

See answer

The alignment is encoded in the path field. The walk through the graph is indicated by a list of mapping records, which each have a position that includes the node ID. The mappings also have a list of edits that expression the alignment of the read sequence to the node sequence.

Another common format that used to express graph alignments is GAF. We can use vg convert to convert between the formats.

vg convert -G 1mb1kgp.mapped.gam 1mb1kgp.xg > 1mb1kgp.mapped.gaf

GAF is a text-based, tab-separated format. Look at a few lines.

head 1mb1kgp.mapped.gaf | less -S

How does GAF show the alignment's walk through the graph?

See answer

The walk is expressed in column 6 by a series of node IDs with arrows to indicate their orientation.

Exercise: Look at the output to confirm that the first alignment has the same walk through the graph in both GAF and GAM format.

Indexing a graph for mapping with `vg` (the easy way)

We have seen that the indexing pipeline for vg map requires a number of steps to preprocess the data and construct indexes. These steps are spread out across multiple subcommands, each of which handles one or a few indexes. This is complicated, and it gets worse. Both vg giraffe and vg mpmap have more complicated indexing pipelines.

The vg autoindex subcommand is designed to alleviate this complexity by focusing on the mapping tool you want to use, rather than the indexes you need to create. The inputs are all common interchange formats like FASTA, VCF, and GFA. Power users might still find need for the individual indexing subcommands, but common use cases should be handled in a single command.

Let's re-do the indexing pipeline from the previous section in a single command.

# -p indicates the prefix for the output indexes
# -w indicates that we are indexing for the "map" workflow
vg autoindex -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -p 1mb1kgp_again -w map

See what got created:

ls 1mb1kgp_again*

Is there anything that you expected to see but do not?

See answer

You may have expected to see a .vg like the ones we made using vg construct and vg prune. However, these files are not needed for mapping with vg map. They are constructed as temporary files and then discarded after being converted to more efficient indexes.

Exercise: Look at the logging output (lines starting with [IndexRegistry]) and try to correspond the steps vg autoindex takes to the ones we performed in the previous exercises.

We will soon shift focus to the vg giraffe mapping tool, so let's make indexes for it. Start by looking at the help documentation.

vg autoindex --help

Exercise: Using vg autoindex, create the graph and indexes for vg giraffe, using tiny/tiny.fa and tiny/tiny.vcf.gz. Give the indexes the prefix giraffe_indexes.

What indexes did vg autoindex create?

See answer

You should see giraffe_indexes.dist, giraffe_indexes.giraffe.gbz, and giraffe_indexes.min. If you don't see these indexes, did you request the correct workflow with the -w parameter?

Calling structural variants using `vg`

Background

One of the greatest promises for pangenome graphs is in characterizing structural variants (SVs). Methods based on linear references must contend with the fact that SVs often lead to poor read mapping. In contrast, any SV that is present in the pangenome graph is easy to map to. This makes pangenome graphs very useful tools for SV characterization using short reads.

This use case has been a particular focus for the vg giraffe mapping tool. It has been designed so that its speed is comparable to linear reference mapping tools, provided that the graph is relatively simple. On graphs with more complex topologies, some of its specialized indexes can become inefficient or intractable.

Constructing (simple-enough) graphs with structural variation

Presently, there are two types of data sources for SVs. The first are VCFs containing SV calls, often from long read sequencing panels. The second is from phased assemblies. We will discuss both options.

VCF input

We have already seen how to construct graphs using VCF input.

Exercise: How would you construct indexes for vg giraffe using a reference sequence Reference.fasta and a SV call set SVs.vcf.gz?

See answer

The easiest method would use vg autoindex.

vg autoindex -r Reference.fasta -v SVs.vcf.gz -w giraffe -p index

Important considerations

vg can only use SVs that are fully base-resolved. It will ignore variants that have approximate breakpoints.
For vg giraffe in particular, the results will be best if the variants are phased. Much of its speed comes from focusing on the paths that haplotypes take through the graph, so it needs this haplotype information.

Phased assembly input

Presently, the most capable method of producing pangenome graphs from phased assemblies that are both a) fully base-resolved, and b) simple enough for vg giraffe is the Cactus-Minigraph pipeline. It incorporates components from the cactus whole genome aligner and the minigraph pangenome graph constructor, which does not produce fully detailed graphs on its own.

Minigraph-Cactus is shipped along with cactus, which coordinates among a large number of sub-tools. For this reason, cactus is typically run out of a Python virtualenv that has been set up to ensure that all of the sub-tools are on the Unix paths that they need to be (see here for instructions on how to set up this virtualenv). Accordingly, we start by activating the virtualenv, which has already been set up on your server.

source /opt/cactus-bin-v2.5.2/cactus_env/bin/activate

Now let's build a pangenome graph. For demonstration, we will use a small set of haplotypes from the human MHC region---an immune-related region with extremely high levels of genetic polymorphism. Let's take a look at them:

ls -l /opt/vg_data/mhc_haplotypes/

The primary command line input to Cactus-Minigraph is called the "seqFile", and it is essentially just a list of the sequences you are including. The format is:

seqName1    /location/of/seqName1
seqName2    /location/of/seqName2
...

The locations in the second column can be local filepaths, URLs, or addresses in AWS s3 storage.

Exercise: Construct a seqFile for the MHC haplotypes and call it seqFile.txt.

See answer

The contents of the seqFile should be:

MHC-GRCh38	/opt/vg_data/mhc_haplotypes/MHC-00GRCh38.fa
MHC-CHM13	/opt/vg_data/mhc_haplotypes/MHC-CHM13.0.fa
MHC-HG002.1	/opt/vg_data/mhc_haplotypes/MHC-HG002.1.fa
MHC-HG002.2	/opt/vg_data/mhc_haplotypes/MHC-HG002.2.fa
MHC-HG00438.1	/opt/vg_data/mhc_haplotypes/MHC-HG00438.1.fa
MHC-HG00438.2	/opt/vg_data/mhc_haplotypes/MHC-HG00438.2.fa
MHC-HG005.1	/opt/vg_data/mhc_haplotypes/MHC-HG005.1.fa
MHC-HG005.2	/opt/vg_data/mhc_haplotypes/MHC-HG005.2.fa

One drawback of Cactus-Minigraph is that it is reference-based. The graph starts out with one sequence, and then adds new sequences onto it. This allows it to use synteny as a hint for sequence orthology, which is a large part of why Minigraph-Cactus produces simple graphs. For our example, we will use the T2T haplotype of CHM13 (called MHC-CHM13 in this data set) as the reference.

We are now ready to execute Cactus-Minigraph.

cactus-pangenome ./working_dir seqFile.txt --outDir mhc_pangenome --outName mhc --reference MHC-CHM13 --vcf --giraffe --gfa --gbz

This may take several minutes to run. While you are waiting, we will note a few things about Cactus-Minigraph and this command line invocation.

Cactus is built on top of the workflow framework Toil. Toil provides it with methods to run on both clouds and clusters, although the complexity of running it increases. Our examples will all run locally. Running locally is sufficient for small inputs, or for larger inputs in a time frame of a week or two.
The arguments --vcf --giraffe --gfa --gbz each add additional outputs to the pipeline. As you may guess, these can be used to provide the indexing input for vg giraffe. In fact, Cactus-Minigraph will run vg autoindex in the background to make these indexes.
You will notice that the vg giraffe indexes' file names include d2. This refers to the fact that these indexes exclude non-reference nodes that are present on only 1 haplotype (i.e. do not have "depth 2"). In the current iteration of the pipeline, we have found that this improves mapping performance. You can use the argument --giraffe full to keep all sequences and --giraffe clip to remove all rare nodes below 10% frequency.

Aside: if you are still waiting for Minigraph-Cactus to complete, you can continue to the next section and return to the following exercise when it finishes. The next section does not depend on the output of this one.

Exercise: Once Minigraph-Cactus is done building the pangenome, use Bandage to visualize the graph.

See answer

Bandage takes GFA as input, so first we need to locate the GFA, which turns out to be gzipped by default.

gunzip mhc_pangenome/mhc.gfa.gz
Bandage load mhc_pangenome/mhc.gfa

You could also use scp or rsync to move the GFA to your local computer and then use Bandage locally.

A precaution: The GFAs produced by Minigraph-Cactus differ somewhat from the GFAs produced by PGGB. Specifically, Minigraph-Cactus uses the W ("walk") lines added in GFA v1.1 to indicate haplotypes and distinguish them from reference paths, which are still represented using P lines. In contrast, PGGB uses P lines for both.

Mapping to a pangenome with `vg giraffe`

We are now going to shift focus to a Drosophila melanogaster pangenome graph that was also constructed with Minigraph-Cactus. The D. melanogaster genome is small for a multicellular eukaryote (~180 Mbp), so we can work with the entire genome and real sequencing data and still have practically-sized exercises.

Find the indexes in /opt/vg_data/drosophila/. What is there?

Exercise: Look at the vg giraffe --help documentation to see which arguments to supply the indexes to. Then map the paired reads in /opt/vg_data/drosophila/SRR835025_chr2_1.fastq.gz and /opt/vg_data/drosophila/SRR835025_chr2_2.fastq.gz. Save the result in GAM format as SRR835025_chr2.gam. (Tip: vg can read gzipped FASTQ files directly.)

See answer

vg giraffe -Z /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz -d /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.dist -m /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.min -f /opt/vg_data/drosophila/SRR835025_chr2_1.fastq.gz -f /opt/vg_data/drosophila/SRR835025_chr2_2.fastq.gz > SRR835025_chr2.gam

The mapping may take several minutes. This is a real 30x Illumina data set taken from SRA, although (as the name suggests) it has been subsetted to the reads that map to chromosome 2.

Calling structural variants with `vg call`

vg call is the subcommand in vg that genotypes variants from mapped reads. However, it does not operate directly on the GAM file. First, we must preprocess the mappings into coverage depth for each of the nodes in a format that is called a "pack".

# -x: graph
# -g: GAM (-a to use GAF)
# -Q: minimum mapping quality to include an alignment
# -t: number of threads
# -o: output name
vg pack -x /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz -g SRR835025_chr2.gam -Q 5 -t 16 -o SRR835025_chr2.pack

Why do you think we exclude alignments with mapping quality < 5?

See answer

For these reads, we believe that there is at least a ~30% chance that they are mapped to the wrong location. Including these alignments can introduce noise that makes variant calling less accurate. The developers consider this mapping quality filter a best practice.

Now we can use the pack to call variants.

vg call /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz -k SRR835025_chr2.pack -t 16 > SRR835025_chr2.vcf

Open the VCF file with a text viewer. Roughly speaking, what proportion of the called variants are SVs?

See answer

Very few of the variants are SVs. vg call calls variants at all "snarls" in the pangenome graph: bubble-like structures that usually indicate variation. There are many more point variants than SVs in most populations (even though SVs affect a large amount of total sequence). Accordingly, most of the of the called variants are point variants. To restrict to SVs, you may want to filter downstream using bcftools.

What is given as the variant ID (column 3)?

See answer

It is the node IDs of the snarl boundaries.

Look for homozygous reference allele calls (indicated by a 0/0 genotype) in the VCF. Roughly speaking, what proportion of the variants are homozygous reference?

See answer

There are none. By default, vg call excludes homozygous reference calls. However, you may want to include them in the VCF because, for example, you are combining individual VCFs into a larger panel VCF.

Exercise: Find the option that includes homozygous reference calls using vg call --help.

See answer

The option is -a.

Calling novel variants and fine-tuning SVs

The vg call algorithm calls variants only at the bubble-like "snarl" motifs in the graph. As a consequence, it can't on its own call any novel variants (those that are present in the sample but not the graph). To call novel variants, you have to add them into the graph as bubbles using the vg augment subcommand. This can also help "fine tune" SV sequences by, for example, correcting a breakpoint or adding a variant to an insertion.

To start we make a mutable graph from our static one.

vg convert /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz > 16-fruitfly-mc-2022-05-26-d2.vg

And then we augment it.

vg augment 16-fruitfly-mc-2022-05-26-d2.vg SRR835025_chr2.gam -A SRR835025_chr2.aug.gam -m 2 -q 10 > 16-fruitfly-mc-2022-05-26-d2.aug.vg

The -A argument saves a new GAM file. Why is this necessary?

See answer

Adding new bubbles into the graph requires us to break existing nodes and add new nodes. This requires the existing node IDs to change in the augmented graph. Accordingly, we need to update the paths of the alignments if we want them to match the graph that we are going to call variants with.

The arguments -m 2 -q 10 filter out bubbles unless they are supported by at least 2 reads with an average base quality of 10. Why might we do this?

See answer

Putative variants that are only supported by a single read or by only bases of low quality are likely to be sequencing errors. Adding these candidate variants will expend unnecessary computational effort. In addition, it would increase the multiple testing burden, which can erode accuracy.

Exercise: Repeat the vg call pipeline with the augmented graph and augmented GAM. Name the resulting VCF SRR835025_chr2.aug.vcf.

See answer

vg pack -x 16-fruitfly-mc-2022-05-26-d2.aug.vg -g SRR835025_chr2.aug.gam -Q 5 -t 16 -o SRR835025_chr2.aug.pack
vg call 16-fruitfly-mc-2022-05-26-d2.aug.vg -k SRR835025_chr2.aug.pack -t 16 > SRR835025_chr2.aug.vcf

Exercise: Use a text viewer to look at the first handful of variants in SRR835025_chr2.aug.vcf and SRR835025_chr2.vcf. Try to locate the same variant in both VCFs. What do you notice about their IDs (column 3)?

See answer

Many of the variants will now have different IDs, because the node IDs in the graph have changed through the augmentation process.

Exercise: Find a variant that is present in the augmented VCF but not in the original VCF. (Hint: you can use grep -v "#" filename.vcf | cut -f 2 to extract the positions of the variants).

Calling small variants with `vg giraffe` and `DeepVariant`

Many state-of-the-art small variant calling algorithms are based on deep neural network technology, and Google's DeepVariant is among the most successful. It is fundamentally a linear reference method. DeepVariant uses the coordinates of a linear reference to make numerical vector inputs for the neural network.

Nevertheless, DeepVariant can benefit from pangenome read mapping. The pangenome can alleviate alignment issues caused by reference bias, which improves the accuracy of small variant calling. The combination of vg giraffe and DeepVariant can produce some of the best-in-class small variant calls.

To use vg mappings in a linear reference method, it is first necessary to convert them into linear mappings. This is accomplished with the vg surject subcommand, which realigns the reads to the path of the reference while preserving as much as possible of the graph alignment.

Exercise: We need to identify which paths in the graph are the reference. Use vg paths --list to compare the names of the graph's paths to /opt/vg_data/drosophila/reference_paths.txt.

See answer

You should see a very long list of paths if you execute

vg paths --list -x /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz

However, the paths from /opt/vg_data/drosophila/reference_paths.txt are all present (try piping it through grep dm6). The other paths are regions of non-reference haplotypes added by Minigraph-Cactus during graph construction. We need the reference_paths.txt file to identify which of these paths we should be realigning to.

Now let's actually use vg surject.

# -b for BAM output
# -F is a list of the paths embedded in the graph that we project the alignments onto
vg surject -x /opt/vg_data/drosophila/16-fruitfly-mc-2022-05-26-d2.gbz -t 16 -F /opt/vg_data/drosophila/reference_paths.txt -b SRR835025_chr2.gam > SRR835025_chr2.bam

Don't feel any need to wait for the realignment to complete, we won't be using it. We only include this command to introduce it to you. In truth, the best practices pipeline to run DeepVariant with vg giraffe is quite complicated. Fortunately, it has already been packaged in a WDL (Workflow Description Language) script.

Download the vg WDL repository.

git clone https://github.com/vgteam/vg_wdl.git

WDL scripts can be run in cloud and cluster environments, but we will use the miniwdl, a WDL runner that uses local compute. We have already collected the inputs for the workflow. They are passed to miniwdl through an input file at /opt/vg_data/giraffe_deepvariant/inputs.json.

Exercise: Look at the contents of /opt/vg_data/giraffe_deepvariant/inputs.json. Do you recognize the inputs for the vg subcommands we have used?

See answer

We previously saw GBZ_FILE, MIN_FILE, and DIST_FILE as the outputs of vg autoindex. We also saw PATH_LIST_FILE as an input for vg surject in the previous exercise.

Let's run the pipeline.

miniwdl run vg_wdl/workflows/giraffe_and_deepvariant.wdl -i /opt/vg_data/giraffe_deepvariant/inputs.json

The WDL pipeline includes both mapping and variant calling, and you will be able to explore the outputs of every step. The outputs will be placed in a subdirectory of your current working directory when it completes.

Exercise: Find the file that describes the pipeline's outputs, and use it to find the output VCF.

See answer

The output directory should be named something like (current date and time)_GiraffeDeepVariant. It contains a file outputs.json which gives the path to the VCF under GiraffeDeepVariant.output_vcf.

Open-ended final exercise: What can you infer about the sample's blood type based on the results in this VCF?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Day_3a_vg_mapping_and_calling.md

Day_3a_vg_mapping_and_calling.md

VG refresher

What to do during your down time

Getting started

Review: using `vg` to build and view graphs

Indexing and mapping in `vg`

`vg` mapping preliminaries

Indexing a graph for mapping with `vg` (the hard way)

A `vg map` indexing pipeline

Mapping with `vg map`

Indexing a graph for mapping with `vg` (the easy way)

Calling structural variants using `vg`

Background

Constructing (simple-enough) graphs with structural variation

VCF input

Phased assembly input

Mapping to a pangenome with `vg giraffe`

Calling structural variants with `vg call`

Calling novel variants and fine-tuning SVs

Calling small variants with `vg giraffe` and `DeepVariant`

Files

Day_3a_vg_mapping_and_calling.md

Latest commit

History

Day_3a_vg_mapping_and_calling.md

File metadata and controls

VG refresher

What to do during your down time

Getting started

Review: using vg to build and view graphs

Indexing and mapping in vg

vg mapping preliminaries

Indexing a graph for mapping with vg (the hard way)

A vg map indexing pipeline

Mapping with vg map

Indexing a graph for mapping with vg (the easy way)

Calling structural variants using vg

Background

Constructing (simple-enough) graphs with structural variation

VCF input

Phased assembly input

Mapping to a pangenome with vg giraffe

Calling structural variants with vg call

Calling novel variants and fine-tuning SVs

Calling small variants with vg giraffe and DeepVariant

Review: using `vg` to build and view graphs

Indexing and mapping in `vg`

`vg` mapping preliminaries

Indexing a graph for mapping with `vg` (the hard way)

A `vg map` indexing pipeline

Mapping with `vg map`

Indexing a graph for mapping with `vg` (the easy way)

Calling structural variants using `vg`

Mapping to a pangenome with `vg giraffe`

Calling structural variants with `vg call`

Calling small variants with `vg giraffe` and `DeepVariant`