-
Notifications
You must be signed in to change notification settings - Fork 196
Mapping short reads with Giraffe
This tutorial will explain how to use vg giraffe
to map short reads to a pangenome graph. See also Giraffe best practices.
The easiest way to start mapping to your own data is to have a single-copy reference FASTA, and a phased VCF(s) on several samples describing the variation you want to include. Your FASTA should not include alternative loci, because to use them properly you would also need to provide an alignment of the alternative loci to the main chromosome contigs, and vg cannot make sue of such an alignment alongside a VCF. Your VCF file(s) should be self-consistent, and not include contradictory or overlapping variants. They should also be restricted to VCF 4.2 features; the *
allele of VCF 4.3 is not yet supported.
If you have the graph and the haplotypes already constructed in a GFA file, you can also use that as an input. With vg version 1.52.0 and earlier, the haplotypes should be stored as W-lines in the GFA file. Older versions of vg autoindex
do not recognize haplotypes stored as P-lines and will pick the wrong approach for building the graph. Starting from version 1.53.0, P-lines work as well, as long as vg understands their names as haplotypes.
You can also use unphased VCFs, or a GFA file without haplotype paths, as input, but in those cases Giraffe will not have real haplotype information and will instead map to greedily-selected covering paths of variable regions. This can produce bad mapping results when reads span multiple variants.
To turn your FASTA and VCFs into a graph, you can use vg autoindex
, passing the base FASTA with -r
and the VCFs each with -v
. You can set the base name of the output files with -p
; it will default to index
.
cp ./wiki-data/mock-hs37d5/data/hs37d5.fa ./wiki-data/mock-hs37d5/data/hs37d5.fa.fai .
cp ./wiki-data/mock-hs37d5/data/*.vcf.gz ./wiki-data/mock-hs37d5/data/*.vcf.gz.tbi .
VCF_ARGS=()
for CHROM in {1..22} X Y ; do
VCF_ARGS+=(-v chr${CHROM}.vcf.gz)
done
vg autoindex --workflow giraffe -r hs37d5.fa "${VCF_ARGS[@]}" -p hs37d5-pangenome
rm *.fa *.fa.fai *.vcf.gz *.vcf.gz.tbi
This will build all the files needed for Giraffe to run:
- GBZ graph
.giraffe.gbz
containing:- GBWT haplotype index, augmented with a path cover of graph components without haplotypes and/or subsampled to a reasonable number of local haplotypes.
- GBWTGraph that provides node sequences.
- Minimizer index for seed finding, annotated with positions in the distance index, in the
.min
file. - Minimum distance index in the
.dist
file.
Before version 1.34.0, the GBWT and the GBWTGraph were stored in separate .giraffe.gbwt
and .gg
files.
If you are working with a graph input in GFA format, instead of a FASTA and VCFs, you can use the -g
option to provide the GFA as input to vg autoindex
, instead of the -r
and -v
options.
When a linear reference sequence needed (e.g. for mapping to SAM/BAM/CRAM format), reference samples can be specified using the RS
tag in GFA header. See File Formats for further information. You can also set reference samples later using the VG GBWT Subcommand.
If you have a graph with haplotypes as P-lines, you must build the indexes manually. For example:
vg gbwt --gbz-format -g graph.gbz -G graph.gfa
vg index -j graph.dist graph.gfa
vg minimizer -d graph.dist -o graph.min graph.gbz
See VG GBWT Subcommand and Index Construction for further details.
If you have trouble building indexes, you may need more memory. You can control the amount of memory that the autoindexing process seeks to use at any given time with the --target-mem
option, but be aware that this is a target, and the heuristics used to estimate the memory requirements for building various partial indexes may not work well on your data.
Once your indexes are built, you can then map reads using vg giraffe
. For example, to map single-end reads into vg's GAM alignment format with vg 1.34.0 or later, you can do:
cp ./wiki-data/mock-hs37d5/data/sim.fq .
vg giraffe -Z hs37d5-pangenome.giraffe.gbz -m hs37d5-pangenome.min -d hs37d5-pangenome.dist -f sim.fq >mapped.gam
rm sim.fq
If you would prefer GAF-format output, add -o gaf
to the command line. It is also possible to project the alignments to a linear reference and output them in SAM/BAM/CRAM format. Before version 1.37.0, that required giving a separate graph with reference paths using -x
. Starting from 1.37.0, reference paths are also stored in GBWTGraphs and GBZ graphs.
If you have paired-end reads, add -i
if they are interleaved, or add a second FASTQ file with another -f
option.
Option -p
prints some progress information, which can be useful if something goes wrong.
If all indexes have the same base name, it is often enough to specify only the name of the GBZ file with -Z
, and Giraffe will guess the names of the minimizer index and the distance index. Specifying the names explicitly with -m
and -d
is still the best practice, as Giraffe will start building the indexes if it cannot find then.
To inspect the alignments for quality control, you can use:
vg stats -a mapped.gam
For this test data (simulated directly from the base FASTA with no variants), this will produce something like:
Total alignments: 20000
Total primary: 20000
Total secondary: 0
Total aligned: 20000
Total perfect: 20000
Total gapless (softclips allowed): 20000
Total paired: 0
Total properly paired: 0
Insertions: 0 bp in 0 read events
Deletions: 0 bp in 0 read events
Substitutions: 0 bp in 0 read events
Softclips: 0 bp in 0 read events
Total time: 0.912064 seconds
Speed: 21928.3 reads/second
In real data, you should expect Insertions
, Deletions
, Substitutions
, and Softclips
all to be nonzero, and if your reads are paired you would want to see almost all your reads under Total properly paired
When you are done following along, you can clean up after yourself:
rm mapped.gam hs37d5-pangenome.*