- Abstract
- Key Reference Material
- Part 1: Pangenome Graph Construction
- Part 2: Pangenome Graph Properties
- Part 3: Mapping Reads to the Graph
- Part 4: Genotyping and Variant Calling
This is a tutorial written to support the Reference Graph Pangenome Data Analysis Hackathon 2023 Nov. 13-17 in Cape Town, South Africa. The aim is to provide detailed instructions on how to create a pangenome reference graph with Minigraph-Cactus then use it for some downstream analysis like variant calling and genotyping.
Unlike some previous workshops, and most of the existing Cactus documentation, this tutorial will focus on whole-genome human data. As such, it will need to be run over a period of time longer than a typical workshop session. The running times and memory usage of each command will be given wherever possible.
Update: Many thanks to the workshop attendees for their patience and interest, and all credit to your feedback for helping to debug it! I will try to continue to support it at least for a little while. Feel free to reach out to me on github or otherwise (even if you weren't in the workshop) with any questions or problems.
Please visit these links for related material and background information before proceeding further. The first link is essential and should absolutely be consulted before continuing, and the rest are highly recommended.
- Minigraph-Cactus Manual: This is essential to read, and includes several small examples (with data) that should be run before tackling whole-genomes.
- Minigraph-Cactus Paper: The methods are described in detail here.
- HPRC v1.1 Minigraph-Cactus Instructions: Commands and explanations in order to exactly reproduce the latest released HPRC graphs. The commands themselves assume a Slurm cluster but can be trivially modified to run on a single computer (remove
--batchSystem slurm
). - HPRC Graph Downloads: Get the HPRC graphs here.
- HPRC Paper: Detailed analysis of the HPRC graph, and examples of many downstream applications of Minigraph-Cactus pangenomes.
- @jeizenga's 2023 Memphis Workshop, which served as an inspiration for this tutorial.
Important: We will be using Cactus v2.6.13 for this tutorial. Be warned that some steps may not work for older (or newer) versions.
For simplicity, all cactus will be run in "single-machine" mode via its docker image. Cactus also supports distributed computing environments via slurm and AWS/Mesos. See the Running on Slurm section for more details about running on a cluster.
In order to make sure singularity
is working, try running the following and verify that you do not get an error. If this step does not work, you will need to consult your local sysadmin.
singularity run docker://hello-world
As you've seen in the Minigraph-Cactus Manual (please go back and read it if you haven't already), the input is a list of sample name and genome assembly pairs. For diploid assemblies, the convention of SAMPLE.1 / SAMPLE.2
must be used (and dots avoided in sample names otherwise).
In addition to your samples of interest, you should include at least one reference genome. This will allow you to use reference coordinates to, for example, project variants on. In this example, which is based on a small subset of 4 samples of the HPRC data, we will use GRCh38 and CHM13.
Please copy-paste the following data into hprc10.seqfile
GRCh38 https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
CHM13 https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
HG00438.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00438/assemblies/year1_f1_assembly_v2_genbank/HG00438.paternal.f1_assembly_v2_genbank.fa.gz
HG00438.2 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00438/assemblies/year1_f1_assembly_v2_genbank/HG00438.maternal.f1_assembly_v2_genbank.fa.gz
HG00621.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00621/assemblies/year1_f1_assembly_v2_genbank/HG00621.paternal.f1_assembly_v2_genbank.fa.gz
HG00621.2 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00621/assemblies/year1_f1_assembly_v2_genbank/HG00621.maternal.f1_assembly_v2_genbank.fa.gz
HG00673.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00673/assemblies/year1_f1_assembly_v2_genbank/HG00673.paternal.f1_assembly_v2_genbank.fa.gz
HG00673.2 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC/HG00673/assemblies/year1_f1_assembly_v2_genbank/HG00673.maternal.f1_assembly_v2_genbank.fa.gz
HG00733.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC_PLUS/HG00733/assemblies/year1_f1_assembly_v2_genbank/HG00733.paternal.f1_assembly_v2_genbank.fa.gz
HG00733.2 https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC_PLUS/HG00733/assemblies/year1_f1_assembly_v2_genbank/HG00733.maternal.f1_assembly_v2_genbank.fa.gz
If you are making a pangenome graph with your own data, this input listing should be the only part you need to change, but do see the explanation of the options below as some may require adjustments for different data sizes. Also, nothing changes if you want to use haploid assemblies -- just do not use the .1
and .2
suffixes (see CHM13
and GRCh38
above).
I am going to run on 32-cores in order to simulate my understanding of an "average" node on your cluster. As you've seen in the Minigraph-Cactus Manual (please go back and read it if you haven't already), the simplest way to build the graph is with the cactus-pangenome
command.
Here it is, with an explanation of each option following below.
Update: Previous versions of this document used --giraffe clip filter
below. Since Cactus v2.8.5, this is no longer necessary: you can use just --giraffe filter
(or leave the --giraffe
option out altogether to go all in on haplotype subsampling). Not using --giraffe clip
drastically reduces peak memory usage.
rm -rf cactus-scratch && mkdir cactus-scratch
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
cactus-pangenome ./js ./hprc10.seqfile --outDir ./hprc10 --outName hprc10 --reference GRCh38 CHM13 \
--filter 2 --haplo --giraffe filter --viz --odgi --chrom-vg clip filter --chrom-og --gbz clip filter full \
--gfa clip full --vcf --vcfReference GRCh38 CHM13 --logFile ./hprc10.log --workDir ./cactus-scratch \
--consCores 8 --mgMemory 128Gi
For singularity exec
:
-H
: Set the home/working directory to the current directorydocker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13
: the cactus docker image. It will be cached locally (probably in~/.singularity
as a.sif
file)
For cactus-pangenome
:
./js
: Scratch directory that will be created for Toil's jobstore./hprc10.seqfile
: The input samples and assemblies. This file was created above.--outDir ./hprc10
: The output directory. All results will be here.--outName hprc10
: This will be the prefix of all output files.--reference GRCh38 CHM13
: Specify these two samples as reference genomes. Reference samples are indexed a little different invg
to make their coordinates easier to use. Also, the first reference given (GRCh38 in this case), is used to anchor the entire graph and is treated differently than the other samples. Please see here for more details.--filter 2
: Create an Allele-Frequency filtered (AF) graph that contains only nodes and edges supported by at least 2 haplotypes. This can lead to better mapping performance.--filter 9
was used for the 90-assembly HPRC graph.--haplo
: We are actually phasing out the Allele-Frequency filtering as described above in favour or dynamic creation of personal pangenomes. Using this option will create the necessary indexes for this functionality.--giraffe filter
: Make giraffe indexes for the Allele-Frequency filtered graph.--viz
: Make an ODGI 1D visualization image for each chromosome.--odgi
: Make an ODGI formatted whole-genome graph--chrom-vg clip filter
: Make VG formatted chromosome graphs for the both the AF filtered and (default) clipped pangenome.--chrom-og
: Make ODGI formatted chromosome graphs for the full (unclipped) graph. Useful for visualization.--gbz clip filter full
: Make GBZ formatted whole-genome graphs for AF filtered, (default) clipped and full (containing unaligned centromeres) graphs.--gfa clip full
: Make GFA formatted whole-genome graphs for (default) clipped and full graphs.--vcf
: Make a VCF (based on the first reference) version of the graph--vcfReference GRCh38 CHM13
: Specify that we want two VCFs, one for each reference--logFile ./hprc10.log
: All logging information will end up here in addition tostderr
. Important to save!--consCores 8
: Specify 8 threads for each core cactus job (cactus_consolidated
). By default it will use all cores available on your system. By reducing to8
, we attempt to run up to 4 chromosomes at once to save time (assuming 32 cores total). Note that this will increase peak memory usage.--mgMemory 128Gi
: Override Cactus's estimated memory limit forminigraph
construction to make sure it does not exceed what's available. By default, Cactus is very conservative (estimates too much memory) in order to prevent jobs from being evicted from slurm clusters. But we know here126Gi
is fine.--workDir cactus-scratch
Location for cactus's temporary files.
All of the above is explained in more detail in the Minigraph-Cactus Manual. We are erring on the side of producing lots of different indexes, but it's usually easier than going back and regenerating any forgotten ones.
Here are some details about the resources used. I'm on a big shared server using docker run --cpus 32 --memory 250000000000
to emulate a smaller computer. This data is taken from hpr10.log
which lists the wall time and memory usage of each command in the pipeline. The log for the full 90-way HPRC graph can be found here.
- Minigraph Construction : 106Gi, ~3 hours
- Minigraph Mapping : ~200Gi (max per-job 40Gi) ~2 hour
- Cactus Alignment : ~65Gi (max per-job 16Gi) ~1 hour
- Normalization and Indexing : ~64 Gi ~3 hours
- Overall : 11 hours (in environment with 32 cores / 256 Gb RAM)
Here are the output files:
4.0K chrom-alignments
4.0K chrom-subproblems
278M hprc10.CHM13.raw.vcf.gz
1.6M hprc10.CHM13.raw.vcf.gz.tbi
224M hprc10.CHM13.vcf.gz
1.6M hprc10.CHM13.vcf.gz.tbi
4.0K hprc10.chroms
773M hprc10.d2.dist
1.8G hprc10.d2.gbz
35G hprc10.d2.min
65M hprc10.d2.snarls
1.1G hprc10.dist
2.2G hprc10.full.gbz
1.5G hprc10.full.gfa.gz
9.7G hprc10.full.hal
9.1G hprc10.full.og
57M hprc10.full.snarls
104M hprc10.gaf.gz
1.8G hprc10.gbz
769M hprc10.gfa.fa.gz
1.4G hprc10.gfa.gz
1.1G hprc10.hapl
35G hprc10.min
450M hprc10.paf
7.3M hprc10.paf.filter.log
117M hprc10.paf.unfiltered.gz
279M hprc10.raw.vcf.gz
1.6M hprc10.raw.vcf.gz.tbi
907M hprc10.ri
1.7K hprc10.seqfile
52M hprc10.snarls
406K hprc10.stats.tgz
729M hprc10.sv.gfa.gz
226M hprc10.vcf.gz
1.6M hprc10.vcf.gz.tbi
4.0K hprc10.viz
There are four versions of the graph produced (please see here for more details), denoted by these preffixes:
hprc.sv
: This is the output ofminigraph
and contains only structural variants. The input haplotypes are not embedded as pathshprc.full
: This is a base-level graph containing all sequence that could be assigned to a reference chromosome. Centromeres are included but are unaligned.hprc.
: This is a subgraph ofhprc.full
but with centromeres removed. This is usually the most relevant graph for analysis.hprc.d2
: This is a subgraph ofhprc.
but with nodes and edges supported by fewer than 2 haplotypes removed. This graph yields better results for read mapping with the originalgiraffe
pipeline. We used allele frequency filtering in the HPRC paper (.d9
via ``-filter 9for a
10%` cutoff) but have recently changed `giraffe` so that it is no longer necessary (more details later in the mapping section).
The graphs themselves are present in .gfa.gz
(standard, text-based), .gbz
(highly compressed, vg
) and .og
(odgi
) formats. vg giraffe
mapping requires the .gbz
and .hapl
index (or .gbz
, .dist
and .min
for the original pipeline).
There are four VCF files, two each for GRCh38 and CHM13:
hprc10.raw.vcf.gz
andhprc10.CHM13.raw.vcf.gz
: Output ofvg deconstruct
for the GRCh38- and CHM13-based graphs, respectively. These VCFs contain nested variants, and need special attention when using.hprc10.vcf.gz
andhprc10.CHM13.vcf.gz
: "Flattened" versions of the above VCFs (usingvcfbub
) that do not contain nested variants and will be more useful for standard tools.
For the HPRC, we took some extra normalization steps using vcfwave
to realign the variants. This gives a slightly cleaner VCF in some regions. See here for details.
There are four directories:
chrom-subproblems
: The per-chromosome inputs tocactus
. This directory has some useful statistics about the chromosome decomposition such ascontig_sizes.tsv
which shows the amount of sequence from each sample for each reference chromosome as well asminigraph.split.log
which lists which chromosome each contig gets assigned to and why, along with all contigs excluded from further analysis (counted_AMBIGUOUS_
) because they didn't align anywhere.chrom-alignments
: The raw, per-chromosome output ofcactus
including.vg
and.hal
files. Only useful for debugging and / or re-running the last part of the pipeline (indexing and normalization).hprc10.chroms
: Chromosome graphs in.vg
and.og
format. Useful for debugging and visualization. If you are usingGRCh38
as a reference, the unplaced contigs will all get lumped into thechrOther
graph.hprc10.viz
: ODGI 1-D visualizations for each chromosome.
Cactus is a Python script that uses Toil to execute different programs in parallel. Toil supports distributed computing environments such as Slurm. Slurm is now the best way to run Cactus at scale. They key challenge is to make sure that each cactus job gets a good memory estimate. If the memory estimate is too high, then the job will use too much cluster resources or, even worse, fail completely because it asks for more resources than are available on the cluster. If the memory estimate is too low, then Slurm may evict the job for using too much memory. Both of these errors are very difficult to recover from, unfortunately. Cactus does its best to estimate the memory from the input data (and should do fine in the current tutorial), but there are three options to override the memory estimates of bigger jobs:
--mgMemory
: Memory for minigraph construction.--consMemory
: Memory for cactus alignment.--indexMemory
: Memory for full-genome vg indexing.
See here for an example of how Cactus was run on Slurm to generate the v1.1 HPRC graphs.
To run the previous cactus-pangenome
command on Slurm instead of locally, you must do the following.
First, install the Cactus virtual environment:
wget -q https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v2.6.13/cactus-bin-v2.6.13.tar.gz
Then follow the instructions described here
Next, switch to the directory where your input data is and where you want to run cactus and run the following.
IMPORTANT Toil/Cactus do not (yet) understand cluser time limits. This will change soon (our cluster will be adopting time limits this month), but in the meantime, you need to make sure that the default time limit for all jobs is longer than the slowest job (which is almost always minigraph construction). One way to do this is with the TOIL_SLURM_ARGS
environment variable. In general, this variable lets you add any options you want to every job submitted to the cluster by Cactus (see sbatch --help
for a listing) of possible options. If you do not specify this, jobs will be submitted with some default limit (3 hours, I think) and get evicted if they go longer. Thanks Mamana Mbiyavanga for helping to figure this out!!!
Update: Previous versions of this document used --giraffe clip filter
below. Since Cactus v2.8.5, this is no longer necessary: you can use just --giraffe filter
(or leave the --giraffe
option out altogether to go all in on haplotype subsampling). Not using --giraffe clip
drastically reduces peak memory usage.
export TOIL_SLURM_ARGS="-t 1440"
rm -rf slurm-logs ; mkdir -p slurm-logs
cactus-pangenome ./js ./hprc10.seqfile --outDir ./hprc10 --outName hprc10 --reference GRCh38 CHM13 \
--filter 2 --haplo --giraffe filter --viz --odgi --chrom-vg clip filter --chrom-og --gbz clip filter full \
--gfa clip full --vcf --vcfReference GRCh38 CHM13 --logFile ./hprc10.log
--consCores 8 --mgMemory 128Gi --batchSystem slurm --batchLogsDir ./slurm-logs --binariesMode singularity
These are the differences with the previous example:
--workDir
not set: but you can put a location that's available on all worker nodes if you want.--batchSystem slurm
: activates slurm support--batchLogsDir ./slurm-logs
: improves logging of slurm-specific issues by keeping a local copy of all slurm logs--binariesMode singularity
: run all cactus binaries from inside the singularity container
If you see jobs disappearing or mysteriously stopping and restarting, try running cat *
in the --batchLogsDir
directory to see if there are any errors. For example, if you are losing jobs due to exceeding the time limit (see discussion of TOIL_SLURM_ARGS
above), then you will see some logs with CANCELLED DUE TO TIME LIMIT
in them.
NOTE: You may not exactly reproduce the exact numbers below, even running on the same data with the same version as Cactus is not deterministic due to how it is parallelized. Your numbers should still be extremely close, though.
The very first thing to check is the size of your graph. You can do this with vg stats -lz
:
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg stats -lz ./hprc10/hprc10.gbz
It will show the number of nodes and edges in the graph along with the total sequence length over all nodes:
nodes 28195250
edges 38322112
length 3145521882
You can compare that to the length of GRCh38 in the graph
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.gbz -S GRCh38 -E | awk '{sum += $2} END {print sum}'
which is 3099922541
. There is ~45Mbp
bp of additional (excluding most heterochromatic) sequence added to the pangenome from CHM13 and the four samples. Something on the order of a few megabases per sample is reasonable. If your results are much different, then that is a definite warning sign that something went very wrong.
Looking at the .full
graph shows how much additional sequence is added by the centromeres.
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg stats -lz ./hprc10/hprc10.full.gbz
An extra gigabase in this case. We cannot effectively index or map to such graphs (centromere alignment is something we are actively working on, though!)
nodes 29372041
edges 39555335
length 4213877926
You can use vg paths
to inspect the amount of sequence (total length of all embedded paths) of any given sample of haplotype. For example
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.gbz -S HG00438 -E | awk '{sum += $2} END {print sum}'
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.gbz -Q HG00438#1 -E | awk '{sum += $2} END {print sum}'
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.gbz -Q HG00438#2 -E | awk '{sum += $2} END {print sum}'
Show that there is 5679580423
bp for HG00438
, with 2841204110
and 2838376313
in its first (paternal) and second (maternal) haplotype, respectively.
The aforementioned hprc10/chrom-subproblems/contig_sizes.tsv
gives a breakdown of the length of each haplotype in each chromosome. Can be useful to load into a spreadsheet and/or graph in order to check that all input haplotypes are properly represented in the graph.
minigraph-cactus
graphs are linearized along the reference genome (GRCh38 in this case). There is exactly one graph component for each contig in GRCh38. And each component has exactly two tips or stubs (nodes with zero edges at one of their ends). You can count the tips in the graph with
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg stats -HT ./hprc10/hprc10.gbz | sed -e 's/heads//g' -e 's/tails//g' | wc -w
Giving a result of 390
. This is two times the number of contigs in GRCh38, 195
, which can be inspected with
wget -q https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
zcat GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz | grep '>' | wc -l
To verify the number of graph components, you can use vg chunk
to break up the graph by chromosome (there isn't really a practical use for this as cactus-pangenome
will have already output chromosome graphs (see above).
mkdir chrom-components
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg chunk -x ./hprc10/hprc10.gbz -C -b ./chrom-components/chunk -t 32
This will make 195
(one for each GRCh38 contig) .vg
files in ./chrom-components
ls chrom-components/*.vg | wc -l
195
Working with whole-genome, or even chromosome, graphs can be unwieldy for many tools such as those used for visualization. You can use vg
or odgi
to extract smaller regions from a larger graph. This tutorial will focus on vg
, but remember that the odgi
commands you learned in the PGGB tutorial will apply to .og
output of cactus-pangenome
as well.
Note: vg
can read .gfa
, .vg
, .gbz
and .xg
files, though running times and memory usage can vary substantially.
The simplest way to extract a subgraph is by performing queries on GRCh38
coordinates using vg chunk
on the .gbz
file. For example to extract the lrc_kir
region for visualization with Bandage-NG (which expects .gfa
), use
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg chunk -x ./hprc10.gbz -S ./hprc10.snarls -p GRCh38#0#chr19:54015634-55094318 -O gfa > ./lrc_kir.gfa
The command looks for the region chr19:54015634-55094318
in GRCh38
, then pulls out the smallest site (aka bubble aka snarl) in the graph that contains it (this is what -S ./hprc10.snarls
is used for).
Instead of pulling out the site, you can use -c
to specify the number of steps away from the selected region to extract. For example, -c 0
will only extract the given path with no added variation.
Subgraph extraction from .gbz
is rather slow, and does not return non-reference paths in the subgraphs. So in the example abovel lrc_kir.gfa
will only have paths for CHM13 and GRCh38. If you want to add in the other haplotypes, you can try adding -T
to vg chunk
.
If you will be making many queries and/or you want to query on non-reference genomes, the easiest thing may be to create an xg
index. This is how to make an .xg
index of the full graph. This one will be more appropriate for querying samples that are not GRCh38
and therefore will potentially be fragmented in the default graph. (unfortunately vg chunk
does not yet transparently handle querying on path fragments). You can also use the odgi extract
with the hprc10.full.og
graph that was already made.
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg convert ./hprc10.full.gbz -x > ./hprc10.full.xg"
(we need to use bash -c "<command>"
in order for the >
redirect to work)
Use vg convert
to convert between formats and vg stats -F
to determine which format a graph is.
The most important formats for vg
are:
.gbz
: Highly compressed haplotype paths (will scale to 1000s of samples). Read-only. Required forvg giraffe
..vg
: Less compressed but can be modified (ie withvg mod
).xg
: Indexed for fast lookup on any path. Read-only..gfa
: Standard text-based interchange. By default,vg
stores paths a W-lines, butvg convert
can be used to change toP
-lines.vg
cannot read.gfa.gz
..og
: ODGI format, which combines best properties of.xg
and.vg
but will not scale as well as well.gbz
.
In general, you will probably mostly use .gbz
, 'gfa' and .og
files for your graphs.
Panacus is a tool for making beautiful figures describing the coverage in a pangenome graph. It is not (yet) included in the Cactus Docker image, but you can locally install it as follows (see other options in its manual):
With Conda
mamba install -c conda-forge -c bioconda panacus
or manually with a Python virtualenv:
virtualenv -p python3 venv-panacus
. venv-panacus/bin/activate
pip install -U matplotlib numpy pandas scikit-learn scipy seaborn
wget --no-check-certificate -c https://github.com/marschall-lab/panacus/releases/download/0.2.3/panacus-0.2.3_linux_x86_64.tar.gz
tar -xzvf panacus-0.2.3_linux_x86_64.tar.gz
# suggestion: add tool to path in your ~/.bashrc
export PATH="$(readlink -f panacus-0.2.3_linux_x86_64/bin)":$PATH
And go through all the examples on the Panacus webpage using this graph as input. Note that even examples that use PGGB graphs can still be run on your graph.
IMPORTANT To exclude reference paths, use grep -ive 'grch38\|chm13'
instead of grep -ve 'grch38\|chm13'
and grep ^W
instead of grep ^P
, as well as add |sort | uniq
when making paths lists.
So to run the first example from the Panacus website, you would do the following:
gzip -d hprc10.gfa.gz
grep '^W' hprc10.gfa | cut -f2 | grep -ive 'grch38\|chm13' | sort | uniq > hprc10.paths.haplotypes.txt
RUST_LOG=info panacus histgrowth -t8 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s hprc10.paths.haplotypes.txt hprc10.gfa > hprc10.histgrowth.node.tsv
panacus-visualize -e hprc10.histgrowth.node.tsv > hprc10.histgrowth.node.pdf
Using panacus
, especially with --count bp
to chart sequence length instead of nodes, is a very good way to visualize the diversity of the samples in the graph.
The best way to map short read (genomic) data to the pangenome is vg giraffe
. vg giraffe
supports single and paired-end inputs, and will output graph mappings in GAM or GAF format. See the surjecting section below for details on outputting BAM.
You can find some 30X paired-end reads from Genome in a Bottle's HG002 here. Each file is ~33Gb
:
wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz
wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz
You can map the above reads with giraffe
using this command (it assumes the reads are in the same location as the graph, but you can modify it accordingly, even adding another -v
argument if necessary):
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg giraffe -Z ./hprc10/hprc10.d2.gbz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R1.fastq.gz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R2.fastq.gz -o gaf | bgzip > ./hprc10/hprc10.hg002.gaf.gz"
This takes about 2.5 hours and takes 47 Gb of RAM.
Instead of mapping to the filtered graph, you can use the reads to extract a personal pangenome and map to that. This way you keep rare variants that are present in the sample, and exclude common ones that aren't. In most benchmarks so far, this helps accuracy of downstream applications. It does require one extra step, though, which is extracting the kmers from the input reads with kmc
.
First you need to make a file containing the paths of your reads. Assuming it's called hg002.reads.txt
in the current directory:
printf "./HG002.hiseqx.pcr-free.30x.R1.fastq.gz\n../HG002.hiseqx.pcr-free.30x.R2.fastq.gz\n" > hg002.reads.txt
Then you use kmc
to make the kmers index (hg002.kff
)
singularity exec -H $(pwd) docker://gregorysprenger/kmc:v3.2.2 \
kmc -k29 -m128 -okff -t32 @./hg002.reads.txt hg002 .
which takes about 15 minutes and 128Gb of memory. The .
as the last argument is telling kmc
to use the current working directory as its workind directory.
And you use this index to map to the unfiltered graph with vg giraffe
.
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg giraffe -Z ./hprc10/hprc10.gbz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R1.fastq.gz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R2.fastq.gz -o gaf --sample HG002 --progress --kff-name ./hg002.kff --haplotype-name ./hprc10/hprc10.hapl | bgzip > ./hprc10/hprc10.hg002.new.gaf.gz"
This takes about 2.75 hours and 64Gb of RAM, and also produces ./hprc10.HG002.gbz
, which is the personal pangenome graph itself.
Note: this section is adpated from methods for the mc paper.
vg giraffe
will soon be able to map long reads, but is not ready yet. For now, you should use GraphAligner.
In order for the resulting mapping to be compatible with the .gbz
, we first convert the .gbz
into a .gfa
. You can run GraphAligner
on the .gfa
that comes out of cactus-pangenome
, but the coordinates will be different from the .gbz
(note the --vg-algorithm
flag is important for ensuring this).
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg convert ./hprc10/hprc10.gbz -f --vg-algorithm > ./hprc10/hprc10.gbz.gfa"
This takes 43 seconds and 10Gb RAM.
Now download some public GIAB HiFi reads (or use your own):
# log in as anonymous, put anything for password
ftp ftp-trace.ncbi.nlm.nih.gov
cd /giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/reads/
get m64011_190830_220126.fastq.gz
Protip: or use aria2c
to get the file way faster
aria2c -x8 -j8 -s8 ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/reads/m64011_190830_220126.fastq.gz
Now run GraphAligner in pangenome mode. I'm using a relatively old version of GraphAligner
here since the latest one, 1.0.17b--h21ec9f0_2
, doesn't seem to work properly (tons of assert errors, about 100X slower than expect)
singularity exec -H $(pwd) docker://quay.io/biocontainers/graphaligner:1.0.13--he1c1bb9_0 \
GraphAligner -g ./hprc10/hprc10.gbz.gfa -f m64011_190830_220126.fastq.gz -a HG002.hifi.gam -x vg -t 32
This takes about 2.5 hours and 220Gb RAM.
You can now use the .gam
and .gbz
together with with vg pack/call
as described in the SV Genotyping with vg section (though it should be passed into vg pack
with -g
, not -a
as you would for .gaf.gz
). You do not need to further use hprc10.gbz.gfa
.
IMPORTANT : This version of GraphAligner
doesn't output mapping qualities. So if you use vg pack
on it, make sure to not use a quality filter, ie use -Q0
.
If you had mapped to hprc10.gfa.gz
instead of converting the .gbz
, you can still use vg pack/call
, but you would need to run theme on hprc10.gfa.gz
. Also note, vg call
must always be run on the exact same graph as vg pack
. Even if the nodes are identical, if you change formats between these two tools, then vg call
will crash).
You can project your read mappings from the graph to a linear reference with vg surject
. This will let you output your mappings in BAM format, which can be used with non-pangenome tools like DeepVariant
, samtools
, GATK
etc.
You can project your mappings to any reference path in the graph (as selected with --reference
in cactus-pangenome
), so GRCh38 or CHM13 in the example. You can in theory project reads to any sample in the graph (even non reference samples) but it is a little trickier and not covered here (requires updating the .gbz
with vg gbwt
).
You must first create a list of reference paths. Note we're using the .full
graph here just to make our path lists with the vg paths
command. This is important when getting the CHM13 path list, as they will be fragmented otherwise. Important you don't want to map or call with the .full
graph -- it's just for getting this path list.
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.full.gbz -S GRCh38 -L > grch38.paths.txt
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg paths -x ./hprc10/hprc10.full.gbz -S CHM13 -L > chm13.paths.txt
To project your reads to GRCh38
, do the following (use -p chm13.paths.txt
to instead project to CHM13). If you don't supply a path list with -F
it will project to to a mix of GRCh38 and CHM13 which is almost certainly not what you want.
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg surject -x ./hprc10/hprc10.gbz -G ./hprc10/hprc10.hg002.new.gaf.gz --interleaved -F grch38.paths.txt -b -N HG002 -R 'ID:1 LB:lib1 SM:HG002 PL:illumina PU:unit1' > ./hprc10/hprc10.hg002.new.bam"
This takes about 4 hours and 64Gb RAM.
It's important to use --interleaved
to tell surject
that the reads are paired. The readgroup -R
is boilerplate tags to help DeepVariant
or other tools that expect this info in the BAM header. You can also modify the header yourself with samtools
if needed.
You should be able to use hprc10.gbz
for surjection whether you aligned to hprc10.d2.gbz
or hprc10.gbz
or the personalized pangenome initially.
If you are only ever going to use the BAM, you don't need to create the GAF with giraffe
then surject
afterwards -- you can do both at once (use --ref-paths chm13.paths.txt
to project to CHM13 instead):
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg giraffe -Z ./hprc10/hprc10.gbz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R1.fastq.gz -f ./hprc10/HG002.hiseqx.pcr-free.30x.R2.fastq.gz -o bam --sample HG002 --progress --kff-name ./hg002.kff --haplotype-name ./hprc10/hprc10.hapl -R 'ID:1 LB:lib1 SM:HG002 PL:illumina PU:unit1' --ref-paths ./grch38.paths.txt > ./hprc10/hprc10.hg002.new.bam"
This takes about 4 hours and 64 Gb RAM.
See note below about fixing the BAM header if you have surjected onti a different reference than was used for the graph (ie you've surjected onto CHM13
with a GRCh38
-based graph.
DeepVariant is a state of the art variant caller. It does not use pangenome formats, and rather works on FASTA and BAM files, but has been trained to support data from vg giraffe / surject
.
First, make a FASTA file from your graph (it is generally best to make the FASTA from the graph, to make sure it matches up exactly. If you are using a different reference, ie CHM13, use the .full
graph for this step):
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg paths -x ./hprc10/hprc10.gbz -S GRCh38 -F > ./GRCh38.fa"
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
samtools faidx ./GRCh38.fa
Important if you are surjecting onto CHM13
but using the GRCh38-based graph, you need to fix your BAM header as follows:
singularity exec -H $(pwd) docker://biocontainers/picard-tools:v2.18.25dfsg-2-deb_cv1 \
PicardCommandLine CreateSequenceDictionary R=./GRCh38.fa O=./GRCh38.dict
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
samtools reheader ./GRCh38.dict ./hprc10/hprc10.hg002.new.bam > ./hprc10/hprc10.hg002.new.fix.bam
mv ./hprc10/hprc10.hg002.new.fix.bam ./hprc10/hprc10.hg002.new.bam
Next, index the BAM (this is a required step for almost any variant caller that takes BAM input)
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
samtools sort ./hprc10/hprc10.hg002.new.bam -O BAM -o ./hprc10/hprc10.hg002.new.sort.bam --threads 8
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
samtools index ./hprc10/hprc10.hg002.new.sort.bam -@ 8
These commands took about 30 minutes and very little memory (could be much faster on a local disk).
Finally, run DeepVariant to make a VCF from the BAM.
singularity exec -H $(pwd) docker://google/deepvariant:1.6.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=./GRCh38.fa \
--reads=./hprc10/hprc10.hg002.new.sort.bam\
--output_vcf=./hprc10/hprc10.hg002.new.dv.vcf.gz \
--output_gvcf=./hprc10/hprc10.hg002.new.dv.g.vcf.gz \
--make_examples_extra_args="min_mapping_quality=1,keep_legacy_allele_counter_behavior=true,normalize_reads=true" \
--num_shards=32
This took about 13 hours.
We make an important distinction between genotying and calling:
- genotyping: Determine which variants in the graph are present in (each haplotype) of the sample.
- calling: Determine which variants in the reads are present in (each haplotype) of the sample. These variants may or may not be in the graph.
One strength of pangenome graphs is that they allow Structural Variants (SVs), which are normally different to determine from short reads, to be efficiently genotyped. One way to do this is with vg call. This is a two-step process, beginning with a graph alignment (GAF or GAM) from vg giraffe
.
First, create a .pack
coverage index (note, if you are using GraphAligner
output, use -Q0
below instead):
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
vg pack -x ./hprc10/hprc10.gbz -Q5 -a ./hprc10/hprc10.hg002.new.gaf.gz -o ./hprc10/hprc10.hg002.pack
This takes about 1 hour and 60 Gb RAM.
Then, create the VCF with vg call
:
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 \
bash -c "vg call ./hprc10/hprc10.gbz -r ./hprc10/hprc10.snarls -k ./hprc10/hprc10.hg002.pack -s HG002 -S GRCh38 -az | bgzip > ./hprc10/hprc10.call.vcf.gz"
This takes 30 minutes and 40 Gb RAM.
Credit for this section: Matteo Ungaro
Along with vg call
another tool for Structural Variants detection which works on pangenome graphs is
PanGenie; however, in contrast with the former PanGenie does not leverage any information from reads
alignment. It uses a HMM to determine the best possible haplotypes combination based on the paths in
the graph and the short reads dataset for the sample in question.
For this reason, PanGenie performs a genome inference for this sample that is represented as a mosaic
of haplotypes, and which variants are limited those present in the graph space.
PanGenie can be installed using some commonly used package manager such as Conda or Singularity. Conda installation is the one proposed in this tutorial. First, the tool repository is downloaded with git clone
git clone https://github.com/eblerjana/pangenie.git
Afterwards, it is possible to access the ./pangenie
directory where it is necessary to create a custom
Conda environment with all the dependencies necessary to run the tool
cd pangenie
conda env create -f environment.yml
Finally, the environments has to be activated and tested for the tool’s build
conda activate pangenie
mkdir build; cd build; cmake .. ; make
Importantly, the PanGenie pipeline later introduced will make use of Snakemake that needs to be present as a separate Conda environment, activated when using the tool.
The first step to use PanGenie is to make sure the graph VCF presents the following properties (more on this at the GitHub page for the tool):
- phased samples
- sequence-resolved
- non-overlapping variants
- diploid (important, since in our example CHM13 will be haploid in the VCF)
We need to make sure the VCF is simplified as much as possible and diploid. To that end,
we start with the VCF output of MC, cut the other reference out (CHM13 in our example),
then normalize it with vcfwave
.
singularity exec -H $(pwd) docker://ghcr.io/pangenome/pggb:latest \
bash -c "bcftools view -s ^CHM13 ./hprc10/hprc10.vcf.gz \
| vcfwave -I 10000 -t 32 -n | bgzip > hprc10/hprc10.wave.vcf.gz \
&& tabix -fp vcf hprc10/hprc10.wave.vcf.gz"
PanGenie can be run form command line, but it is recommended to use the pipeline available because the
output VCF for the sample will be otherwise unviable for conversion to biallelic sites, a post-processing
step which improves performance.
The pipeline runs from a config.yaml
file found at this path:
/path/to/pangenie/pipelines/run-from-callset/config.yaml)
once the tool is installed.
And this is how it looks like:
# input vcf with variants to be genotyped (uncompressed)
vcf: /path/to/<decomposed_and_filtered>.vcf
# reference genome
reference: /path/to/<reference>.fna
# path to reads (FASTQ format, uncompressed)
# for each sample that shall be genotyped
reads:
sample1: /path/to/<sample_name>.fastq
#sample2: reads-sample2.fastq
# path to PanGenie exectuable
pangenie_genotype: /path/to/pangenie/build/src/PanGenie
pangenie_index: /path/to/pangenie/build/src/PanGenie-index
# name of the output directory (keep in mind PanGenie wants a specific directory where to save all the outputs
# for instance, <sample_name>-genome_inference
outdir: /path/to/outdir
Once you've edited this to put the full path of your reads and .wave
vcf (IMPORTANT: Pangenie does not accept gzipped input: you must uncompress everything first).
Once done, PanGenie can be run in interactive from the folder where the config.yaml lives or, alternatively, form a bash script simply typing
snakemake --cores n_of_threads
Beware, PanGenie requires that the chromosome/contig names match between the reference genome and the graph VCF. Normally, it is good practice to keep the reference names unchanged and set the chromosome names in the VCF to match the one in the reference; this can be easily done with sed . For a minigraph-CACTUS pangenome this is how it is done:
# example based on a GRCh38 graph
sed -i 's/GRCh38#0#//' <decomposed_and_filtered>.vcf
#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=<n_of_threads>
#SBATCH --time=<time_in_hh:mm:ss>
#SBATCH --mem=<allocated_ram>
#
#SBATCH --job-name=pangenie
#SBATCH --output=genome_inference.out
#
#SBATCH --partition=<partition_name>
source /path/to/anaconda3/etc/profile.d/conda.sh
conda activate snakemake
cd /path/to/pangenie/pipelines/run-from-callset/
snakemake --cores n_of_threads
PanGenie is very fast but the process is driven by the graph complexity and the quality/size of the short
reads dataset for the sample analyzed. A typical run can take ~1.5 hours and up to 80Gb of RAM.
However, 1.2 Filtering and decomposition can be a fairly long process mainly because vcfwave
multithreads only per chromosome and not over the whole VCF file at once.
The results are found in a subfolder of the /outdir
called /genotypes
. In there, a single VCF file named
sample1-genotypes.vcf will be present that needs to be converted to biallelic sites.
To do so it is best to use the available script at the GitHub page for the tool (convert-to-biallelic.py
) which
can be run as follow: