Construction Benchmarks

Hardware and software

One Amazon EC2 i3.8xlarge instance:

32 cores of Xeon E5-2686 v4
244 gigabytes of memory
/: EBS General Purpose SSD (gp2), 160 GB
/large_disk: 4x 1.9 TB NVMe SSD in RAID 0
Ubuntu 16.04 with Linux kernel 4.4.0
GCC 5.4.0
A recent version of VG

Directory structure

~/vg: VG
~/data: Downloaded data
~/graphs: VG graphs
~/indexes: GBWT/GCSA2 indexes
~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 12 construction jobs in parallel, starting from the largest chromosomes. The last jobs will finish soon after chromosome 2. Memory usage is roughly 1 GB for each 10 Mbp. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 12 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
    -F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

We can also merge the individual GBWT indexes into a single file. This takes around 10 minutes and less than 30 gigabytes of memory using a single thread.

#!/bin/bash

VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$VG gbwt -f -o $OUTPUT -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE

Chromosome	Time (h)	GBWT (MB)	`locate()` (MB)	Total (MB)
1	24.70	568	504	1072
2	27.12	603	554	1157
3	22.62	492	456	947
4	22.13	484	450	934
5	20.31	442	415	857
6	19.45	428	396	825
7	18.27	407	374	781
8	17.75	388	364	751
9	13.92	326	272	598
10	15.38	370	320	689
11	15.60	367	312	679
12	14.87	357	300	657
13	11.13	259	218	478
14	10.22	246	203	449
15	9.30	231	187	418
16	10.36	255	206	461
17	8.97	224	179	403
18	8.58	209	175	384
19	6.95	181	139	319
20	6.84	169	133	302
21	4.18	111	83	195
22	4.05	110	83	193
X	12.86	329	219	548
Y	0.06	6	1	7
Total	28.97	7561	6543	14104
Merged	0.14	7565	7379	14944

Provide feedback

Saved searches