Skip to content
Mike Lin edited this page Feb 11, 2020 · 14 revisions

For this exercise, we'll use GLnexus to merge gVCF files representing the ALDH2 locus from the 2019 resequencing of the 1000 Genomes Project phase 3 cohort (N=2,504).

Obtain glnexus_cli

  • static executable: for modern Linux x86-64 hosts, download glnexus_cli from the Releases page and chmod +x glnexus_cli
  • docker image: docker pull the image tag listed on the Releases page
  • from source: see the README instructions on compiling glnexus_cli

Additionally, have tabix and bcftools installed.

Download example gVCF

In your working directory, download and extract the 2,504 example gVCF files for the ALDH2 locus, excerpted from a whole-genome call set generated using DeepVariant 0.8.0. The download is 93MB.

curl -LSs https://raw.githubusercontent.com/wiki/dnanexus-rnd/GLnexus/data/dv_1000G_ALDH2_gvcf.tar | tar xv

Run GLnexus

The glnexus_cli executable consumes the gVCF files, and a three-column BED file giving the genomic ranges to analyze. For exomes, the BED file might contain the exome capture targets with some padding, while for WGS you can just give the full-length chromosomes.

echo -e "chr12\t111760000\t111820000" > ALDH2.bed
./glnexus_cli --config DeepVariant --bed ALDH2.bed \
    dv_1000G_ALDH2_gvcf/*.g.vcf.gz > dv_1000G_ALDH2.bcf

It's most efficient to provide the BED file with desired regions if possible, but you can omit it to process all contigs listed in the gVCF headers. For example with Docker:

docker run --rm -i -v $(pwd)/dv_1000G_ALDH2_gvcf:/in quay.io/mlin/glnexus:vX.Y.Z \
    bash -c 'glnexus_cli --config DeepVariant /in/*.g.vcf.gz' > dv_1000G_ALDH2.bcf

This should take just a few minutes. glnexus_cli emits an uncompressed, multi-sample BCF stream to its standard output. We can use bcftools to convert this BCF to bgzip VCF:

bcftools view dv_1000G_ALDH2.bcf | bgzip -@ 4 -c > dv_1000G_ALDH2.vcf.gz

You could put glnexus_cli, bcftools, and bgzip in a shell pipeline to automate the format conversion (but see Performance).

To process gVCFs from other variant callers, change the --config flag appropriately; run glnexus_cli -h to list the available configuration presets. The Configuration page discusses customizing them if needed.

glnexus_cli leaves behind a subdirectory GLnexus.DB used for external sorting of the gVCF data. You can delete this directory when you're done; glnexus_cli currently has no way to do anything further with it.

Scaling up

For large projects GLnexus is designed to utilize a powerful server flat-out, but there are several Performance tuning tricks needed to achieve that.

If you have too many gVCFs to enumerate on the command line, you can make a manifest file with one gVCF filename per line. Then pass the filename of this manifest to GLnexus along with the --list flag, instead of the individual gVCF filenames.

glnexus_cli does not use tabix indices for the input gVCFs. If you need to process only a few selected genomic ranges, then it may be advantageous to slice your gVCFs beforehand.

On DNAnexus

We also have a DNAnexus platform applet to wrap the open-source executable, which we can use for the same exercise. A copy of this applet resides in the public project GLnexus_Getting_Started along with example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1.

Copy these all to your own project, and run it like so:

echo -e "chr21\t0\t48129895" | dx upload -o hg19_chr21.bed -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz \
        -y --watch

This will output a bgzip VCF file on job completion.

dx cat dv_platinum6_chr21.vcf.gz | zless

If you have too many gVCF files to enumerate on the command-line, the applet can also take in a file containing list of gVCF file IDs, in case there are too many:

dx find data --brief --folder dv_platinum6_chr21_gvcf/ | dx upload -o platinum6_gvcfs.txt -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf_manifest=platinum6_gvcfs.txt

To process gVCFs from other variant callers, change the config input appropriately.

The source code for this applet is in this repository under cli/dxapplet. Beyond this little applet wrapping the open-source executable, we have a cloud-native framework for GLnexus enabling giant projects to scale out on many compute nodes, and reprocess incrementally as new samples are sequenced. Contact the DNAnexus science team to discuss such requirements. (The open-source version produces identical scientific results.)