-
Notifications
You must be signed in to change notification settings - Fork 38
Getting Started
For this exercise, we'll use GLnexus to merge gVCF files representing the ALDH2 locus from the 2019 resequencing of the 1000 Genomes Project phase 3 cohort (N=2,504).
-
static executable: for modern Linux x86-64 hosts, download
glnexus_cli
from the Releases page andchmod +x glnexus_cli
-
docker image:
docker pull
the image tag listed on the Releases page -
from source: see the README instructions on compiling
glnexus_cli
Additionally, have tabix
and bcftools
installed.
In your working directory, download and extract the 2,504 example gVCF files for the ALDH2 locus, excerpted from a whole-genome call set generated using DeepVariant 0.8.0. The download is 93MB.
curl -LSs https://raw.githubusercontent.com/wiki/dnanexus-rnd/GLnexus/data/dv_1000G_ALDH2_gvcf.tar | tar xv
The glnexus_cli
executable consumes the gVCF files, and a three-column BED file giving the genomic ranges to analyze. For exomes, the BED file might contain the exome capture targets with some padding, while for WGS you can just give the full-length chromosomes.
echo -e "chr12\t111760000\t111820000" > ALDH2.bed
./glnexus_cli --config DeepVariant --bed ALDH2.bed \
dv_1000G_ALDH2_gvcf/*.g.vcf.gz > dv_1000G_ALDH2.bcf
It's most efficient to provide the BED file with desired regions if possible, but you can omit it to process all contigs listed in the gVCF headers. For example with Docker:
docker run --rm -i -v $(pwd)/dv_1000G_ALDH2_gvcf:/in quay.io/mlin/glnexus:vX.Y.Z \
bash -c 'glnexus_cli --config DeepVariant /in/*.g.vcf.gz' > dv_1000G_ALDH2.bcf
This should take just a few minutes. glnexus_cli
emits an uncompressed, multi-sample BCF stream to its standard output. We can use bcftools
to convert this BCF to bgzip VCF:
bcftools view dv_1000G_ALDH2.bcf | bgzip -@ 4 -c > dv_1000G_ALDH2.vcf.gz
You could put glnexus_cli
, bcftools
, and bgzip
in a shell pipeline to automate the format conversion (but see Performance).
To process gVCFs from other variant callers, change the --config
flag appropriately; run glnexus_cli -h
to list the available configuration presets. The Configuration page discusses customizing them if needed.
glnexus_cli
leaves behind a subdirectory GLnexus.DB
used for external sorting of the gVCF data. You can delete this directory when you're done; glnexus_cli
currently has no way to do anything further with it.
For large projects GLnexus is designed to utilize a powerful server flat-out, but there are several Performance tuning tricks needed to achieve that.
If you have too many gVCFs to enumerate on the command line, you can make a manifest file with one gVCF filename per line. Then pass the filename of this manifest to GLnexus along with the --list
flag, instead of the individual gVCF filenames.
glnexus_cli
does not use tabix indices for the input gVCFs. If you need to process only a few selected genomic ranges, then it may be advantageous to slice your gVCFs beforehand.
We also have a DNAnexus platform applet to wrap the open-source executable, which we can use for the same exercise. A copy of this applet resides in the public project GLnexus_Getting_Started along with example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1.
Copy these all to your own project, and run it like so:
echo -e "chr21\t0\t48129895" | dx upload -o hg19_chr21.bed -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 \
-i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz \
-y --watch
This will output a bgzip VCF file on job completion.
dx cat dv_platinum6_chr21.vcf.gz | zless
If you have too many gVCF files to enumerate on the command-line, the applet can also take in a file containing list of gVCF file IDs, in case there are too many:
dx find data --brief --folder dv_platinum6_chr21_gvcf/ | dx upload -o platinum6_gvcfs.txt -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf_manifest=platinum6_gvcfs.txt
To process gVCFs from other variant callers, change the config
input appropriately.
The source code for this applet is in this repository under cli/dxapplet. Beyond this little applet wrapping the open-source executable, we have a cloud-native framework for GLnexus enabling giant projects to scale out on many compute nodes, and reprocess incrementally as new samples are sequenced. Contact the DNAnexus science team to discuss such requirements. (The open-source version produces identical scientific results.)